http://blogs.sun.com/brendan/date/20080218 Monday February 18, 2008

DTrace Links

A few happenings in the increasingly busy world of DTrace:

Posted by brendan [DTrace] ( February 18, 2008 03:48 AM ) Permalink | Comments[0]
http://blogs.sun.com/brendan/date/20070810 Friday August 10, 2007

DTrace Bourne shell (sh) provider

Alan Hargreaves has prototyped a DTrace Bourne shell provider, and is looking ahead to see how this can be integrated with Solaris and OpenSolaris. This is great news for anyone who would like to analyze or troubleshoot their Bourne shell scripts. It is also good news for people (especially system administrators) wanting to learn DTrace, as it may provide a familiar path (sh scripting) for you to learn user-land DTracing.

Why sh, and not csh, tcsh, ksh, bash or zsh? Alan could have done any, we wanted to start with the Bourne shell to both get it done and as an example for the other shell providers to follow (watch this space). So if you really just care about bash or whatever, then learning about the sh provider now should be a useful heads-up of what should be possible.

hello.sh

When learning to DTrace software, it can be best to start with the simple stuff. Confirm that what you think will happen, does indeed happen, then move to progressivly more complex scripts.

Here is hello.sh, a dead simple script to start with,

The following DTrace one-liner will run hello.sh (-c), trace all sh provider probes (sh*:::), and frequency count their probe names (@[probename] = count()).

So builtin-entry and builtin-return were called 5 times, one for each echo. The line probe was also called 5 times, once for each line that the shell executed.

Frequency counting builtins

Each probe provides useful information in their args, such as the script name, builtin or exec name, line number, etc. Here I'll use the builtin-entry probe to frequency count the builtin names,

That found that echo was called 5 times. Nothing suprising there, this is just confirming that the probes behave as they should.

Line number tracing

The line probe allows us to trace line execution. This one liner counts how many times each line was run,

Each of the lines 3 to 7 were called once, as expected. This one-liner should be useful for finding loop hotspots.

And the following one-liner traces line execution live,

Great, that works. Is this getting too easy? lets move on...

func_abc.sh

Now for somthing a little harder (and more interesting). In the following script, func_a() calls func_b() which calls func_c().

# cat -n func_abc.sh   
     1  #!./sh
     2
     3  func_c()
     4  {
     5          echo "Function C"
     6          sleep 1
     7  }
     8
     9  func_b()
    10  {
    11          echo "Function B"
    12          sleep 1
    13          func_c
    14  }
    15
    16  func_a()
    17  {
    18          echo "Function A"
    19          sleep 1
    20          func_b
    21  }
    22
    23  func_a
What order will the lines be executed now? Try thinking about it before checking the answer below.

Tracing line execution

There were some conspicuous one-second pauses during the output, which are sadly lost in this screen shot. I've added the word "pause" so that you know when that happened.

The order of the lines does make sense. What might be confusing is that we see the output text "Function B" followed by the trace of line 20 -- this is due to DTrace output buffering.

Tracing function flow - sh_flow.d

Since we have function-entry and function-return probes, we can trace function execution and indent the output based on the function depth. The following demonstrates sh_flow.d - a script which does just that for any script running on the system (with this sh provider enabled binary), and also traces builtin calls and command executions. sh_flow.d is run in one window, and func_abc.sh runs in another.

The legend is,

Great! the output makes it clear who is calling who.

The sh_flow.d script is,

Tracing function flow with times - sh_flowtime.d

A great feature of DTrace is the availability of high resolution timestamps. Here I've enhanced the previous script to print delta times,

The legend for understanding the delta times is, The legend is,

Hey - we found the latency in the script, it's those 1.02 second calls to "sleep"! :-)

Being able to present this information in such a compelling way is awsome stuff.

I won't include the source to that script just yet (this blog entry will get too long), but I'll encourage Alan to bundle these scripts along with the patch/binary of the sh provider (when he makes that available).

Tracing type times - sh_typetime.d

If your script is rather long, the above output from sh_flowtime.d may get rather long. This condensed summary may be more practical,

It's pretty obvious from that output where the time was spent.

I was experimenting with more exciting forms of that type of script,

Lots more will be possible - Alan only wrote the provider two days ago, so I haven't had much of a chance to write scripts yet!

Real world tracing

Alan literally wrote this provider from scratch in the last few days, so we haven't had a chance to point it at something big yet. Here's at least one example of something familiar, /etc/profile,

Cool. See Alan's blog for updates.



Posted by brendan [DTrace] ( August 10, 2007 12:23 PM ) Permalink | Comments[1]

http://blogs.sun.com/brendan/date/20070731 Tuesday July 31, 2007

iSCSI DTrace provider

I recently tried out the new iSCSI Target Server DTrace provider which Adam wrote (which should be included in Solaris sometime soon). It provides observability of iSCSI traffic from the the server's perspective. This is great news if you have an iSCSI Solaris server, or are planning to have one.

In this entry I'll post a few screenshots of the iSCSI provider in action, both one-liners and scripts.

one-liners

I'll start by listing the probes - which may convey a sense of what is possible to trace. Look down the "NAME" column,
# dtrace -ln 'iscsi*:::'
   ID    PROVIDER            MODULE                          FUNCTION NAME
11696 iscsi100653         iscsitgtd                 send_async_logout async-send
11697 iscsi100653         iscsitgtd                   send_async_scsi async-send
11698 iscsi100653         iscsitgtd                  handle_scsi_data data-receive
11699 iscsi100653         iscsitgtd              iscsi_conn_data_rqst data-request
11700 iscsi100653         iscsitgtd                   send_datain_pdu data-send
11701 iscsi100653         iscsitgtd            iscsi_handle_login_pkt login-command
11702 iscsi100653         iscsitgtd                 send_login_reject login-response
11703 iscsi100653         iscsitgtd            iscsi_handle_login_pkt login-response
11704 iscsi100653         iscsitgtd                 handle_logout_msg logout-command
11705 iscsi100653         iscsitgtd                        conn_state logout-response
11706 iscsi100653         iscsitgtd                   handle_noop_cmd nop-receive
11707 iscsi100653         iscsitgtd                    iscsi_conn_pkt nop-send
11708 iscsi100653         iscsitgtd                   handle_scsi_cmd scsi-command
11709 iscsi100653         iscsitgtd                     send_scsi_rsp scsi-response
11710 iscsi100653         iscsitgtd                   handle_task_mgt task-command
11711 iscsi100653         iscsitgtd                    iscsi_conn_pkt task-response
11712 iscsi100653         iscsitgtd                   handle_text_msg text-command
11713 iscsi100653         iscsitgtd                   handle_text_msg text-response
The easiest one-liner would be to frequency count iSCSI operations,
# dtrace -n 'iscsi*::: { @[probename] = count(); }'
dtrace: description 'iscsi*::: ' matched 18 probes
^C

  nop-receive                                                       3
  nop-send                                                          3
  data-receive                                                     55
  data-request                                                     55
  scsi-response                                                   103
  scsi-command                                                    342
  data-send                                                      2865
The above output shows that the iSCSI server is doing more send operations (2865 data-sends), than receive (55 data-receives).

Now to see who is using this iSCSI server, by frequency counting remote IP addresses,

# dtrace -n 'iscsi*::: { @[args[0]->ci_remote] = count(); }'
dtrace: description 'iscsi*::: ' matched 18 probes
^C

  192.168.1.109                                                  1338
The above shows that 192.168.1.108 triggered 1338 iSCSI operations - which may have been sends, receives, nops, etc. Breaking down further is just a little more DTrace,
# dtrace -qn 'iscsi*::: { @[args[0]->ci_remote, probename] = count(); } 
  END { printa("%32s %18s %@16d\n", @); }'
^C
                   192.168.1.109        nop-receive                2
                   192.168.1.109           nop-send                2
                   192.168.1.109       data-receive               16
                   192.168.1.109       data-request               16
                   192.168.1.109      scsi-response               43
                   192.168.1.109       scsi-command              253
                   192.168.1.109          data-send             2619
Great, while a simple one-liner - it quickly identifies who is connecting and what operations they are doing. Some performance wins are about locating and eliminating unnecessary work, which one-liners such as this help identify.

Now for printing payload bytes by iSCSI operation,

# dtrace -n 'iscsi*::: { @[probename] = sum(args[1]->ii_datalen); }'
dtrace: description 'iscsi*::: ' matched 18 probes
^C

  data-request                                                      0
  scsi-response                                                     0
  data-receive                                                  99328
  scsi-command                                                 140800
  data-send                                                  20672512
The above one-liner prints total bytes by operation, which does lose some data - what was the count by operation, average, min, max? DTrace can answer all of these seperately, or provide a summary as a distribution plot. Here I've used quantize for a power of 2 plot,
# dtrace -n 'iscsi*::: { @[probename] = quantize(args[2]->ii_datalen); }'
dtrace: description 'iscsi*::: ' matched 18 probes
^C

  data-request                                      
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 12       
               1 |                                         0        

  scsi-response                                     
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 55       
               1 |                                         0        

  data-receive                                      
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@@                                  2        
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@                            4        
            8192 |@@@@@@@@@@@@@@@@@@@@                     6        
           16384 |                                         0        

  scsi-command                                      
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         209      
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         0        
             512 |@@@                                      18       
            1024 |@@                                       10       
            2048 |@@                                       11       
            4096 |@                                        4        
            8192 |@@                                       12       
           16384 |                                         0        

  data-send                                         
           value  ------------- Distribution ------------- count    
             256 |                                         0        
             512 |                                         1        
            1024 |                                         16       
            2048 |                                         21       
            4096 |                                         28       
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  2559     
           16384 |                                         0        
The plots above show that for the data-send operation, the payload size was usually in the 8 to 16 Kbyte bucket.

scripts

Apart from one-lines, DTrace allows powerful scripts to be contstructed. I've put a few on the SolarisInternals DTrace Topics: iSCSI page.

Here I'll post my favourite so far - a script which measures read/write latency (or "response time") in nanoseconds,

# ./iscsirwlat.d 
Tracing... Hit Ctrl-C to end.
^C
Read Latency (ns),

  192.168.1.109                                     
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |@@@                                      27       
          131072 |@@                                       17       
          262144 |                                         4        
          524288 |@@@@                                     39       
         1048576 |@@@                                      32       
         2097152 |@@@@@@@@@@@@@@@@@                        164      
         4194304 |@                                        11       
         8388608 |@                                        5        
        16777216 |@                                        10       
        33554432 |@                                        7        
        67108864 |@                                        5        
       134217728 |@@@@@                                    46       
       268435456 |@                                        14       
       536870912 |                                         0        

Write Latency (ns),

  192.168.1.109                                     
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |@@                                       14       
          131072 |@@                                       16       
          262144 |@@@                                      20       
          524288 |@@@                                      20       
         1048576 |@@@@@                                    33       
         2097152 |@                                        5        
         4194304 |                                         0        
         8388608 |                                         2        
        16777216 |@                                        10       
        33554432 |@@@                                      18       
        67108864 |@@@@@                                    33       
       134217728 |@@@@@@@@@@@@@                            93       
       268435456 |@@                                       12       
       536870912 |                                         0        
The script is on the DTrace Topics page. It is a glimpse of things to come; how many of those were synchronous requests (which may hurt application performance) versus asynchronous flushing or read ahead (which may not hurt performance)? More DTrace is needed - which will probably include running DTrace on the client side as well...

Posted by brendan [DTrace] ( July 31, 2007 10:07 AM ) Permalink | Comments[0]
http://blogs.sun.com/brendan/date/20070730 Monday July 30, 2007

JavaScript DTrace provider ver 2.0

About 2 months ago I posted an update to the JavaScript DTrace provider to Bugzilla. In case you are interested and missed it, here are some details: The new additions were,

I was finding that the JavaScript provider was only helping with about 4 problems out of 10 (still, better than nothing). The improvements that I added should take that to around 8 out of 10 - so it is now proving to be really useful.

Anyhow, here are some before and after screenshots, tracing a simple JavaScript clock (it's enough to demonstrate the features without bogging down in application specifics),

Old,

# ./js_funcflow.d
C TIME                  FILE                     -- FUNC
1 2007 May 31 06:07:13  clock2.html              -> startTime
1 2007 May 31 06:07:13  clock2.html                -> getHours
1 2007 May 31 06:07:13  clock2.html                <- getHours
1 2007 May 31 06:07:13  clocklib.js                -> padLibZero
1 2007 May 31 06:07:13  clocklib.js                <- padLibZero
1 2007 May 31 06:07:13  clock2.html                -> getMinutes
1 2007 May 31 06:07:13  clock2.html                <- getMinutes
1 2007 May 31 06:07:13  clocklib.js                -> padLibZero
1 2007 May 31 06:07:13  clocklib.js                <- padLibZero
1 2007 May 31 06:07:13  clock2.html                -> getSeconds
1 2007 May 31 06:07:13  clock2.html                <- getSeconds
1 2007 May 31 06:07:13  clocklib.js                -> padLibZero
1 2007 May 31 06:07:13  clocklib.js                <- padLibZero
1 2007 May 31 06:07:13  clock2.html                -> getElementById
1 2007 May 31 06:07:13  clock2.html                <- getElementById
1 2007 May 31 06:07:13  clock2.html                -> setTimeout
1 2007 May 31 06:07:13  clock2.html                <- setTimeout
1 2007 May 31 06:07:13  clock2.html              <- startTime
^C
New,
# ./js_argflow.d
C TIME                  FILE                 -- FUNC()
1 2007 May 31 06:07:34  clock2.html           -> startTime()
1 2007 May 31 06:07:34  clock2.html             -> getHours()
1 2007 May 31 06:07:34  clock2.html             <- startTime:9 = 0x80000000
1 2007 May 31 06:07:34  clocklib.js             -> padLibZero(0x6)
1 2007 May 31 06:07:34  clocklib.js             <- padLibZero:3 = 0x955fd70
1 2007 May 31 06:07:34  clock2.html             -> getMinutes()
1 2007 May 31 06:07:34  clock2.html             <- startTime:10 = 0x80000000
1 2007 May 31 06:07:34  clocklib.js             -> padLibZero(0x7)
1 2007 May 31 06:07:34  clocklib.js             <- padLibZero:3 = 0x955fd50
1 2007 May 31 06:07:34  clock2.html             -> getSeconds()
1 2007 May 31 06:07:34  clock2.html             <- startTime:11 = 0x80000000
1 2007 May 31 06:07:34  clocklib.js             -> padLibZero(0x22)
1 2007 May 31 06:07:34  clocklib.js             <- padLibZero:4 = 0x22
1 2007 May 31 06:07:34  clock2.html             -> getElementById()
1 2007 May 31 06:07:34  clock2.html             <- startTime:12 = 0x80000000
1 2007 May 31 06:07:34  clock2.html             -> setTimeout()
1 2007 May 31 06:07:34  clock2.html             <- startTime:13 = 0x80000000
1 2007 May 31 06:07:34  clock2.html           <- startTime:13 = 0x80000000
^C
Now there is some visibility for function arguments and return values. If arguments are strings, it is possible to print them as such from DTrace.

Old,

# ./js_funccalls.d
Tracing... Hit Ctrl-C to end.
^C
 FILE                             FUNC                                    CALLS
 clock2.html                      getElementById                              1
 clock2.html                      getHours                                    1
 clock2.html                      getMinutes                                  1
 clock2.html                      getSeconds                                  1
 clock2.html                      setTimeout                                  1
 clock2.html                      startTime                                   1
 clocklib.js                      padLibZero                                  3
New,
# ./js_funcinfo.d
Tracing... Hit Ctrl-C to end.
^C
 BASE[FILE:LINE]         FUNCNAME             RUN[FILE:LINE]             CALLS
 clock2.html:7           getElementById       clock2.html:12                 1
 clock2.html:7           getHours             clock2.html:9                  1
 clock2.html:7           getMinutes           clock2.html:10                 1
 clock2.html:7           getSeconds           clock2.html:11                 1
 clock2.html:7           setTimeout           clock2.html:13                 1
 clock2.html:7           startTime            clock2.html:13                 1
 clocklib.js:2           padLibZero           clock2.html:10                 1
 clocklib.js:2           padLibZero           clock2.html:11                 1
 clocklib.js:2           padLibZero           clock2.html:9                  1
The "BASE" info describes where that function was declared or first referred to; the "RUN" info describes where that function was executed. These details are especially useful for tracing anonymous functions.

The scripts used above can be downloaded in a tar file from the OpenSolaris DTrace JavaScript site.

There are a number of people at Sun and at Mozilla helping to put DTrace probes in Mozilla, not just for JavaScript - but also for tracing browser latencies at a level which is useful for developers. For more information see the OpenSolaris page, and on Bugzilla: 370906, 388564.

Posted by brendan [DTrace] ( July 30, 2007 01:02 PM ) Permalink | Comments[0]

http://blogs.sun.com/brendan/date/20070729 Sunday July 29, 2007

DTracing off-CPU time

In this entry I'll demonstrate DTracing off-CPU time.

On my way back from OSCON 2007, I found myself in Portland airport with a laptop, no Internet and half an hour to kill. In fact, the same laptop that I used during the SVOSUG talk in April. It was a chance to finish the demo that I started back then.

For those that were there or listened in, the talk had an unfortunate start with over an hour of audiovisual issues - including problems with the laptop video driver, laptop boot process, the data projector, lights and microphone. Alan DuBoff did a good job of fixing things while I began plan B (presenting from a SunRay), and he fixed the video driver such that it works better with data projectors than any other Acer Ferrari I've tried.

One issue we didn't fix that night was a 20 second wait when starting up certain applications, such as gnome-terminal. It was to make a great live DTrace demo to finish with, but I ran out of time to do it properly (we finished at around 11:30pm). Here it is as a blog entry instead.

The problem

When running gnome-terminal, there is a 20 second wait before the terminal appears on the screen.

Initial check

During the 20 second wait,

DTrace investigation

The gnome-terminal application is off-CPU for some reason, and the CPUs are otherwise idle. There are numerous different ways to begin the investigation with DTrace, including, I'll start with the sched provider (the syscall provider would also make a good starting point). I'll keep restarting gnome-terminal, so that I can DTrace all the events (this is harder if you are trying to DTrace something that is already off-CPU).

The following one-liner runs gnome-terminal and measures the time from that process leaving the CPU to when it returns, and only prints user stack traces if that time was over 1 second,

# dtrace -n 'sched:::off-cpu /pid == $target/ { self->start = timestamp; }
  sched:::on-cpu /self->start && ((timestamp - self->start) > 1000000000)/
  { printf("waited: %d ms\n", (timestamp - self->start) / 1000000); ustack(); }
' -c gnome-terminal
dtrace: description 'sched:::off-cpu ' matched 6 probes
CPU     ID                    FUNCTION:NAME
  0  48718                    resume:on-cpu waited: 20034 ms

              libc.so.1`door_call+0x1a
              libc.so.1`_nsc_trydoorcall+0x213
              libnsl.so.1`_door_getipnodebyname_r+0x8f
              libnsl.so.1`_get_hostserv_inetnetdir_byname+0xb62
              libnsl.so.1`getipnodebyname+0xdf
              libsocket.so.1`get_addr+0x126
              libsocket.so.1`_getaddrinfo+0x414
              libsocket.so.1`getaddrinfo+0x19
              libORBit-2.so.0.1.0`get_netid+0x91
              libORBit-2.so.0.1.0`link_get_local_hostname+0x35
              libORBit-2.so.0.1.0`link_server_setup+0x56
              libORBit-2.so.0.1.0`giop_server_new+0x5e
              libORBit-2.so.0.1.0`ORBit_ORB_start_servers+0x1d8
              libORBit-2.so.0.1.0`IOP_generate_profiles+0x67
              libORBit-2.so.0.1.0`ORBit_marshal_object+0x8e
              libORBit-2.so.0.1.0`ORBit_marshal_value+0x391
              libORBit-2.so.0.1.0`orbit_small_marshal+0xf7
              libORBit-2.so.0.1.0`ORBit_small_invoke_stub+0x11c
              libORBit-2.so.0.1.0`ORBit_small_invoke_stub_n+0x43
              libORBit-2.so.0.1.0`ORBit_c_stub_invoke+0x132

^C
Hmm - looks like that user stack trace was truncated. Time to boost the ustackframes value,
# dtrace -x ustackframes=64 -n '
  sched:::off-cpu /pid == $target/ { self->start = timestamp; }
  sched:::on-cpu /self->start && ((timestamp - self->start) > 1000000000)/
  { printf("waited: %d ms\n", (timestamp - self->start) / 1000000); ustack(); }
' -c gnome-terminal
dtrace: description 'sched:::off-cpu ' matched 6 probes
dtrace: pid 101415 has exited
CPU     ID                    FUNCTION:NAME
  0  48720                    resume:on-cpu waited: 20038 ms

              libc.so.1`door_call+0x1a
              libc.so.1`_nsc_trydoorcall+0x213
              libnsl.so.1`_door_getipnodebyname_r+0x8f
              libnsl.so.1`_get_hostserv_inetnetdir_byname+0xb62
              libnsl.so.1`getipnodebyname+0xdf
              libsocket.so.1`get_addr+0x126
              libsocket.so.1`_getaddrinfo+0x414
              libsocket.so.1`getaddrinfo+0x19
              libORBit-2.so.0.1.0`get_netid+0x91
              libORBit-2.so.0.1.0`link_get_local_hostname+0x35
              libORBit-2.so.0.1.0`link_server_setup+0x56
              libORBit-2.so.0.1.0`giop_server_new+0x5e
              libORBit-2.so.0.1.0`ORBit_ORB_start_servers+0x1d8
              libORBit-2.so.0.1.0`IOP_generate_profiles+0x67
              libORBit-2.so.0.1.0`ORBit_marshal_object+0x8e
              libORBit-2.so.0.1.0`ORBit_marshal_value+0x391
              libORBit-2.so.0.1.0`orbit_small_marshal+0xf7
              libORBit-2.so.0.1.0`ORBit_small_invoke_stub+0x11c
              libORBit-2.so.0.1.0`ORBit_small_invoke_stub_n+0x43
              libORBit-2.so.0.1.0`ORBit_c_stub_invoke+0x132
              libgconf-2.so.4.1.0`ConfigServer_add_client+0x4f
              libgconf-2.so.4.1.0`gconf_get_config_server+0xb9
              libgconf-2.so.4.1.0`gconf_engine_connect+0x24f
              libgconf-2.so.4.1.0`gconf_engine_get_default+0x4c
              libgconf-2.so.4.1.0`gconf_client_get_default+0x2a
              libgnomeui-2.so.0.1401.0`libgnomeui_post_args_parse+0x187
              libgnome-2.so.0.1401.0`gnome_program_postinit+0x61
              libgnome-2.so.0.1401.0`gnome_program_init_common+0x37b
              libgnome-2.so.0.1401.0`gnome_program_initv+0x4d
              libgnome-2.so.0.1401.0`gnome_program_init+0x56
              gnome-terminal`main+0x2d5
              gnome-terminal`_start+0x7a
Good. This stack trace is likely to have led to the process leaving the CPU for the measured 20038 ms (the system was otherwise idle, so it is unlikely to have been kicked off due to scheduling). Reading through the lines, it looks like it is resolving a hostname - a common source of latencies when DNS is misconfigured.

There are a number of lines we could begin studying to confirm what is happening (and arrive at the same answer); I'll start with libnsl.so.1`getipnodebyname+0xdf, which should have a man page entry,

# man getipnodebyname
[...]
     struct hostent *getipnodebyname(const char  *name,  int  af,
     int flags, int *error_num);
[...]
Now to check what hostname is being resolved,
# dtrace -n 'pid$target:libnsl:getipnodebyname:entry { trace(copyinstr(arg0)); }'
  -c gnome-terminal
dtrace: description 'pid$target:libnsl:getipnodebyname:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  50488            getipnodebyname:entry   marlin
^C
dtrace: pid 101758 terminated by SIGINT
Oh - "marlin" is the hostname of this laptop, which should resolve just fine. Checking related configuration files,
# grep hosts /etc/nsswitch.conf
hosts:      files dns
#
# cat -n /etc/hosts
     1  #
     2  # Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
     3  # Use is subject to license terms.
     4  #
     5  #ident  "@(#)hosts      1.1     06/08/04 SMI"
     6  #
     7  127.0.0.1       localhost
     8  192.168.1.166   marlin marlin.sf.fw.jpn.com
Hmm - the name service switch file should cause /etc/hosts (/etc/inet/hosts) to be read first, which contains a valid entry for "marlin".

Lets take a closer look at the getipnodebyname() call, and trace the address family and flags arguments,

# dtrace -n 'pid$target:libnsl:getipnodebyname:entry { printf("%s af=%d flags=%d",
  copyinstr(arg0), arg1, arg2); }' -c gnome-terminal
dtrace: description 'pid$target:libnsl:getipnodebyname:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  50488            getipnodebyname:entry marlin af=26 flags=19
^C
dtrace: pid 101821 terminated by SIGINT
Ok, address family 26 is,
# grep 26 /usr/include/sys/socket.h
#define AF_INET6        26              /* Internet Protocol, Version 6 */
IPv6! Should have checked earlier,
# grep ipnodes /etc/nsswitch.conf
ipnodes:    files dns
#
# cat -n /etc/inet/ipnodes
     1  #
     2  # Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
     3  # Use is subject to license terms.
     4  #
     5  #ident  "@(#)ipnodes    1.1     06/08/04 SMI"
     6  #
     7  ::1             localhost
     8  127.0.0.1       localhost
Adding an entry to the ipnodes file for "marlin" fixed the issue.

Take 2

Was using the sched provider and then pid to trace getipnodebyname() the best path to take? It's hard to say - there are many different ways DTrace can help you solve problems. Lets try some other paths.

Here I'll trace elapsed times from the syscall provider - since if we have an off-CPU issue on an idle system, it will almost certainly be visible at the syscall layer.

The procsystime tool from the DTraceToolkit performs different types of syscall time analysis. I'm using it as it saves a minute or so of typing. Here it runs the gnome-terminal command and produces a report of elapsed times by syscall,

# /opt/DTT/procsystime -e gnome-terminal

Elapsed Times for command gnome-terminal,

         SYSCALL          TIME (ns)
           gtime               1882
      sigpending               1921
     priocntlsys               4200
          sysi86               4354
          getgid               5152
       sysconfig               5606
     getsockname               5904
      systeminfo               7040
      setcontext               7342
          fxstat               7943
     getpeername               9393
       getrlimit              11717
          getuid              13927
           uname              14661
       sigaction              14698
          getpid              19241
      setsockopt              23743
          getcwd              33784
            fsat              41700
          stat64              43024
           readv              45937
          listen              49874
          llseek              52341
           mkdir              52633
            pipe              64612
           chmod              65684
           fcntl              70864
           utime              76677
         fstat64             109420
           ioctl             133301
          access             157803
          unlink             165703
          open64             221120
          accept             276863
            bind             281026
          writev             343976
             brk             344410
         memcntl             347475
           write             358764
      getdents64             422206
          munmap             486137
         connect             588166
     resolvepath             686768
       so_socket             748039
           close             846782
            open            1127761
            read            1191102
            mmap            1592314
           xstat            8521257
         pollsys            9817661
          doorfs        20035299808
The doorfs() call takes the door descriptor as the first argument,
# dtrace -n 'syscall::doorfs:entry /pid == $target/ { trace(arg0); }' -c gnome-terminal
dtrace: description 'syscall::doorfs:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    572                     doorfs:entry                 3
  0    572                     doorfs:entry                 3
  0    572                     doorfs:entry                 3
  0    572                     doorfs:entry                 3
^C
dtrace: pid 101833 terminated by SIGINT
#
# dtrace -n 'syscall::doorfs:entry /pid == $target/ { trace(fds[arg0].fi_pathname); }'
  -c gnome-terminal
dtrace: description 'syscall::doorfs:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    572                     doorfs:entry   /var/run/name_service_door
  0    572                     doorfs:entry   /var/run/name_service_door
  0    572                     doorfs:entry   /var/run/name_service_door
  0    572                     doorfs:entry   /var/run/name_service_door
^C
dtrace: pid 101835 terminated by SIGINT
The door calls were to /var/run/name_service_door, pointing to name resolution and the nscd process (which can be confirmed with more DTrace).

Running procsystime on nscd produces,

# /opt/DTT/procsystime -e -n nscd
Hit Ctrl-C to stop sampling...
^C

Elapsed Times for processes nscd,

         SYSCALL          TIME (ns)
           ioctl               2452
          llseek               2615
         fstat64               3996
       sysconfig              13933
            open              15416
            read              26336
           gtime              52980
       so_socket             323066
           close             349885
           xstat             430465
            send             454944
         connect             795246
          doorfs          360127347
        lwp_park        10009952745
         pollsys        20032365315
       nanosleep       153088010532
nscd is multi-threaded, with many threads sleeping as they wait for work - meaning that large off-CPU elapsed times may have nothing to do with gnome-terminal. The time for pollsys() is interesting, as at 20 seconds it matches the measured time in gnome-terminal.

Analysing pollsys() further (the poll() manpage is similar),

# man poll
[...]
     int poll(struct pollfd fds[], nfds_t nfds, int timeout);
[...]
The first argument is an array with length specified by the second argument. Looping over an array is difficult from DTrace (due to lack of loops), but this doesn't really present a difficulty when troubleshooting. Here I'll dump the struct in hex, and the other values as ints,
# dtrace -n 'syscall::pollsys:entry /execname == "nscd"/ { printf("nfds=%d", arg1);
  tracemem(copyin(arg0, 16), 16); }' -c gnome-terminal
dtrace: description 'syscall::pollsys:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    542                    pollsys:entry nfds=1
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 05 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00  ....@...........

  0    542                    pollsys:entry nfds=1
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 05 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00  ....@...........

  0    542                    pollsys:entry nfds=1
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 05 00 00 00 40 00 42 fd 0b 7b ab 46 f0 d7 c0 0d  ....@.B..{.F....

  0    542                    pollsys:entry nfds=1
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 05 00 00 00 40 00 00 00 0b 7b ab 46 f0 d7 c0 0d  ....@....{.F....

dtrace: pid 102041 has exited
Don't worry - DTrace lets you cast variables as structs if you like; I'm just dumping the data in hex as this is a short one-liner.

The nfds value shows that there is only ever one file descriptor in the array; and the hex dump (on this little endian architecture) shows that the file descriptor number is "5" (the first member of struct pollfd).

Now DTrace is used to print the pathname for that file descriptor,

# dtrace -n 'syscall::pollsys:entry /execname == "nscd"/ 
  { trace(fds[*(uint32_t *)copyin(arg0, 4)].fi_pathname); }' -c gnome-terminal
dtrace: description 'syscall::pollsys:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    542                    pollsys:entry   <unknown>
  0    542                    pollsys:entry   <unknown>
^C
dtrace: pid 102063 terminated by SIGINT
Well, that didn't work. Not all file descriptors have pathnames, of course. It will save some time to borrow pfiles for the next step,
# pfiles `pgrep -x nscd`
101917: /usr/sbin/nscd
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
      O_RDWR
      /devices/pseudo/mm@0:null
   1: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
      O_RDWR
      /devices/pseudo/mm@0:null
   2: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
      O_RDWR
      /devices/pseudo/mm@0:null
   3: S_IFDOOR mode:0777 dev:279,0 ino:0 uid:0 gid:0 size:0
      O_RDWR FD_CLOEXEC  door to nscd[101917]
   4: S_IFSOCK mode:0666 dev:277,0 ino:17874 uid:0 gid:0 size:0
      O_RDWR
        SOCK_RAW
        SO_SNDBUF(8192),SO_RCVBUF(8192)
        sockname: AF_ROUTE
        peername: AF_ROUTE
   5: S_IFSOCK mode:0666 dev:277,0 ino:53988 uid:0 gid:0 size:0
      O_RDWR
        SOCK_DGRAM
        SO_DGRAM_ERRIND,SO_SNDBUF(57344),SO_RCVBUF(57344)
        sockname: AF_INET 192.168.1.78  port: 51697
        peername: AF_INET 192.168.1.5  port: 53
File descriptor 5 was a socket to the remote host 192.168.1.5 port 53 (DNS). This shows that nscd is waiting for 20 seconds on a DNS socket, and we know that gnome-terminal has a 20 second wait on the /var/run/name_service_door file. They are probably related.

Proving that they are related, if needed, is simply more DTrace. This time I'll borrow a script from /usr/demo/dtrace, which shows who is waiting for nscd and how long for,

# dtrace -s /usr/demo/dtrace/nscd.d -c gnome-terminal
dtrace: script '/usr/demo/dtrace/nscd.d' matched 27 probes
dtrace: pid 101594 has exited

  nscd                                                gnome-session
           value  ------------- Distribution ------------- count
           16384 |                                         0
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
           65536 |                                         0

  nscd                                                gnome-terminal
           value  ------------- Distribution ------------- count
           16384 |                                         0
           32768 |@@@@@@@@@@                               1
           65536 |@@@@@@@@@@@@@@@@@@@@                     2
          131072 |                                         0
          262144 |                                         0
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |                                         0
         8388608 |                                         0
        16777216 |                                         0
        33554432 |                                         0
        67108864 |                                         0
       134217728 |                                         0
       268435456 |                                         0
       536870912 |                                         0
      1073741824 |                                         0
      2147483648 |                                         0
      4294967296 |                                         0
      8589934592 |                                         0
     17179869184 |@@@@@@@@@@                               1
     34359738368 |                                         0
In the 17.1 to 34.3 second bucket is gnome-terminal waiting for nscd - our 20 second wait.

The /usr/demo/dtrace/nscd.d script works by tracing who is nscd is waking up via the sched:::wakeup probe, after having traced the time when threads sleep (see the "sched" chapter in the Dynamic Tracing Guide on docs.sun.com).

The wrap

DTrace provides many ways to solve performance issues or for troubleshooting. Above were only a few techniques for analysing off-CPU time, but there are more available.

Some techniques can get very complex, and require much systems knowledge. Don't worry: if you only figure out one way to solve your problem - you've still solved your problem. (Systems knowledge should help you solve system problems faster).

Posted by brendan [DTrace] ( July 29, 2007 10:39 PM ) Permalink | Comments[1]

http://blogs.sun.com/brendan/date/20070426 Thursday April 26, 2007

SVOSUG - DTrace

I'll be presenting a number of DTrace Topics at SVOSUG tonight, April 26th 2007. Alan posted the announcement here.

I'll be using two presentations (fairly quickly), which are linked at the bottom of these pages:

DTrace Topics: Intro
DTrace Topics: DTraceToolkit

Plus demonstrating various scripts. Whoever can make it - I'll see you soon.

Posted by brendan [Sun] ( April 26, 2007 04:49 PM ) Permalink | Comments[0]

http://blogs.sun.com/brendan/date/20070305 Monday March 05, 2007

Audio Volume CLI

While Solaris has a fancy GNOME based desktop called JDS3, other desktop environments are available and work fine. FVWM2 is a fast alternative with a modest set of features, and is available on the Software Companion CD (which you may already have a copy of).

If you have installed fvwm2 from the companion CD and would like to try it, the easiest way is to enter a fail safe session from the login screen, then run the binary - /opt/sfw/bin/fvwm2. The proper way is to create config files under /etc/dt/config, so that the login screen provides FVWM as an option.

After getting fvwm2 running, I found my volume up/down/mute keys on this Sun type 7 keyboard didn't work. An internet search didn't find any solutions. To get these keys to work, I wrote a short C program to ioctl /dev/audioctl, and added some lines to the .fvwmrc file. I'm writing this quick blog entry to help the next person doing the same Internet search. If there is a better way to do this in Solaris already (like a shipped binary), I missed it!

This is the C program,

/* volumeset.c - set Sun's /dev/audio play volume */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/audioio.h>

void
usage(char *name)
{
        (void) printf("USAGE: %s [+|-]volume_percent\n", name);
        (void) printf("   eg,\n");
        (void) printf("       %s 100     # maximum volume\n", name);
        (void) printf("       %s +5      # plus 5 percent\n", name);
        exit(1);
}

int
main(int argc, char *argv[])
{
        audio_info_t ai;
        int fd, vol, mod, gain;

        if (argc < 2)
                usage(argv[0]);

        switch (argv[1][0]) {
                case '+':
                        mod = 1;
                        vol = atoi(&argv[1][1]);
                        break;
                case '-':
                        mod = -1;
                        vol = atoi(&argv[1][1]);
                        break;
                case '0'...'9':
                        mod = 0;
                        vol = atoi(argv[1]);
                        if (vol > 100 || vol < 0) {
                                (void) printf("ERROR: volume must be "
                                    "between 0 and 100.\n");
                                exit(4);
                        }
                        break;
                default:
                        usage(argv[0]);
        }

        if (mod != 0 && vol == 0)
                usage(argv[0]);

        if ((fd = open("/dev/audioctl", O_RDONLY)) == -1) {
                (void) perror("can't open /dev/audioctl");
                exit(2);
        }

        if (ioctl(fd, AUDIO_GETINFO, &ai) == -1) {
                (void) perror("fetching audio state failed");
                exit(3);
        }

        if (mod == 0)
                gain = (vol * 255) / 100;
        else
                gain = ai.play.gain + (mod * vol * 255) / 100;
        if (gain < 0)
                gain = 0;
        if (gain > 255)
                gain = 255;

        ai.play.gain = gain;
        ai.output_muted = gain == 0 ? 1 : 0;

        if (ioctl(fd, AUDIO_SETINFO, &ai) == -1) {
                (void) perror("setting audio state failed");
                exit(4);
        }

        (void) close(fd);

        return (0);
}
If you don't have Sun's C compiler installed, you can compile it using /usr/sfw/bin/gcc -o volumeset volumeset.c.

The following are the lines I added to ~/.fvwm/.fvwm2rc to bind the audio keys on the top left to the volumeset program (copied to /usr/local/bin); these bindings probably work for type 6 keyboards as well (haven't tried),

Key SunAudioMute A       A       Exec /usr/local/bin/volumeset 0
Key SunAudioLowerVolume A A      Exec /usr/local/bin/volumeset -15
Key SunAudioRaiseVolume A A      Exec /usr/local/bin/volumeset +15
The above lines bind the keys to mute the volume, decrease by 15% or increase by 15% (it may be better to make the mute behave as a toggle, rather than always mute). After restarting fvwm, my audio keys now work fine.

Posted by brendan [Solaris] ( March 05, 2007 04:07 PM ) Permalink | Comments[5]
http://blogs.sun.com/brendan/date/20070227 Tuesday February 27, 2007

DTrace on SDNtv

I was recently interviewed about DTrace for SDNtv, Sun's Developer Network Channel. You can watch the video in a browser or download it and watch it later.

For the interview we discussed what DTrace is, what DTrace can do, and I ran the colortrace demo. It was a short interview aimed at introducing people to DTrace for the first time.

I found SDNtv itself quite interesting: the host I had, Mark Herring, was technically savvy and both understood DTrace and knew what best to ask. It was also filmed in one take - that was a big suprise - I've always assumed such interviews are cut together creatively, but this was a real interview filmed in one shot.

But what is especially interesting is the idea of videoing Sun engineers as they discuss their own products, and making those videos publically available. Not that long ago (6 years?), I remember being a Sun customer in Australia who was trying to learn more about Solaris and the kernel, and in particular (being a code geek and a Unix history geek) I wanted to know who wrote what and why. There was very little public information about Sun's engineers - it seemed that the very names of the engineers were confidential. And, growing up in Australia, there were no conferences to visit where Sun engineers gave presentations, like there are in the US.

When Bryan first emailed me in early 2004 suggesting I check out something new called DTrace, I had no idea who Bryan was. I remember staring at his sig and thinking "I wish I knew who this guy was". A quick Internet search didn't help much (try that now!).

Times have changed. Engineers now have blogs, and discuss source code on OpenSolaris mailing lists. And SDNtv lets you see and hear them speak!

Posted by brendan [DTrace] ( February 27, 2007 08:20 PM ) Permalink | Comments[0]

AMD64 PICs, CPI

I recently wanted to gather some numbers on CPU and memory system performance, for AMD64 CPUs. I reached a point where I searched the Internet for other Solaris AMD64 PIC (Performance Instrumentation Counters) analysis, and found little. I hope to improve this with some blog entries. In this part I'll introduce PIC observability, and demonstrate measuring CPI (cycles per instruction) for different workloads.

To see why PICs are important, the following are the sort of questions that PIC analysis can answer:
  1. What is the Level 2 cache hit rate?
  2. What is the Level 2 cache miss volume?
  3. What is the hit rate and miss volume for the TLB?
  4. What is my memory bus utilization?
Questions 1 and 2 relate to the CPU hardware cache, where Level 2 is the E$ (meaning either "external" cache or "embedded" cache, depending on the CPU architecture). For optimal performance we want to see a high hit rate, and more importantly, a low miss volume.

Question 3 concerns a component of the memory management unit - the translation lookaside buffer (TLB). This processes and caches virtual to physical memory page translations. It can consume a lot of CPU (the worst I've seen is 60%), and it can be tuned. A good document for understanding this further is Taming Your Emu by Richard McDougall.

Question 4 seems obvious - the memory bus can be a bottleneck for system performance, so, how utilized is it? Answering this isn't easy, but it is usually possible by examining CPU PICs.

cpustat
There are many AMD64 CPU PICs available, which can be viewed using tools such as cpustat and cputrack. Running cpustat -h dumps the list:
# cpustat -h
Usage:
        cpustat [-c events] [-p period] [-nstD] [interval [count]]

        -c events specify processor events to be monitored
        -n        suppress titles
        -p period cycle through event list periodically
        -s        run user soaker thread for system-only events
        -t        include tsc register
        -D        enable debug mode
        -h        print extended usage information

        Use cputrack(1) to monitor per-process statistics.

        CPU performance counter interface: AMD Opteron & Athlon64

        event specification syntax:
        [picn=][,attr[n][=]][,[picn=][,attr[n][=]],...]

        event[0-3]: FP_dispatched_fpu_ops FP_cycles_no_fpu_ops_retired 
                 FP_dispatched_fpu_ops_ff LS_seg_reg_load 
                 LS_uarch_resync_self_modify LS_uarch_resync_snoop 
                 LS_buffer_2_full LS_locked_operation LS_retired_cflush 
                 LS_retired_cpuid DC_access DC_miss DC_refill_from_L2 
                 DC_refill_from_system DC_copyback DC_dtlb_L1_miss_L2_hit 
                 DC_dtlb_L1_miss_L2_miss DC_misaligned_data_ref 
                 DC_uarch_late_cancel_access DC_uarch_early_cancel_access 
                 DC_1bit_ecc_error_found DC_dispatched_prefetch_instr 
                 DC_dcache_accesses_by_locks BU_memory_requests 
                 BU_data_prefetch BU_system_read_responses 
                 BU_quadwords_written_to_system BU_cpu_clk_unhalted 
                 BU_internal_L2_req BU_fill_req_missed_L2 BU_fill_into_L2 
                 IC_fetch IC_miss IC_refill_from_L2 IC_refill_from_system 
                 IC_itlb_L1_miss_L2_hit IC_itlb_L1_miss_L2_miss 
                 IC_uarch_resync_snoop IC_instr_fetch_stall 
                 IC_return_stack_hit IC_return_stack_overflow 
                 FR_retired_x86_instr_w_excp_intr FR_retired_uops 
                 FR_retired_branches_w_excp_intr 
                 FR_retired_branches_mispred FR_retired_taken_branches 
                 FR_retired_taken_branches_mispred 
                 FR_retired_far_ctl_transfer FR_retired_resyncs 
                 FR_retired_near_rets FR_retired_near_rets_mispred 
                 FR_retired_taken_branches_mispred_addr_miscop 
                 FR_retired_fpu_instr FR_retired_fastpath_double_op_instr 
                 FR_intr_masked_cycles FR_intr_masked_while_pending_cycles 
                 FR_taken_hardware_intrs FR_nothing_to_dispatch 
                 FR_dispatch_stalls 
                 FR_dispatch_stall_branch_abort_to_retire 
                 FR_dispatch_stall_serialization 
                 FR_dispatch_stall_segment_load 
                 FR_dispatch_stall_reorder_buffer_full 
                 FR_dispatch_stall_resv_stations_full 
                 FR_dispatch_stall_fpu_full FR_dispatch_stall_ls_full 
                 FR_dispatch_stall_waiting_all_quiet 
                 FR_dispatch_stall_far_ctl_trsfr_resync_branc_pend 
                 FR_fpu_exception FR_num_brkpts_dr0 FR_num_brkpts_dr1 
                 FR_num_brkpts_dr2 FR_num_brkpts_dr3 
                 NB_mem_ctrlr_page_access NB_mem_ctrlr_page_table_overflow 
                 NB_mem_ctrlr_turnaround 
                 NB_mem_ctrlr_bypass_counter_saturation NB_ECC_errors 
                 NB_sized_commands NB_probe_result NB_gart_events 
                 NB_ht_bus0_bandwidth NB_ht_bus1_bandwidth 
                 NB_ht_bus2_bandwidth NB_sized_blocks NB_cpu_io_to_mem_io 
                 NB_cache_block_commands 

        attributes: edge pc inv cmask umask nouser sys 

        See Chapter 10 of the "BIOS and Kernel Developer's Guide for the 
        AMD Athlon 64 and AMD Opteron Processors," AMD publication #26094 
There are around fifty names above such as "FP_dispatched_fpu_ops" which describe the PICs available. On my AMD Opteron CPUs you can measure four of these at a time, which can be provided in the arguments to cpustat, eg,
# cpustat -c IC_fetch,DC_access,DC_dtlb_L1_miss_L2_hit,DC_dtlb_L1_miss_L2_miss 0.25
   time cpu event      pic0      pic1      pic2      pic3 
  0.257   0  tick   6406429   8333198     45826      5515 
  0.257   1  tick   3333442   3942694     24682      4409 
  0.507   1  tick   6450964   8229104     44046      5713 
  0.507   0  tick   2359697   2828683     14365      4415 
  0.757   0  tick   2490406   3060416     16458      4901 
  0.757   1  tick   7292986   9530806     68956      6490 
  1.007   0  tick   2514008   3063049     15037      3863 
  1.007   1  tick   6057048   7747580     42415      6083 
^C
In the above example I printed four PICs every 0.25 seconds, for each CPU (I'm on a 2 x virtual CPU server). The CPU column shows that the output is slightly shuffled - a harmless side effect from the way cpustat was coded (it pbinds a libcpc consumer onto each CPU in the available processor set, and all threads write to STDOUT in any order). These PICs are provided by programmable hardware registers - so there is no ideal way around the four-at-a-time limit. You can shuffle measurements between different sets of PICs, which cpustat supports.

Reference Documentation
Since different CPUs provide different PICs, the guide mentioned at the bottom of the cpustat -h output will list what PICs your CPU type provides. It is important to read these guides carefully - for example, PICs that track cache misses may have some exceptions to what is considered a "miss".

I spent a while with AMD's #26094 guide, but I found that the PIC descriptions raise more questions than answers. (try to find basics such as "instruction count")... If you find yourself in a similar situation, it can help to create known workloads and then examine which metrics move by a similar amount. I used this approach to confirm what PICs provided cycle counts and instruction counts.

I did eventually find two good resources on AMD PICs,

You may notice some really interesting PICs mentioned, such as memory locality observability in the newer revs of AMD CPUs.

If you are interested in PIC analysis for any CPU type, see chapter 8 "Performance Counters" in Solaris Performance and Tools, by Richard McDougall, Jim Mauro and myself. One of the metrics we made sure to include in the book was CPI (cycles per instruction), as it proves to be a useful starting point for understanding CPU behavior.

Example - CPI
The cycles per instruction metric (sometimes measured as IPC - instructions per cycle) is a useful ratio and (depending on CPU type) fairly easy to measure. If the measured CPI ratio is low, more instructions can be dispatched in a given time, which usually means higher performance. High CPI means instructions are stalling, usually on main memory bus activity.

The output of cpustat can be formatted with a little scripting; the following script "amd64cpiu" uses a little shell and Perl to aggregate and print the output:

#!/usr/bin/sh
#
# amd64cpiu - measure CPI and Utilization on AMD64 processors.
#
# USAGE: amd64cpiu [interval]
#   eg,
#        amd64cpiu 0.1          # for 0.1 second intervals
#
# CPI is cycles per instruction, a metric that increases due to activity
# such as main memory bus lookups.
#
# ident "@(#)amd64cpiu.sh       1.1     07/02/17 SMI"

interval=${1:-1}        # default interval, 1 second

set -- `kstat -p unix:0:system_misc:ncpus`      # assuming no psets,
cpus=$2                                         # number of CPUs

pics='BU_cpu_clk_unhalted'                      # cycles
pics=$pics,'FR_retired_x86_instr_w_excp_intr'   # instructions

/usr/sbin/cpustat -tc $pics $interval | perl -e '
        printf "%16s %8s %8s\n", "Instructions", "CPI", "%CPU";
        while (<>) {
                next if ++$lines == 1;
                split;
                $total += $_[3];
                $cycles += $_[4];
                $instructions += $_[5];

                if ((($lines - 1) % '$cpus') == 0) {
                        printf "%16u %8.2f %8.2f\n", $instructions,
                            $cycles / $instructions, $total ?
                            100 * $cycles / $total : 0;
                        $total = 0;
                        $cycles = 0;
                        $instructions = 0;
                }
        }
'
This script prints a column for CPI and for percent CPU utilization. I've used the PICs that were suggested in the AMD article - and from testing they do appear to be the best ones for measuring CPI.

Here amd64cpiu is used to examine a CPU bound workload of fast register based instructions,

# ./amd64cpiu.sh 
    Instructions      CPI     %CPU
     16509657954     0.34    97.56
     16550162001     0.33    98.54
     16523746049     0.34    98.41
     16510783100     0.34    98.32
     16497135723     0.34    98.29
^C
The CPI is around 0.34. This is the maximum to be expected from the AMD64 architecture, which attempts to run three instructions per clock cycle.

Now for a memory bound workload of sequential 1 byte memory reads,

# ./amd64cpiu.sh 
    Instructions      CPI     %CPU
      4883935299     1.12    97.60
      4852961204     1.12    97.03
      4884120645     1.13    97.69
      4898818096     1.12    97.92
      4895064839     1.12    97.80
^C
Things are starting to become slower due to the extra overhead of memory requests. Many reads will satisfy from the level 1 cache, some from the slower level 2 cache, and occasionally a cache line will be read from main memory. This additional overhead slows us to 1.13 CPI, and we are running fewer instructions for the same %CPU.

Watch what happens when our memory workload performs 1 byte scattered reads (100 Kbytes apart),

# ./amd64cpiu.sh 
    Instructions      CPI     %CPU
       653300388     8.53    98.36
       648496314     8.53    98.37
       644163952     8.54    97.75
       648941939     8.53    98.35
       648507176     8.53    98.37
^C
Many of the reads will not be in the CPU caches, and so now most are requiring a memory bus lookup. Our CPI is around 8.53, some 25 times slower than register based CPU instructions. Our %CPU is still around the same, but this buys us fewer instructions in total.

As you can see, CPI is shedding light on memory bus activity - it's very cool, and from such a simple metric.

Now for a real application: Here I watch as Sun's C compiler chews through a source tree,

# ./amd64cpiu.sh 
    Instructions      CPI     %CPU
      2624028943     1.26    58.52
      2992167837     1.19    63.17
      2327129316     1.26    52.08
      2046997158     1.27    46.14
      2414376864     1.23    52.80
      3305351199     1.23    70.72
^C
That's not so bad - any memory access instructions must be hitting caches fairly often (something that we can confirm by measuring other PICs).

Beware of output such as the following:

# ./amd64cpiu.sh 
    Instructions      CPI     %CPU
        22695257     1.82     0.73
        22197894     1.75     0.69
        49626271     2.16     1.90
       102731779     2.21     4.04
       104795796     1.49     2.78
^C
The CPUs are fairly idle (less than 5% utilization), and so CPI is less likely to be useful to indicate performance issues.

Suggested Actions - CPI
While many PICs produce interesting measurements, it's much more useful if there is some action we can take based on the results. The following is a list of suggestions based on CPI.

Firstly, to be even considering this list you need to have a known and measured performance issue. If one or more CPUs are 100% busy, then that may be a performance issue and it can be useful to check CPI; if your CPUs are idle, then it probably won't be useful to check. As for measured performance issue - it can be especially helpful to be able to quantify an issue, eg average latency is 150 ms; tools such as DTrace can take these measurements.

I hope this has been helpful. And there are many more cool metrics to observe on AMD64 CPUs - CPI is just the beginning.

Posted by brendan [Performance] ( February 27, 2007 06:51 PM ) Permalink | Comments[1]
http://blogs.sun.com/brendan/date/20070129 Monday January 29, 2007

colortrace

I was recently asked to give a very short but important DTrace demonstration to a wide audience, where I would only have time to run and discuss one script. It is a tough choice - there are so many to pick from, including the 100+ scripts and oneliners in the DTraceToolkit. Also, a short text-based demo may not convey the incredible power of DTrace - if you aren't paying attention it may look like just another tool; DTrace is far more than that.

What a great excuse to use some color escape sequences (yes, I've wanted to do this since I started writing the DTraceToolkit). With a colorized output it won't look like just another tool, and I can draw appropriate attention to a key DTrace feature.

I wrote colortrace.d (or colourtrace.d, for readers from the Commonwealth), which prints a flow indented trace of function entrys and returns for the entire softwark stack. Different layers of the softwark stack have been colorized:

I added a column to print delta times in nanoseconds, to illustrate that yes - we can measure whatever times we want. When Adam previously performed a similar demonstration, he added a color for a Java function layer.

Here I run colortrace.d on a bash shell, and while colortrace.d is running I press the letter "t" on the bash shell:

# ./colortrace.d -p 101695
DELTA(ns)                TYPE    FUNCTION
120805042             syscall   <- read
34461               libc.so.1   <- _read
13474               libc.so.1   <- read
8387                     bash   <- rl_getc
12899                    bash   <- rl_read_key
10044                    bash   -> _rl_dispatch
11760                    bash    -> _rl_dispatch_subseq
7189                     bash     -> rl_insert
7020                     bash      -> _rl_insert_char
6777                     bash       -> _rl_any_typein
8922                     bash       <- _rl_any_typein
7489                     bash       -> rl_insert_text
9230                libc.so.1        -> strlen
12040               libc.so.1        <- strlen
7362                libc.so.1        -> strncpy
6956                libc.so.1        <- strncpy
6996                     bash       <- rl_insert_text
7095                     bash      <- _rl_insert_char
6069                     bash     <- rl_insert
8117                     bash    <- _rl_dispatch_subseq
6484                     bash   <- _rl_dispatch
6992                     bash   -> rl_redisplay
5764                libc.so.1    -> strlen
6577                libc.so.1    <- strlen
5356                libc.so.1    -> strncpy
6249                libc.so.1    <- strncpy
6992                libc.so.1    -> memset
7572                libc.so.1    <- memset
7749                libc.so.1    -> strlen
6118                libc.so.1    <- strlen
6627                     bash    -> _rl_move_cursor_relative
6573                     bash    <- _rl_move_cursor_relative
7533                     bash    -> _rl_output_some_chars
7458                libc.so.1     -> fwrite
6335                libc.so.1      -> _fwrite_unlocked
7085                libc.so.1       -> _realbufend
6167                libc.so.1        -> getxfdat
6250                libc.so.1        <- getxfdat
6490                libc.so.1       <- _realbufend
6306                libc.so.1      <- _fwrite_unlocked
6115                libc.so.1     <- fwrite
6488                     bash    <- _rl_output_some_chars
7628                libc.so.1    -> fflush
6317                libc.so.1     -> _fflush_u
6407                libc.so.1      -> _xflsbuf
5046                libc.so.1       -> _realbufend
52132               libc.so.1        -> getxfdat
5178                libc.so.1        <- getxfdat
5480                libc.so.1       <- _realbufend
7046                libc.so.1       -> _write
29263                 syscall        -> write
12044                  kernel         -> write32
9820                   kernel          -> write
5170                   kernel           -> getf
5514                   kernel            -> set_active_fd
5643                   kernel            <- set_active_fd
10803                  kernel           <- getf
7063                   kernel           -> nbl_need_check
5365                   kernel           <- nbl_need_check
6075                   kernel           -> fop_rwlock
5619                   kernel            -> fs_rwlock
4939                   kernel            <- fs_rwlock
5117                   kernel           <- fop_rwlock
5819                   kernel           -> fop_write
5389                   kernel            -> spec_write
6382                   kernel             -> smark
5471                   kernel              -> gethrestime_sec
5174                   kernel               -> gethrestime
5159                   kernel                -> pc_gethrestime
4724                   kernel                 -> gethrtime
4923                   kernel                  -> tsc_gethrtime
4831                   kernel                  <- tsc_gethrtime
4875                   kernel                 <- gethrtime
5240                   kernel                <- pc_gethrestime
5215                   kernel               <- gethrestime
5368                   kernel              <- gethrestime_sec
5710                   kernel             <- smark
5656                   kernel             -> strwrite
5529                   kernel              -> strwrite_common
5543                   kernel               -> straccess
4976                   kernel                -> mutex_owned
4761                   kernel                <- mutex_owned
5611                   kernel               <- straccess
6066                   kernel               -> strput
4713                   kernel                -> mutex_owned
4705                   kernel                <- mutex_owned
5836                   kernel                -> canputnext
6037                   kernel                <- canputnext
6160                   kernel                -> strmakedata
5542                   kernel                 -> allocb_cred
5513                   kernel                  -> allocb
4982                   kernel                   -> kmem_cache_alloc
5522                   kernel                    -> kmem_cache_alloc_debug
5185                   kernel                     -> verify_and_copy_pattern
5054                   kernel                     <- verify_and_copy_pattern
6464                   kernel                     -> dblk_constructor
4853                   kernel                      -> kmem_cache_alloc
4696                   kernel                       -> kmem_cache_alloc_debug
5188                   kernel                        -> verify_and_copy_pattern
4707                   kernel                        <- verify_and_copy_pattern
5153                   kernel                        -> gethrtime
4843                   kernel                         -> tsc_gethrtime
4587                   kernel                         <- tsc_gethrtime
4836                   kernel                        <- gethrtime
5133                   kernel                        -> getpcstack
4742                   kernel                        <- getpcstack
4849                   kernel                        -> kmem_log_enter
5100                   kernel                        <- kmem_log_enter
5047                   kernel                       <- kmem_cache_alloc_debug
5262                   kernel                      <- kmem_cache_alloc
5773                   kernel                     <- dblk_constructor
4970                   kernel                     -> gethrtime
4610                   kernel                      -> tsc_gethrtime
4504                   kernel                      <- tsc_gethrtime
4975                   kernel                     <- gethrtime
4906                   kernel                     -> getpcstack
4534                   kernel                     <- getpcstack
4991                   kernel                     -> kmem_log_enter
4760                   kernel                     <- kmem_log_enter
4740                   kernel                    <- kmem_cache_alloc_debug
5060                   kernel                   <- kmem_cache_alloc
5642                   kernel                  <- allocb
5394                   kernel                  -> crhold
5467                   kernel                  <- crhold
5435                   kernel                 <- allocb_cred
5531                   kernel                 -> uiomove
5663                   kernel                  -> xcopyin_nta
4871                   kernel                  <- kcopy
5012                   kernel                 <- uiomove
5939                   kernel                <- strmakedata
5853                   kernel                -> stream_willservice
5280                   kernel                <- stream_willservice
6012                   kernel                -> putnext
4937                   kernel                 -> mutex_owned
4680                   kernel                 <- mutex_owned
5381                   kernel                 -> mutex_owned
4694                   kernel                 <- mutex_owned
6160                   kernel                 -> ttcompatwput
5016                   kernel                  -> putnext
4776                   kernel                   -> mutex_owned
4523                   kernel                   <- mutex_owned
4981                   kernel                   -> mutex_owned
4661                   kernel                   <- mutex_owned
5683                   kernel                   -> ldtermwput
5704                   kernel                    -> bcanputnext
5815                   kernel                     -> claimstr
5410                   kernel                     <- claimstr
5598                   kernel                     -> bcanput
6099                   kernel                     <- bcanput
5602                   kernel                     -> releasestr
5013                   kernel                      -> cv_broadcast
5708                   kernel                       -> getpcstack
4690                   kernel                       <- getpcstack
5167                   kernel                       -> apic_intr_enter
5135                   kernel                        -> psm_get_cpu_id
4733                   kernel                        <- psm_get_cpu_id
5540                   kernel                        -> psm_get_cpu_id
4531                   kernel                        <- psm_get_cpu_id
5295                   kernel                        -> psm_get_cpu_id
4503                   kernel                        <- psm_get_cpu_id
4917                   kernel                       <- apic_intr_enter
5296                   kernel                       -> hilevel_intr_prolog
10448                  kernel                        -> apic_intr_exit
5041                   kernel                         -> psm_get_cpu_id
4481                   kernel                         <- psm_get_cpu_id
5119                   kernel                        <- apic_intr_exit
5333                   kernel                       <- hilevel_intr_epilog
5143                   kernel                      <- cv_broadcast
5124                   kernel                     <- releasestr
5350                   kernel                    <- bcanputnext
5851                   kernel                    -> ldtermwmsg
6826                   kernel                     -> msgdsize
5427                   kernel                     <- msgdsize
5652                   kernel                     -> ldterm_output_msg
4981                   kernel                      -> allocb
4817                   kernel                       -> kmem_cache_alloc
5224                   kernel                        -> kmem_cache_alloc_debug
5270                   kernel                         -> verify_and_copy_pattern
5033                   kernel                         <- verify_and_copy_pattern
5303                   kernel                         -> dblk_constructor
4692                   kernel                          -> kmem_cache_alloc
4928                   kernel                           -> kmem_cache_alloc_debug
5243                   kernel                            -> getpcstack
4569                   kernel                            <- getpcstack
5069                   kernel                            -> apic_intr_enter
4830                   kernel                            <- apic_intr_enter
4972                   kernel                            -> hilevel_intr_prolog
40463                  kernel                             -> apic_intr_exit
5107                   kernel                              -> psm_get_cpu_id
4549                   kernel                              <- psm_get_cpu_id
4995                   kernel                             <- apic_intr_exit
5123                   kernel                            <- hilevel_intr_epilog
5633                   kernel                            -> verify_and_copy_pattern
4867                   kernel                            <- verify_and_copy_pattern
5360                   kernel                            -> gethrtime
4633                   kernel                             -> tsc_gethrtime
4566                   kernel                             <- tsc_gethrtime
4999                   kernel                            <- gethrtime
4856                   kernel                            -> getpcstack
4739                   kernel                            <- getpcstack
5058                   kernel                            -> kmem_log_enter
4818                   kernel                            <- kmem_log_enter
4786                   kernel                           <- kmem_cache_alloc_debug
5189                   kernel                          <- kmem_cache_alloc
5063                   kernel                         <- dblk_constructor
4939                   kernel                         -> gethrtime
4694                   kernel                          -> tsc_gethrtime
4706                   kernel                          <- tsc_gethrtime
4749                   kernel                         <- gethrtime
4852                   kernel                         -> getpcstack
4780                   kernel                         <- getpcstack
4838                   kernel                         -> kmem_log_enter
4669                   kernel                         <- kmem_log_enter
5068                   kernel                        <- kmem_cache_alloc_debug
4970                   kernel                       <- kmem_cache_alloc
4925                   kernel                      <- allocb
6118                   kernel                      -> movtuc
6115                   kernel                      <- movtuc
5246                   kernel                      -> freeb
5098                   kernel                       -> dblk_lastfree
5180                   kernel                        -> crfree
5155                   kernel                        <- crfree
5292                   kernel                        -> kmem_cache_free
4946                   kernel                         -> kmem_cache_free_debug
4814                   kernel                          -> kmem_log_enter
4880                   kernel                          <- kmem_log_enter
4998                   kernel                          -> gethrtime
4598                   kernel                           -> tsc_gethrtime
4653                   kernel                           <- tsc_gethrtime
4949                   kernel                          <- gethrtime
4841                   kernel                          -> getpcstack
4515                   kernel                          <- getpcstack
5073                   kernel                          -> kmem_log_enter
4595                   kernel                          <- kmem_log_enter
5352                   kernel                          -> dblk_destructor
5061                   kernel                           -> kmem_cache_free
4670                   kernel                            -> kmem_cache_free_debug
4721                   kernel                             -> kmem_log_enter
4863                   kernel                             <- kmem_log_enter
4914                   kernel                             -> gethrtime
4555                   kernel                              -> tsc_gethrtime
4853                   kernel                              <- tsc_gethrtime
4761                   kernel                             <- gethrtime
4825                   kernel                             -> getpcstack
4714                   kernel                             <- getpcstack
4888                   kernel                             -> kmem_log_enter
4580                   kernel                             <- kmem_log_enter
5219                   kernel                             -> copy_pattern
4980                   kernel                             <- copy_pattern
4978                   kernel                            <- kmem_cache_free_debug
5806                   kernel                           <- kmem_cache_free
4964                   kernel                          <- dblk_destructor
4893                   kernel                          -> copy_pattern
4530                   kernel                          <- copy_pattern
5010                   kernel                         <- kmem_cache_free_debug
4931                   kernel                        <- kmem_cache_free
5002                   kernel                       <- dblk_lastfree
5293                   kernel                      <- freeb
5397                   kernel                     <- ldterm_output_msg
5285                   kernel                     -> msgdsize
4840                   kernel                     <- msgdsize
6236                   kernel                     -> drv_setparm
5415                   kernel                     <- drv_setparm
5231                   kernel                     -> putnext
4820                   kernel                      -> mutex_owned
4617                   kernel                      <- mutex_owned
5286                   kernel                      -> mutex_owned
4526                   kernel                      <- mutex_owned
5851                   kernel                      -> ptemwput
5325                   kernel                       -> bcanputnext
4759                   kernel                        -> claimstr
4458                   kernel                        <- claimstr
5060                   kernel                        -> bcanput
5074                   kernel                        <- bcanput
4841                   kernel                        -> releasestr
4663                   kernel                         -> cv_broadcast
4771