DTrace and the mighty Hercules
DTrace and the mighty Hercules
One of the advanced features of Solaris is DTrace, a powerful tool for troubleshooting system or application problems, very frequently used for performance issues. It gives you a powerful tool for looking inside a system or application - in a non-intrusive way - for the best observability of any OS I know. In fact, a company I visited just a few weeks ago is very interested in moving from Linux to Solaris because of the observability that DTrace provides, along with other Solaris features designed for production work in the enterprise
Anyhow, that got me thinking about using DTrace to look at a workload I sometimes run for hobby purposes, the Hercules System/370, ESA/390, and z/Architecture Emulator. This lets me run a simulated mainframe, on which I can can boot ("IPL") an operating system like Linux and ancient versions of MVS or VM. (Technically it's possible to run current mainframe operating systems, but they are proprietary and not made available for download or licensed operation).
So, I booted up Hercules, which I ported to Solaris by making a few minor changes (incorporated in the Hercules 3.06 download) to see what I could learn about where time is spent when Hercules is running.
First the stethoscope, then the MRI
An important thing about DTrace is that it's not the first place to start when you're trying to understand the performance of a system. I like to say that it's like an MRI or CAT scan in medical diagnosis. You don't start with the advanced tool for a deep "look inside" - first you start with the general tools (like a stethoscope) to get an overall picture of what is going on.
So, the first thing I did was run some standard "*stat" commands while running Hercules:
$ vmstat 3 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr cd s0 s1 s2 in sy cs us sy id 0 0 0 2728320 1004208 54 333 0 0 0 0 54 8 -1 -1 -1 1147 3458 1700 9 2 89 0 0 0 1867328 129140 6 26 0 0 0 0 0 36 0 0 0 4280 23438 8437 53 6 41 1 0 0 1834364 96368 0 0 0 0 0 0 0 163 0 0 0 4314 20460 8153 52 6 42 0 0 0 1846368 108384 0 0 0 0 0 0 0 58 0 0 0 8277 20873 8582 52 7 41 0 0 0 1852356 114376 0 0 0 0 0 0 0 0 0 0 0 3629 20029 7300 54 4 42 0 0 0 1848048 110092 0 0 0 0 0 0 0 57 0 0 0 4501 24527 9419 53 6 41 0 0 0 1823660 88100 50 2613 0 0 0 0 0 61 0 0 0 3856 21135 7355 62 9 29 ^C $ mpstat 3 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 165 0 8 602 126 877 66 48 20 1 1771 9 2 0 89 1 168 0 28 563 242 853 66 48 20 1 1769 10 2 0 88 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 9 0 1 2322 368 3990 338 143 62 0 10859 55 8 0 37 1 3 0 21 2026 305 4478 573 142 51 0 11870 50 8 0 42 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 5449 390 4138 438 150 98 0 9052 63 6 0 31 1 116 0 3187 2097 317 4313 384 151 68 0 12272 42 8 0 50 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 5 0 0 2038 165 4323 535 164 122 0 11398 52 6 0 42 1 0 0 19 1992 400 4078 438 157 90 0 11289 59 6 0 36 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 205 2475 479 4313 443 130 67 0 11991 50 9 0 41 1 0 0 223 2208 276 4293 487 135 85 0 10710 55 8 0 37 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 1919 336 3847 502 143 114 0 10301 64 6 0 29 1 181 0 19 1927 392 4269 662 143 115 0 11507 56 7 0 3
Okay, I'm not short on memory, am driving some CPU, and I'm spending most of my CPU time in user state, not in the kernel.
Who is making system calls?
Despite the fact we're not spending most time in the kernel,
we are doing a substantial number of system calls per second (syscl), so for fun let's see
which ones they are - DTrace comes in here.
First, let's see which programs are issuing system calls; there's other
stuff going on and we shouldn't assume it's Hercules. I'm using one of the most frequently seen
DTrace one-liners, which "fires" on any system call, and creates an aggregate of number of system calls,
indexed by the
name of the program (execname built-in DTrace variable) issuing them.
The following command traces each probe that matches any system call (the syscall provider),
and creates a table of counts of system calls indexed by executable program name. By default, the action
of stopping the command (the Control-C) dumps aggregates to the terminal.
(Note: I'm running DTrace from root userid, but that's not necessary - you can
use roles-based authorization to assign the privileges needed to run DTrace to a regular userid.
This is my workstation, so I just used root).
# dtrace -n 'syscall::: { @[execname] = count(); }'
dtrace: description 'syscall::: ' matched 472 probes
^C
inetd 1
gnome-volume-man 3
iiimd 3
mdnsd 5
devfsadm 7
nwam-manager 15
sendmail 19
... lines omitted for brevity ...
dtrace 3170
evolution 3213
firefox-bin 3447
hercules 3951
gnome-terminal 6035
rhythmbox 7628
Which system calls is Hercules making?
Ah, hah - playing music while I work is the leading cause of system calls! Well, let's disregard that, and focus on only the system calls coming from Hercules, and see which calls they are. Another well-known one liner, this time using a predicate that restricts observation only to calls issued by Hercules' binary. Let's index the calls by the name of the system call itself; I did this while running an application that reads from a simulated tape (the tape image is actually a disk file):
# dtrace -n 'syscall::: /execname == "hercules" / { @[probefunc] = count(); }'
dtrace: description 'syscall::: ' matched 472 probes
^C
gtime 16
pollsys 16
ioctl 18
nanosleep 1623
yield 6077
write 12380
llseek 62482
lwp_park 105148
read 116566
Well, that makes a lot of sense: I'm reading from a disk file, and sure enough, there are lots of read system calls.
We can even use DTrace to see which file it is: when read(), write() or poll() are frequent system calls,
aggregate on arg0 to the call, which is the file descriptor number, and then look up which file via pfiles.
# dtrace -n 'syscall::read:entry /execname == "hercules"/ { @[arg0] = count(); }
dtrace: description 'syscall::read:entry ' matched 1 probe
^C
34 2
35 48440
The left column is the file descriptor number, and the second column is the number of times it was seen, so the
vast majority of read system calls was for file descriptor number 35.
Using pfiles I can see which file it was:
# pfiles 6376
6376: /rpool/hercules/misc/hercules-3.06/.libs/hercules -f hercules.cnf
Current rlimit: 256 file descriptors
0: S_IFCHR mode:0620 dev:315,0 ino:820483781 uid:103812 gid:7 rdev:24,4
O_RDWR
/dev/pts/4
1: S_IFIFO mode:0000 dev:312,0 ino:2943 uid:103812 gid:10 size:0
O_RDWR
...
35: S_IFREG mode:0644 dev:182,65562 ino:53251 uid:103812 gid:10 size:1443627008
O_RDWR|O_LARGEFILE
/rpool/sirius/phase7/phase7.awstape
...
Now I know which system call was dominant, and which file it was for (the simulated tape file).
Which part of the program was making those system calls?
But wait, there's more! (just like on late night TV ads). What if I also want to know which program inside the executable was doing the system calls. Well, DTrace has access to both kernel and user stack, so we can issue:
# dtrace -n 'syscall::read:entry /execname == "hercules" / { @[ustack(5)] = count(); }'
dtrace: description 'syscall::read:entry ' matched 1 probe
^C
libc.so.1`__read+0x7
libhercd.so`cckd_read+0x78
2
libc.so.1`__read+0x7
hdt3270.so`console_connection_handler+0x2f9
28
libc.so.1`__read+0x7
libhercu.so`logger_thread+0x170
0x632e7265
1002
libc.so.1`__read+0x7
hdt3420.so`readhdr_awstape+0x53
58414
libc.so.1`__read+0x7
hdt3420.so`read_awstape+0x120
58414
Now I know exactly which functions (and offsets inside function) were the most frequent issuers
of the read. No surprise there: the dominant system call was for reads, and the reads
were issued from the piece of code that handles reading from simulated tape drives. In this case
I could have made an easy guess about what was going on, but DTrace made it possible to instrument
this, in a completely unmodified application.
Looking inside a program's entry points
But this has more implications: DTrace can see the objects within the executable, using the
pid provider, so I can get even more detailed information. First, let's get a list
of the probe points that are exposed when the program is running. In this case, I know that the process
has process id (pid) 6376 (in later examples the pid may differ), so I can expose its contents this way:
# dtrace -l -n pid6376:::entry|more ID PROVIDER MODULE FUNCTION NAME 78992 pid6376 hercules _start entry 78993 pid6376 hercules __fsr entry 78994 pid6376 hercules _mcount entry 78995 pid6376 hercules main entry ... ... snip for brevity ... 79467 pid6376 libherc.so set_or_reset_console_mode entry 79468 pid6376 libherc.so set_screen_pos entry 79469 pid6376 libherc.so erase_to_eol entry 79470 pid6376 libherc.so clear_screen entry 79471 pid6376 libherc.so set_screen_color entry 79472 pid6376 libherc.so translate_keystroke entry 79473 pid6376 libherc.so console_beep entry ... ... snip for brevity ... 79484 pid6376 libherc.so configure_cpu entry 79485 pid6376 libherc.so deconfigure_cpu entry 79486 pid6376 libherc.so get_devblk entry 79487 pid6376 libherc.so ret_devblk entry 79488 pid6376 libherc.so find_device_by_devnum entry 79489 pid6376 libherc.so attach_device entry 79490 pid6376 libherc.so detach_devblk entry ... ... snip for brevity ... 79650 pid6376 libherc.so s370_branch_on_condition entry 79651 pid6376 libherc.so s370_branch_on_count_register entry 79652 pid6376 libherc.so s370_branch_on_count entry 79653 pid6376 libherc.so s370_branch_on_index_high entry 79654 pid6376 libherc.so s370_branch_on_index_low_or_equal entry 79655 pid6376 libherc.so s370_compare_register entry 79656 pid6376 libherc.so s370_compare_logical_register entry ... snip
Hey, I can see the C functions that implement service routines, and the ones that implement the simulated instructions!
How many times do I invoke each function in the program?
Well, if I can do that, then I should be able to see how often they are invoked, and how much time is spent in them. So, I could issue a command that shows me when a given instruction is emulated, like:
dtrace -n 'pid6376:libherc.so:s390_set_storage_key_extended:entry { trace(probefunc);}'
but that will run off the screen REAL FAST, as there will be zillions of times any given instruction
is issued. Aggregates are our friend again, so instead I issue (notes, this was done a different day, different pid):
# dtrace -n 'pid19284:::entry { @a[probefunc] = count(); }'
dtrace: description 'pid19284:::entry ' matched 11135 probes
^C
...
... I snipped out lines for low-frequency functions
...
s390_shift_left_double 341
s390_external_interrupt 343
s390_perform_external_interrupt 343
s390_store_characters_under_mask 346
...
......................I snipped a bunch of lines.....................
...
s390_compare_logical_register 11769
memcpy 12616
s390_exclusive_or_register 12627
s390_branch_and_link 13483
s390_branch_and_save_register 14161
s390_subtract_register 14351
s390_move_immediate 14601
...
......................I snipped a bunch of lines.....................
...
s390_move_character 40638
s390_load_and_test_register 44384
s390_branch_on_condition_register 47965
s390_compare_logical_immediate 49598
sigon 49928
ptt_pthread_trace 73559
s390_compare_logical 76252
s390_compare_register 83654
s390_compare_logical_character 91370
s390_load_register 96943
s390_test_under_mask 132845
s390_load_address 158997
s390_insert_characters_under_mask 168542
s390_store 196300
s390_load 309268
s390_branch_on_condition 725825
Wow - what this tells me is that Branch on Condition (BC) and Load (L)
are the most frequently seen instructions. This will mean something to you if you program in
mainframe assembly language, otherwise not. It gives a good hint that anything that speeds
up processing of these instructions will make a difference.
Quantifying actual time spent in functions
However, we can refine that: the above gives you the number of times you invoked the most frequently used functions, but not how much time was spent in them. So, let's use a different script that remembers the entry and exit times of the probe points, and creates aggregates of both number of times they're invoked, and a histogram of the times inside those functions. This is too complicated for a one-liner at the terminal, so I'm using a script I found (IIRC, in the DTrace Toolkit). What this does is remember the high-precision timestamp on entry to each function to compute the elapsed time spent in it, as well as the number of entries and exits.
#!/usr/sbin/dtrace -s
pid$1:::entry
{
self->ts[self->stack++] = timestamp;
}
pid$1:::return
/ self->ts[self->stack - 1] /
{
this->elapsed = timestamp - self->ts[--self->stack];
@[probefunc] = count();
@a[probefunc] = quantize(this->elapsed);
self->ts[self->stack] = 0;
}
|
Notice that the pid id is passed as an argument to the script. Running this against pid 21080 gives the following output (with low frequency and low elapsed time entries removed):
# dtrace -s quantify.d 21080 ... output snipped ... s390_compare_register 51374 s390_subtract_logical_register 52798 s390_compare 56810 s390_load_and_test_register 59370 s390_compare_logical 68099 s390_and 71375 s390_branch_on_condition_register 72790 ptt_pthread_trace 76804 s390_load_register 135094 s390_add_logical_register 139985 s390_store 199027 s390_test_under_mask 234316 s390_load_address 293952 s390_load 393249 s390_branch_on_condition 746037The first aggregate dumped showed frequency of invocation by function call, and that's similar to other tests.
TM, LA, L, and BC are indeed frequent instructions.
But the aggregate based on time shows interesting different results! For exposition purposes I removed output for function calls related to sleeping or waiting on a condition, as they would be expected to have high elapsed times (a later test could be used to distinguish between elapsed times spent waiting for work to do or I/O to complete - versus time spent waiting for a lock to be released).
Instead, what is interesting is that while s390_branch_on_condition does get invoked a lot,
it doesn't have high elapsed times - the left column represents time in nanoseconds, and this is a fast
instruction in emulation. Other instructions, and a number of support routines, got invoked a lot
less frequently, but represented disproportionate time for the number of invocations!
s390_branch_on_condition
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 727692
2048 | 3758
4096 |@ 14337
8192 | 103
16384 | 37
32768 | 39
65536 | 26
131072 | 40
262144 | 4
524288 | 1
1048576 | 0
s390_execute_b2xx
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@@@@@@@@@@ 1635
8192 |@@@@@@@@@@@@ 1282
16384 | 14
32768 | 1
65536 |@@@@@@@@@ 990
131072 | 7
262144 | 7
524288 |@ 150
1048576 | 22
2097152 | 1
4194304 | 0
8388608 | 0
16777216 | 6
33554432 |@ 141
67108864 | 13
134217728 | 0
sprintf
value ------------- Distribution ------------- count
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 40
16384 |@@@@@@@@@ 14
32768 | 0
65536 | 0
131072 | 0
262144 | 0
524288 | 0
1048576 | 0
2097152 | 0
4194304 | 0
8388608 | 0
16777216 | 0
33554432 | 0
67108864 | 0
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 |@@@@@ 8
2147483648 | 0
s390_insert_characters_under_mask
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29837
2048 |@ 844
4096 |@ 621
8192 | 5
16384 | 5
32768 | 8
65536 |@ 675
131072 |@ 749
262144 |@ 601
524288 |@ 569
1048576 |@ 706
2097152 |@ 1547
4194304 |@@ 2278
8388608 |@ 756
16777216 | 402
33554432 |@ 866
67108864 |@@@ 2776
134217728 | 0
In this example, the long durations for some invocations of
s390_insert_characters_under_mask make it disproportionately more expensive than
its frequency of use would indicate. Optimizing the "tail" of long duration calls, and for that
matter reducing the number of calls to sprintf could make a difference in performance.
Summary
In this (slightly contrived) example, I've shown a little bit about how DTrace can be used for performance measurement. We were able to use standard Solaris tools to see how busy the system was, and then used DTrace to get counts of system calls by application and by function. After that, we figured out which functions in the application were making system calls, which file was involved in a filesystem call, and then we got the frequency counts and durations of the "hot" entry points in the application. This illustrates performance observability of an application not otherwise set up for instrumentation. No special compiler options were used, no code was changed, but we were able to see where we were spending our time.This just scratches the surface of what can be done - it's possible to get a lot more information, such as flow of control from function to function, just as an example. With a little investment of effort (especially if you borrow existing DTrace scripts available on the 'net!) you can get incredible insight into the behavior of your applications or entire Solaris systems. There's really nothing remotely like this elsewhere, and I can't recommend it strongly enough.
For a great reference for learning more, go to the
BigAdmin DTrace portal
or OpenSolaris DTrace page.
Posted by jsavit
[Sun] ( February 22, 2009 08:02 PM )
Permalink
