ZFS L2ARC
An exciting new ZFS feature has now become publicly known - the second level ARC, or L2ARC. I've been busy with its development for over a year, however this is my first chance to post about it. This post will show a quick example and answer some basic questions.
Background in a nutshell
The "ARC" is the ZFS main memory cache (in DRAM), which can be accessed with sub microsecond latency. An ARC read miss would normally read from disk, at millisecond latency (especially random reads). The L2ARC sits in-between, extending the main memory cache using fast storage devices - such as flash memory based SSDs (solid state disks).
![]() |
![]() |
![]() |
Some example sizes to put this into perspective, from a lab machine named "walu":
| Layer | Medium | Total Capacity |
| ARC | DRAM | 128 Gbytes |
| L2ARC | 6 x SSDs | 550 Gbytes |
| Storage Pool | 44 Disks | 17.44 Tbytes (mirrored) |
For this server, the L2ARC allows around 650 Gbytes to be stored in the total ZFS cache (ARC + L2ARC), rather than just DRAM with about 120 Gbytes.
A previous ZFS feature (the ZIL) allowed you to add SSD disks as log devices to improve write performance. This means ZFS provides two dimensions for adding flash memory to the file system stack: the L2ARC for random reads, and the ZIL for writes.
Adam has been the mastermind behind our flash memory efforts, and has written an excellent article in Communications of the ACM about flash memory based storage in ZFS; for more background, check it out.
L2ARC Example
To illustrate the L2ARC with an example, I'll use walu - a medium sized server in our test lab, which was briefly described above. Its ZFS pool of 44 x 7200 RPM disks is configured as a 2-way mirror, to provide both good reliability and performance. It also has 6 SSDs, which I'll add to the ZFS pool as L2ARC devices (or "cache devices").I should note - this is an example of L2ARC operation, not a demonstration of the maximum performance that we can achieve (the SSDs I'm using here aren't the fastest I've ever used, nor the largest.)
20 clients access walu over NFSv3, and execute a random read workload with an 8 Kbyte record size across 500 Gbytes of files (which is also its working set).
1) disks only
Since the 500 Gbytes of working set is larger than walu's 128 Gbytes of DRAM, the disks must service many requests. One way to grasp how this workload is performing is to examine the IOPS that the ZFS pool delivers:
walu# zpool iostat pool_0 30
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
pool_0 8.38T 9.06T 95 4 762K 29.1K
pool_0 8.38T 9.06T 1.87K 15 15.0M 30.3K
pool_0 8.38T 9.06T 1.88K 3 15.1M 20.4K
pool_0 8.38T 9.06T 1.89K 16 15.1M 39.3K
pool_0 8.38T 9.06T 1.89K 4 15.1M 23.8K
[...]
|
The pool is pulling about 1.89K ops/sec, which would require about 42 ops per disk of this pool. To examine how this is delivered by the disks, we can either use zpool iostat or the original iostat:
walu# iostat -xnz 10
[...trimmed first output...]
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
43.9 0.0 351.5 0.0 0.0 0.4 0.0 10.0 0 34 c0t5000CCA215C46459d0
47.6 0.0 381.1 0.0 0.0 0.5 0.0 9.8 0 36 c0t5000CCA215C4521Dd0
42.7 0.0 349.9 0.0 0.0 0.4 0.0 10.1 0 35 c0t5000CCA215C45F89d0
41.4 0.0 331.5 0.0 0.0 0.4 0.0 9.6 0 32 c0t5000CCA215C42A4Cd0
45.6 0.0 365.1 0.0 0.0 0.4 0.0 9.2 0 34 c0t5000CCA215C45541d0
45.0 0.0 360.3 0.0 0.0 0.4 0.0 9.4 0 34 c0t5000CCA215C458F1d0
42.9 0.0 343.5 0.0 0.0 0.4 0.0 9.9 0 33 c0t5000CCA215C450E3d0
44.9 0.0 359.5 0.0 0.0 0.4 0.0 9.3 0 35 c0t5000CCA215C45323d0
45.9 0.0 367.5 0.0 0.0 0.5 0.0 10.1 0 37 c0t5000CCA215C4505Dd0
[...etc...]
|
iostat is interesting as it lists the service times: wsvc_t + asvc_t. These I/Os are taking on average between 9 and 10 milliseconds to complete, which the client application will usually suffer as latency. This time will be due to the random read nature of this workload - each I/O must wait as the disk heads seek and the disk platter rotates.
Another way to understand this performance is to examine the total NFSv3 ops delivered by this system (these days I use a GUI to monitor NFSv3 ops, but for this blog post I'll hammer nfsstat into printing something concise):
walu# nfsstat -v 3 1 | sed '/^Server NFSv3/,/^[0-9]/!d' [...] Server NFSv3: calls badcalls 2260 0 Server NFSv3: calls badcalls 2306 0 Server NFSv3: calls badcalls 2239 0 [...] |
That's about 2.27K ops/sec for NFSv3; I'd expect 1.89K of that to be what our pool was delivering, and the rest are cache hits out of DRAM, which is warm at this point.
2) L2ARC devices
Now the 6 SSDs are added as L2ARC cache devices:
walu# zpool add pool_0 cache c7t0d0 c7t1d0 c8t0d0 c8t1d0 c9t0d0 c9t1d0 |
And we wait until the L2ARC is warm.
Time passes ...
Several hours later the cache devices have warmed up enough to satisfy most I/Os which miss main memory. The combined 'capacity/used' column for the cache devices shows that our 500 Gbytes of working set now exists on those 6 SSDs:
walu# zpool iostat -v pool_0 30
[...]
capacity operations bandwidth
pool used avail read write read write
------------------------- ----- ----- ----- ----- ----- -----
pool_0 8.38T 9.06T 30 14 245K 31.9K
mirror 421G 507G 1 0 9.44K 0
c0t5000CCA216CCB905d0 - - 0 0 4.08K 0
c0t5000CCA216CCB74Cd0 - - 0 0 5.36K 0
mirror 416G 512G 0 0 7.66K 0
c0t5000CCA216CCB919d0 - - 0 0 4.34K 0
c0t5000CCA216CCB763d0 - - 0 0 3.32K 0
[... 40 disks truncated ...]
cache - - - - - -
c7t0d0 84.5G 8.63G 2.63K 0 21.1M 11.4K
c7t1d0 84.7G 8.43G 2.62K 0 21.0M 0
c8t0d0 84.5G 8.68G 2.61K 0 20.9M 0
c8t1d0 84.8G 8.34G 2.64K 0 21.1M 0
c9t0d0 84.3G 8.81G 2.63K 0 21.0M 0
c9t1d0 84.2G 8.91G 2.63K 0 21.0M 1.53K
------------------------- ----- ----- ----- ----- ----- -----
|
The pool_0 disks are still serving some requests (in this output 30 ops/sec) but the bulk of the reads are being serviced by the L2ARC cache devices - each providing around 2.6K ops/sec. The total delivered by this ZFS pool is 15.8K ops/sec (pool disks + L2ARC devices), about 8.4x faster than with disks alone.
This is confirmed by the delivered NFSv3 ops:
walu# nfsstat -v 3 1 | sed '/^Server NFSv3/,/^[0-9]/!d' [...] Server NFSv3: calls badcalls 18729 0 Server NFSv3: calls badcalls 18762 0 Server NFSv3: calls badcalls 19000 0 [...] |
walu is now delivering 18.7K ops/sec, which is 8.3x faster than without the L2ARC.
However the real win for the client applications is that of read latency; the disk-only iostat output showed our average was between 9 and 10 milliseconds, the L2ARC cache devices are delivering the following:
walu# iostat -xnz 10
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
[...]
2665.0 0.4 21317.2 0.0 0.7 0.7 0.2 0.2 39 67 c9t0d0
2668.1 0.5 21342.0 3.2 0.6 0.7 0.2 0.2 38 66 c9t1d0
2665.4 0.4 21320.4 0.0 0.7 0.7 0.3 0.3 42 69 c8t0d0
2683.6 0.4 21465.9 0.0 0.7 0.7 0.3 0.3 41 68 c8t1d0
2660.7 0.6 21295.6 3.2 0.6 0.6 0.2 0.2 36 65 c7t1d0
2650.7 0.4 21202.8 0.0 0.6 0.6 0.2 0.2 36 64 c7t0d0
|
Our average service time is between 0.4 and 0.6 ms (wsvt_t + asvc_t columns), which is about 20x faster than what the disks were delivering.
What this means ...
An 8.3x improvement for 8 Kbyte random IOPS across a 500 Gbyte working set is impressive, as is improving storage I/O latency by 20x.But this isn't really about the numbers, which will become dated (these SSDs were manufactured in July 2008, by a supplier who is providing us with bigger and faster SSDs every month).
What's important is that ZFS can make intelligent use of fast storage technology, in different roles to maximize their benefit. When you hear of new SSDs with incredible ops/sec performance, picture them as your L2ARC; or if it were great write throughput, picture them as your ZIL.
The example above was to show that the L2ARC can deliver, over NFS, whatever these SSDs could do. And these SSDs are being used as a second level cache, in-between main memory and disk, to achieve the best price/performance.
Questions
I recently spoke to a customer about the L2ARC and they asked a few questions which may be useful to repeat here:What is L2ARC?
- The L2ARC is best pictured as a cache layer in-between main memory and disk,
using flash memory based SSDs or other fast devices as storage.
It holds non-dirty ZFS data, and is currently intended to
improve the performance of random read workloads.
Isn't flash memory unreliable? What have you done about that?
- It's getting much better, but we have designed the L2ARC to handle errors
safely. The data stored on the L2ARC is checksummed, and if the
checksum is wrong or the SSD reports an error, we defer that read to the
original pool of disks. Enough errors and the L2ARC device will offline itself.
I've even yanked out busy L2ARC devices on live systems as part of testing, and
everything continues to run.
Aren't SSDs really expensive?
- They used to be, but their price/performance has now reached the point where
it makes sense to start using them in the coming months. See Adam's ACM
article
for more details about price/performance.
What about writes - isn't flash memory slow to write to?
- The L2ARC is coded to write to the cache devices asynchronously, so write
latency doesn't affect system performance. This allows us to use "read-bias"
SSDs for the L2ARC, which have the best read latency (and slow write
latency).
What's bad about the L2ARC?
- It was designed to either improve performance or do nothing, so there
isn't anything that should be bad. To explain what I mean by do nothing -
if you use the L2ARC for a streaming or
sequential workload, then the L2ARC will mostly ignore it and not cache it.
This is because the default L2ARC settings assume you are using current SSD
devices, where caching random read workloads is most favourable; with future
SSDs (or other storage technology), we can use the L2ARC for streaming
workloads as well.
Internals
If anyone is interested, I wrote a summary of L2ARC internals as a block comment in usr/src/uts/common/fs/zfs/arc.c, which is also surrounded by the actual implementation code. The block comment is below (see the source for the latest version), and is an excellent reference for how it really works:
/* * Level 2 ARC * * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk. * It uses dedicated storage devices to hold cached data, which are populated * using large infrequent writes. The main role of this cache is to boost * the performance of random read workloads. The intended L2ARC devices * include short-stroked disks, solid state disks, and other media with * substantially faster read latency than disk. * * +-----------------------+ * | ARC | * +-----------------------+ * | ^ ^ * | | | * l2arc_feed_thread() arc_read() * | | | * | l2arc read | * V | | * +---------------+ | * | L2ARC | | * +---------------+ | * | ^ | * l2arc_write() | | * | | | * V | | * +-------+ +-------+ * | vdev | | vdev | * | cache | | cache | * +-------+ +-------+ * +=========+ .-----. * : L2ARC : |-_____-| * : devices : | Disks | * +=========+ `-_____-' * * Read requests are satisfied from the following sources, in order: * * 1) ARC * 2) vdev cache of L2ARC devices * 3) L2ARC devices * 4) vdev cache of disks * 5) disks * * Some L2ARC device types exhibit extremely slow write performance. * To accommodate for this there are some significant differences between * the L2ARC and traditional cache design: * * 1. There is no eviction path from the ARC to the L2ARC. Evictions from * the ARC behave as usual, freeing buffers and placing headers on ghost * lists. The ARC does not send buffers to the L2ARC during eviction as * this would add inflated write latencies for all ARC memory pressure. * * 2. The L2ARC attempts to cache data from the ARC before it is evicted. * It does this by periodically scanning buffers from the eviction-end of * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are * not already there. It scans until a headroom of buffers is satisfied, * which itself is a buffer for ARC eviction. The thread that does this is * l2arc_feed_thread(), illustrated below; example sizes are included to * provide a better sense of ratio than this diagram: * * head --> tail * +---------------------+----------+ * ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC * +---------------------+----------+ | o L2ARC eligible * ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer * +---------------------+----------+ | * 15.9 Gbytes ^ 32 Mbytes | * headroom | * l2arc_feed_thread() * | * l2arc write hand <--[oooo]--' * | 8 Mbyte * | write max * V * +==============================+ * L2ARC dev |####|#|###|###| |####| ... | * +==============================+ * 32 Gbytes * * 3. If an ARC buffer is copied to the L2ARC but then hit instead of * evicted, then the L2ARC has cached a buffer much sooner than it probably * needed to, potentially wasting L2ARC device bandwidth and storage. It is * safe to say that this is an uncommon case, since buffers at the end of * the ARC lists have moved there due to inactivity. * * 4. If the ARC evicts faster than the L2ARC can maintain a headroom, * then the L2ARC simply misses copying some buffers. This serves as a * pressure valve to prevent heavy read workloads from both stalling the ARC * with waits and clogging the L2ARC with writes. This also helps prevent * the potential for the L2ARC to churn if it attempts to cache content too * quickly, such as during backups of the entire pool. * * 5. After system boot and before the ARC has filled main memory, there are * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru * lists can remain mostly static. Instead of searching from tail of these * lists as pictured, the l2arc_feed_thread() will search from the list heads * for eligible buffers, greatly increasing its chance of finding them. * * The L2ARC device write speed is also boosted during this time so that * the L2ARC warms up faster. Since there have been no ARC evictions yet, * there are no L2ARC reads, and no fear of degrading read performance * through increased writes. * * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that * the vdev queue can aggregate them into larger and fewer writes. Each * device is written to in a rotor fashion, sweeping writes through * available space then repeating. * * 7. The L2ARC does not store dirty content. It never needs to flush * write buffers back to disk based storage. * * 8. If an ARC buffer is written (and dirtied) which also exists in the * L2ARC, the now stale L2ARC buffer is immediately dropped. * * The performance of the L2ARC can be tweaked by a number of tunables, which * may be necessary for different workloads: * * l2arc_write_max max write bytes per interval * l2arc_write_boost extra write bytes during device warmup * l2arc_noprefetch skip caching prefetched buffers * l2arc_headroom number of max device writes to precache * l2arc_feed_secs seconds between L2ARC writing * * Tunables may be removed or added as future performance improvements are * integrated, and also may become zpool properties. */ |
Jonathan recently linked to this block comment in a blog entry about flash memory, to show that ZFS can incorporate flash into the storage hierarchy, and here is the actual implementation.
Posted by brendan
[Performance] ( July 22, 2008 09:48 PM )
Permalink
| Comments[13]
DTrace Links
A few happenings in the increasingly busy world of DTrace:Posted by brendan [DTrace] ( February 18, 2008 03:48 AM ) Permalink | Comments[0]
DTrace Bourne shell (sh) provider
Alan Hargreaves has prototyped a DTrace Bourne shell provider, and is looking ahead to see how this can be integrated with Solaris and OpenSolaris. This is great news for anyone who would like to analyze or troubleshoot their Bourne shell scripts. It is also good news for people (especially system administrators) wanting to learn DTrace, as it may provide a familiar path (sh scripting) for you to learn user-land DTracing.Why sh, and not csh, tcsh, ksh, bash or zsh? Alan could have done any, we wanted to start with the Bourne shell to both get it done and as an example for the other shell providers to follow (watch this space). So if you really just care about bash or whatever, then learning about the sh provider now should be a useful heads-up of what should be possible.
hello.sh
When learning to DTrace software, it can be best to start with the simple stuff. Confirm that what you think will happen, does indeed happen, then move to progressivly more complex scripts.Here is hello.sh, a dead simple script to start with,
# cat -n hello.sh
1 #!./sh
2
3 echo "Hello World!"
4 echo "Hello World!"
5 echo "Hello World!"
6 echo "Hello World!"
7 echo "Hello World!"
The following DTrace one-liner will run hello.sh (-c), trace all sh provider probes (sh*:::), and frequency count their probe names (@[probename] = count()).
# dtrace -Zn 'sh$target::: { @[probename] = count(); }' -c ./hello.sh
dtrace: description 'sh$target::: ' matched 9 probes
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
dtrace: pid 248847 has exited
script-begin 1
script-end 1
builtin-entry 5
builtin-return 5
line 5
So builtin-entry and builtin-return were called 5 times, one for each echo. The line probe was also called 5 times, once for each line that the shell executed.Frequency counting builtins
Each probe provides useful information in their args, such as the script name, builtin or exec name, line number, etc. Here I'll use the builtin-entry probe to frequency count the builtin names,
# dtrace -Zn 'sh$target:::builtin-entry { @[copyinstr(arg1)] = count(); }' -c ./hello.sh
dtrace: description 'sh$target:::builtin-entry ' matched 1 probe
Hello World!
[...]
dtrace: pid 248851 has exited
echo 5
That found that echo was called 5 times. Nothing suprising there, this is just confirming that the probes behave as they should.Line number tracing
The line probe allows us to trace line execution. This one liner counts how many times each line was run,
# dtrace -Zn 'sh$target:::line { @[arg1] = count(); }' -c ./hello.sh
dtrace: description 'sh$target:::line ' matched 1 probe
Hello World!
[...]
dtrace: pid 248855 has exited
3 1
4 1
5 1
6 1
7 1
Each of the lines 3 to 7 were called once, as expected. This one-liner should be useful for finding loop hotspots.And the following one-liner traces line execution live,
# dtrace -Zn 'sh$target:::line { trace(arg1); }' -c ./hello.sh
dtrace: description 'sh$target:::line ' matched 1 probe
Hello World!
[...]
CPU ID FUNCTION:NAME
0 54539 execute:line 3
0 54539 execute:line 4
0 54539 execute:line 5
0 54539 execute:line 6
0 54539 execute:line 7
dtrace: pid 248869 has exited
Great, that works. Is this getting too easy? lets move on...
func_abc.sh
Now for somthing a little harder (and more interesting). In the following script, func_a() calls func_b() which calls func_c().
# cat -n func_abc.sh
1 #!./sh
2
3 func_c()
4 {
5 echo "Function C"
6 sleep 1
7 }
8
9 func_b()
10 {
11 echo "Function B"
12 sleep 1
13 func_c
14 }
15
16 func_a()
17 {
18 echo "Function A"
19 sleep 1
20 func_b
21 }
22
23 func_a
What order will the lines be executed now? Try thinking about it before checking the answer below.Tracing line execution
# dtrace -Zn 'sh$target:::line { trace(arg1); }' -c ./func_abc.sh
dtrace: description 'sh$target:::line ' matched 1 probe
Function A
CPU ID FUNCTION:NAME
0 54539 execute:line 23
0 54539 execute:line 18
0 54539 execute:line 19 pause
Function B
0 54539 execute:line 20
0 54539 execute:line 11
0 54539 execute:line 12 pause
Function C
0 54539 execute:line 13
0 54539 execute:line 5
0 54539 execute:line 6 pause
dtrace: pid 248873 has exited
There were some conspicuous one-second pauses during the output, which are sadly lost in this screen shot. I've added the word "pause" so that you know when that happened.The order of the lines does make sense. What might be confusing is that we see the output text "Function B" followed by the trace of line 20 -- this is due to DTrace output buffering.
Tracing function flow - sh_flow.d
Since we have function-entry and function-return probes, we can trace function execution and indent the output based on the function depth. The following demonstrates sh_flow.d - a script which does just that for any script running on the system (with this sh provider enabled binary), and also traces builtin calls and command executions. sh_flow.d is run in one window, and func_abc.sh runs in another.
# ./sh_flow.d C TIME FILE -- NAME 0 2007 Aug 10 18:41:10 func_abc.sh -> func_a 0 2007 Aug 10 18:41:10 func_abc.sh > echo 0 2007 Aug 10 18:41:10 func_abc.sh | sleep 0 2007 Aug 10 18:41:11 func_abc.sh -> func_b 0 2007 Aug 10 18:41:11 func_abc.sh > echo 0 2007 Aug 10 18:41:11 func_abc.sh | sleep 0 2007 Aug 10 18:41:12 func_abc.sh -> func_c 0 2007 Aug 10 18:41:12 func_abc.sh > echo 0 2007 Aug 10 18:41:12 func_abc.sh | sleep 0 2007 Aug 10 18:41:13 func_abc.sh <- func_c 0 2007 Aug 10 18:41:13 func_abc.sh <- func_b 0 2007 Aug 10 18:41:13 func_abc.sh <- func_a ^CThe legend is,
| -> | function entry |
| <- | function return |
| > | builtin entry |
| | | command entry |
The sh_flow.d script is,
#!/usr/sbin/dtrace -Zs
/*
* sh_flow.d - snoop Bourne shell execution showing function flow.
* Written for the Solaris sh DTrace provider.
*
* This traces shell activity from all Bourne shells on the system that are
* running with sh provider support.
*
* USAGE: sh_flow.d # hit Ctrl-C to end
*
* This watches shell function entries and returns, and indents child
* function calls. Shell builtins are also printed.
*
* FIELDS:
* C CPU-id
* TIME Time of execution
* FILE Filename that this function belongs to
* NAME Shell function, builtin or command name
*
* If a name isn't available at the time of tracing, "<null>" is printed.
*
* WARNING: Watch the first column carefully, it prints the CPU-id. If it
* changes, then it is very likely that the output has been shuffled.
*/
#pragma D option quiet
#pragma D option switchrate=10
dtrace:::BEGIN
{
depth = 0;
printf("%s %-20s %-22s -- %s\n", "C", "TIME", "FILE", "NAME");
}
sh*:::function-entry
{
depth++;
printf("%d %-20Y %-22s %*s-> %s\n", cpu, walltimestamp,
basename(copyinstr(arg0)), depth*2, "", copyinstr(arg1));
}
sh*:::function-return
{
printf("%d %-20Y %-22s %*s<- %s\n", cpu, walltimestamp,
basename(copyinstr(arg0)), depth*2, "", copyinstr(arg1));
depth--;
}
sh*:::builtin-entry
{
printf("%d %-20Y %-22s %*s > %s\n", cpu, walltimestamp,
basename(copyinstr(arg0)), depth*2, "", copyinstr(arg1));
}
sh*:::command-entry
{
printf("%d %-20Y %-22s %*s | %s\n", cpu, walltimestamp,
basename(copyinstr(arg0)), depth*2, "", copyinstr(arg1));
}
Tracing function flow with times - sh_flowtime.dA great feature of DTrace is the availability of high resolution timestamps. Here I've enhanced the previous script to print delta times,
# ./sh_flowtime.d C TIME FILE DELTA(us) -- NAME 0 2007 Aug 10 18:52:51 func_abc.sh 0 -> func_a 0 2007 Aug 10 18:52:51 func_abc.sh 54 > echo 0 2007 Aug 10 18:52:52 func_abc.sh 1022880 | sleep 0 2007 Aug 10 18:52:52 func_abc.sh 34 -> func_b 0 2007 Aug 10 18:52:52 func_abc.sh 44 > echo 0 2007 Aug 10 18:52:53 func_abc.sh 1029963 | sleep 0 2007 Aug 10 18:52:53 func_abc.sh 44 -> func_c 0 2007 Aug 10 18:52:53 func_abc.sh 43 > echo 0 2007 Aug 10 18:52:54 func_abc.sh 1029863 | sleep 0 2007 Aug 10 18:52:54 func_abc.sh 33 <- func_c 0 2007 Aug 10 18:52:54 func_abc.sh 14 <- func_b 0 2007 Aug 10 18:52:54 func_abc.sh 7 <- func_a ^CThe legend for understanding the delta times is, The legend is,
| -> | function entry | time from the previous event, to the entry of this function |
| <- | function return | time from the previous event, to the completion of this function |
| > | builtin entry | time from the previous event, to when this builtin completed execution |
| | | command entry | time from the previous event, to when this command completed execution |
Being able to present this information in such a compelling way is awsome stuff.
I won't include the source to that script just yet (this blog entry will get too long), but I'll encourage Alan to bundle these scripts along with the patch/binary of the sh provider (when he makes that available).
Tracing type times - sh_typetime.d
If your script is rather long, the above output from sh_flowtime.d may get rather long. This condensed summary may be more practical,
# ./sh_typetime.d Tracing... Hit Ctrl-C to end. ^C FILE TYPE NAME TIME(us) func_abc.sh builtin echo 82 func_abc.sh func func_c 260 func_abc.sh func func_b 285 func_abc.sh func func_a 478 func_abc.sh exec sleep 3080418It's pretty obvious from that output where the time was spent.
I was experimenting with more exciting forms of that type of script,
# ./sh_typetime2.d Tracing... Hit Ctrl-C to end. ^C FILE TYPE NAME COUNT ELAPSED ON-CPU func_abc.sh func func_c 1 296 31 func_abc.sh func func_b 1 346 44 func_abc.sh func func_a 1 529 61 func_abc.sh builtin echo 3 87 77 func_abc.sh exec sleep 3 3085831 53984Lots more will be possible - Alan only wrote the provider two days ago, so I haven't had much of a chance to write scripts yet!
Real world tracing
Alan literally wrote this provider from scratch in the last few days, so we haven't had a chance to point it at something big yet. Here's at least one example of something familiar, /etc/profile,
# ./sh_typetime2.d Tracing... Hit Ctrl-C to end. ^C FILE TYPE NAME COUNT ELAPSED ON-CPU profile builtin [ 1 3 2 profile builtin umask 1 4 3 profile builtin export 1 5 3 profile builtin trap 2 30 22Cool. See Alan's blog for updates.
Posted by brendan
[DTrace] ( August 10, 2007 12:23 PM )
Permalink
| Comments[1]
iSCSI DTrace provider
I recently tried out the new iSCSI Target Server DTrace provider which Adam wrote (which should be included in Solaris sometime soon). It provides observability of iSCSI traffic from the the server's perspective. This is great news if you have an iSCSI Solaris server, or are planning to have one.In this entry I'll post a few screenshots of the iSCSI provider in action, both one-liners and scripts.
one-liners
I'll start by listing the probes - which may convey a sense of what is possible to trace. Look down the "NAME" column,# dtrace -ln 'iscsi*:::' ID PROVIDER MODULE FUNCTION NAME 11696 iscsi100653 iscsitgtd send_async_logout async-send 11697 iscsi100653 iscsitgtd send_async_scsi async-send 11698 iscsi100653 iscsitgtd handle_scsi_data data-receive 11699 iscsi100653 iscsitgtd iscsi_conn_data_rqst data-request 11700 iscsi100653 iscsitgtd send_datain_pdu data-send 11701 iscsi100653 iscsitgtd iscsi_handle_login_pkt login-command 11702 iscsi100653 iscsitgtd send_login_reject login-response 11703 iscsi100653 iscsitgtd iscsi_handle_login_pkt login-response 11704 iscsi100653 iscsitgtd handle_logout_msg logout-command 11705 iscsi100653 iscsitgtd conn_state logout-response 11706 iscsi100653 iscsitgtd handle_noop_cmd nop-receive 11707 iscsi100653 iscsitgtd iscsi_conn_pkt nop-send 11708 iscsi100653 iscsitgtd handle_scsi_cmd scsi-command 11709 iscsi100653 iscsitgtd send_scsi_rsp scsi-response 11710 iscsi100653 iscsitgtd handle_task_mgt task-command 11711 iscsi100653 iscsitgtd iscsi_conn_pkt task-response 11712 iscsi100653 iscsitgtd handle_text_msg text-command 11713 iscsi100653 iscsitgtd handle_text_msg text-responseThe easiest one-liner would be to frequency count iSCSI operations,
# dtrace -n 'iscsi*::: { @[probename] = count(); }'
dtrace: description 'iscsi*::: ' matched 18 probes
^C
nop-receive 3
nop-send 3
data-receive 55
data-request 55
scsi-response 103
scsi-command 342
data-send 2865
The above output shows that the iSCSI server is doing more send operations (2865 data-sends), than receive (55 data-receives).Now to see who is using this iSCSI server, by frequency counting remote IP addresses,
# dtrace -n 'iscsi*::: { @[args[0]->ci_remote] = count(); }'
dtrace: description 'iscsi*::: ' matched 18 probes
^C
192.168.1.109 1338
The above shows that 192.168.1.108 triggered 1338 iSCSI operations - which may have been sends, receives, nops, etc. Breaking down further is just a little more DTrace,
# dtrace -qn 'iscsi*::: { @[args[0]->ci_remote, probename] = count(); }
END { printa("%32s %18s %@16d\n", @); }'
^C
192.168.1.109 nop-receive 2
192.168.1.109 nop-send 2
192.168.1.109 data-receive 16
192.168.1.109 data-request 16
192.168.1.109 scsi-response 43
192.168.1.109 scsi-command 253
192.168.1.109 data-send 2619
Great, while a simple one-liner - it quickly identifies who is connecting and what operations they are doing. Some performance wins are about locating and eliminating unnecessary work, which one-liners such as this help identify.Now for printing payload bytes by iSCSI operation,
# dtrace -n 'iscsi*::: { @[probename] = sum(args[1]->ii_datalen); }'
dtrace: description 'iscsi*::: ' matched 18 probes
^C
data-request 0
scsi-response 0
data-receive 99328
scsi-command 140800
data-send 20672512
The above one-liner prints total bytes by operation, which does lose some data - what was the count by operation, average, min, max? DTrace can answer all of these seperately, or provide a summary as a distribution plot. Here I've used quantize for a power of 2 plot,
# dtrace -n 'iscsi*::: { @[probename] = quantize(args[2]->ii_datalen); }'
dtrace: description 'iscsi*::: ' matched 18 probes
^C
data-request
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 12
1 | 0
scsi-response
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 55
1 | 0
data-receive
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@ 2
2048 | 0
4096 |@@@@@@@@@@@@@ 4
8192 |@@@@@@@@@@@@@@@@@@@@ 6
16384 | 0
scsi-command
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 209
1 | 0
2 | 0
4 | 0
8 | 0
16 | 0
32 | 0
64 | 0
128 | 0
256 | 0
512 |@@@ 18
1024 |@@ 10
2048 |@@ 11
4096 |@ 4
8192 |@@ 12
16384 | 0
data-send
value ------------- Distribution ------------- count
256 | 0
512 | 1
1024 | 16
2048 | 21
4096 | 28
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2559
16384 | 0
The plots above show that for the data-send operation, the payload size was usually in the 8 to 16 Kbyte bucket.
scripts
Apart from one-lines, DTrace allows powerful scripts to be contstructed. I've put a few on the SolarisInternals DTrace Topics: iSCSI page.Here I'll post my favourite so far - a script which measures read/write latency (or "response time") in nanoseconds,
# ./iscsirwlat.d
Tracing... Hit Ctrl-C to end.
^C
Read Latency (ns),
192.168.1.109
value ------------- Distribution ------------- count
32768 | 0
65536 |@@@ 27
131072 |@@ 17
262144 | 4
524288 |@@@@ 39
1048576 |@@@ 32
2097152 |@@@@@@@@@@@@@@@@@ 164
4194304 |@ 11
8388608 |@ 5
16777216 |@ 10
33554432 |@ 7
67108864 |@ 5
134217728 |@@@@@ 46
268435456 |@ 14
536870912 | 0
Write Latency (ns),
192.168.1.109
value ------------- Distribution ------------- count
32768 | 0
65536 |@@ 14
131072 |@@ 16
262144 |@@@ 20
524288 |@@@ 20
1048576 |@@@@@ 33
2097152 |@ 5
4194304 | 0
8388608 | 2
16777216 |@ 10
33554432 |@@@ 18
67108864 |@@@@@ 33
134217728 |@@@@@@@@@@@@@ 93
268435456 |@@ 12
536870912 | 0
The script is on the DTrace Topics page. It is a glimpse of things to come; how many of those were synchronous requests (which may hurt application performance) versus asynchronous flushing or read ahead (which may not hurt performance)? More DTrace is needed - which will probably include running DTrace on the client side as well...
Posted by brendan [DTrace] ( July 31, 2007 10:07 AM ) Permalink | Comments[0]
JavaScript DTrace provider ver 2.0
About 2 months ago I posted an update to the JavaScript DTrace provider to Bugzilla. In case you are interested and missed it, here are some details: The new additions were,
- the ability to identify anonymous functions (js_function-info)
- access to function entry arguments (js_function-args)
- access to function return value (js_function-rval)
Anyhow, here are some before and after screenshots, tracing a simple JavaScript clock (it's enough to demonstrate the features without bogging down in application specifics),
Old,
# ./js_funcflow.d C TIME FILE -- FUNC 1 2007 May 31 06:07:13 clock2.html -> startTime 1 2007 May 31 06:07:13 clock2.html -> getHours 1 2007 May 31 06:07:13 clock2.html <- getHours 1 2007 May 31 06:07:13 clocklib.js -> padLibZero 1 2007 May 31 06:07:13 clocklib.js <- padLibZero 1 2007 May 31 06:07:13 clock2.html -> getMinutes 1 2007 May 31 06:07:13 clock2.html <- getMinutes 1 2007 May 31 06:07:13 clocklib.js -> padLibZero 1 2007 May 31 06:07:13 clocklib.js <- padLibZero 1 2007 May 31 06:07:13 clock2.html -> getSeconds 1 2007 May 31 06:07:13 clock2.html <- getSeconds 1 2007 May 31 06:07:13 clocklib.js -> padLibZero 1 2007 May 31 06:07:13 clocklib.js <- padLibZero 1 2007 May 31 06:07:13 clock2.html -> getElementById 1 2007 May 31 06:07:13 clock2.html <- getElementById 1 2007 May 31 06:07:13 clock2.html -> setTimeout 1 2007 May 31 06:07:13 clock2.html <- setTimeout 1 2007 May 31 06:07:13 clock2.html <- startTime ^CNew,
# ./js_argflow.d C TIME FILE -- FUNC() 1 2007 May 31 06:07:34 clock2.html -> startTime() 1 2007 May 31 06:07:34 clock2.html -> getHours() 1 2007 May 31 06:07:34 clock2.html <- startTime:9 = 0x80000000 1 2007 May 31 06:07:34 clocklib.js -> padLibZero(0x6) 1 2007 May 31 06:07:34 clocklib.js <- padLibZero:3 = 0x955fd70 1 2007 May 31 06:07:34 clock2.html -> getMinutes() 1 2007 May 31 06:07:34 clock2.html <- startTime:10 = 0x80000000 1 2007 May 31 06:07:34 clocklib.js -> padLibZero(0x7) 1 2007 May 31 06:07:34 clocklib.js <- padLibZero:3 = 0x955fd50 1 2007 May 31 06:07:34 clock2.html -> getSeconds() 1 2007 May 31 06:07:34 clock2.html <- startTime:11 = 0x80000000 1 2007 May 31 06:07:34 clocklib.js -> padLibZero(0x22) 1 2007 May 31 06:07:34 clocklib.js <- padLibZero:4 = 0x22 1 2007 May 31 06:07:34 clock2.html -> getElementById() 1 2007 May 31 06:07:34 clock2.html <- startTime:12 = 0x80000000 1 2007 May 31 06:07:34 clock2.html -> setTimeout() 1 2007 May 31 06:07:34 clock2.html <- startTime:13 = 0x80000000 1 2007 May 31 06:07:34 clock2.html <- startTime:13 = 0x80000000 ^CNow there is some visibility for function arguments and return values. If arguments are strings, it is possible to print them as such from DTrace.
Old,
# ./js_funccalls.d Tracing... Hit Ctrl-C to end. ^C FILE FUNC CALLS clock2.html getElementById 1 clock2.html getHours 1 clock2.html getMinutes 1 clock2.html getSeconds 1 clock2.html setTimeout 1 clock2.html startTime 1 clocklib.js padLibZero 3New,
# ./js_funcinfo.d Tracing... Hit Ctrl-C to end. ^C BASE[FILE:LINE] FUNCNAME RUN[FILE:LINE] CALLS clock2.html:7 getElementById clock2.html:12 1 clock2.html:7 getHours clock2.html:9 1 clock2.html:7 getMinutes clock2.html:10 1 clock2.html:7 getSeconds clock2.html:11 1 clock2.html:7 setTimeout clock2.html:13 1 clock2.html:7 startTime clock2.html:13 1 clocklib.js:2 padLibZero clock2.html:10 1 clocklib.js:2 padLibZero clock2.html:11 1 clocklib.js:2 padLibZero clock2.html:9 1The "BASE" info describes where that function was declared or first referred to; the "RUN" info describes where that function was executed. These details are especially useful for tracing anonymous functions.
The scripts used above can be downloaded in a tar file from the OpenSolaris DTrace JavaScript site.
There are a number of people at Sun and at Mozilla helping to put DTrace probes in Mozilla, not just for JavaScript - but also for tracing browser latencies at a level which is useful for developers. For more information see the OpenSolaris page, and on Bugzilla: 370906, 388564.
Posted by brendan
[DTrace] ( July 30, 2007 01:02 PM )
Permalink
| Comments[0]
DTracing off-CPU time
In this entry I'll demonstrate DTracing off-CPU time.On my way back from OSCON 2007, I found myself in Portland airport with a laptop, no Internet and half an hour to kill. In fact, the same laptop that I used during the SVOSUG talk in April. It was a chance to finish the demo that I started back then.
For those that were there or listened in, the talk had an unfortunate start with over an hour of audiovisual issues - including problems with the laptop video driver, laptop boot process, the data projector, lights and microphone. Alan DuBoff did a good job of fixing things while I began plan B (presenting from a SunRay), and he fixed the video driver such that it works better with data projectors than any other Acer Ferrari I've tried.
One issue we didn't fix that night was a 20 second wait when starting up certain applications, such as gnome-terminal. It was to make a great live DTrace demo to finish with, but I ran out of time to do it properly (we finished at around 11:30pm). Here it is as a blog entry instead.
The problem
When running gnome-terminal, there is a 20 second wait before the terminal appears on the screen.Initial check
During the 20 second wait,- vmstat 1 showed idle CPUs
- prstat -m showed that gnome-terminal was not on-CPU
DTrace investigation
The gnome-terminal application is off-CPU for some reason, and the CPUs are otherwise idle. There are numerous different ways to begin the investigation with DTrace, including,- using the sched provider to trace off-CPU to on-CPU event time
- using the syscall provider to trace elapsed time for syscalls
- using the pid provider to trace elapsed time for application binary and library calls
The following one-liner runs gnome-terminal and measures the time from that process leaving the CPU to when it returns, and only prints user stack traces if that time was over 1 second,
# dtrace -n 'sched:::off-cpu /pid == $target/ { self->start = timestamp; }
sched:::on-cpu /self->start && ((timestamp - self->start) > 1000000000)/
{ printf("waited: %d ms\n", (timestamp - self->start) / 1000000); ustack(); }
' -c gnome-terminal
dtrace: description 'sched:::off-cpu ' matched 6 probes
CPU ID FUNCTION:NAME
0 48718 resume:on-cpu waited: 20034 ms
libc.so.1`door_call+0x1a
libc.so.1`_nsc_trydoorcall+0x213
libnsl.so.1`_door_getipnodebyname_r+0x8f
libnsl.so.1`_get_hostserv_inetnetdir_byname+0xb62
libnsl.so.1`getipnodebyname+0xdf
libsocket.so.1`get_addr+0x126
libsocket.so.1`_getaddrinfo+0x414
libsocket.so.1`getaddrinfo+0x19
libORBit-2.so.0.1.0`get_netid+0x91
libORBit-2.so.0.1.0`link_get_local_hostname+0x35
libORBit-2.so.0.1.0`link_server_setup+0x56
libORBit-2.so.0.1.0`giop_server_new+0x5e
libORBit-2.so.0.1.0`ORBit_ORB_start_servers+0x1d8
libORBit-2.so.0.1.0`IOP_generate_profiles+0x67
libORBit-2.so.0.1.0`ORBit_marshal_object+0x8e
libORBit-2.so.0.1.0`ORBit_marshal_value+0x391
libORBit-2.so.0.1.0`orbit_small_marshal+0xf7
libORBit-2.so.0.1.0`ORBit_small_invoke_stub+0x11c
libORBit-2.so.0.1.0`ORBit_small_invoke_stub_n+0x43
libORBit-2.so.0.1.0`ORBit_c_stub_invoke+0x132
^C
Hmm - looks like that user stack trace was truncated. Time to boost the ustackframes value,
# dtrace -x ustackframes=64 -n '
sched:::off-cpu /pid == $target/ { self->start = timestamp; }
sched:::on-cpu /self->start && ((timestamp - self->start) > 1000000000)/
{ printf("waited: %d ms\n", (timestamp - self->start) / 1000000); ustack(); }
' -c gnome-terminal
dtrace: description 'sched:::off-cpu ' matched 6 probes
dtrace: pid 101415 has exited
CPU ID FUNCTION:NAME
0 48720 resume:on-cpu waited: 20038 ms
libc.so.1`door_call+0x1a
libc.so.1`_nsc_trydoorcall+0x213
libnsl.so.1`_door_getipnodebyname_r+0x8f
libnsl.so.1`_get_hostserv_inetnetdir_byname+0xb62
libnsl.so.1`getipnodebyname+0xdf
libsocket.so.1`get_addr+0x126
libsocket.so.1`_getaddrinfo+0x414
libsocket.so.1`getaddrinfo+0x19
libORBit-2.so.0.1.0`get_netid+0x91
libORBit-2.so.0.1.0`link_get_local_hostname+0x35
libORBit-2.so.0.1.0`link_server_setup+0x56
libORBit-2.so.0.1.0`giop_server_new+0x5e
libORBit-2.so.0.1.0`ORBit_ORB_start_servers+0x1d8
libORBit-2.so.0.1.0`IOP_generate_profiles+0x67
libORBit-2.so.0.1.0`ORBit_marshal_object+0x8e
libORBit-2.so.0.1.0`ORBit_marshal_value+0x391
libORBit-2.so.0.1.0`orbit_small_marshal+0xf7
libORBit-2.so.0.1.0`ORBit_small_invoke_stub+0x11c
libORBit-2.so.0.1.0`ORBit_small_invoke_stub_n+0x43
libORBit-2.so.0.1.0`ORBit_c_stub_invoke+0x132
libgconf-2.so.4.1.0`ConfigServer_add_client+0x4f
libgconf-2.so.4.1.0`gconf_get_config_server+0xb9
libgconf-2.so.4.1.0`gconf_engine_connect+0x24f
libgconf-2.so.4.1.0`gconf_engine_get_default+0x4c
libgconf-2.so.4.1.0`gconf_client_get_default+0x2a
libgnomeui-2.so.0.1401.0`libgnomeui_post_args_parse+0x187
libgnome-2.so.0.1401.0`gnome_program_postinit+0x61
libgnome-2.so.0.1401.0`gnome_program_init_common+0x37b
libgnome-2.so.0.1401.0`gnome_program_initv+0x4d
libgnome-2.so.0.1401.0`gnome_program_init+0x56
gnome-terminal`main+0x2d5
gnome-terminal`_start+0x7a
Good. This stack trace is likely to have led to the process leaving the CPU for the measured 20038 ms (the system was otherwise idle, so it is unlikely to have been kicked off due to scheduling). Reading through the lines, it looks like it is resolving a hostname - a common source of latencies when DNS is misconfigured.There are a number of lines we could begin studying to confirm what is happening (and arrive at the same answer); I'll start with libnsl.so.1`getipnodebyname+0xdf, which should have a man page entry,
# man getipnodebyname
[...]
struct hostent *getipnodebyname(const char *name, int af,
int flags, int *error_num);
[...]
Now to check what hostname is being resolved,
# dtrace -n 'pid$target:libnsl:getipnodebyname:entry { trace(copyinstr(arg0)); }'
-c gnome-terminal
dtrace: description 'pid$target:libnsl:getipnodebyname:entry ' matched 1 probe
CPU ID FUNCTION:NAME
0 50488 getipnodebyname:entry marlin
^C
dtrace: pid 101758 terminated by SIGINT
Oh - "marlin" is the hostname of this laptop, which should resolve just fine. Checking related configuration files,
# grep hosts /etc/nsswitch.conf
hosts: files dns
#
# cat -n /etc/hosts
1 #
2 # Copyright 2006 Sun Microsystems, Inc. All rights reserved.
3 # Use is subject to license terms.
4 #
5 #ident "@(#)hosts 1.1 06/08/04 SMI"
6 #
7 127.0.0.1 localhost
8 192.168.1.166 marlin marlin.sf.fw.jpn.com
Hmm - the name service switch file should cause /etc/hosts (/etc/inet/hosts) to be read first, which contains a valid entry for "marlin".Lets take a closer look at the getipnodebyname() call, and trace the address family and flags arguments,
# dtrace -n 'pid$target:libnsl:getipnodebyname:entry { printf("%s af=%d flags=%d",
copyinstr(arg0), arg1, arg2); }' -c gnome-terminal
dtrace: description 'pid$target:libnsl:getipnodebyname:entry ' matched 1 probe
CPU ID FUNCTION:NAME
0 50488 getipnodebyname:entry marlin af=26 flags=19
^C
dtrace: pid 101821 terminated by SIGINT
Ok, address family 26 is,
# grep 26 /usr/include/sys/socket.h #define AF_INET6 26 /* Internet Protocol, Version 6 */IPv6! Should have checked earlier,
# grep ipnodes /etc/nsswitch.conf
ipnodes: files dns
#
# cat -n /etc/inet/ipnodes
1 #
2 # Copyright 2006 Sun Microsystems, Inc. All rights reserved.
3 # Use is subject to license terms.
4 #
5 #ident "@(#)ipnodes 1.1 06/08/04 SMI"
6 #
7 ::1 localhost
8 127.0.0.1 localhost
Adding an entry to the ipnodes file for "marlin" fixed the issue.
Take 2
Was using the sched provider and then pid to trace getipnodebyname() the best path to take? It's hard to say - there are many different ways DTrace can help you solve problems. Lets try some other paths.Here I'll trace elapsed times from the syscall provider - since if we have an off-CPU issue on an idle system, it will almost certainly be visible at the syscall layer.
The procsystime tool from the DTraceToolkit performs different types of syscall time analysis. I'm using it as it saves a minute or so of typing. Here it runs the gnome-terminal command and produces a report of elapsed times by syscall,
# /opt/DTT/procsystime -e gnome-terminal
Elapsed Times for command gnome-terminal,
SYSCALL TIME (ns)
gtime 1882
sigpending 1921
priocntlsys 4200
sysi86 4354
getgid 5152
sysconfig 5606
getsockname 5904
systeminfo 7040
setcontext 7342
fxstat 7943
getpeername 9393
getrlimit 11717
getuid 13927
uname 14661
sigaction 14698
getpid 19241
setsockopt 23743
getcwd 33784
fsat 41700
stat64 43024
readv 45937
listen 49874
llseek 52341
mkdir 52633
pipe 64612
chmod 65684
fcntl 70864
utime 76677
fstat64 109420
ioctl 133301
access 157803
unlink 165703
open64 221120
accept 276863
bind 281026
writev 343976
brk 344410
memcntl 347475
write 358764
getdents64 422206
munmap 486137
connect 588166
resolvepath 686768
so_socket 748039
close 846782
open 1127761
read 1191102
mmap 1592314
xstat 8521257
pollsys 9817661
doorfs 20035299808
The doorfs() call takes the door descriptor as the first argument,
# dtrace -n 'syscall::doorfs:entry /pid == $target/ { trace(arg0); }' -c gnome-terminal
dtrace: description 'syscall::doorfs:entry ' matched 1 probe
CPU ID FUNCTION:NAME
0 572 doorfs:entry 3
0 572 doorfs:entry 3
0 572 doorfs:entry 3
0 572 doorfs:entry 3
^C
dtrace: pid 101833 terminated by SIGINT
#
# dtrace -n 'syscall::doorfs:entry /pid == $target/ { trace(fds[arg0].fi_pathname); }'
-c gnome-terminal
dtrace: description 'syscall::doorfs:entry ' matched 1 probe
CPU ID FUNCTION:NAME
0 572 doorfs:entry /var/run/name_service_door
0 572 doorfs:entry /var/run/name_service_door
0 572 doorfs:entry /var/run/name_service_door
0 572 doorfs:entry /var/run/name_service_door
^C
dtrace: pid 101835 terminated by SIGINT
The door calls were to /var/run/name_service_door, pointing to name resolution and the nscd process (which can be confirmed with more DTrace).Running procsystime on nscd produces,
# /opt/DTT/procsystime -e -n nscd
Hit Ctrl-C to stop sampling...
^C
Elapsed Times for processes nscd,
SYSCALL TIME (ns)
ioctl 2452
llseek 2615
fstat64 3996
sysconfig 13933
open 15416
read 26336
gtime 52980
so_socket 323066
close 349885
xstat 430465
send 454944
connect 795246
doorfs 360127347
lwp_park 10009952745
pollsys 20032365315
nanosleep 153088010532
nscd is multi-threaded, with many threads sleeping as they wait for work - meaning that large off-CPU elapsed times may have nothing to do with gnome-terminal. The time for pollsys() is interesting, as at 20 seconds it matches the measured time in gnome-terminal.Analysing pollsys() further (the poll() manpage is similar),
# man poll
[...]
int poll(struct pollfd fds[], nfds_t nfds, int timeout);
[...]
The first argument is an array with length specified by the second argument. Looping over an array is difficult from DTrace (due to lack of loops), but this doesn't really present a difficulty when troubleshooting. Here I'll dump the struct in hex, and the other values as ints,
# dtrace -n 'syscall::pollsys:entry /execname == "nscd"/ { printf("nfds=%d", arg1);
tracemem(copyin(arg0, 16), 16); }' -c gnome-terminal
dtrace: description 'syscall::pollsys:entry ' matched 1 probe
CPU ID FUNCTION:NAME
0 542 pollsys:entry nfds=1
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
0: 05 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00 ....@...........
0 542 pollsys:entry nfds=1
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
0: 05 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00 ....@...........
0 542 pollsys:entry nfds=1
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
0: 05 00 00 00 40 00 42 fd 0b 7b ab 46 f0 d7 c0 0d ....@.B..{.F....
0 542 pollsys:entry nfds=1
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
0: 05 00 00 00 40 00 00 00 0b 7b ab 46 f0 d7 c0 0d ....@....{.F....
dtrace: pid 102041 has exited
Don't worry - DTrace lets you cast variables as structs if you like; I'm just dumping the data in hex as this is a short one-liner.The nfds value shows that there is only ever one file descriptor in the array; and the hex dump (on this little endian architecture) shows that the file descriptor number is "5" (the first member of struct pollfd).
Now DTrace is used to print the pathname for that file descriptor,
# dtrace -n 'syscall::pollsys:entry /execname == "nscd"/
{ trace(fds[*(uint32_t *)copyin(arg0, 4)].fi_pathname); }' -c gnome-terminal
dtrace: description 'syscall::pollsys:entry ' matched 1 probe
CPU ID FUNCTION:NAME
0 542 pollsys:entry <unknown>
0 542 pollsys:entry <unknown>
^C
dtrace: pid 102063 terminated by SIGINT
Well, that didn't work. Not all file descriptors have pathnames, of course. It will save some time to borrow pfiles for the next step,
# pfiles `pgrep -x nscd`
101917: /usr/sbin/nscd
Current rlimit: 256 file descriptors
0: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
O_RDWR
/devices/pseudo/mm@0:null
1: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
O_RDWR
/devices/pseudo/mm@0:null
2: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
O_RDWR
/devices/pseudo/mm@0:null
3: S_IFDOOR mode:0777 dev:279,0 ino:0 uid:0 gid:0 size:0
O_RDWR FD_CLOEXEC door to nscd[101917]
4: S_IFSOCK mode:0666 dev:277,0 ino:17874 uid:0 gid:0 size:0
O_RDWR
SOCK_RAW
SO_SNDBUF(8192),SO_RCVBUF(8192)
sockname: AF_ROUTE
peername: AF_ROUTE
5: S_IFSOCK mode:0666 dev:277,0 ino:53988 uid:0 gid:0 size:0
O_RDWR
SOCK_DGRAM
SO_DGRAM_ERRIND,SO_SNDBUF(57344),SO_RCVBUF(57344)
sockname: AF_INET 192.168.1.78 port: 51697
peername: AF_INET 192.168.1.5 port: 53
File descriptor 5 was a socket to the remote host 192.168.1.5 port 53 (DNS). This shows that nscd is waiting for 20 seconds on a DNS socket, and we know that gnome-terminal has a 20 second wait on the /var/run/name_service_door file. They are probably related.Proving that they are related, if needed, is simply more DTrace. This time I'll borrow a script from /usr/demo/dtrace, which shows who is waiting for nscd and how long for,
# dtrace -s /usr/demo/dtrace/nscd.d -c gnome-terminal
dtrace: script '/usr/demo/dtrace/nscd.d' matched 27 probes
dtrace: pid 101594 has exited
nscd gnome-session
value ------------- Distribution ------------- count
16384 | 0
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
65536 | 0
nscd gnome-terminal
value ------------- Distribution ------------- count
16384 | 0
32768 |@@@@@@@@@@ 1
65536 |@@@@@@@@@@@@@@@@@@@@ 2
131072 | 0
262144 | 0
524288 | 0
1048576 | 0
2097152 | 0
4194304 | 0
8388608 | 0
16777216 | 0
33554432 | 0
67108864 | 0
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 0
2147483648 | 0
4294967296 | 0
8589934592 | 0
17179869184 |@@@@@@@@@@ 1
34359738368 | 0
In the 17.1 to 34.3 second bucket is gnome-terminal waiting for nscd - our 20 second wait.The /usr/demo/dtrace/nscd.d script works by tracing who is nscd is waking up via the sched:::wakeup probe, after having traced the time when threads sleep (see the "sched" chapter in the Dynamic Tracing Guide on docs.sun.com).
The wrap
DTrace provides many ways to solve performance issues or for troubleshooting. Above were only a few techniques for analysing off-CPU time, but there are more available.
Some techniques can get very complex, and require much systems knowledge. Don't worry: if you only figure out one way to solve your problem - you've still solved your problem. (Systems knowledge should help you solve system problems faster).
Posted by brendan
[DTrace] ( July 29, 2007 10:39 PM )
Permalink
| Comments[1]
SVOSUG - DTrace
I'll be presenting a number of DTrace Topics at SVOSUG tonight, April 26th 2007. Alan posted the announcement here.I'll be using two presentations (fairly quickly), which are linked at the bottom of these pages:
DTrace Topics: Intro
DTrace Topics: DTraceToolkit
Plus demonstrating various scripts. Whoever can make it - I'll see you soon.
Posted by brendan
[Sun] ( April 26, 2007 04:49 PM )
Permalink
| Comments[0]
Audio Volume CLI
While Solaris has a fancy GNOME based desktop called JDS3, other desktop environments are available and work fine. FVWM2 is a fast alternative with a modest set of features, and is available on the Software Companion CD (which you may already have a copy of).If you have installed fvwm2 from the companion CD and would like to try it, the easiest way is to enter a fail safe session from the login screen, then run the binary - /opt/sfw/bin/fvwm2. The proper way is to create config files under /etc/dt/config, so that the login screen provides FVWM as an option.
After getting fvwm2 running, I found my volume up/down/mute keys on this Sun type 7 keyboard didn't work. An internet search didn't find any solutions. To get these keys to work, I wrote a short C program to ioctl /dev/audioctl, and added some lines to the .fvwmrc file. I'm writing this quick blog entry to help the next person doing the same Internet search. If there is a better way to do this in Solaris already (like a shipped binary), I missed it!
This is the C program,
/* volumeset.c - set Sun's /dev/audio play volume */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/audioio.h>
void
usage(char *name)
{
(void) printf("USAGE: %s [+|-]volume_percent\n", name);
(void) printf(" eg,\n");
(void) printf(" %s 100 # maximum volume\n", name);
(void) printf(" %s +5 # plus 5 percent\n", name);
exit(1);
}
int
main(int argc, char *argv[])
{
audio_info_t ai;
int fd, vol, mod, gain;
if (argc < 2)
usage(argv[0]);
switch (argv[1][0]) {
case '+':
mod = 1;
vol = atoi(&argv[1][1]);
break;
case '-':
mod = -1;
vol = atoi(&argv[1][1]);
break;
case '0'...'9':
mod = 0;
vol = atoi(argv[1]);
if (vol > 100 || vol < 0) {
(void) printf("ERROR: volume must be "
"between 0 and 100.\n");
exit(4);
}
break;
default:
usage(argv[0]);
}
if (mod != 0 && vol == 0)
usage(argv[0]);
if ((fd = open("/dev/audioctl", O_RDONLY)) == -1) {
(void) perror("can't open /dev/audioctl");
exit(2);
}
if (ioctl(fd, AUDIO_GETINFO, &ai) == -1) {
(void) perror("fetching audio state failed");
exit(3);
}
if (mod == 0)
gain = (vol * 255) / 100;
else
gain = ai.play.gain + (mod * vol * 255) / 100;
if (gain < 0)
gain = 0;
if (gain > 255)
gain = 255;
ai.play.gain = gain;
ai.output_muted = gain == 0 ? 1 : 0;
if (ioctl(fd, AUDIO_SETINFO, &ai) == -1) {
(void) perror("setting audio state failed");
exit(4);
}
(void) close(fd);
return (0);
}
If you don't have Sun's C compiler installed, you can compile it using /usr/sfw/bin/gcc -o volumeset volumeset.c.The following are the lines I added to ~/.fvwm/.fvwm2rc to bind the audio keys on the top left to the volumeset program (copied to /usr/local/bin); these bindings probably work for type 6 keyboards as well (haven't tried),
Key SunAudioMute A A Exec /usr/local/bin/volumeset 0 Key SunAudioLowerVolume A A Exec /usr/local/bin/volumeset -15 Key SunAudioRaiseVolume A A Exec /usr/local/bin/volumeset +15The above lines bind the keys to mute the volume, decrease by 15% or increase by 15% (it may be better to make the mute behave as a toggle, rather than always mute). After restarting fvwm, my audio keys now work fine.
Posted by brendan [Solaris] ( March 05, 2007 04:07 PM ) Permalink | Comments[5]
DTrace on SDNtv
I was recently interviewed about DTrace for SDNtv, Sun's Developer Network Channel. You can watch the video in a browser or download it and watch it later.For the interview we discussed what DTrace is, what DTrace can do, and I ran the colortrace demo. It was a short interview aimed at introducing people to DTrace for the first time.
I found SDNtv itself quite interesting: the host I had, Mark Herring, was technically savvy and both understood DTrace and knew what best to ask. It was also filmed in one take - that was a big suprise - I've always assumed such interviews are cut together creatively, but this was a real interview filmed in one shot.
But what is especially interesting is the idea of videoing Sun engineers as they discuss their own products, and making those videos publically available. Not that long ago (6 years?), I remember being a Sun customer in Australia who was trying to learn more about Solaris and the kernel, and in particular (being a code geek and a Unix history geek) I wanted to know who wrote what and why. There was very little public information about Sun's engineers - it seemed that the very names of the engineers were confidential. And, growing up in Australia, there were no conferences to visit where Sun engineers gave presentations, like there are in the US.
When Bryan first emailed me in early 2004 suggesting I check out something new called DTrace, I had no idea who Bryan was. I remember staring at his sig and thinking "I wish I knew who this guy was". A quick Internet search didn't help much (try that now!).
Times have changed. Engineers now have blogs, and discuss source code on OpenSolaris mailing lists. And SDNtv lets you see and hear them speak!
Posted by brendan
[DTrace] ( February 27, 2007 08:20 PM )
Permalink
| Comments[0]
AMD64 PICs, CPI
I recently wanted to gather some numbers on CPU and memory system performance, for AMD64 CPUs. I reached a point where I searched the Internet for other Solaris AMD64 PIC (Performance Instrumentation Counters) analysis, and found little. I hope to improve this with some blog entries. In this part I'll introduce PIC observability, and demonstrate measuring CPI (cycles per instruction) for different workloads. To see why PICs are important, the following are the sort of questions that PIC analysis can answer:- What is the Level 2 cache hit rate?
- What is the Level 2 cache miss volume?
- What is the hit rate and miss volume for the TLB?
- What is my memory bus utilization?
Question 3 concerns a component of the memory management unit - the translation lookaside buffer (TLB). This processes and caches virtual to physical memory page translations. It can consume a lot of CPU (the worst I've seen is 60%), and it can be tuned. A good document for understanding this further is Taming Your Emu by Richard McDougall.
Question 4 seems obvious - the memory bus can be a bottleneck for system performance, so, how utilized is it? Answering this isn't easy, but it is usually possible by examining CPU PICs.
cpustat
There are many AMD64 CPU PICs available, which can be viewed using tools such as cpustat and cputrack. Running cpustat -h dumps the list:# cpustat -h
Usage:
cpustat [-c events] [-p period] [-nstD] [interval [count]]
-c events specify processor events to be monitored
-n suppress titles
-p period cycle through event list periodically
-s run user soaker thread for system-only events
-t include tsc register
-D enable debug mode
-h print extended usage information
Use cputrack(1) to monitor per-process statistics.
CPU performance counter interface: AMD Opteron & Athlon64
event specification syntax:
[picn=][,attr[n][=]][,[picn=][,attr[n][=]],...]
event[0-3]: FP_dispatched_fpu_ops FP_cycles_no_fpu_ops_retired
FP_dispatched_fpu_ops_ff LS_seg_reg_load
LS_uarch_resync_self_modify LS_uarch_resync_snoop
LS_buffer_2_full LS_locked_operation LS_retired_cflush
LS_retired_cpuid DC_access DC_miss DC_refill_from_L2
DC_refill_from_system DC_copyback DC_dtlb_L1_miss_L2_hit
DC_dtlb_L1_miss_L2_miss DC_misaligned_data_ref
DC_uarch_late_cancel_access DC_uarch_early_cancel_access
DC_1bit_ecc_error_found DC_dispatched_prefetch_instr
DC_dcache_accesses_by_locks BU_memory_requests
BU_data_prefetch BU_system_read_responses
BU_quadwords_written_to_system BU_cpu_clk_unhalted
BU_internal_L2_req BU_fill_req_missed_L2 BU_fill_into_L2
IC_fetch IC_miss IC_refill_from_L2 IC_refill_from_system
IC_itlb_L1_miss_L2_hit IC_itlb_L1_miss_L2_miss
IC_uarch_resync_snoop IC_instr_fetch_stall
IC_return_stack_hit IC_return_stack_overflow
FR_retired_x86_instr_w_excp_intr FR_retired_uops
FR_retired_branches_w_excp_intr
FR_retired_branches_mispred FR_retired_taken_branches
FR_retired_taken_branches_mispred
FR_retired_far_ctl_transfer FR_retired_resyncs
FR_retired_near_rets FR_retired_near_rets_mispred
FR_retired_taken_branches_mispred_addr_miscop
FR_retired_fpu_instr FR_retired_fastpath_double_op_instr
FR_intr_masked_cycles FR_intr_masked_while_pending_cycles
FR_taken_hardware_intrs FR_nothing_to_dispatch
FR_dispatch_stalls
FR_dispatch_stall_branch_abort_to_retire
FR_dispatch_stall_serialization
FR_dispatch_stall_segment_load
FR_dispatch_stall_reorder_buffer_full
FR_dispatch_stall_resv_stations_full
FR_dispatch_stall_fpu_full FR_dispatch_stall_ls_full
FR_dispatch_stall_waiting_all_quiet
FR_dispatch_stall_far_ctl_trsfr_resync_branc_pend
FR_fpu_exception FR_num_brkpts_dr0 FR_num_brkpts_dr1
FR_num_brkpts_dr2 FR_num_brkpts_dr3
NB_mem_ctrlr_page_access NB_mem_ctrlr_page_table_overflow
NB_mem_ctrlr_turnaround
NB_mem_ctrlr_bypass_counter_saturation NB_ECC_errors
NB_sized_commands NB_probe_result NB_gart_events
NB_ht_bus0_bandwidth NB_ht_bus1_bandwidth
NB_ht_bus2_bandwidth NB_sized_blocks NB_cpu_io_to_mem_io
NB_cache_block_commands
attributes: edge pc inv cmask umask nouser sys
See Chapter 10 of the "BIOS and Kernel Developer's Guide for the
AMD Athlon 64 and AMD Opteron Processors," AMD publication #26094
There are around fifty names above such as "FP_dispatched_fpu_ops" which describe the PICs available. On my AMD Opteron CPUs you can measure four of these at a time, which can be provided in the arguments to cpustat, eg,
# cpustat -c IC_fetch,DC_access,DC_dtlb_L1_miss_L2_hit,DC_dtlb_L1_miss_L2_miss 0.25 time cpu event pic0 pic1 pic2 pic3 0.257 0 tick 6406429 8333198 45826 5515 0.257 1 tick 3333442 3942694 24682 4409 0.507 1 tick 6450964 8229104 44046 5713 0.507 0 tick 2359697 2828683 14365 4415 0.757 0 tick 2490406 3060416 16458 4901 0.757 1 tick 7292986 9530806 68956 6490 1.007 0 tick 2514008 3063049 15037 3863 1.007 1 tick 6057048 7747580 42415 6083 ^CIn the above example I printed four PICs every 0.25 seconds, for each CPU (I'm on a 2 x virtual CPU server). The CPU column shows that the output is slightly shuffled - a harmless side effect from the way cpustat was coded (it pbinds a libcpc consumer onto each CPU in the available processor set, and all threads write to STDOUT in any order). These PICs are provided by programmable hardware registers - so there is no ideal way around the four-at-a-time limit. You can shuffle measurements between different sets of PICs, which cpustat supports.
Reference Documentation
Since different CPUs provide different PICs, the guide mentioned at the bottom of the cpustat -h output will list what PICs your CPU type provides. It is important to read these guides carefully - for example, PICs that track cache misses may have some exceptions to what is considered a "miss".I spent a while with AMD's #26094 guide, but I found that the PIC descriptions raise more questions than answers. (try to find basics such as "instruction count")... If you find yourself in a similar situation, it can help to create known workloads and then examine which metrics move by a similar amount. I used this approach to confirm what PICs provided cycle counts and instruction counts.
I did eventually find two good resources on AMD PICs,
- Basic Performance Measurements by Paul J. Drongowski of AMD - a summary of the most important PICs and suggestions for interpretation.
- Opteron Hardware Performance Counters by Richard Smith of Sun - a crash course in AMD Architecture from a PICs perspective.
If you are interested in PIC analysis for any CPU type, see chapter 8 "Performance Counters" in Solaris Performance and Tools, by Richard McDougall, Jim Mauro and myself. One of the metrics we made sure to include in the book was CPI (cycles per instruction), as it proves to be a useful starting point for understanding CPU behavior.
Example - CPI
The cycles per instruction metric (sometimes measured as IPC - instructions per cycle) is a useful ratio and (depending on CPU type) fairly easy to measure. If the measured CPI ratio is low, more instructions can be dispatched in a given time, which usually means higher performance. High CPI means instructions are stalling, usually on main memory bus activity.The output of cpustat can be formatted with a little scripting; the following script "amd64cpiu" uses a little shell and Perl to aggregate and print the output:
#!/usr/bin/sh
#
# amd64cpiu - measure CPI and Utilization on AMD64 processors.
#
# USAGE: amd64cpiu [interval]
# eg,
# amd64cpiu 0.1 # for 0.1 second intervals
#
# CPI is cycles per instruction, a metric that increases due to activity
# such as main memory bus lookups.
#
# ident "@(#)amd64cpiu.sh 1.1 07/02/17 SMI"
interval=${1:-1} # default interval, 1 second
set -- `kstat -p unix:0:system_misc:ncpus` # assuming no psets,
cpus=$2 # number of CPUs
pics='BU_cpu_clk_unhalted' # cycles
pics=$pics,'FR_retired_x86_instr_w_excp_intr' # instructions
/usr/sbin/cpustat -tc $pics $interval | perl -e '
printf "%16s %8s %8s\n", "Instructions", "CPI", "%CPU";
while (<>) {
next if ++$lines == 1;
split;
$total += $_[3];
$cycles += $_[4];
$instructions += $_[5];
if ((($lines - 1) % '$cpus') == 0) {
printf "%16u %8.2f %8.2f\n", $instructions,
$cycles / $instructions, $total ?
100 * $cycles / $total : 0;
$total = 0;
$cycles = 0;
$instructions = 0;
}
}
'
This script prints a column for CPI and for percent CPU utilization. I've used the PICs that were suggested in the AMD article - and from testing they do appear to be the best ones for measuring CPI.Here amd64cpiu is used to examine a CPU bound workload of fast register based instructions,
# ./amd64cpiu.sh
Instructions CPI %CPU
16509657954 0.34 97.56
16550162001 0.33 98.54
16523746049 0.34 98.41
16510783100 0.34 98.32
16497135723 0.34 98.29
^C
The CPI is around 0.34. This is the maximum to be expected from the AMD64 architecture, which attempts to run three instructions per clock cycle.Now for a memory bound workload of sequential 1 byte memory reads,
# ./amd64cpiu.sh
Instructions CPI %CPU
4883935299 1.12 97.60
4852961204 1.12 97.03
4884120645 1.13 97.69
4898818096 1.12 97.92
4895064839 1.12 97.80
^C
Things are starting to become slower due to the extra overhead of memory requests. Many reads will satisfy from the level 1 cache, some from the slower level 2 cache, and occasionally a cache line will be read from main memory. This additional overhead slows us to 1.13 CPI, and we are running fewer instructions for the same %CPU.Watch what happens when our memory workload performs 1 byte scattered reads (100 Kbytes apart),
# ./amd64cpiu.sh
Instructions CPI %CPU
653300388 8.53 98.36
648496314 8.53 98.37
644163952 8.54 97.75
648941939 8.53 98.35
648507176 8.53 98.37
^C
Many of the reads will not be in the CPU caches, and so now most are requiring a memory bus lookup. Our CPI is around 8.53, some 25 times slower than register based CPU instructions. Our %CPU is still around the same, but this buys us fewer instructions in total.As you can see, CPI is shedding light on memory bus activity - it's very cool, and from such a simple metric.
Now for a real application: Here I watch as Sun's C compiler chews through a source tree,
# ./amd64cpiu.sh
Instructions CPI %CPU
2624028943 1.26 58.52
2992167837 1.19 63.17
2327129316 1.26 52.08
2046997158 1.27 46.14
2414376864 1.23 52.80
3305351199 1.23 70.72
^C
That's not so bad - any memory access instructions must be hitting caches fairly often (something that we can confirm by measuring other PICs).Beware of output such as the following:
# ./amd64cpiu.sh
Instructions CPI %CPU
22695257 1.82 0.73
22197894 1.75 0.69
49626271 2.16 1.90
102731779 2.21 4.04
104795796 1.49 2.78
^C
The CPUs are fairly idle (less than 5% utilization), and so CPI is less likely to be useful to indicate performance issues.
Suggested Actions - CPI
While many PICs produce interesting measurements, it's much more useful if there is some action we can take based on the results. The following is a list of suggestions based on CPI.Firstly, to be even considering this list you need to have a known and measured performance issue. If one or more CPUs are 100% busy, then that may be a performance issue and it can be useful to check CPI; if your CPUs are idle, then it probably won't be useful to check. As for measured performance issue - it can be especially helpful to be able to quantify an issue, eg average latency is 150 ms; tools such as DTrace can take these measurements.
- Measure other PICs. CPI is a high level metric, and there are many other PICs that will explain why CPI is high or low, such as cache hits and misses, TLB hits and misses, and memory locality.
- If CPI is low,
- Examine the application for unnecessary CPU work (eg, using DTrace)
- Get faster CPUs
- Get more CPUs
- If CPI is high,
- Examine the application for unnecessary memory work
- Recompile your application with optimization, and with Sun's C compiler
- Consider using processor sets to improve memory locality
- Get CPUs with larger caches
- Test different CPU architectures (multi-core/multi-thread may improve performance as bus distance is shorter)
Posted by brendan [Performance] ( February 27, 2007 06:51 PM ) Permalink | Comments[1]
colortrace
I was recently asked to give a very short but important DTrace demonstration to a wide audience, where I would only have time to run and discuss one script. It is a tough choice - there are so many to pick from, including the 100+ scripts and oneliners in the DTraceToolkit. Also, a short text-based demo may not convey the incredible power of DTrace - if you aren't paying attention it may look like just another tool; DTrace is far more than that.
What a great excuse to use some color escape sequences (yes, I've wanted to do this since I started writing the DTraceToolkit). With a colorized output it won't look like just another tool, and I can draw appropriate attention to a key DTrace feature.
I wrote colortrace.d (or colourtrace.d, for readers from the Commonwealth), which prints a flow indented trace of function entrys and returns for the entire softwark stack. Different layers of the softwark stack have been colorized:
- blue - application functions
- green - library functions
- violet - system calls
- red - kernel functions
I added a column to print delta times in nanoseconds, to illustrate that yes - we can measure whatever times we want. When Adam previously performed a similar demonstration, he added a color for a Java function layer.
Here I run colortrace.d on a bash shell, and while colortrace.d is running I press the letter "t" on the bash shell:
# ./colortrace.d -p 101695 DELTA(ns) TYPE FUNCTION 120805042 syscall <- read 34461 libc.so.1 <- _read 13474 libc.so.1 <- read 8387 bash <- rl_getc 12899 bash <- rl_read_key 10044 bash -> _rl_dispatch 11760 bash -> _rl_dispatch_subseq 7189 bash -> rl_insert 7020 bash -> _rl_insert_char 6777 bash -> _rl_any_typein 8922 bash <- _rl_any_typein 7489 bash -> rl_insert_text 9230 libc.so.1 -> strlen 12040 libc.so.1 <- strlen 7362 libc.so.1 -> strncpy 6956 libc.so.1 <- strncpy 6996 bash <- rl_insert_text 7095 bash <- _rl_insert_char 6069 bash <- rl_insert 8117 bash <- _rl_dispatch_subseq 6484 bash <- _rl_dispatch 6992 bash -> rl_redisplay 5764 libc.so.1 -> strlen 6577 libc.so.1 <- strlen 5356 libc.so.1 -> strncpy 6249 libc.so.1 <- strncpy 6992 libc.so.1 -> memset 7572 libc.so.1 <- memset 7749 libc.so.1 -> strlen 6118 libc.so.1 <- strlen 6627 bash -> _rl_move_cursor_relative 6573 bash <- _rl_move_cursor_relative 7533 bash -> _rl_output_some_chars 7458 libc.so.1 -> fwrite 6335 libc.so.1 -> _fwrite_unlocked 7085 libc.so.1 -> _realbufend 6167 libc.so.1 -> getxfdat 6250 libc.so.1 <- getxfdat 6490 libc.so.1 <- _realbufend 6306 libc.so.1 <- _fwrite_unlocked 6115 libc.so.1 <- fwrite 6488 bash <- _rl_output_some_chars 7628 libc.so.1 -> fflush 6317 libc.so.1



