It's been a while coming but it's here at last in build 109 of Solaris Nevada; the DTrace CPU Performance Counter provider (or 'cpc' provider for short). The cpc provider gives you the ability to profile your system by many different types of processor related events; the list of events is processor specific and usually quite large but typically includes events such as cycles executed, instructions executed, cache misses, TLB misses and many more. From a high level view, the cpc provider is essentially the same as the profile provider except that, where the profile provider allows you to profile a system using a fixed, time based source, the cpc provider lets you profile by events related to processor activity.
There's a bunch of things around the implementation that I'd like to talk about but that can probably wait for another day as we'll ease ourselves gently into the world of the cpc provider with some simple examples. Additional sources of information that you may find useful are the DTrace documentation and the cpc provider ARC case one pager. Note that most of the examples I give here are focused on two of the more common event types, instructions or cycles executed. However, we could have just as easily profiled behaviour around the cache hierarchy (l1/l2/l3 hits/misses) for example - we'd merely need to specify different event types to look at.
Anyway, let's start with a simple example to see it in action:
Example 1: User Mode Instruction Profile by Executable
#cat userexecbyinst.d
cpc:::PAPI_tot_ins-user-5000
{
@[execname] = count();
}
END
{
trunc(@, 10);
printa("%-30s %@d\n", @);
}
# dtrace -qs ./userexecbyinst.d
^C
gnome-panel 3760
soffice.bin 5952
nautilus 15255
gnome-terminal 23363
metacity 29627
wnck-applet 36635
isapython2.4 88264
thunderbird-bin 119269
firefox-bin 319431
Xorg 399781
Here we have a profile of our system which looks at executed user mode instructions. The specification of what we wanted to profile is contained in the probename of the cpc probe. The probename is made up of the following, hyphen delimited, fields:
- Event Name - here we are using the generic name, PAPI_tot_ins, to refer to total instructions executed. See my previous post for more on generic performance counter events. A full list of supported event names for the platform you are on can be found by running the `cpustat -h` command.
- Mode - specifies which operating mode we are interested in profiling. In this example we are interested in user mode instructions so we specified 'user'. We could also specify 'kernel' for kernel mode events or 'all' for bother user and kernel mode events.
- Optional Mask - some events can take an optional mask to further subdivide the event. We aren't using one here.
- Count - specifies the number of events that need to occur on a CPU before a probe fires on that CPU. Here we specified that we want a probe to fire every 5000 events (5000 user mode instructions in this case).
So, let's get a bit fruitier and take a different view of things. Here's a system wide view of user mode TLB misses by library:
Example 2: User Mode TLB miss Profile by Library
#cat usertlbmiss.d
#!/usr/sbin/dtrace -qs
cpc:::PAPI_tlb_dm-user-5000
{
@[umod(arg1)] = count();
}
END
{
trunc(@, 10);
printa("%-50A %10@d\n", @);
}
# dtrace -qs ./usertlbmiss.d
^C
libproc.so.1 274
libjemalloc.so 454
libcairo.so.2.20.0 739
libglib-2.0.so.0.1800.0 791
libmozjs.so 964
firefox-bin 1026
libjemalloc.so 1443
libc.so.1 1740
libxul.so 5651
Xorg 6766
Note that probes from the cpc provider export the same arguments as the profile provider. arg0 will be non zero if the event occured while operating in kernel mode. arg1 will be non zero id the event occured while we were operating in user mode.
Example 3: Selective User Mode Cycle Count Profile by Function
As this is DTrace, we can use predicates to prune the data we capture. Let's profile a couple of applications from a particular user but tyhis time profile them by the number of cycles used:
# cat ./userbyappbycycles.d
cpc:::PAPI_tot_cyc-user-5000
/(execname == "firefox-bin" || execname == "thunderbird-bin") && uid == 101/
{
@[ufunc(arg1)] = count();
}
END
{
trunc(@, 10);
printa("%-60A %@10d\n", @);
}
# dtrace -qs ./userbyappbycycles.d
^C
libc.so.1`mutex_unlock 1544
libfontconfig.so.1`FcFontMatch 1671
libc.so.1`clear_lockbyte 1754
libgfx_gtk.so`__1cKnsImageGTKODrawComposited6MrnTnsIRenderingContext_pnRnsIDrawingSurface_iiiiiiiiii_v_ 1987
libc.so.1`mutex_unlock_queue 2324
libc.so.1`mutex_lock_impl 2684
libgklayout.so`__1cPSelectorMatches6FrnRRuleProcessorData_pnNnsCSSSelector_ipnHnsIAtom_W_i_ 3755
libgfx_gtk.so`__1cKnsImageGTKRDrawCompositeTile6MrnTnsIRenderingContext_pnRnsIDrawingSurface_iiiiiiii_v_ 4597
libc.so.1`mutex_lock_impl 5280
libc.so.1`clear_lockbyte 5313
Just bare in mind that all the other user mode application on the system are contributing to the event count here. If you wanted to only profile events from a particular set of applications then you would need to put them in there own processor set to make sure nothing else contributes to the event count.
Example 4: Kernel mode instruction profile by module
Now we'll look at some examples of kernel profiling. I'm using filebench to generate some workload on top of a ZFS filesystem so these examples tend to focus on zfs. To start with, let's look at a kernel profile of which modules are executing the most instructions:
# cat kernbymod.d
cpc:::PAPI_tot_ins-kernel-5000
/arg0 != 0/
{
@[mod(arg0)] = count();
}
END
{
trunc(@, 10);
printa("%-50a %@10d\n", @);
}
# dtrace -qs ./kernbymod.d
^C
dtrace 348
ata 625
specfs 753
rootnex 880
kcf 1551
sha1 5704
procfs 6097
unix 158137
genunix 161461
zfs 575552
Not surprisingly, as we're running an intensive workload on top of a ZFS filesystem, zfs is at the top of the list. It may be interesting to break down the cycle usage within the zfs module so we'll do that next.
Example 5: Kernel mode cycle profile: functions from specified module
# cat kernmodbyfunc.d
#!/usr/sbin/dtrace -s
/*
* Simple script to profile all functions within a given module ('zfs' is
* the lucky chap here). The module range is just extracted from modinfo(1M)
* output.
*/
inline int ZFS_MOD_START = 0xfffffffff77fb000;
inline int ZFS_MOD_END = 0xfffffffff78c7378;
cpc:::PAPI_tot_cyc-kernel-5000
/arg0 >= ZFS_MOD_START && arg0 < ZFS_MOD_END/
{
@[func(arg0)] = count();
}
END
{
trunc(@, 10);
printa("%-50a %@d\n", @);
}
# dtrace -qs ./kernmodbyfunc.d
^C
zfs`arc_write_ready 1459
zfs`arc_write_done 1472
zfs`zio_done 1512
zfs`refcount_is_zero 1836
zfs`zio_execute 1918
zfs`arc_cksum_verify 2013
zfs`dbuf_verify 2167
zfs`lzjb_compress 3356
zfs`zil_commit_writer 3706
zfs`fletcher_2_native 497284
Example 6: Kernel mode cycle profile of a given function
Taking the example above, we may now want to drill down on the fletcher_2_native() function to see the distribution of samples within the function to try and gain any insights into hotspots. The following script may look a bit hacky but it does the job of ensuring that we only store the PC for any events which were in this function.
# cat kernfunc.d
cpc:::PAPI_tot_cyc-kernel-5000
/arg0 >= (uint64_t)&(`fletcher_2_native)
&& arg0 < ((uint64_t)&(`fletcher_2_native) + sizeof(`fletcher_2_native))/
{
@[arg0] = count();
}
END
{
printa("%a %@d\n", @);
}
# dtrace -qs ./kernfunc.d
^C
zfs`fletcher_2_native+0xc 1
zfs`fletcher_2_native+0x10 1
zfs`fletcher_2_native+0x17 1
zfs`fletcher_2_native+0x1b 1
zfs`fletcher_2_native+0x42 1
zfs`fletcher_2_native+0x46 1
zfs`fletcher_2_native+0x49 1
zfs`fletcher_2_native+0x5d 1
zfs`fletcher_2_native+0xa6 1
zfs`fletcher_2_native+0xce 1
zfs`fletcher_2_native+0xd2 1
zfs`fletcher_2_native+0x53 2
zfs`fletcher_2_native+0x5b 2
zfs`fletcher_2_native+0x83 2
zfs`fletcher_2_native+0xa3 2
zfs`fletcher_2_native+0xa8 2
zfs`fletcher_2_native+0x37 3
zfs`fletcher_2_native+0x3b 3
zfs`fletcher_2_native+0xca 4
zfs`fletcher_2_native+0x9c 9
zfs`fletcher_2_native+0xc6 27
zfs`fletcher_2_native+0x2b 35
zfs`fletcher_2_native+0xab 40
zfs`fletcher_2_native+0x1 56
zfs`fletcher_2_native+0x72 79
zfs`fletcher_2_native+0x91 709
zfs`fletcher_2_native+0x80 1331
zfs`fletcher_2_native+0x75 1346
zfs`fletcher_2_native+0x61 1382
zfs`fletcher_2_native+0x79 1409
zfs`fletcher_2_native+0x87 1437
zfs`fletcher_2_native+0x9f 1438
zfs`fletcher_2_native+0x68 1522
zfs`fletcher_2_native+0x99 1754
zfs`fletcher_2_native+0x7d 1818
zfs`fletcher_2_native+0x6f 1839
zfs`fletcher_2_native+0x95 2341
zfs`fletcher_2_native+0x8b 2517
zfs`fletcher_2_native+0x6b 38448
fletcher_2_native() is a simple piece of code that basically contains a single for() loop. If we map out the PCs in the above example then they map onto the instructions in the for loop as you'd expect.
Example 7: Selective kernel mode cycle profile of a given function
The fletcher_2_native code can be called from 3 separate sites. It may be the case that we are only interested in profiles of the code when it has been called from one of these sites. Here's an example of how we can just profile calls into fletcher_2_native when the caller is arc_cksum_verify:
# cat kerncaller-callee.d
/*
* This is an example of selectively profiling a kernel function depending upon
* who its caller is. The example we have here is only profiling the zfs
* function, fletcher_2_native(), when it is called from arc_cksum_verify().
* We can't use thread local variables to predicate on in the cpc probe as it
* is executing in a high level interrupt context`. To work around this we use a
* global associative array and hash on the lwp's pr_addr as this is guaranteed
* to be consistent.
*/
arc_cksum_verify:entry
{
vrfy[curlwpsinfo->pr_addr] = 1;
}
/*
* This is a bit icky to say the least. We're basically checking that the PC
* that generated the event (presented in arg0 for kernel events) is in the
* fletcher_2_native functions address range and also that the underlying
* thread has called fletcher_2_native() from within arc_cksum_verify().
*/
cpc:::PAPI_tot_cyc-kernel-5000
/arg0 >= (uint64_t)&(`fletcher_2_native)
&& arg0 < ((uint64_t)&(`fletcher_2_native) + sizeof(`fletcher_2_native))
&& vrfy[curlwpsinfo->pr_addr] == 1/
{
@[arg0] = count();
}
arc_cksum_verify:return
/vrfy[curlwpsinfo->pr_addr] == 1/
{
vrfy[curlwpsinfo->pr_addr] = 0;
}
END
{
printa("%a %@d\n", @);
}
I've elided the output of the above script as it looks remarkably similar to the previous one! However, it's the technique that was of interest here and not so much the result itself.
So, there's a very quick introduction to the cpc provider. There's a bunch of stuff that I could have mentioned that I didn't so I'll try and do some more entries fleshing out a few other areas and points. I'm just the guy with the spanners bolting this stuff together and it's you guys out there who can do the clever stuff of applying this to real code to gain new insights. Let me know if there is anything that doesn't work as you thought or if you have any ideas around other pieces that could be bolted on here. Have at it!
