Alan Hargreaves' Weblog

The ramblings of an Australian STSC* Staff Engineer

* Systems Technical Support Centre - The group I work for

Tags

(update 1) acoustic birthday blues bugs cec cec2007 cec2008 china cmt contention cringley debugging dogs dtrace earthquake encumbered-binaries extra flash funny google guitar halloween huron install kids linux liveupgrade locking mdb music mysql newyear niagra openjava opensolaris oracle patches percussion performance redhat secondlife security solaris sru sun support sxcr t2 t2000 timeslider ufs upgrade virtualbox windows youtube zfs
pageicon Monday Jun 06, 2005

DTrace - Using (placing) SDT probes

This is a slight re-write of the talk that I gave at the SOSUG meeting in Sydney on May 31.

I've also included in this entry some demonstrations that I've done up using flash and vnc2swf. These are all set up to neither automatically play nor loop. In order to start them you will need to bring up the flash menu (right button insider the demo) and select play. Hopefully I can find a better way to do this in the future.

Updates

June 6 - Thanks to Peter, I now have Start buttons on the demos.

June 7 - A question in the comments propted me to talk a little more about what the assembly looks like and how much space a probe adds to a binary. See the Comments section on this entry.

Overview

What is an SDT Probe?

It stands for Statically Defined Tracing Probe.

In reality, with the exception of fbt and pid probes, all probes are static. They just don't appear under the sdt provider as they have been written to be under other providers. A provider is simply a grouping of probes.

The fact that they are static probes does not detract from the dynamic nature of DTrace, which refers to the fact that they can be dynamically enabled.

These probe points can be placed anywhere in the code in order to make that code more observable.

There are a number of probes that do exist under the sdt provider. On the notebook that I write this on there are around 200.

   ID   PROVIDER    MODULE                          FUNCTION NAME
  460        sdt      unix         page_get_replacement_page page-get
  461        sdt      unix                page_get_cachelist page-get
  462        sdt      unix                 page_get_freelist page-get
  475        sdt      unix                       lgrp_choose lgrp_choose_start
  490        sdt      unix              av_dispatch_autovect interrupt-complete
  491        sdt      unix                       intr_thread interrupt-complete
  492        sdt      unix                            cmnint interrupt-complete
  494        sdt      unix              av_dispatch_autovect interrupt-start
  495        sdt      unix                       intr_thread interrupt-start
  496        sdt      unix                            cmnint interrupt-start
  513        sdt   genunix                   callout_execute callout-end
  514        sdt   genunix                   callout_execute callout-start
  550        sdt   genunix                 kcpc_pcbe_tryload kcpc-pcbe-spec
  588        sdt   genunix                  priv_policy_only priv-ok
  589        sdt   genunix                priv_policy_choice priv-ok
  590        sdt   genunix                       priv_policy priv-ok
  591        sdt   genunix                  priv_policy_only priv-err
  592        sdt   genunix                priv_policy_choice priv-err
  ...

A little background on the output of dtrace -l as seen above.

The first column is the probe number. This number will vary from system to system depending on their order of creation. You generally won't need to know this.

The second column is the provider. As mentioned earlier, the provider is simply a grouping of probes. For example, mib, io, sdt and sched.

Next we have the module. In kernel space, this will be the kernel module in which the probe resides. In user space it will be the object. For example a.out, libc, ...

Then we have the probe function. This is simply the name of the function into which the probe was written.

Finally there is the probe name. The name given to the probe.

So a probe is made up of a 4-tuple of these.

probeprov:probemod:probefunc:probename

Incidentally, the names used in the above layout are the variable names that can be used from within the D language to access the particular information about the probe.

To add a probe to the code we use the DTRACE_PROBE macros by first including <sys/sdt.h>

As a part of loading the module (in kernel space) or building the object (in user space) the function call generated by this macro is replaced by a number of nop instructions.

Kernel SDT probes looks like

    DTRACE_PROBE(name);
    DTRACE_PROBE1(name, type1, arg1);
    DTRACE_PROBE2(name, type1, arg1, type2, arg2);
    DTRACE_PROBE3(name, type1, arg1, type2, arg2, type3, arg3);
    DTRACE_PROBE4(name, type1, arg1, type2, arg2, type3, arg3, type4, arg4);

Each probe has a name and zero or more type/argument pairs. Each of these pairs describe a variable that is passed to the probe.

User SDT probes look a little different.

    DTRACE_PROBE(name);
    DTRACE_PROBE1(name, arg1);
    DTRACE_PROBE2(name, arg1, arg2);
    DTRACE_PROBE3(name, arg1, arg2, arg3);
    DTRACE_PROBE4(name, arg1, arg2, arg3, arg4);

You'll notice that we don't provide the type information in the code. I'll get to how we define them later on.

OK, so let's look at some examples, starting in kernel space.

The io provider

This shows all kinds of nifty stuff when dealing with I/O.

 $ dtrace -l -P io
    ID   PROVIDER            MODULE                 FUNCTION NAME
  521         io           genunix                  biodone done
  522         io           genunix                  biowait wait-done
  523         io           genunix                  biowait wait-start
  532         io           genunix           default_physio start
  533         io           genunix            bdev_strategy start
  534         io           genunix                  aphysio start
 1263         io               nfs                 nfs4_bio done
 1264         io               nfs                 nfs3_bio done
 1265         io               nfs                  nfs_bio done
 1266         io               nfs                 nfs4_bio start
 1267         io               nfs                 nfs3_bio start
 1268         io               nfs                  nfs_bio start

The io provider is implemented in sys/sdt.h as wrappers around the DTRACE_PROBE macros.

#define DTRACE_IO(name)                                         \
        DTRACE_PROBE(__io_##name);

#define DTRACE_IO1(name, type1, arg1)                           \
        DTRACE_PROBE1(__io_##name, type1, arg1);

#define DTRACE_IO2(name, type1, arg1, type2, arg2)              \
        DTRACE_PROBE2(__io_##name, type1, arg1, type2, arg2);

#define DTRACE_IO3(name, type1, arg1, type2, arg2, type3, arg3) \
        DTRACE_PROBE3(__io_##name, type1, arg1, type2, arg2,    \
            type3, arg3);

#define DTRACE_IO4(name, type1, arg1, type2, arg2,              \
    type3, arg3, type4, arg4)                                   \
        DTRACE_PROBE4(__io_##name, type1, arg1, type2, arg2,    \
            type3, arg3, type4, arg4);

The only thing that these wrappers add is the __io_ prefix to the name. The kernel DTrace module picks this up as the provider name.

These probe points have been placed where io stats get updated.

This is probably better demonstrated with an example.

On starting this demonstration you will see the command issued. It will then appear to pause for 10 seconds (because of the tick-10s probe) and then dump a histogram showing the times taken to do each IO involved in saving a Star Office presentation.

The tick-10s probe fires after 10 seconds and we use it to exit the command.

We don't list the probe module or probe func in the wait-start/wait-done probe as we don't care what these values are, we simply want to match these probe names.

The string between the forward slashes is called a predicate. This defines the conditions under which the probe will fire. In the wait-start probe, the condition is that the process name (execname) is "soffice.bin". The wait-done probe will only fire if self->start is non-null.

self->start probably also needs a little explanation. This is a thread local variable. It has scope of the lifetime of the thread and DTrace program. That is, if we had two seperate threads, the value of self->start could be different for each. start is simply the variable name that I chose to use to record the timestamp at which this probe fired.

Assigning 0 to self->start frees the memory assoiated with it.

The quantize function is what produces the histogram.

The mib provider

The mib provider, while not defined in terms of the DTRACE_PROBE macros, is also a static probe.

What makes this one interesting is that this minor change to two macros has an enormous effect on the observability of the network stack.

In order to show this properly, the full definition of BUMP_MIB and UPDATE_MIB from inet/mib2.h are shown.

#define BUMP_MIB(s, x)          {                               \
        extern void __dtrace_probe___mib_##x(int, void *);      \
        void *stataddr = &((s)->x);                             \
        __dtrace_probe___mib_##x(1, stataddr);                  \
        (s)->x++;                                               \
}

#define UPDATE_MIB(s, x, y)     {                               \
        extern void __dtrace_probe___mib_##x(int, void *);      \
        void *stataddr = &((s)->x);                             \
        __dtrace_probe___mib_##x(y, stataddr);                  \
        (s)->x += (y);                                          \
}

The lines could have been written


#define BUMP_MIB(s, x)          {                               \
        DTRACE_PROBE2(__mib_##x, int, 1, void *, &((s)->x))     \
        (s)->x++;                                               \
}

#define UPDATE_MIB(s, x, y)     {                               \
        DTRACE_PROBE2(__mib_##x, int, y, void *, &((s)->x))     \
        (s)->x += (y);                                          \
}

These two macros would previously have been defined simply without the reference to DTRACE_PROBE2. They are used exclusively to increment the network kstats. As a result, we end up with a probe point for every kstat update within the network code. On the machine that I am writing this on, that means 436 probes!

The variables passed in to these macros are

  • s - The address of the kstat structure
  • x - The name of the kstat variable to be increased
  • y - The amount by which the kstat will be increased

This produces a probe with

  • arg0 - The amount by which the kstat will be increased
  • arg1 - The value of the kstat before it is increased

Let's look at an example of tracing a mib probe.

OK, this is a pretty contrived example. What it does is to dump the kernel stack when we first increase the value of the rawipOutDatagrams kstat and inform us of it's current value and what it will be incremented by. I fired this probe by pinging localhost.

User Space

We use a slightly different macro in user space

#define DTRACE_PROBE1(provider, name, arg1) {                          \
        extern void __dtrace_##provider##___##name(unsigned long);     \
        __dtrace_##provider##___##name((unsigned long)arg1);           \
}

The arguments are

  • provider - name of the provider (duh!)
  • name - name of the probe
  • arg1, ... - the actual value the probe will report

Note that we don't define the type. This is done differently in user space and we'll get to that shortly.

Let's take a simple little program (helloworld.c)

#include <stdio.h>
#include <unistd.h>

int
main(int ac, char **av) {
        int i;
        for (i = 0 ; i < 5; i++) {
                printf("Hello World\n");
                sleep(2);
        }
}

and compile and run it.

Which is pretty much as you would expect.

Let's try adding a probe to this. Say we wanted to be able to monitor the loop variable.

#include <stdio.h>
#include <unistd.h>
#include <sys/sdt.h>

int
main(int ac, char **av) {
        int i;
        for (i = 0 ; i < 10; i++) {
                DTRACE_PROBE1(world, loop, i);
                printf("Hello World\n");
                sleep(2);
        }
}

The additions to the code are those in green. Basically We need to include <sys/sdt.h> and add the probe.

But wait, in user space there's more.

We still need to define the types of the arguments and the stability levels. This gets linked into the code later (as you will see in the demonstration).

Let's create msderv.d.

provider world {
        probe loop(int);
};

#pragma D attributes Evolving/Evolving/Common provider world provider
#pragma D attributes Private/Private/Common provider world module
#pragma D attributes Private/Private/Common provider world function
#pragma D attributes Evolving/Evolving/Common provider world name
#pragma D attributes Evolving/Evolving/Common provider world args

The stuff in provider is relatively self explanatory and I'm not going to go into the stability stuff here. Chapter 39 of the reference manual does a good job of that.

In order to build the probes incorporating the provider description we need another step in the build.

Let's build and run it without enabling any probes first.

Exactly as before, which we should expect.

The options to the dtrace command deserve some explanation.

  • -32 - I don't have an opteron notebook, I need a 32 bit object.

  • -G - Generate ELF files containing the embedded DTrace program. The probes specified in the files listed in the -s options are saved into these objects to be linked in to the binary. This also goes looking through for the probe points in the other objects and replaces them with nops so that the default is that the probes are disabled.
  • -s - As previously mentioned, the file associated with this argument is treated as a D program containing the declarations of the probe points.

Let's look at both the counter and the first argument to printf.

i is now observable, but with no overhead unless we are tracing it.

Conclusion

SDT probes are an easy way to make stuff visible.

The helloworld example was trivial, but, imagine being able to place probes like this into large applications or drivers, sendmail might be a good example in user space, Veritas Volume Manager might be a good candidate for kernel.

We get observability without the need for

  • Seperate instrumented binaries
  • Restart
  • Reboot (in the case of drivers/kernel)

With next to no overhead if they are not being observed.

Having placed probes into kernel code for debugging purposes, I can definitely recommend putting SDTs into your code to improve observability and diagnosability.

Technorati Tags: ,

Comments:

Alan,

thanks, great article and the demos.

How much (if any) overhead theese probes adds to the binary size?

For example, in Java2D there's hundreds of primitives which would benefit from adding probes for tracing arguments and other stuff. Most of those primitives are generated by our macros during build time, so dtrace probes could easily be added with just a few lines in a couple of header files, but I'm worried that it'd increase the binary size (which also affects startup time, etc).

Thanks,
Dmitri
Java2D Team

Posted by Dmitri Trembovetski on June 07, 2005 at 03:28 PM EST #

In the demonstration code shown, it adds 5 nop instructions in the place of the probe on x86.

The actual assembly is:

$ mdb helloworld2 -
> main::dis ! ~/target.pl
main            pushl  %ebp
main+1          movl   %esp,%ebp
main+3          subl   $0x8,%esp
main+6          movl   $0x0,-0x8(%ebp)
main+0xd        movl   -0x8(%ebp),%eax
main+0x10       cmpl   $0x5,%eax
main+0x15       jge    +0x3e    <main+0x53>(2f)
main+0x1b 1:    movl   -0x8(%ebp),%eax
main+0x1e       pushl  %eax
<font color=green>main+0x1f       nop
main+0x20       nop
main+0x21       nop
main+0x22       nop
main+0x23       nop</font>
main+0x24       addl   $0x4,%esp
main+0x27       pushl  $0x8050fdc
main+0x2c       call   -0x50c   <PLT:printf>
main+0x31       addl   $0x4,%esp
main+0x34       pushl  $0x2
main+0x36       call   -0x506   <PLT:sleep>
main+0x3b       addl   $0x4,%esp
main+0x3e       movl   -0x8(%ebp),%eax
main+0x41       incl   %eax
main+0x42       movl   %eax,-0x8(%ebp)
main+0x45       movl   -0x8(%ebp),%eax
main+0x48       cmpl   $0x5,%eax
main+0x4d       jl     -0x32    <main+0x1b>(1b)
main+0x53 2:    leave
main+0x54       ret

It depends on how many arguments you are sending to the probe.

I should also mention that in this case the myserv.o code (from the myserv.d) is a grand total of 376 bytes long.

Thanks for bringing that up, I'd forgotten that I wanted to mention it.

Also thank you for commenting on this entry. I put a lot of work into it and I was starting to worry that no-one had looked at it. :)

Oh, by the way <tt><font color=green>target.pl</font></tt> is a small perl script that I wrote some time ago to list branch targets given ::dis output. Very useful when trying to work out conditionals and loops.

Alan.

Posted by Alan Hargreaves on June 07, 2005 at 03:49 PM EST #

Alan, thanks for the clarification on the size increase.

I have another question: our rendering loops are very tight, and highly optimized (and typically compiled with maximum optimization level).

Could adding a probe to such a loop have negative effect on performance even when the probe is not enabled?

For example, suppose I want to trace some variable, which the compiler would've optimized away freeing up a register if it wasn't for being traced by dtrace.

Thanks,
Dmitri
Java2 Team

Posted by Dmitri Trembovetski on June 08, 2005 at 04:22 AM EST #

Post a Comment:
Comments are closed for this entry.