Basic demonstration of DTrace feature of OpenSolaris
This is a demonstration, for a full material on DTrace please refer to the Solaris DTrace guide at: http://docs.sun.com/app/docs/doc/817-6223
D is a C-like event-oriented script that uses operational system instrumentation points as its events.
That points of instrumentation are called Probes.
You can list all the probes of your system using '-l' option.
# pfexec dtrace -l ID PROVIDER MODULE FUNCTION NAME 1 dtrace BEGIN 2 dtrace END 3 dtrace ERROR 4 python723 libpython2.4.so.1.0 PyEval_EvalFrame function-entry 5 python723 libpython2.4.so.1.0 dtrace_entry function-entry [...] 71348 fbt semsys _info return 71349 fbt semsys semexit entry 71350 fbt semsys semexit return 71351 syscall semsys entry 71352 syscall semsys return
You can see that I have 71352 places in my system that I can get statistics right now.
Every Probe in the system can be mapped by a 4-layer hierarchy:
provider : subsystem : function : probeName
More information on that can be found on chap 1 of the DTrace Guide: http://docs.sun.com/app/docs/doc/817-6223/chp-intro?a=view
What is important to understand now is that 'provider' is normally the name of module inside the kernel, 'subsystem' is a kernel subsystem or a user lib, 'function' is the function that we wanna inspect and 'probeName' is what point of that function we wanna get statistics. If I omit any member of this hierarchy the system will understand as '*'(anything)
|
Provider |
The name of the DTrace provider that is publishing this probe. The provider name typically corresponds to the name of the DTrace kernel module that performs the instrumentation to enable the probe. |
|
Subsystem |
If this probe corresponds to a specific program location, the name of the module in which the probe is located. This name is either the name of a kernel module or the name of a user library. |
|
Function |
If this probe corresponds to a specific program location, the name of the program function in which the probe is located. |
|
Name |
The final component of the probe name is a name that gives you some idea of the probe's semantic meaning, such as BEGIN or END. |
So, let's say I wanna the entry point of every system call of the system:
syscall:::entry
The option '-n' is used to run the D script from the shell call, so this command should show the standard output for all syscall probes:
The standard output is to print the id of the CPU, the id of the process, the function name and the probe name:
# pfexec dtrace -n 'syscall:::entry{ }'
CPU ID FUNCTION:NAME
0 70861 ioctl:entry
0 70861 ioctl:entry
0 70999 sysconfig:entry
0 70999 sysconfig:entry
1 70767 read:entry
1 70861 ioctl:entry
1 70861 ioctl:entry
[...]
This can be impressive, but just for a second. In the next moment you will want some way to aggregate this huge amount of information.
To aggregate it by syscall name and count how many times it was called:
# pfexec dtrace -n 'syscall:::entry{ @[probefunc]=count(); }'
^C
fstat 1
getpid 1
mmap 1
schedctl 1
sigpending 1
stat64 1
sysconfig 3
getgid 4
[...]
setcontext 24
writev 39
setitimer 47
p_online 64
clock_gettime 150
read 261
write 301
pollsys 335
lwp_park 379
ioctl 590
For more information on aggergations, refer to chap 9 of the Dtrace guide: http://docs.sun.com/app/docs/doc/817-6223/chp-aggs?a=view
You can short by execname too:
# pfexec dtrace -n 'syscall:::entry{ @[execname,probefunc]=count();}'
^C
dtrace fstat 1
dtrace lwp_park 1
dtrace lwp_sigmask 1
[...]
firefox-bin read 84
Xorg clock_gettime 98
firefox-bin ioctl 132
firefox-bin pollsys 151
rhythmbox ioctl 165
rhythmbox write 169
rhythmbox lwp_park 230
One can say that so far, the only thing that we discover is that I'm listening to music right now, because rhythmbox is making syscalls like crazy.
So, let's say I wanna really understand what is going on when some process calls the syscall, let's say, ioctl. It wouldn't be perfect if we could just ask the process to say every function it enters?
Let's look at this script:
#!/sbin/dtrace -s
#pragma D option flowindent
syscall::ioctl:entry
{
self->follow = 1;
}
fbt:::
/self->follow/
{}
syscall::ioctl:return
/self->follow/
{
self->follow = 0;
exit(0);
}
Each probe description is an event.
Don't bother yourself with the first two lines. The first one is the interpreter, and the second is just a macro for spacing rightly the standard output.
The first will fire when the system enters in the system call ioctl. The second will fire every time the system changes the working function and the thread have a special thread variable follow. FBT is the function boundary provider, and it maps all the functions in the system. What is under the slashes is a filter, and it is called predicate. The last will fire every time a syscall ioctl returns with the thread variable.
Things are getting clearer now.
The flow gos like that:
The first process that calls ioctl will cause the system to put a thread variable named follow in that specific thread and print the standard output (probe details). The next functions that this thread calls will fire a probe, and the next, and the next, and all will print it details until we reach the return of the syscall. This probe will print itself, unset the thread variable and exit the dtrace program.
Without more delays, let's see the output:
# pfexec dtrace -s flw.d dtrace: script 'flw.d' matched 60932 probes CPU FUNCTION 0 -> ioctl 0 -> getf 0 -> set_active_fd 0 <- set_active_fd 0 <- getf 0 -> fop_ioctl 0 -> crgetmapped 0 <- crgetmapped 0 -> spec_ioctl 0 -> cdev_ioctl 0 -> ddi_get_soft_state 0 <- ddi_get_soft_state 0 -> kmem_zalloc 0 -> kmem_cache_alloc 0 -> kmem_depot_alloc 0 <- kmem_depot_alloc 0 -> kmem_slab_alloc 0 <- kmem_slab_alloc 0 <- kmem_cache_alloc 0 <- kmem_zalloc 0 -> kmem_free 0 -> kmem_cache_free 0 <- kmem_cache_free 0 <- kmem_free 0 <- cdev_ioctl 0 <- spec_ioctl 0 <- fop_ioctl 0 -> releasef 0 -> clear_active_fd 0 <- clear_active_fd 0 -> cv_broadcast 0 <- cv_broadcast 0 <- releasef 0 <- ioctl 0 <= ioctl
Now you can search at http://src.opensolaris.org/source/ for each function and really understand what is happening when ioctl is called.
As we could imagine, ioctl deals with io control, some file descriptor configuration, in the middle we can see some memory allocation and of course, all of that are protected by mutual exclusions locks. All that can be seen in the output.
If you run this scripts some more times it won't be long until you get some result like that:
# pfexec dtrace -s flw.d dtrace: script 'flw.d' matched 60932 probes CPU FUNCTION 0 -> do_interrupt 0 -> tlb_service 0 <- tlb_service 0 -> uppc_intr_enter 0 -> uppc_setspl 0 <- uppc_setspl 0 <- uppc_intr_enter 0 -> intr_thread_prolog 0 <- intr_thread_epilog 0 <- dispatch_hardint 0 <- switch_sp_and_call 0 <- do_interrupt 0 -> sys_rtt_common 0 <- sys_rtt_common 0 -> do_interrupt 0 -> tlb_service 0 <- tlb_service 0 -> uppc_intr_enter 0 -> uppc_setspl 0 <- uppc_setspl 0 <- uppc_intr_enter 0 -> intr_thread_prolog 0 <- intr_thread_epilog 0 <- dispatch_hardint 0 <- switch_sp_and_call 0 <- do_interrupt 0 -> sys_rtt_common 0 <- sys_rtt_common 0 -> do_interrupt 0 -> tlb_service 0 <- tlb_service 0 -> uppc_intr_enter 0 -> uppc_setspl 0 <- uppc_setspl 0 <- uppc_intr_enter 0 -> intr_thread_prolog 0 <- intr_thread_epilog 0 <- dispatch_hardint 0 <- switch_sp_and_call 0 <- do_interrupt 0 -> sys_rtt_common 0 <- sys_rtt_common 0 -> do_interrupt 0 -> tlb_service 0 <- tlb_service 0 -> uppc_intr_enter 0 -> uppc_setspl 0 <- uppc_setspl 0 <- uppc_intr_enter 0 -> intr_thread_prolog 0 <- intr_thread_epilog 0 <- dispatch_hardint 0 <- switch_sp_and_call 0 <- do_interrupt 0 -> sys_rtt_common 0 <- sys_rtt_common 0 -> ioctl 0 -> getf 0 -> set_active_fd 0 <- set_active_fd 0 <- getf 0 -> fop_ioctl 0 -> crgetmapped 0 <- crgetmapped 0 -> nm_ioctl 0 -> fop_ioctl 0 -> crgetmapped 0 <- crgetmapped 0 -> fifo_ioctl 0 -> fifo_strioctl 0 -> strioctl 0 -> job_control_type 0 <- job_control_type 0 -> strcopyout 0 <- strcopyout 0 <- strioctl 0 <- fifo_strioctl 0 <- fifo_ioctl 0 <- fop_ioctl 0 <- nm_ioctl 0 <- fop_ioctl 0 -> releasef 0 -> clear_active_fd 0 <- clear_active_fd 0 -> cv_broadcast 0 <- cv_broadcast 0 <- releasef 0 <- ioctl 0 <= ioctl
Probably when ioctl was called, it needed to take some other thread out of the cpu. We can see some calls from the interrupt sequence. TLB is the table that maps virtual memory, and we can see some calls related to stack pointers and that cleary are evidencing context changes.
As we can see, Dtrace is a great way to find bugs, profile, do performance analyses/evaluation. It is also a great way to learn and unsderstand operational systems behavior.
So, Where we can go now?
Community at Opensolaris.org:
opensolaris.org/os/community/dtrace/
Adam Leventhal:
http://blogs.sun.com/ahl
BigAdmin resources:
www.sun.com/bigadmin/content/dtrace/
Sun's How To:
www.sun.com/software/solaris/howtoguides/dtracehowto.jsp
Solaris Internals:
http://www.solarisinternals.com
Please post a comment or send me a email if you have any doubts,
suggestions or if you find something here that isn't that right. We're
a community and your help is very welcome a encouraged. =)

Nice job. Pretty cool to start with.
Posted by Tirthankar on February 11, 2009 at 06:25 PM BRST #
really nice tutorial. Good job dude :)
Posted by Angad Singh on February 12, 2009 at 07:28 PM BRST #
Very informative Tutorial. Great Job.
Posted by Ashwin on February 14, 2009 at 10:53 AM BRST #