DTrace performance
I was exploring the performance cornet of DTrace as such. I've used DTrace several times in analyzing transient states and DTrace is no doubt a wonderful tool. I was trying to understand the mechanism of USDT and what goes on "behind the scenes" when you incorporate a DTrace probes using the USDT mechanism. To show this, I developed a small C program shown below. This program has just one function, printDemo, which takes in a string and prints the input string to the screen. However, we are not interested in what gets displayed, but how DTrace works when you incorporate a probe at this location. So, we ignore any printf() statements or the like.
#include <stdio.h>
void printDemo(char *);
void printDemo(char *x){
}
int main(void){
long long x = 0;
long long count = 1E7;
for(x = 0; x <count;x++){
printDemo("Hallo Demo"
;
}
printf("Done %lld iterations\n", x);
}
This way, we manage to eliminate the time spent on printing the characters to the screen. So, what we are looking at is the point when the function, printDemo, is called and the associated times. The main invokes this function 1E7 times, so we have a significant time window to observe the behaviour. Now, executing the program and measuring the runtime with timex command, we observe that the run is pretty fast.meson:/export/home/mithun/Desktop/USDT >cc demonodt.c -o demono
meson:/export/home/mithun/Desktop/USDT >timex ./demono
Done 10000000 iterations
real 0.15
user 0.14
sys 0.00
Now, let us introduce DTrace probes in our function. The DTrace script, demoprobes.d contains the provider information. The file is shown below:
meson:/export/home/mithun/Desktop/USDT >more demoprobes.d
provider demo{
probe printDemo(char *);
};
From my previous blog entry, you should be knowing the necessary steps in incorporating probes into your application. The modified C-program is shown below:
meson:/export/home/mithun/Desktop/USDT >more demo.c
#include <stdio.h>
#include <sys/sdt.h>
void printDemo(char *);
void printDemo(char *x){
DTRACE_PROBE1(demo, printDemo, x);
}
int main(void){
long long x = 0;
long long count = 1E7;
for(x = 0; x <count;x++){
printDemo("Hallo Demo"
;
}
printf("Done %lld iterations\n", x);
}
The listing is pretty simple and straightforward. Now, I shall compile the setup using a Makefile listed below:
meson:/export/home/mithun/Desktop/USDT >more Makefile
CC=cc
DTRACE=dtrace
demo: demoprobes.o demo.o
$(CC) -o demo demoprobes.o demo.o
demoprobes.o: demo.o
$(DTRACE) -G -s demoprobes.d demo.o
demo.o:
$(CC) -c demo.c
clean:
rm -f *.o
Now, the DTrace probes are built in the application and is now ready to be traced. I shall now DTrace the program execution and measure the execution time. The script I shall use here is shown below:
#!/usr/sbin/dtrace -Zqs
BEGIN{
self->ts = walltimestamp;
}
demo$target:::printDemo
{
@[arg1]=count();
}
END{
self->ts = walltimestamp - self->ts;
printf("Total time used is (us): %d", (self->ts)/1000);
}
The script has BEGIN and END unanchored probes that I use to measure the execution time. On running the script, one could measure the execution time.
meson:/export/home/mithun/Desktop/USDT >./demotest.d -c ./demo
Done 10000000 iterations
Total time used is (us): 19659984
0 10000000
The run takes around 19.65 seconds. I could get the same information by using timex as follows:Total time used is (us): 20859986
0 10000000
real 21.57
user 0.51
sys 0.16
The difference in time between DTrace and timex could be neglected for practical reasons beyond my interest. Approximately, the run with DTrace takes 20 seconds. Thats the information that we are looking at. This is around 130 times more overhead. Where does this come from. To get to this, we need to dissect the binaries produced upon compiling the same source with DTrace enabled and the code without any DTrace probe. The point of interest is the function printDemo. Looking at printDemo without DTrace probes, the assembler output looks like this.
127 printDemo()
128 printDemo: 55 pushl %ebp
129 printDemo+0x1: 8b ec movl %esp,%ebp
130 printDemo+0x3: c9 leave
131 printDemo+0x4: c3 ret
However, with DTrace probes enabled, the assembler output looks like this.
477 printDemo()
478 printDemo: 55 pushl %ebp
479 printDemo+0x1: 8b ec movl %esp,%ebp
480 printDemo+0x3: 6a 00 pushl $0x0
481 printDemo+0x5: 8b 45 08 movl 0x8(%ebp),%eax
482 printDemo+0x8: 50 pushl %eax
483 printDemo+0x9: 90 nop
484 printDemo+0xa: 90 nop
485 printDemo+0xb: 90 nop
486 printDemo+0xc: 90 nop
487 printDemo+0xd: 90 nop
488 printDemo+0xe: 83 c4 08 addl $0x8,%esp
489 printDemo+0x11: c9 leave
490 printDemo+0x12: c3 ret
Between the two assembler listings, you might observe that there are several nop ( lines 483 - 488). These are the points that DTrace uses for tracing. The DTrace framework uses these points to dynamically patch with a call instruction that will enable the DTrace probe. As described in the document here, the principle induces a "disable probe effect". The
call site is replaced with a no-operation, the compiler must
nonetheless
treat the call site as a transfer of control into an unknown function.
As a result, this mechanism can induce disabled probe effect by
creating
artificial register pressure. However, by carefully placing
SDT probes near extant calls to functions, this disabled probe effect
can be made so small as to be unobservable. This explains the overhead observed in our program.
Though there are several shortcomings in my implementation, I believe that the core idea has been conveyed. One should take care and contemplate the need for DTrace probes in certain points in your application, that may result in a degradation. Also, wild tracing could result in a performance snag, though DTrace could indirectly be responsible. With a little care, one could overcome these shortcomings and make an effective use of DTrace facility.
Posted at 02:47PM Dec 09, 2007 by mithun in DTrace | Comments[0]