Friday August 13, 2004 | Speaker To Machines Erik O'Shaughnessy - erik.oshaughnessy AT Sun.COM |
|
Perl+DTrace=Instruction Tracing As promised, an ugly perl script that can be used to generate a custom DTrace script which will generate an instruction trace. For instance:
trogdor:utrace -> ls hello* hello.c utrace.pl* utrace.txt* trogdor:utrace -> utrace.pl -i hello trogdor:utrace -> sudo dtrace -Cs hello.t -c hello ... For the interested, here is the text of the traced program, the DTrace script, and the output generated. The perl script is no where near "feature complete", nor is it an example of good perl programming practice ( parts of it are legible ). But, say it with me, it scratched an itch :). -ejo (2004-08-13 12:54:05.0) Permalink DTrace Instruction Tracing Continues.. Wow, this has me bugged. Using this script ( called itrc.d ):
pid$target:a.out::
{
printf("%d %016llx",vtimestamp,uregs[R_PC]);
}
# dtrace -s itrc.d -c hello3 > rtrc.hello3
a custom SPARC disassembler, and a really awful perl script, I produced:
I think I'm almost done. -ejo (2004-07-29 14:28:54.0) Permalink Adam Leventhal pointed out that the DTrace pid provider can instrument user-land instructions, so my interest was piqued enough to give it a whirl last night. The script I ended up with is just stupidly short, I feel like I'm cheating.
pid$target:a.out::
{
printf("pc %p: %x",uregs[R_PC],*(uint )copyin(uregs[R_PC],4));
}
Using a new DTrace feature, I start the trace and get some output: # dtrace -c ./hello2 -s itrc.d CPU ID FUNCTION:NAME 0 38695 _start:entry pc 1000006a0: 91d02038 0 38696 _start:0 pc 1000006a0: 91d02038 0 38697 _start:4 pc 1000006a4: 91d02038 0 38698 _start:8 pc 1000006a8: 91d02038 ... That doesn't look right. The PC's look good, but those aren't the instructions at those addresses. Or are they? I was a bit foggy last night when I started messing with this, so I assumed that I wasn't using copyin() right. Somehow I was getting the address of the DTrace buffer that copyin returns or something. In the cold light of morning things look different. It occurs to me that 91d02038 looks familiar. Yes I know, it's numbers. The crawling itch in my lizard hacker brain was trying to tell me something: 91d02038 <=> ta %icc, 0x38 What? Your lizard hacker brain doesn't tell you stuff like that? Lucky you. So I was using copyin correctly. I am fetching the instruction at the PC, just not the original instruction. As everybody knows, 0x38 is the DTRACE pid provider trap, so this instruction is the DTrace instrumentation that makes it all work. I'm feeling moderately clever for having figured that out on only two cups of coffee. :) -ejo (2004-07-29 08:17:52.0) Permalink Comments [1] Adventures in Instruction Tracing: Part Duex! Yesterday I began talking about cobbling together a user-land instruction trace tool using a variety of UltraSPARC and Solaris features. Today I'm going to cop out alittle and present the comments for the user trap handler I wrote which is the heart of the tool. For the truly bored, I've provided a hello.c and associated (post-processeed) trace. -ejo
#if 0
illtrap_hndlr - Btrace3 ILLTRAP utrap handler
erik.oshaughnessy@sun.com
Btrace3 instruments the code of a program by substituting control
transfer instructions (CTIs) with specially prepared ILLTRAP
instructions. When the code executes, the ILLTRAP instructions cause
the utrap handler to execute. This allows us to observe the actual
path of execution by noting which ILLTRAP instructions are executed.
It probably is worth mentioning that the ILLTRAP instruction is a
sneaky way of hiding 22 bit constant data in your code. According to
_The SPARC Architecture Manual_ Version 9, David Weaver and Tom
Germond, page 168:
"The ILLTRAP instruction causes a illegal_instruction exception. The
const22 value is ignored by the hardware; specifically, this field is
NOT reserved by the architecure for any future use."
I have taken advantage of this by crafting the ILLTRAP instructions so
that they serve as indices into the array of saved original
instructions. Used in a flat array, a const22 value gives us 2^22-1 or
roughly 4 million points where code can be instrumented. For the
current generation of software this number of probe points appears to
suffice, but it is not difficult to conceive of future applications
where the number of probe points exceeds this flat space.
The first implementation of Btrace played a dangerous shell game,
storing the original instruction back into the executing stream of
instructions and re-inserting the ILLTRAP when handling the next
probed CTI. Despite the odd corner case, this technique worked
surprisingly well in a single- threaded application. It falls apart
when attempting to trace multi- threaded applications since the
complexity of managing multiple stores into the instruction stream and
guranteeing correctness of the instrumented program proved simply
overwhelming.
Btrace3 sidesteps the self-modifying code aspects of Btrace1 and 2 by
emulating the CTI instruction instead of storing the probed
instruction back into the instruction stream. This is possible since
the utrap handler can choose where the trapping context begins
executing after the utrap handler returns. Typical utrap handlers
either skip the trapping instruction or restart the trapping
instruction after modifying the state of the trapping context. A CTI
instruction can be emulated by calculating the target address if the
branch is taken and then determining if the CTI is taken. After
evaluating any conditions predicated by the type of instruction being
emulated, it is a simple matter to return with arbitrary PC and nPC
addresses.
Since the instruction stream remains instrumented during the lifetime
of of the process, the synchronization complexity of the tracing code
is greatly reduced. The only remaining synchronization point is trace
buffer management, which is a relatively well understood problem.
The first Btrace implementation also uncovered the importance of
preserving the state of the trapping context before proceeding with
tracing activities. The two macros UTRAP_PROLOG and UTRAP_EPILOG
preserve and restore the following registers: %ccr, %g1 - %g7.
Additionally, UTRAP_PROLOG writes ASI_PNF to %asi as required by the
__sparc_utrap_install(2) man page before calls are made to ABI
conforming functions. Values are preserved on the stack, making the
utrap handler reasonably re-entrant. The stack also has room for
sundry other values; %tick, %fsr, %l6 and %l7 (PC and nPC) from
trapping context, and the ILLTRAP instruction at [%l6]. The macro
UTRAP_EPILOG restores the state of %ccr, %g1 - %g7 which have may have
been corrupted by the utrap handler activites.
The following is a pseudo pseudocode description of how illtrap_hndlr works:
illtrap_hndlr{
sampled_tick_register = gethrtime()
save trapping context
original instruction = ProbeTable[ trapping ILLTRAP instruction ]
tid = thr_self(); /* re-implemented to avoid linking libthread */
cpuid = getcpuid() /* new in Solaris 9 */
switch( original instruction )
case CALL:
emulate CALL instruction
case BPr:
emulate BPr instruction
case FBfcc:
emulate FBfcc instruction
case FBPfcc:
emulate FBPfcc instruction
case Bicc:
emulate Bicc instruction
case BPcc:
emulate BPcc instruction
case JMPL:
emulate JMPL instruction
default: /* getting to default is an error condition */
unregister illtrap_hndlr
restore trapping context
restart trapping instruction, this time without the handler to
catch it when it falls. the program should die.
/* end of switch */
pick PC and nPC values based on contents of FLG register which
indicates if the branch was taken and if the delay slot instruction was
annulled.
if( !TraceOn )
restore trapping context
start executing at PC and nPC computed during instruction emulation
fi
lock(TrcLock,tid)
if( cTrcSlot > TRCMAX )
write(btrc->fd,TraceBuffer,TRCBUFSZ)
cTrcSlot = 0;
fi
store btfrec_t built in emulate_* to TraceBuffer[cTrcSlot]
cTrcSlot++
unlock(TrcLock)
restore trapping conext
start executing at PC and nPC computed during instruction emulation
}
Of course there's alittle bit of hand waving, but that should give the
reader a basic idea of what's being done and in what order.
#endif
(2004-07-28 13:12:23.0)
Permalink
Anything You Can Do, D Can Do Better! Not sour grapes honest! Jon Haslam saw my fsflush post and couldn't just bask in the glory that was old school hacking. He had to throw the new hotness at the problem. I, for one, welcome our new DTrace overlords! -ejo (2004-07-27 15:20:05.0) Permalink Comments [1] Adventures in Instruction Tracing One thing that performance people like to do is actually know what instructions are executed by a CPU. Crazy, I know. You can disassemble the program to know what specific machine instructions compose the program, but you still don't know what instructions specifically are executed at run-time due to branching in the code induced by the input to the program. So way back when, clever programmers found ways to instrument their code ( or more often, somebody else's code ) in order to determine what instructions a CPU was executing. There are lots of techniques for doing this, but I'd like to talk about the one I implemented and was disclosed in a short paper that was published in the May 2004 issue of Research Disclosure Journal. I'm happy to have been published, but the two page limit did sort of cramp my style.
Branch TracingSome computer architectures provide a spiffy feature, often referred to as "trap on branch". It is possible to build a pretty efficient instruction tracing tool using trap on branch since it interrupts the code under observation only at the end of a basic block, rather than on every instruction executed. On average a SPARCv9 basic block consists of 5 instrutions, so using trap on branch to trace would trap 80% less than trapping on every instruction. Wait for it. Unfortunately, SPARCv9 does not define a trap on branch. Oh if only.
Work With What You've GotDespite the lack of branch on trap in SPARCv9, there are some other architectural goodies defined which the intrepid hacker could exploit. The first thing I looked at was the Instruction Breakpoint Register( Ok, technically not defined by SPARCv9, it's a USIII implementation defined extension ), which caused a breakpoint trap to fire when an instruction matched a user definable mask. Originally intended as a means of hot-patching broken instructions in the field, this feature never really worked as designed and was disabled in USIII+. Another cool feature that doesn't get alot of love is the user trap handler facility, described in __sparc_utrap_install(2) and in the SPARC architecture manual (verson 9). It allows a user to define and install a custom trap handler for a specific (but diverse) set of exceptions. Now that is pretty cool!
You Got Your Debugger In My Instruction Tracer!
Skipping ahead alittle bit, I had the idea of combining an old debugger technique for breakpointing with SPARC user trap handlers to create a poor-man's trap on branch. Many debuggers implement breakpointing by inserting an illegal instruction at the specified address in the program and catching the illegal instruction trap generated when the client program executes it. The debugger then patches up the replaced instruction or emulates it, whichever seemed easier to do at the time. An instruction tracing tool could patch up the
It's Never That EasyWhere to start. First, user trap handling for ILLTRAP was broken as detailed in bug 4650095. Second, there's the small matter of introducing ILLTRAPs into a program. Third, what to do with the CTIs that you've replaced with ILLTRAPs and all the trace output you will generate? I discovered that an LD_AUDIT shared object would be an excellent way to introduce the trace instrumentation into a target application. The runtime linker audit interfaces are described adequately in rtld_audit(3ext) and in excellent detail in the Linker and Libraries Guide. The audit interfaces give the determined hacker a great deal of freedom to mess with an executable before control is passed to it from the runtime linker. Another advantage of this technique is the instrumenting code and the target code all share the same address space, so it's very easy to find and replace all the CTIs in the target code with ILLTRAPs.
A Commercial Plug And Then A Pause For BreathAs an engineer for Sun, I have found these three books to be invaluable: The Linker and Library Guide in particular is an amazing piece of work, especially when you learn that it's written by the same guys who write the linker code. Those guys rock! All that's missing is a blog from Siezo! Tomorrow, I'll post some condensed notes on the design of a user-land instruction tracing tool built on user traps called Btrace3, and why it ultimately was a flop. -ejo (2004-07-27 15:10:15.0) Permalink Yesterday I was reading performance related bug reports, trolling for work, when I stumbled upon 5024143. I don't know alot about fsflush, so I did some reading and found that on10's fsflush is keeping some internal statistics concerning it's operation that it isn't sharing ( a kstat would be a good place to start ). So being a curious sort, I whipped up a libkvm(3LIB) consumer to expose fsflush's stats to userland. fsfstat is born! Caveat: on10 required.
Some Example OutputHere is sample output from fsfstat run on a 12-way 4800 with 28G of real memory, mostly idle ( Burninator! ).
trogdor:fsfstat -> sudo ./fsfstat 1 #3594320 SCANNED PCT EXAMINED LOCKED MODIFIED COALESCE RELEASES TIME(ns) 15:08:44 119810 3.3 235 0 0 0 0 340320 15:08:45 119810 6.7 235 0 0 0 0 324880 15:08:46 119810 10.0 235 0 0 0 0 324800 15:08:47 119810 13.3 235 0 0 0 0 318000 15:08:48 119810 16.7 634 2 0 0 0 501840 15:08:49 119810 20.0 13802 773 6 7 0 4269440 15:08:50 119810 23.3 235 0 0 0 0 335520 15:08:51 119810 26.7 235 0 0 0 0 320800 15:08:52 119810 30.0 235 0 0 0 0 317840 15:08:53 119810 33.3 235 0 0 0 0 317360 15:08:54 119810 36.7 235 0 0 0 0 320640 15:08:55 119810 40.0 235 0 0 0 0 323680 15:08:56 119810 43.3 235 0 0 0 0 320400 15:08:57 119810 46.7 235 0 0 0 0 315440 15:08:58 119810 50.0 235 0 0 0 0 318800 15:08:59 119810 53.3 235 0 0 0 0 319440 15:09:00 119810 56.7 6367 3124 0 0 0 7615280 15:09:01 119810 60.0 110739 54692 0 6 0 118389680 15:09:03 119810 63.3 33125 14333 374 10 0 31190320 15:09:04 239620 70.0 470 0 0 0 0 663520 15:09:05 119810 73.3 235 0 0 0 0 324240 15:09:06 119810 76.7 235 0 0 0 0 320080 15:09:07 119810 80.0 235 0 0 0 0 317280 15:09:08 119810 83.3 235 0 0 0 0 319680 15:09:09 119810 86.7 235 0 0 0 0 318640 #3594320 SCANNED PCT EXAMINED LOCKED MODIFIED COALESCE RELEASES TIME(ns) 15:09:10 119810 90.0 6572 773 18 0 0 3049280 15:09:11 119810 93.3 113238 617 4 0 0 20368080 15:09:12 119810 96.7 103493 0 0 21 0 17344240 15:09:13 119810 100.0 235 0 0 0 0 346960 15:09:14 119810 3.3 235 0 0 0 0 321040 15:09:15 119810 6.7 235 0 0 0 0 324240 15:09:16 119810 10.0 235 0 0 0 0 322960 15:09:17 119810 13.3 235 0 0 0 0 318400 15:09:18 119810 16.7 2139 96 0 0 0 1095680 15:09:19 119810 20.0 12249 679 6 0 0 3521120 15:09:20 119810 23.3 235 0 0 0 0 334400 15:09:21 119810 26.7 235 0 0 0 0 326960 15:09:22 119810 30.0 235 0 0 0 0 325840 15:09:23 119810 33.3 235 0 0 0 0 320240 15:09:24 119810 36.7 235 0 0 0 0 316720 15:09:25 119810 40.0 235 0 0 0 0 317040 15:09:26 119810 43.3 235 0 0 0 0 316640 15:09:27 119810 46.7 235 0 0 0 0 317120 15:09:28 119810 50.0 235 0 0 0 0 317040 15:09:29 119810 53.3 235 0 0 0 0 317600 15:09:30 119810 56.7 13010 6448 0 0 0 15070240 15:09:31 119810 60.0 115114 56225 109 0 0 121045120 15:09:32 119810 63.3 21994 9476 265 0 0 20595040 15:09:33 119810 66.7 235 0 0 0 0 339840 15:09:34 119810 70.0 235 0 0 0 0 319760 The first column is a timestamp, the second column is number of 8k pages scanned. The fourth column is number of pages examined, which can be a smaller number due to larger pages being composed of sets of smaller pages. Locked, modified, coalesce and release I'm still foggy on ( I need to read and understand fsflush.c some more ). The final column is the time in nanoseconds that fsflush ran over the previous interval. I've saved the third column for last because I thought I was being rather clever. The third column tracks how far thru memory fsflush has scanned since fsfstat has been observing it. In this way, we can tell when fsflush has made a complete sweep thru physical memory and is starting over again.
Some Interesting ObservationsUsing the locality information gained by the percentage progress thru memory, we can see a pattern emerging in the sample data. Starting at the 20%, 56% and 90% range, we see three second bursts of activity that recurr. I don't have a definite explanation for this activity, and I wouldn't mind hearing thoughts about what's going on here. Anyway, today is a double-two-fer :) I'm just happy with how this hack turned out. -ejo (2004-07-23 13:56:44.0) Permalink Comments [5] LD_PRELOAD, or how I added timestamps to the output of anything. A pet peeve of mine forever has been the fact that the *stat family of tools don't timestamp their output. Oh sure, iostat has been granted a timestamp but vmstat, mpstat and the rest were left out in the cold. Yes, you could write a wrapper script which time stamps the beginning and the end of a ream of vmstat output but you still have to work to figure out when an event occurred in the middle of a 11,111,111 line capture of performance data. Given my need to explore the odd nooks and crannies for what is possible, I thought it might be fun to write an LD_PRELOAD shared object which interjected a timestamp at the beginning of a line of output. All I'd need to do is provide a new printf and I'd be done right? Wrong! After some messing about, it turned out that write(2) and _write() were the culprits that needed to be overridden to accomplish the mission. It also turned out that I needed to do some line buffering since I wanted the timestamp to be nice and tidy in column 1 of the output. For those of you with access to the internal network, you can find the fruit of my labor here. Those of you who don't can contact me and if my manager approves I could make it more widely available. An example using venerable vmstat: schwa:stamp-1.1 -> vmstat 1 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr f0 s0 s1 s6 in sy cs us sy id 0 0 0 1729512 1154760 0 116 1 0 0 0 0 0 0 1 0 634 1594 717 1 1 97 0 0 0 1648560 1030352 0 21 0 0 0 0 0 0 0 0 0 633 1323 466 0 0 100 0 0 0 1648560 1030352 0 12 0 0 0 0 0 0 0 0 0 585 1157 425 0 0 100 0 0 0 1648560 1030352 0 15 0 0 0 0 0 0 0 52 0 756 1143 492 0 21 79 0 0 0 1648560 1030352 0 15 0 0 0 0 0 0 0 0 0 722 1658 509 0 0 100 ^C schwa:stamp-1.1 -> setenv LD_PRELOAD stamp.so.1 schwa:stamp-1.1 -> vmstat 1 14:21:28 kthr memory page disk faults cpu 14:21:28 r b w swap free re mf pi po fr de sr f0 s0 s1 s6 in sy cs us sy id 14:21:28 0 0 0 1729504 1154744 0 116 1 0 0 0 0 0 0 1 0 634 1594 717 1 1 97 14:21:29 0 0 0 1648480 1030272 0 23 0 0 0 0 0 0 0 0 0 591 1211 415 0 0 99 14:21:30 0 0 0 1648480 1030272 0 15 0 0 0 0 0 0 0 0 0 577 1179 413 0 1 99 14:21:31 0 0 0 1648480 1030272 0 12 0 0 0 0 0 0 0 0 0 582 1240 438 0 0 100 14:21:32 0 0 0 1648480 1030272 0 15 0 0 0 0 0 0 0 0 0 584 1150 404 0 0 100 14:21:33 0 0 0 1648480 1030272 0 15 0 0 0 0 0 0 0 0 0 644 1543 492 1 0 99 14:21:34 0 0 0 1648480 1030272 0 15 0 0 0 0 0 0 0 0 0 677 1353 477 0 0 100 14:21:35 0 0 0 1648480 1030272 0 18 0 0 0 0 0 0 0 27 0 843 1644 552 0 1 99 14:21:36 0 0 0 1647280 1030032 0 325 0 0 0 0 0 0 0 0 0 777 2056 687 8 0 92 14:21:37 0 0 0 1648480 1030272 0 16 0 0 0 0 0 0 0 0 0 705 1422 526 2 0 98 14:21:38 0 0 0 1648480 1030272 0 40 0 0 0 0 0 0 0 0 0 772 2260 723 3 0 97 ^C Admittedly, not horribly clever but it scratched an itch. -ejo (2004-07-20 12:22:43.0) Permalink Comments [1] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||