inside the sausage factory Adam Leventhal's Weblog
Adam Leventhal, Fishworks engineer

Thursday Jun 30, 2005

Thanks to everyone who attended Jarod's and my talk this afternoon at JavaOne and especially to those who had to go to the overflow room. The rest of the DTrace team and I were thrilled by the turnout -- we would never have thought that 900+ Java developers would be so interested in DTrace. We spent the next few hours hashing out some ways to get better support for Java in DTrace; we look forward to giving an update at next year's conference.

As promised, here are the slides from the talk:

To get started with the dvm provider on your Solaris 10 system, download the agent here. You may need to set your LD_LIBRARY_PATH to point to wherever you install it. Then invoke java with the -Xrundvmti:all option.

I wasn't able to capture the command history as was requested, but Bryan wrote up a nice post which can be used for the first part of the talk, and here are some of the java-specific commands from today.

# dtrace -n dvm`pgrep java`:::method-entry'{ @[copyinstr(arg0), copyinstr(arg1)] = count() }'

# dtrace -n dvm`pgrep java`:::object-alloc'{ @[jstack(20, 8000)] = count() }'

# dtrace -n dvm`pgrep java`:::object-alloc'/copyinstr(arg0) == "java/awt/Rectangle"/{}'

Try mixing and matching. Check out the Solaris Dynamic Tracing Guide for the exhaustive reference guide. If you have questions or want to figure out some more scripts, feel free to post a question here or -- even better -- on the DTrace forum on opensolaris.org.


Technorati Tags:

Tuesday Jun 28, 2005

In case you missed the DTrace demo during the JavaOne keynote, you can view it in RealPlayer format here. To skip to the DTrace stuff, jump 1:12:30 in. I haven't actually been able to bring myself to watch it -- I'm too self-conscious -- so feel free to comment on it. Things to keep in mind when watching: I can type better when I'm not in front of 7,000 people and I'm well aware that I need a hair cut.

That guy next to me is John Loiacono (not to be confused with Laocoon, the tragic figure of Roman epic).


Technorati Tags:

Monday Jun 27, 2005

This morning I gave a demo of DTrace with the Java agents during the keynote at JavaOne. In the past few hours I've had a lot of great feedback from Java developers -- we've found a bunch of big performance wins already, and I expect we'll find more this week (remember the DTrace challenge). For the demo, I ran /usr/java/demo/jfc/Java2D/Java2Demo.jar with the Java DTrace agents enabled and ran a couple of scripts on it.

The first script just gathered a frequency count for each method invoked -- nothing too subtle:

jmethod.d

#!/usr/sbin/dtrace -s

dvm$1:::method-entry
{
        @[copyinstr(arg0), copyinstr(arg1)] = count();
}

END
{
        printa("%-10@u %s.%s()\n", @);
}

bash-3.00# dtrace -s jmethods.d `pgrep java`
...
574        sun/java2d/SunGraphics2D.getCompClip()
608        sun/java2d/pipe/Region.dimAdd()
648        java/lang/ref/Reference.get()
671        java/awt/geom/AffineTransform.transform()
685        java/awt/Component.getParent_NoClientCode()
685        java/awt/Component.getParent()
702        sun/misc/VM.addFinalRefCount()
798        java/lang/ref/ReferenceQueue.remove()
809        java/lang/ref/Reference.access$200()
923        java/awt/geom/RectIterator.isDone()
1228       sun/dc/pr/Rasterizer.nextTile()
1657       sun/dc/pr/Rasterizer.getTileState()
1692       sun/java2d/pipe/AlphaColorPipe.renderPathTile()
1692       sun/java2d/pipe/AlphaColorPipe.needTile()
1702       sun/java2d/SunGraphics2D.getSurfaceData()
3457       java/lang/Math.min()
^C

The second demo was a little more exciting: this guy followed a thread of control all the way from Java code through the native library code, the system calls, and all the kernel function calls to the lowest levels of the system. Each different layer of the stack is annotated with color -- the first use of color in a DTrace script as far as I know.

follow.d

#!/usr/sbin/dtrace -s

/*
 * This script was used for the DTrace demo during the JavaOne keynote.
 *
 * VT100 escape sequences are used to produce multi-colored output from
 * dtrace(1M). Pink is Java code, red is library code, blue is system calls,
 * and green is kernel function calls.
 */

#pragma D option quiet

dvm$1:::method-entry
/copyinstr(arg0) == "sun/java2d/pipe/AlphaColorPipe" &&
 copyinstr(arg1) == "renderPathTile"/
{
        self->interested = 1;
        self->depth = 8;
}

dvm$1:::method-entry
/self->interested/
{
        printf("\033[01;35m%*.*s -> %s.%s\033[0m\n",
            self->depth, self->depth, "",
            copyinstr(arg0), copyinstr(arg1));
        self->depth += 2;
}

dvm$1:::method-return
/self->interested/
{
        self->depth -= 2;
        printf("\033[01;35m%*.*s <- %s.%s\033[0m\n",
            self->depth, self->depth, "",
            copyinstr(arg0), copyinstr(arg1));
}

dvm$1:::method-return
/self->interested &&
 copyinstr(arg0) == "sun/java2d/pipe/AlphaColorPipe" &&
 copyinstr(arg1) == "renderPathTile"/
{
        self->interested = 0;
        self->depth = 0;
        exit(0);
}

pid$1:::entry
/self->interested && probemod != "libdvmti.so"/
{
        printf("\033[01;31m%*.*s -> %s`%s\033[0m\n",
            self->depth, self->depth, "",
            probemod, probefunc);
        self->depth += 2;
}

pid$1:::return
/self->interested && probemod != "libdvmti.so"/
{
        self->depth -= 2;
        printf("\033[01;31m%*.*s <- %s`%s\033[0m\n",
            self->depth, self->depth, "",
            probemod, probefunc);
}

syscall:::entry
/self->interested/
{
        printf("\033[01;34m%*.*s => %s\033[0m\n",
            self->depth, self->depth, "",
            probefunc);
        self->depth += 2;
}

syscall:::return
/self->interested/
{
        self->depth -= 2;
        printf("\033[01;34m%*.*s <= %s\033[0m\n",
            self->depth, self->depth, "",
            probefunc);
}

fbt:::entry
/self->interested/
{
        printf("\033[32m%*.*s -> %s\033[0m\n",
            self->depth, self->depth, "",
            probefunc);
        self->depth += 2;
}

fbt:::return
/self->interested/
{
        self->depth -= 2;
        printf("\033[32m%*.*s <- %s\033[0m\n",
            self->depth, self->depth, "",
            probefunc);
}

bash-3.00# dtrace -s follow.d `pgrep java`
     -> sun/java2d/pipe/AlphaColorPipe.renderPathTile
       -> copyout
       <- kcopy
       -> sun/java2d/SunGraphics2D.getSurfaceData
...
           <- libc.so.1`_lwp_cond_signal
         <- libjvm.so`__1cNObjectMonitorEexit6MpnGThread__v_
       <- libjvm.so`__1cSObjectSynchronizerIjni_exit6FpnHoopDesc_pnGThread__v_
     <- libjvm.so`jni_MonitorExit
   <- libawt.so`Java_sun_java2d_loops_MaskFill_MaskFill
<- sun/java2d/pipe/AlphaColorPipe.renderPathTile

Now you can give a keynote demo of your very own.


Technorati Tags:

Sunday Jun 26, 2005

I'll be at JavaOne this week demonstrating the new Java/DTrace mash-up. At Monday morning's keynote, I get a 100 second talk-within-a-talk (which itself is a talk-within-a-talk). On Monday evening, Stephen, Mike, Bryan, and I (and possibly some other Solaris folks) will be holding a BoF on OpenSolaris, DTrace and what they mean for Java. We'll be in Golden Gate B1 at the Marriott Hotel from 7:30pm to 8:30pm. I'll be hanging around one of the DTrace or Solaris booths whenever I can to watch Jarod Jenson getting performance wins in Java applications and giving away iPods if he can't (I'm betting he gives away between zero and one iPods). Feel free to come by the booth or stop me in the halls for a demo or to talk about DTrace.


Technorati Tags:

Friday Jun 17, 2005

Something kind of crazy is going on. I got a call from Chris; "Check out what we're going to do at JavaOne," he said. Here's the deal: come to JavaOne, bring your Java application (on a usb device or CD or whatever), and we'll[1] improve your application using DTrace. Or you get a free tchotchke -- possibly one of several iPods. Usually you only see this kind of offer at furniture stores, and there's some catch, but this is the real deal: either your application goes faster because of our clever DTrace trickery or you're up some whizzy gadget. I was dubious until I saw the official announcement.

On the subject of JavaOne, this seems as good a time as any to plug the session I'm giving on Thursday at 1:15pm in Room 121/122 (TS-5211). I'll be talking about the same technology we're (apparently) going to be using on your application.

See you at JavaOne; bring your app.


[1] For some value of the term we -- I'm guessing I am going to be party to this insanity...


Technorati Tags:

Thursday Jun 16, 2005

The OpenSolaris launch has been pretty fun -- I've already had some discussions with customers over the source code. Of course, the first thing people seemed to do with the source code is look for references to "shit" and "fuck". This was titillating to be sure. Unsatisfied with the cheap laugh, ZDNet wanted to draw some conclusions from the profanity:

The much-vaunted dynamic tracing (dtrace) feature of Sun's system may not be as safe to use as most people think.

"This bit me in the ass a couple of times, so lets toss this in as a cursory sanity check," wrote one careful developer in the dtrace section.

I wrote that code in October of 2002. For those of you keeping score at home, that's almost a year before DTrace integrated into Solaris 10 and more than two years before Solaris 10 hit the streets. Here's the larger context of that comment:

    923 	/*
    924 	 * This bit me in the ass a couple of times, so lets toss this
    925 	 * in as a cursory sanity check.
    926 	 */
    927 	ASSERT(pc != rp->r_g7 + 4);
    928 	ASSERT(pc != rp->r_g7 + 8);

This gets pretty deep into the bowels of the pid provider, but the code preceeding these ASSERTs does the work of modifying the registers appropriately to emulate the effects of the traced instruction. For most instructions, we relocate the instruction bits to some per-thread scratch space and execute it there. We keep this scratch space in the user-land per-thread structure which, on SPARC, is always pointed to by the %g7 register (rp->r_g7 in the code above). The tricky thing is that while we change the program counter (%pc) to point to the scratch space, we leave the next program counter (%npc) where it was.

A bug I ran into very early in development was winding up executing the wrong code because I had incorrectly emulated an instruction. One way this manifested itself was that the program counter was set to %g7 + 4 or %g7 + 8. I added these ASSERTs after tracking down the problem -- not because it was a condition that I thought should be handled, but because I wanted everything to stop immediately if it did.

In the nearly three years this code has existed, those ASSERTs have never been tripped. Of course, I didn't expect them to be -- they were a cursory sanity check so I could be sure this aberrant condition wasn't occuring. Of course, if I had omitted the curse this might not have inspired such a puerile thrill.


Technorati Tags:

Tuesday Jun 14, 2005

I think the thing I love most about debugging software is that each tough bug can seem like an insurmountable challenge -- until you figure it out. But until you do, each tough bugs is the hardest problem you've ever had to solve. There are weeks when every morning presents me with a seemingly impossible challenge, and each afternoon I get to spike my keyboard and do a little victory dance before running for the train.

For my first OpenSolaris blog post, I thought I talk about one of my favorite such bugs. This particularly nasty bug had to do with a tricky interaction between VMware and DTrace (pre-production versions of each to be clear). My buddy Keith -- a fellow Brown CS Grad -- gave me a call and told me about some strange behavior he was seeting running DTrace inside of a VMware VM. Keith is a big fan of DTrace, but an intermittant, but reproducable problem was putting a damper on his DTrace enthusiasm. Every once in a while, running DTrace would cause the system to freeze. Because Solaris was running in the virtual environment, he could see that both virtual CPUs where spinning away, but they weren't making any forward progress. After a couple of back and forths over email, I made the trip down to Palo Alto so we could work on the problem together.

Using some custom debugging tools, we were able to figure out where the two virtual CPUs were spinning. One CPU was in xc_common() and the other was in xc_serv() -- code to handle cross calls. So what was going on?

cross calls

Before I can really delve into the problem, I want to give just a brief overview of cross calls. In general terms, a cross call (xcall) is used in a multi-processor (MP) system when one CPU needs to get another CPU to do some work. It works by sending a special type of interrupt which the remote CPU handles. You may have heard the term interprocessor interrupt (IPI) -- same thing. One example of when xcalls are used is when unmapping a memory region. To unmap a region, a process will typically call the munmap(2) system call. Remember that in an MP system, any processor may have run threads in this process so those mappings may be present in that any CPU's TLB. The unmap operation executes on one CPU, but the other CPUs need to remove the relevant mappings from their own TLBs. To accomplish this communication, the kernel uses a xcall.

DTrace uses xcalls synchronize data used by all CPUs by ensuring that all CPUs have reached a certain point in the code. DTrace executes actions with interrupts disabled (an explanation of why this must be so is well beyond the bounds of this discussion) so we can tell that a CPU isn't in probe context if its able to handle our xcall. When DTrace is stopping tracing activity, for example, it will update some data that affects all CPUs and then use a xcall to make sure that every CPU has seen its effects before proceeding:

dtrace_state_stop()

10739    	/*
10740    	 * We'll set the activity to DTRACE_ACTIVITY_DRAINING, and issue a sync
10741    	 * to be sure that every CPU has seen it.  See below for the details
10742    	 * on why this is done.
10743    	 */
10744    	state->dts_activity = DTRACE_ACTIVITY_DRAINING;
10745    	dtrace_sync();

dtrace_sync() sends a xcall to all other CPUs and has them spin in a holding pattern until all CPUs have reached that point at which time the CPU which sent the xcall releases them all (and they go back to whatever they had been doing when they received the interrupt). That's the high level overview; let's go into a little more detail on how xcalls work (well, actually a lot more detail).

xcall implementation

If you follow the sequence of functions called by dtrace_sync() (and I encourage you to do so), you'll find that this eventually calls xc_common() to do the heavy lifting. It's important to note that this call to xc_common() will have the sync argument set to 1. What's that mean? In a text book example of good software engineering, someone did a good job documenting what this value means:

xc_common()

 411    /*
 412     * Common code to call a specified function on a set of processors.
 413     * sync specifies what kind of waiting is done.
 414     *	-1 - no waiting, don't release remotes
 415     *	0 - no waiting, release remotes immediately
 416     *	1 - run service locally w/o waiting for remotes.
 417     *	2 - wait for remotes before running locally
 418     */
 419    static void
 420    xc_common(
 421    	xc_func_t func,
 422    	xc_arg_t arg1,
 423    	xc_arg_t arg2,
 424    	xc_arg_t arg3,
 425    	int pri,
 426    	cpuset_t set,
 427    	int sync)

Before you start beating your brain out trying to figure out what you're missing here, in this particular case, this destinction bewteen sync having the value of 1 and 2 is nil: the service (function pointer specified by the func argument) that we're running is dtrace_sync_func() which does literally nothing.

Let's start picking apart xc_common():

xc_common()

 446    	/*
 447    	 * Request service on all remote processors.
 448    	 */
 449    	for (cix = 0; cix < NCPU; cix++) {
 450    		if ((cpup = cpu[cix]) == NULL ||
 451    		    (cpup->cpu_flags & CPU_READY) == 0) {
 452    			/*
 453    			 * In case CPU wasn't ready, but becomes ready later,
 454    			 * take the CPU out of the set now.
 455    			 */
 456    			CPUSET_DEL(set, cix);
 457    		} else if (cix != lcx && CPU_IN_SET(set, cix)) {
 458    			CPU_STATS_ADDQ(CPU, sys, xcalls, 1);
 459    			cpup->cpu_m.xc_ack[pri] = 0;
 460    			cpup->cpu_m.xc_wait[pri] = sync;
 461    			if (sync > 0)
 462    				cpup->cpu_m.xc_state[pri] = XC_SYNC_OP;
 463    			else
 464    				cpup->cpu_m.xc_state[pri] = XC_CALL_OP;
 465    			cpup->cpu_m.xc_pend[pri] = 1;
 466    			send_dirint(cix, xc_xlat_xcptoipl[pri]);
 467    		}
 468    	}

We take a first pass through all the processors; if the processor is ready to go and is in the set of processors we care about (they all are in the case of dtrace_sync()) we set the remote CPU's ack flag to 0, it's wait flag to sync (remember, 1 in this case), and it's state flag to XC_SYNC_OP and then actually send the interrupt with the call to send_dirint().

Next we wait for the remote CPUs to acknowledge that they've executed the requested service which they do by setting the ack flag to 1:

xc_common()

 479    	/*
 480    	 * Wait here until all remote calls complete.
 481    	 */
 482    	for (cix = 0; cix < NCPU; cix++) {
 483    		if (lcx != cix && CPU_IN_SET(set, cix)) {
 484    			cpup = cpu[cix];
 485    			while (cpup->cpu_m.xc_ack[pri] == 0) {
 486    				ht_pause();
 487    				return_instr();
 488    			}
 489    			cpup->cpu_m.xc_ack[pri] = 0;
 490    		}
 491    	}

That while loop spins waiting for ack to become 1. If you look at the definition of return_instr() it's name is actually more descriptive that you might imagine: it's just a return instruction -- the most trivial function possible. I'm not absolutely certain, but I think it was put there so the compiler wouldn't "optimize" the loop away. The call to the inline function ht_pause() is so that the thread spins in such a way that's considerate on an hyper-threaded CPU. The call to ht_pause() is probably sufficient to prevent the compiler from being overly clever, but the legacy call to return_instr() remains.

Now let's look at the other side of this conversation: what happens on a remote CPU as a result of this interrupt? This code is in xc_serv()

xc_serv()

 138    	/*
 139    	 * Acknowledge that we have completed the x-call operation.
 140    	 */
 141    	cpup->cpu_m.xc_ack[pri] = 1;
 142    

I'm sure it comes as no surprise that after executing the given function, it just sets the ack flag.

Since in this case we're dealing with a synchronous xcall, the remote CPU then needs to just chill out until the CPU that initiated the xcall discovers that all remote CPUs have executed the function and are ready to be released:

xc_serv()

 146    	/*
 147    	 * for (op == XC_SYNC_OP)
 148    	 * Wait for the initiator of the x-call to indicate
 149    	 * that all CPUs involved can proceed.
 150    	 */
 151    	while (cpup->cpu_m.xc_wait[pri]) {
 152    		ht_pause();
 153    		return_instr();
 154    	}
 155    
 156    	while (cpup->cpu_m.xc_state[pri] != XC_DONE) {
 157    		ht_pause();
 158    		return_instr();
 159    	}

And here's the code on the initiating side that releases all the remote CPUs by setting the wait and state flags to the values that the remote CPUs are waiting to see:

xc_common()

 502    	/*
 503    	 * Release any waiting CPUs
 504    	 */
 505    	for (cix = 0; cix < NCPU; cix++) {
 506    		if (lcx != cix && CPU_IN_SET(set, cix)) {
 507    			cpup = cpu[cix];
 508    			if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) {
 509    				cpup->cpu_m.xc_wait[pri] = 0;
 510    				cpup->cpu_m.xc_state[pri] = XC_DONE;
 511    			}
 512    		}
 513    	}

there's a problem

Wait! Without reading ahead in the code, does anyone see the problem?

Back at VMware, Keith hacked up a version of the virtual machine monitor which allowed us to trace certain points in the code and figure out the precise sequence in which they occurred. We traced the entry and return to xc_common() and xc_serv(). Almost every time we'd see something like this:

  • enter xc_common() on CPU 0
  • enter xc_serv() on CPU 1
  • exit xc_serv() on CPU 1
  • exit xc_common() on CPU 0

or:

  • enter xc_common() on CPU 0
  • enter xc_serv() on CPU 1
  • exit xc_common() on CPU 0
  • exit xc_serv() on CPU 1

But the problem happened when we saw a sequence like this:

  • enter xc_common() on CPU 0
  • enter xc_serv() on CPU 1
  • exit xc_common() on CPU 0
  • enter xc_common() on CPU 0

And nothing futher. What was happening was that after releasing remote CPUs, CPU 0 was exiting from the call to xc_common() and calling it again before the remote invocation of xc_serv() on the other CPU had a change to exit.

Recall that one of the the first things that xc_common() does is set the state flag. If the first call to xc_common() sets the state flag to release the remote CPU from xc_sync(), but when things go wrong, xc_common() was overwriting that flag before the remote CPU got a change to see it.

the problem

We were seeing this repeatably under VMware, but no one had seen this at all on real hardware (yet). The machine Keith and I were using was a 2-way box running Linux. On VMware, each virtual CPU is represented by a thread on the native OS so rather than having absolute control of the CPU, the execution was more or less at the whim of the Linux scheduler.

When this code is running unadulterated on physical CPUs, we won't see this problem. It's just a matter of timing -- the remote CPU has many many fewer instructions to execute before the state flag gets overwritten by a second xcall so there's no problem. On VMware, the Linux scheduler might decide that's your second virtual CPU's right to the physical CPU is trumped by moving the hands on your xclock (why not?) so there are no garuantees about how long these operations can take.

the fix

There are actually quite a few ways to fix this problem -- I'm sure you can think of at least one or two off the top of your head. We just need to make sure that subsequent xcalls can't interfere with each other. When we found this, Solaris 10 was wrapping up -- we were still making changes, but only those deemed of the absolute highest importance. Making changes to the xcall code (which is rather delicate and risky to change) for a bug that only manifests itself on virtual hardware (and which VMware could work around using some clever trickery[1]) didn't seem worthy of being designated a show- stopper.

Keith predicted a few possible situations where this same bug could manifest itself on physical CPUs: on hyper-threaded CPUs, or in the presence of service management interrupts. And that prediction turned out to be spot on: a few weeks after root causing the bug under VMware, we hit the same problem on a system with four hyper-threaded chips (8 logical CPUs).

Since at that time we were even closer to shipping Solaris 10, I chose the fix I thought was the safest and least likely to have nasty side effects. After releasing remote CPUs, the code in xc_common() would now wait for remote CPUs to check in -- wait for them to acknowledge receipt of the directive to proceed.

xc_common()

 515    	/*
 516    	 * Wait for all CPUs to acknowledge completion before we continue.
 517    	 * Without this check it's possible (on a VM or hyper-threaded CPUs
 518    	 * or in the presence of Service Management Interrupts which can all
 519    	 * cause delays) for the remote processor to still be waiting by
 520    	 * the time xc_common() is next invoked with the sync flag set
 521    	 * resulting in a deadlock.
 522    	 */
 523    	for (cix = 0; cix < NCPU; cix++) {
 524    		if (lcx != cix && CPU_IN_SET(set, cix)) {
 525    			cpup = cpu[cix];
 526    			if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) {
 527    				while (cpup->cpu_m.xc_ack[pri] == 0) {
 528    					ht_pause();
 529    					return_instr();
 530    				}
 531    				cpup->cpu_m.xc_ack[pri] = 0;
 532    			}
 533    		}
 534    	}

In that comment, I tried to summarize in 6 lines what has just taken me several pages to describe. And maybe I should have said "livelock" -- oh well. Here's the complementary code in xc_serv():

xc_serv()

 170    	/*
 171    	 * Acknowledge that we have received the directive to continue.
 172    	 */
 173    	ASSERT(cpup->cpu_m.xc_ack[pri] == 0);
 174    	cpup->cpu_m.xc_ack[pri] = 1;

conclusions

That was one of my favorite bugs to work on, and it's actually fairly typical of a lot of the bugs I investigate: something's going wrong; figure out why. I think the folks who work on Solaris tend to love that kind of stuff as a rule. We spend tons of time building facilities like DTrace, mdb(1), kmdb, CTF, fancy core files, and libdis so that the hart part of investigating mysterious problems isn't gathering data or testing hypotheses, it's thinking of the questions to answer and inventing new hypotheses. It's my hope that OpenSolaris will attract those types of inquisitive minds that thrive on the (seemingly) unsolvable problem.


[1] This sort of problem is hardly unique to DTrace or to Solaris. Apparently (and not surprisingly) there are problems like this in nearly every operating system where the code implicitly or explicitly relies on the relative timing of certain operations. In these cases, VMware has hacks to do things like execute the shifty code in lock step.

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag: