Monday Jun 20, 2005

My friend Tim Haley has a handwritten sign in his office. It says, quite simply, "Don't assume." I don't know the original context, but since Tim was part of the core release staff for Solaris 9, I reckon it had something to do with the whole "innocent until proven guilty" concept that the dwindling minority still embraces.

But I'll twist it around, and tackle an assumption near and dear to my own heart right now. From the first time we learn to program in a high level language, most of us go through the "edit, compile, test" cycle, often ad nauseum. And every time the "test" step fails, we go all the way back to "edit," because the problem is always in the code we wrote.

That's wrong, of course. Compilers are programs, too, and sometimes they've got mistakes. Late in the Solaris 10 development cycle, I got drafted to help out one of the Solaris Volume Manager project teams investigate one of those dreaded, logically impossible bugs. (I was part of the SVM team, but hadn't been working on this particular project.) Basically, along with some other output, the metarecover(1M) command was spitting this out:

The following soft partitions were found and will be added to
your metadevice configuration.
 Name            Size     No. of Extents
    d8202          204800         1

...which is all well and good, except for the fact that the value d8202 really was impossible. Because SVM keeps the 13-bit metadevice number and the 5-bit set number encoded in an 18-bit minor number. And, as any calculator-savvy high school student can confirm, (8202 > 8191).

But maybe we were just printing out the whole value, and not masking off the set number. That would explain things quite nicely, and the fix would be easy. On to the code:

	/* display configuration updates */
	(void) printf(dgettext(TEXT_DOMAIN,
	    "The following soft partitions were found and will be added to\n"
	    "your metadevice configuration.\n"));
	(void) printf("%5s %15s %18s\n",
	    dgettext(TEXT_DOMAIN, "Name"),
	    dgettext(TEXT_DOMAIN, "Size"),
	    dgettext(TEXT_DOMAIN, "No. of Extents"));
	for (i = 0; i < num_sps; i++) {
		(void) printf("%5s%lu %15llu %9d\n", "d",
		    MD_MIN2UNIT(MD_SID(un_array[i])),
		    un_array[i]->un_length, un_array[i]->un_numexts);
	}

Dang! I was wrong. There it is, clear as day. The MD_MIN2UNIT macro did the masking:

	#define NBITSMINOR32 18
	#define MD_BITSSET 5
	#define MD_SET_SHIFT (NBITSMINOR32 - MD_BITSSET)
	#define MD_MAXUNITS (1 << MD_SET_SHIFT)
	#define MD_UNIT_MASK (MD_MAXUNITS - 1)
	#define MD_MIN2UNIT(m) ((m) & MD_UNIT_MASK)

Good science fiction readers will recognize this scenario. Two contradictory facts, with no real middle ground. So we use Occam's Razor to pare away the alternatives, until only one remains. It will seem improbable, until viewed in the context of the invalid assumptions we have made. In this case, our two facts are the code and the output.

Did the code that we compiled really match the snippet above? Go to the bug report, find out when the error was reported. Verify that it can be reproduced reliably. Check the history of the source code. Sure enough, it happens every time, but only on one biweekly build, and the source code in question never changed.

We already verified the second fact, the output, in the previous paragraph. So what's left in between? Only the compiled code. Just to refresh our memory, we started with

		(void) printf("%5s%lu %15llu %9d\n", "d",
		    MD_MIN2UNIT(MD_SID(un_array[i])),
		    un_array[i]->un_length, un_array[i]->un_numexts);

...which, after preprocessing, correctly yields

		(void) printf("%5s%lu %15llu %9d\n", "d",
		     ( ( ( ( un_array [ i ] ) -> c . un_self_id ) ) & ( ( 1 << ( 18 - 5 ) ) - 1 ) ),
		    un_array[i]->un_length, un_array[i]->un_numexts);

But, when we disassemble the code from a system evidencing this failure:

libmeta.so.1`meta_sp_recover_from_wm+0x640:     sethi     %hi(0x3000), %l7
libmeta.so.1`meta_sp_recover_from_wm+0x644:     add       %l7, 0x2c0, %i5
libmeta.so.1`meta_sp_recover_from_wm+0x648:     sethi     %hi(0x3000), %l4
libmeta.so.1`meta_sp_recover_from_wm+0x64c:     ld        [%l3 + %i5], %i4
libmeta.so.1`meta_sp_recover_from_wm+0x650:     add       %l4, 0x2bc, %l5
libmeta.so.1`meta_sp_recover_from_wm+0x654:     clr       %i2
libmeta.so.1`meta_sp_recover_from_wm+0x658:     ld        [%sp + 0x7c], %i5
libmeta.so.1`meta_sp_recover_from_wm+0x65c:     ld        [%l3 + %l5], %l5
libmeta.so.1`meta_sp_recover_from_wm+0x660:     ld        [%i5], %o5
libmeta.so.1`meta_sp_recover_from_wm+0x664:     ld        [%o5 + 0x18], %o2
libmeta.so.1`meta_sp_recover_from_wm+0x668:     mov       %l5, %o0
libmeta.so.1`meta_sp_recover_from_wm+0x66c:     add       %i2, 1, %i2
libmeta.so.1`meta_sp_recover_from_wm+0x670:     ld        [%o5 + 0x78], %o3
libmeta.so.1`meta_sp_recover_from_wm+0x674:     add       %i5, 4, %i5
libmeta.so.1`meta_sp_recover_from_wm+0x678:     ld        [%o5 + 0x7c], %o4
libmeta.so.1`meta_sp_recover_from_wm+0x67c:     ld        [%o5 + 0x74], %o5
libmeta.so.1`meta_sp_recover_from_wm+0x680:     
call      +0x40644      

The interesting lines are 0x664, 0x670, and 0x678, which load the non-local-string printf() arguments into %o2, %o3, and %o4. They treat un_self_id, un_length, and un_numexts identically, with none of the masking specified by the MD_MIN2UNIT() macro (and shown to be happening correctly in the preprocessor expansion above).

Finally, from a stock on10 system that does NOT fail similarly:

libmeta.so.1`meta_sp_recover_from_wm+0x508:     sethi     %hi(0x2c00), %o3
libmeta.so.1`meta_sp_recover_from_wm+0x50c:     ld        [%sp + 0x7c], %l1
libmeta.so.1`meta_sp_recover_from_wm+0x510:     sethi     %hi(0x2c00), %o2
libmeta.so.1`meta_sp_recover_from_wm+0x514:     add       %o3, 0x70, %o0
libmeta.so.1`meta_sp_recover_from_wm+0x518:     ld        [%i2 + %o0], %i5
libmeta.so.1`meta_sp_recover_from_wm+0x51c:     add       %o2, 0x6c, %o1
libmeta.so.1`meta_sp_recover_from_wm+0x520:     sethi     %hi(0x1c00), %g1
libmeta.so.1`meta_sp_recover_from_wm+0x524:     ld        [%i2 + %o1], %l3
libmeta.so.1`meta_sp_recover_from_wm+0x528:     clr       %l0
libmeta.so.1`meta_sp_recover_from_wm+0x52c:     add       %g1, 0x3ff, %l2
libmeta.so.1`meta_sp_recover_from_wm+0x530:     ld        [%l1], %i0
libmeta.so.1`meta_sp_recover_from_wm+0x534:     ld        [%i0 + 0x18], %l7
libmeta.so.1`meta_sp_recover_from_wm+0x538:     mov       %l3, %o0
libmeta.so.1`meta_sp_recover_from_wm+0x53c:     add       %l0, 1, %l0
libmeta.so.1`meta_sp_recover_from_wm+0x540:     ld        [%i0 + 0x78], %o3
libmeta.so.1`meta_sp_recover_from_wm+0x544:     add       %l1, 4, %l1
libmeta.so.1`meta_sp_recover_from_wm+0x548:     ld        [%i0 + 0x74], %o5
libmeta.so.1`meta_sp_recover_from_wm+0x54c:     and       %l7, %l2, %o2
libmeta.so.1`meta_sp_recover_from_wm+0x550:     ld        [%i0 + 0x7c], %o4
libmeta.so.1`meta_sp_recover_from_wm+0x554:     
call      +0x3967c        

It's doing the masking we would expect; 0x1c00 + 0x3ff into %l2, un_self_id into %l7, masking them together into %o2.

We're almost home. We've explained the failure, but not its transient nature. In this case, further investigation revealed use of a transitional compiler during this build. How did I figure that out? The build logs generated by the nightly build script explicitly output the version information for all of the compilers. A cross check of the before, during, and after logs confirmed the transition.

I admit that this is a trivial example of debugging, and that I glossed over some of the details of the compiler switch (in fact, I left out information about which compiler was to blame, and whether this was sparc or x86...) I really wanted to emphasize two points. First: don't assume. Don't jump to conclusions about what's right and wrong, or your mind will be closed, and you'll misinterpret evidence. Second: keep detailed records. I was able to access tons of useful information during this investigation: the binary products of every build, the compiler history, the bug report detailing the symptoms and the environment, the version control log of the code, etc. Take away any part of that, and the difficulty of solving the problem increases dramatically.