OBEY Edicts from CLUSTRON

Tuesday Aug 16, 2005

Not so long ago I was looking through Solaris's shells for memory allocators - functions that perform tasks similar to malloc(3c). These functions often store the size of the allocated block at the beginning of each block; if that size is stored as a 4-byte value, the return value from the allocator may not be aligned on an 8-byte boundary. This is a major problem on SPARC, because it's not uncommon to allocate structs or unions containing types that require 8-byte alignment, especially long long. As it turns out, gcc correctly assumes that long long variables are aligned on 8-byte boundaries and uses the ldd and std instructions to access them. Our Studio compiler doesn't; it always issues two ld or st instructions. The result is that programs using this kind of allocator can crash when built with gcc but not with Studio, not a pleasant condition.

As part of my search, I found that, indeed, the Bourne and Korn shells have some alignment problems. Though these are bugs, we've decided that there's no reliable way to find all possible bugs of this type, so we worked around them in the compiler as well as fixed the ones we've found. This is, if nothing else, a good argument against compilers that "help" programmers by covering up this kind of error. But the best prize of all wasn't the kind of problem I was looking for, but rather this gem from the C shell:

        showall(av);
        printf("i=%d: Out of memory\n", i);
 	chdir("/usr/bill/cshcore");
 	abort();

This is the systems programming equivalent of finding a live wooly mammoth contentedly smoking a cigar in your recliner. Unfortunately, there's no way to trigger this behaviour, as it's protected by the "debug" preprocessor symbol, which we never set in a normal build. Nevertheless, thanks to OpenSolaris, you can see it for yourself.

We harp incessantly on the need to be able to debug production code, with no recompilation needed; there are a number of better ways to debug this particular condition. For example, you could use the DTrace pid provider to stop a csh process when nomem() is called, and even provide a backtrace. If that weren't enough, you could then use mdb(1) to debug the problem in greater detail, or gcore(1) to produce a core dump. But the best part, the real joy, if you'll pardon the pun, is the chdir call. Clearly the purpose was to drop core in a predictable location for later analysis by the author. I think you'll find that coreadm(1m), along with other corefile improvements, offers a far more flexible and powerful way to accomplish this - and it complements nicely the other debugging strategies I mentioned above.

Tuesday Aug 02, 2005

Tuesday and Wednesday nights (after the extravaganza on Tuesday and the OpenSolaris BOF on Wednesday) we'll be convening for potent beverages, good food, and unique and amusing company. I'll be at the Lloyd Center DoubleTree in downtown Portland, OR, room 1560. Expect other OpenSolaris personalities to be present. Laura tells me that souvenir shot glasses are among the after-party swag collection, so don't miss out.

Monday Aug 01, 2005

Those of you in or near Portland, Oregon are encouraged to come and see us at OSCON this week. Most of the conference is at the Convention Center this year (use the helpfully-named Convention Center train stop). Sun will have a booth in the exhibit hall starting Wednesday, and we're giving a few talks as well. In particular, join Bryan and me for a free tutorial on building, installing, and developing with OpenSolaris using DTrace, mdb, and more. That will be held Tuesday at 1:30pm in room D140. Then on Wednesday, I'll be giving a short talk on the status of OpenSolaris at 2:35pm in Portland/255, and we'll have a BOF at 8:30pm. Thursday, don't miss Bryan's short talk on DTrace at 4:30pm.

Even if you can't make the conference, you're welcome to join me for a beer. Send me mail at wesolows at eng dot sun dot com if you're interested, or leave a message for me at the 5th Avenue Suites.