Dan Stole the Subject for My Farewell
I had been thinking of what subject line to use for my farewell post, as my internship has come to an end - back to school. Halfway through last week I picked the perfect farewell subject line. Unfortunately, Dan used the Hitchhiker's Guide reference I'd settled on when he left last Friday. He should be most of the way back to the east coast now (he's driving). I've got a plane to catch early Tuesday morning. I started thinking about Spinal Tap references for the title, but nothing quite fitting came to mind.
My project is pretty much wrapped up for me, the code's cleaned up, and is now getting some wider attention from people other than Steve. And while waiting for feedback, I spent some time looking at a P2 panic that appeared Wednesday morning, which revealed an odd API quirk and an inconsistency in counting the number of CPUs available to a zone. These bugs, and my project RFEs have been transferred to Steve.
I've had a good time here, and I may be back after I graduate in May. Thanks to all the folks I've met here at Sun for showing me a good time, and teaching me a lot. I'm heading back to school having signed the OpenSolaris contributor agreement, so if I have any spare time at school, I may keep working on OpenSolaris while there. In fact one of the projects I passed on initially, which I later found myself wanting was prototyped by two people previously, but needs a bit of work, and was just posted as a new project on opensolaris.org, and I'm thinking of jumping into that.
As soon as I leave, my login account goes away, so this will be the last entry in this blog. I'm [colin at cs dot brown dot edu], or [colin dot s dot gordon at gmail dot com], and I'll keep blogging at http://ahamsandwich.wordpress.com/.
Posted by csg [Personal] ( August 17, 2007 04:19 PM ) Permalink | Comments[0]
Favorite Comments and Constants in OpenSolaris Source
While perusing the code for OpenSolaris, my fellow interns and I have come across some great comments, variable names, and documentation, as any large code base must inevitably have:- usr/src/uts/common/os/clock.c: "If the TOD chip isn't giving correct time, then set it to the time that was passed in as a rough estimate. If we don't have an estimate, then set the clock back to a time when Oliver North, ALF and Dire Straits were all on the collective brain: 1987." (Found by Dan)
- usr/src/uts/i86pc/os/machdep.c: "This routine is almost correct now, but not quite." (also from Dan)
- usr/src/uts/common/os/condvar.c: Regarding
cv_wait_stop(), "This is a horrible kludge. It is evil. It is vile. It is swill. If your code has to call this function then your code is the same." - usr/src/uts/common/sys/dumphdr.h: The magic number at the start of all core dumps:
#define DUMP_MAGIC 0xdefec8edU /* dump magic number */
- usr/src/lib/watchmalloc/common/malloc.c: While poking around in mdb, I printed the words at the end of an array and found
0xbaddcabband0xfeedface. So I opened the OS source browser, searched, and found:/* * Patterns to be copied into freed blocks and allocated blocks. * 0xfeedbeef and 0xfeedface are invalid pointer values in all programs. */ static uint64_t patterns[2] = { 0xdeadbeefdeadbeefULL, /* pattern in a freed block */ 0xbaddcafebaddcafeULL /* pattern in an allocated block */ }; - Earlier today I found out that the Solaris libc defines a struct uberdata. It ends up being used on the sly in the psig command.
- For a good time, follow
int bsd_evil_hackthrough the grub source.
Posted by csg [Personal] ( August 01, 2007 02:58 PM ) Permalink | Comments[1]
Layers of Complexity
I've been learning a lot about the layering of software from the kernel through up to userland utilities lately. There are frequently many more layers of indirection than are immediately obvious. Many of the tests I'm running which still have problems have this problem as a result of the side-effect of one test which turns off default binding property of all pools. Normally this is fine to do becausepooladm -x; pooladm -d; pooladm -e will clear everything out. Disabling pools wipes out all properties on the default pool and pool_pset, and enabling recreates the properties with default values. When I made disabling a no-op at the kernel level, this prevented these properties from being cleared by the test suite's environment cleanup routines. So a number of tests failed because they couldn't bind to anything. Another more obnoxious side effect of this is that sshd is unable to accept new sessions because it can't find a pool to bind to. Changing the kernel code for pool_set_status() to reinitialize the properties of the default resources should fix things, right? Well, it would be if the sequence of events I assumed occurred was correct. What I assumed initially, without thinking about it, was that the sequence of events would be:
- Execute
pooladm -din a shell - The above command opens
/dev/pooland performs theioctl()for disabling pools - The
ioctl()code callspool_ioctl(), which dispatches topool_status(), now modified to do the right thing with disable and enable requests.
psrset command simply calls the function in the syscall layer which carries out the appropriate task. I completely forgot about both libpool and poold. The actual call sequence is:
- Execute
pooladm -din a shell - The above code calls into
libpool. - The code called executes '
svcadm disable svc:/system/pools:default' - The shutdown code for the pools service,
/lib/svc/method/svc-pools, which executespooladm -din a shell with a particular environment variable set - With this environment variable set, when the command calls into
libpool, this time it actually opens/dev/pooland performs theioctl()for disabling pools - The
ioctl()code callspool_ioctl(), which dispatches topool_status(), now modified to do the right thing with disable and enable requests.
libpool saw that pools were already enabled, and didn't restart the pools service. So when the later tests ran which disable pools partway through, the invocation of svcadm saw the pools service was already disabled, and did not execute the additional pooladm -d, so no call was ever made into the kernel to reset these properties. I had not really noticed before how differently userland and kernelspace were written. The complexity in kernelspace tends to be intricacies - locking, complicated invariants and such. Userspace's complexity seems to be largely layering and indirection. Or perhaps that's just what I perceive most because it's so large, and unlike the kernel, where the definition a symbol corresponds to is always clear (making following flow of control with tools like cscope or OpenGrok usually straightforward), it's not always easy to figure out the flow of control between programs in userspace without some preexisting knowledge.Fixing this cleaned up a lot of inconclusive tests, and a number of tests which were failing because of properties not being reset. After this and fixing a couple tests which made assumptions about processor sets and pools being mutually exclusive, as of this afternoon the pools test tally has improved from that in the last entry to:
Result Total:
FAIL: 7
PASS: 705
UNSUPPORTED: 5
In addition to this, I've cleaned up some code, merged redundant code paths, done the final implementation for the last class of process sets, and fixed several race conditions and deadlocks related to the conversion of pool reference counting to per-thread. And because two of the failures also fail on the gate (it was one the other week, but I've found bugs in several test cases), there are really only 5 failures left. Current builds also hold up to as many runs of the processor set stress test as I've been able to subject it to, without deadlock or panic. Things are looking good.
Posted by csg [Personal] ( July 24, 2007 05:03 PM ) Permalink | Comments[0]
Two Weeks in the Kernel
Steve Lawrence suggested I make some notes about my first couple weeks at Sun, while it was still fresh. This seems to be the most appropriate place. He suggested a couple questions to answer for myself:- What did I learn? I learned quite a lot. A list of quick things:
- How to use mdb
- How to use DTrace
- The SCM tools used at Sun (currently Teamware, transitioning to Mercurial)
- The build system for all of Solaris (which is the same as OpenSolaris)
- How to set up a Zone
- How to divide up processors between the various ways Solaris had to group processes
- How to use the "One True Keyboard" (UNIX style, Sun Type 7, which has caps lock, control, backspace, backslash, escape, and tilde in what are now unusual places, plus an actual Meta key)
- What were the challenges? Getting used to a new work environment is always a little different. Picking my project for the summer was also a challenge, because there were so many interesting proposals. And of course, once I selected one (more on that when I get the time to write a good technical entry) jumping into a code base as large as Solaris is a lot like jumping into a lake in January in New England.
- What were the new experiences? (good and bad) Builds taking 8 hours was new. This doesn't happen frequently, though, since you can just do the portions you updated after the initial build (as with most build systems) but I'd never worked with such a huge code base before. Being able to see my changes somewhere they were publicly visible, and being able to speak (or blog) to anyone about what I do, has also been a thrill - my main complaint with my previous internship was that I could not (and still can not, and may never be able to) talk about the details of the work I did to people who do not work at that company. Being able to give people links to code on opensolaris.org is the polar opposite of the spectrum. I'm not sure I need the degree of freedom I have, but it's certainly nice.
Posted by csg [Personal] ( June 19, 2007 09:50 PM ) Permalink | Comments[0]
