Main | Next month (Jul 2007) »
http://blogs.sun.com/csg/date/20070630 Saturday June 30, 2007

Deleting 14TB, and a Finished Prototype

Before I finished my prototype of merging processor sets and pools, this past week was already off to a great start:

The first, and possibly the most amusing bug of the week was found by Dan K. (and not in his own code). When we had left last Friday evening, Dan had kicked off a recursive find ... -exec ... to grep in basically every source file for calls to a particular function he was changing for his time independent zones. He redirected the output to a file in the search area, so once find hit the log file and started exec'ing grep on it... you get the idea. We arrived on Monday to several emails about the build server being out of space. People had deleted old workspaces once this happened, and were now working away again. Dan of course deleted the file (which due to compression, was only 654 GB or so on disk). This was all well and good. Then at some point I started doing a file operation, and it wasn't completing. Sending kill signals to the process yielded no results, so it was clearly stuck in the kernel somewhere (actual signal handling for things like kill signals only occurs when threads are entering, or exiting syscalls). My officemate was having a similar problem. I walked to Dan's office, and he also had that problem. We went to Dan Price, who determined there was a filesystem issue. The three of us went to see Matt Ahrens, who was already working on the bug, as another engineer had already tracked it down to ZFS. After some amount of standing there, most of us left, as we weren't really contributing anything. We went to lunch, came back, and about an hour later the server began working again. My chmod had finally finished - after two and a half hours. A short time later we got an email from Matt Ahrens saying that the problem was that a very large file had been deleted, and was just taking a long time. ZFS tries to cache all the level 1 indirect blocks for a file in memory when deleting. For a 14TB file, that's roughly 16GB - on a machine with 12 GB of RAM. And everyone else's threads were blocked on the thread busy deleting Dan's log file. The ZFS guys knew this sort of thing would be problematic, but had never actually run into it before... and so, they filed Bug 6573681.

At this point, I have a mostly functional version of a kernel in which processor sets and pools are merged, and both always enabled. There are a couple minor cases I haven't handled yet, but it's basically finished. I'd sent out a proposal for the change for review by a few people who had some experience with those systems, and am gathering feedback. Soon I'll start going through the formal process of filing for a significant change to the kernel. In the mean time, Im also looking at other possible projects for the rest of the summer - perhaps one of the projects I had considered earlier for my first project, and perhaps a new RFE...



Posted by csg [Sun] ( June 30, 2007 09:08 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070623 Saturday June 23, 2007

Processor Sets, and Pools, and Partitions! Oh My!

My summer project is merging the two processor grouping abstractions in the Solaris kernel, whose uses are currently mutually exclusive. This will allow an extremely useful feature of zones, which is currently disabled by default, to always be enabled. So far, I've discovered some interesting things in the course of this work, such as a trick the JVM plays on Solaris to evade resource controls.[Read More]

Posted by csg [Sun] ( June 23, 2007 12:10 AM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070619 Tuesday June 19, 2007

Two Weeks in the Kernel

Steve Lawrence suggested I make some notes about my first couple weeks at Sun, while it was still fresh. This seems to be the most appropriate place. He suggested a couple questions to answer for myself: I was also somewhat surprised, given my previous internship working in filesystems and the interest in filesystems following that experience, I'm a bit surprised that I'm not working on ZFS. Though really, I suppose it's in tune with my other goals for this internship - to try something different. To some degree an internship is to help decide if the work you do there is the sort of work you'd like to do in the long term. I think last summer I reached the conclusion that I enjoy operating systems work - filesystems, kernels and such. I enjoy the problems in those and related areas. So for me this is more about sampling the variety which can be had within that space. And thus far, the Solaris kernel group has been great, and I don't expect that to change.

Posted by csg [Personal] ( June 19, 2007 09:50 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070615 Friday June 15, 2007

Integer Overflow and An Inverted Spin-Lock

An unexpected limitation for the C preprocessor to place on code, and how to aggravate a race condition by using additional synchronization... Plus an aside on formating code as HTML.[Read More]

Posted by csg [Sun] ( June 15, 2007 10:21 PM ) Permalink | Comments[1]
http://blogs.sun.com/csg/date/20070611 Monday June 11, 2007

It's Broken - Prove It

Today I finally did the putback (basically, the commit to the main repository) for my first, fairly trivial bugfix. However, I'm still working on the fix for the remaining race condition. I have the fix worked out, and am pretty certain that it is correct, but the challenge in this bug is in reproducing it. Or, as I don't actually know of anyone hitting this bug, producing it for the first time.

The problem itself is described in the bug report, so you should take a look at that to get a bit of background. Reproducing the bug requires three threads, in two processes. It revolves around the interactions between exitlwps() and kadmin(). exitlwps() is a function called by user processes to killl all threads other than the calling thread. Called concurrently, the first thread into the function takes a lock, sets a flag, unlocks it, and continues cleanup. Subsequent calls take the lock, see the flag, and then exit themselves. kadmin() does a number for things, but the basic flow of kadmin() for the cases in question is:

stuff
if(!trylock(m))
 return
if(exitlwps(0) != 0)
 unlock(m)
 return
...
shut down or reboot the system
...
unlock(m)
return

So consider this case: one thread takes the lock in kadmin(), another enters kadmin() and fails to take the lock (as this means another thread is shutting down the system already). The second thread returns. Now let the first thread get a nonzero return value from exitlwps(). It will return. This thread has returned on error, and the second thread to enter kadmin() returned success - but the machine is not rebooting or shutting down! Even setting aside the fact that the documentation for this function says that "successful" calls with the flags for rebooting or shutting down should not return, this is a problem. Subsequent attempts to shut down (if they don't hit this race condition) will still succeed (not the case before my first bugfix, where the system could land in a state where it returned "success" from this call repeatedly but not shut down).

The fix isn't terribly complicated, though it took some thought (and conversations) to arrive at. The mutex is supplemented with a condition variable and a flag. The winning thread locks the mutex, sets the flag, unlocks, and goes on its way. Subsequent threads lock, see the flag, and sleep on the condition variable. The winning thread either shuts down the system, or clears state and signals the condition variable if it encounters an error - problem solved.

But to submit a bug, I need to prove it was broken in the first place. I need a way to reproduce the bug. Timing-related bugs are usually very difficult to debug. But they seem to be even more difficult to intentionally cause. The solution I'm currently using is a combination of a hacky C program and DTrace. I have a program which forks. One process spawning an extra thread which calls exit, and has the original thread make a call to uadmin(), which then calls kadmin(). The other process only nanosleep()s and calls kadmin(). Using DTrace's chill() action, I can sleep the first thread to call exitlwps() (which is in the multi-threaded process because of the nanosleep()). This causes that thread's call to fail because the exit() call from the other thread succeeds before that exitlwps() call actually does anything. The difficulty now is to get the part of the single-threaded process's kadmin() call which attempts to take the mutex to occur in the brief window when the first kadmin() call is still holding the mutex - before or after, and the test machine reboots, making this a rather obnoxious bug to (fail to) test.

Why don't I just have the chill() in the exitlwps() call wait for 30 seconds? That's in the critical section, and surely long enough for the other process to be rescheduled and have its call spuriously return success. Because DTrace doesn't allow it. In their attempt to save DTrace users from themselves, they intentionally limited how much chill()ing could occur to no more than half of one second - for the sake of the sytem making progress. Normally this is great - I'm just using DTrace in a way which is the exact opposite of how it was intented to be used. It was meant to find and observe bugs - not to trigger them.

I'm now considering several options:

I'll make my decision tomorrow, unless I think of a better way this (though I'm not sure one exists).



Posted by csg [Sun] ( June 11, 2007 11:23 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070607 Thursday June 07, 2007

Figuring Things Out

Well, Monday I began my summer internship with the Solaris Kernel Group at Sun. Thus far things seem to be going pretty well - I've fixed one bug in the shutdown code (my changes are still uncommitted). I'm also working on a related race condition in the same section of code. I'm doing these small bug fixes as a way to learn the basics of the build system and such - so far I've learned how to build, how to update a running system, the basics of mdb (which I'm enjoying - more on that later, perhaps), and some DTrace basics. I have yet to pick an overarching summer project, though I'm leaning strongly towards working on something related to Solaris Zones.

Posted by csg [Sun] ( June 07, 2007 07:34 PM ) Permalink | Comments[0]