Deleting 14TB, and a Finished Prototype
Before I finished my prototype of merging processor sets and pools, this past week was already off to a great start:
The first, and possibly the most amusing bug of the week was found by Dan K. (and not in his own code). When we had left last Friday evening, Dan had kicked off a recursive find ... -exec ... to grep in basically every source file for calls to a particular function he was changing for his time independent zones. He redirected the output to a file in the search area, so once find hit the log file and started exec'ing grep on it... you get the idea. We arrived on Monday to several emails about the build server being out of space. People had deleted old workspaces once this happened, and were now working away again. Dan of course deleted the file (which due to compression, was only 654 GB or so on disk). This was all well and good. Then at some point I started doing a file operation, and it wasn't completing. Sending kill signals to the process yielded no results, so it was clearly stuck in the kernel somewhere (actual signal handling for things like kill signals only occurs when threads are entering, or exiting syscalls). My officemate was having a similar problem. I walked to Dan's office, and he also had that problem. We went to Dan Price, who determined there was a filesystem issue. The three of us went to see Matt Ahrens, who was already working on the bug, as another engineer had already tracked it down to ZFS. After some amount of standing there, most of us left, as we weren't really contributing anything. We went to lunch, came back, and about an hour later the server began working again. My chmod had finally finished - after two and a half hours. A short time later we got an email from Matt Ahrens saying that the problem was that a very large file had been deleted, and was just taking a long time. ZFS tries to cache all the level 1 indirect blocks for a file in memory when deleting. For a 14TB file, that's roughly 16GB - on a machine with 12 GB of RAM. And everyone else's threads were blocked on the thread busy deleting Dan's log file. The ZFS guys knew this sort of thing would be problematic, but had never actually run into it before... and so, they filed Bug 6573681.
At this point, I have a mostly functional version of a kernel in which processor sets and pools are merged, and both always enabled. There are a couple minor cases I haven't handled yet, but it's basically finished. I'd sent out a proposal for the change for review by a few people who had some experience with those systems, and am gathering feedback. Soon I'll start going through the formal process of filing for a significant change to the kernel. In the mean time, Im also looking at other possible projects for the rest of the summer - perhaps one of the projects I had considered earlier for my first project, and perhaps a new RFE...
Posted by csg [Sun] ( June 30, 2007 09:08 PM ) Permalink | Comments[0]
Processor Sets, and Pools, and Partitions! Oh My!
My summer project is merging the two processor grouping abstractions in the Solaris kernel, whose uses are currently mutually exclusive. This will allow an extremely useful feature of zones, which is currently disabled by default, to always be enabled. So far, I've discovered some interesting things in the course of this work, such as a trick the JVM plays on Solaris to evade resource controls.[Read More]Posted by csg [Sun] ( June 23, 2007 12:10 AM ) Permalink | Comments[0]
Two Weeks in the Kernel
Steve Lawrence suggested I make some notes about my first couple weeks at Sun, while it was still fresh. This seems to be the most appropriate place. He suggested a couple questions to answer for myself:- What did I learn? I learned quite a lot. A list of quick things:
- How to use mdb
- How to use DTrace
- The SCM tools used at Sun (currently Teamware, transitioning to Mercurial)
- The build system for all of Solaris (which is the same as OpenSolaris)
- How to set up a Zone
- How to divide up processors between the various ways Solaris had to group processes
- How to use the "One True Keyboard" (UNIX style, Sun Type 7, which has caps lock, control, backspace, backslash, escape, and tilde in what are now unusual places, plus an actual Meta key)
- What were the challenges? Getting used to a new work environment is always a little different. Picking my project for the summer was also a challenge, because there were so many interesting proposals. And of course, once I selected one (more on that when I get the time to write a good technical entry) jumping into a code base as large as Solaris is a lot like jumping into a lake in January in New England.
- What were the new experiences? (good and bad) Builds taking 8 hours was new. This doesn't happen frequently, though, since you can just do the portions you updated after the initial build (as with most build systems) but I'd never worked with such a huge code base before. Being able to see my changes somewhere they were publicly visible, and being able to speak (or blog) to anyone about what I do, has also been a thrill - my main complaint with my previous internship was that I could not (and still can not, and may never be able to) talk about the details of the work I did to people who do not work at that company. Being able to give people links to code on opensolaris.org is the polar opposite of the spectrum. I'm not sure I need the degree of freedom I have, but it's certainly nice.
Posted by csg [Personal] ( June 19, 2007 09:50 PM ) Permalink | Comments[0]
Integer Overflow and An Inverted Spin-Lock
An unexpected limitation for the C preprocessor to place on code, and how to aggravate a race condition by using additional synchronization... Plus an aside on formating code as HTML.[Read More]Posted by csg [Sun] ( June 15, 2007 10:21 PM ) Permalink | Comments[1]
It's Broken - Prove It
Today I finally did the putback (basically, the commit to the main repository) for my first, fairly trivial bugfix. However, I'm still working on the fix for the remaining race condition. I have the fix worked out, and am pretty certain that it is correct, but the challenge in this bug is in reproducing it. Or, as I don't actually know of anyone hitting this bug, producing it for the first time.
The problem itself is described in the bug report, so you should take a look at that to get a bit of background. Reproducing the bug requires three threads, in two processes. It revolves around the interactions between exitlwps() and kadmin(). exitlwps() is a function called by user processes to killl all threads other than the calling thread. Called concurrently, the first thread into the function takes a lock, sets a flag, unlocks it, and continues cleanup. Subsequent calls take the lock, see the flag, and then exit themselves. kadmin() does a number for things, but the basic flow of kadmin() for the cases in question is:
stuff
if(!trylock(m))
return
if(exitlwps(0) != 0)
unlock(m)
return
...
shut down or reboot the system
...
unlock(m)
return
So consider this case: one thread takes the lock in kadmin(), another enters kadmin() and fails to take the lock (as this means another thread is shutting down the system already). The second thread returns. Now let the first thread get a nonzero return value from exitlwps(). It will return. This thread has returned on error, and the second thread to enter kadmin() returned success - but the machine is not rebooting or shutting down! Even setting aside the fact that the documentation for this function says that "successful" calls with the flags for rebooting or shutting down should not return, this is a problem. Subsequent attempts to shut down (if they don't hit this race condition) will still succeed (not the case before my first bugfix, where the system could land in a state where it returned "success" from this call repeatedly but not shut down).
The fix isn't terribly complicated, though it took some thought (and conversations) to arrive at. The mutex is supplemented with a condition variable and a flag. The winning thread locks the mutex, sets the flag, unlocks, and goes on its way. Subsequent threads lock, see the flag, and sleep on the condition variable. The winning thread either shuts down the system, or clears state and signals the condition variable if it encounters an error - problem solved.
But to submit a bug, I need to prove it was broken in the first place. I need a way to reproduce the bug. Timing-related bugs are usually very difficult to debug. But they seem to be even more difficult to intentionally cause. The solution I'm currently using is a combination of a hacky C program and DTrace. I have a program which forks. One process spawning an extra thread which calls exit, and has the original thread make a call to uadmin(), which then calls kadmin(). The other process only nanosleep()s and calls kadmin(). Using DTrace's chill() action, I can sleep the first thread to call exitlwps() (which is in the multi-threaded process because of the nanosleep()). This causes that thread's call to fail because the exit() call from the other thread succeeds before that exitlwps() call actually does anything. The difficulty now is to get the part of the single-threaded process's kadmin() call which attempts to take the mutex to occur in the brief window when the first kadmin() call is still holding the mutex - before or after, and the test machine reboots, making this a rather obnoxious bug to (fail to) test.
Why don't I just have the chill() in the exitlwps() call wait for 30 seconds? That's in the critical section, and surely long enough for the other process to be rescheduled and have its call spuriously return success. Because DTrace doesn't allow it. In their attempt to save DTrace users from themselves, they intentionally limited how much chill()ing could occur to no more than half of one second - for the sake of the sytem making progress. Normally this is great - I'm just using DTrace in a way which is the exact opposite of how it was intented to be used. It was meant to find and observe bugs - not to trigger them.
I'm now considering several options:
- Playing with splitting the chill() call into two calls, which total to less than half a second, and hoping to get it right.
- Playing with the nanosleep time, and hoping to get it right.
- Recompiling DTrace without this limit, which under any other circumstances would be an absolutely horrendous idea.
Posted by csg [Sun] ( June 11, 2007 11:23 PM ) Permalink | Comments[0]
Figuring Things Out
Well, Monday I began my summer internship with the Solaris Kernel Group at Sun. Thus far things seem to be going pretty well - I've fixed one bug in the shutdown code (my changes are still uncommitted). I'm also working on a related race condition in the same section of code. I'm doing these small bug fixes as a way to learn the basics of the build system and such - so far I've learned how to build, how to update a running system, the basics of mdb (which I'm enjoying - more on that later, perhaps), and some DTrace basics. I have yet to pick an overarching summer project, though I'm leaning strongly towards working on something related to Solaris Zones.Posted by csg [Sun] ( June 07, 2007 07:34 PM ) Permalink | Comments[0]
