http://blogs.sun.com/csg/date/20070809 Thursday August 09, 2007

Processor Sets as Processor Pools Putback Preparations?

I've now gotten some more feedback on my changes, and my project mentor Steve tells me that several others are interested in taking a look at my diffs. Approval for putting back a subsystem rewrite like this, even if it passes all tests and doesn't alter user APIs, takes longer than I have left on my internship. So most likely, if it goes back, it will be Steve's doing. But a number of people seem to really want this put back to the gate, and as a result I'm cleaning stuff up in preparation to hand it over. Also, tomorrow is the last day for my fellow kernel interns, Evan and Dan :-(

Posted by csg [Sun] ( August 09, 2007 01:47 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070806 Monday August 06, 2007

Kernel Technical Discussion Slides

The Intern KTD presentations went pretty well today. Slides are here. If I get a chance before I leave, I might post annotated versions.

Posted by csg [Sun] ( August 06, 2007 02:19 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070803 Friday August 03, 2007

Coming to a close

Well, I have two more full weeks left on my internship, and things are wrapping up. On Monday morning I'll be giving a KTD (Kernel Technical Discussion) to those interested. We just did a practice run, and I'm feeling pretty good about it. If I can, I'll try to post the slides from my talk (just have to double-check the whole NDA thing, but since I was encouraged to blog about this project, I'm not expecting to hear a "No").

In addition to that, my project has mostly wrapped up. I match the gate on results from the pools test suite. I have not been able to panic my system by poking at it or running the psets stress test. I'm basically going back through, making sure my code is well-commented, and that I clean up a few rough edges before I leave. I unfortunately will probably not be doing a putback of my stuff, at least not personally - the ARC process to approve project integration takes a minimum of a week, plus code review, and satisfying an RTI advocate. However, it's in pretty good shape, is pretty clean code, and if someone chose to put it back relatively soon, they shouldn't have much trouble with it. So I guess I'll be watching commit messages for the next few months (though I hope someone would email me if they put my stuff back! :-p).

Next week I'll probably do a final wrap-up summary of the project, or post annotated versions of my talk.



Posted by csg [Sun] ( August 03, 2007 06:35 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070713 Friday July 13, 2007

A Cleaner Prototype - with full pset support

I haven't written in a while, but I've been busy.

At the start of July, I started digging through all of the old processor sets and processor pools correctness and stress tests. Right now I'm basically working on having the pools functionality test pass. Here are the results of testing the gate from a couple weeks ago (output from a test result parsing script):

Result Total:
        FAIL: 1
        PASS: 711
        UNSUPPORTED: 5
UNSUPPORTED means it doesn't apply to the platform - I haven't looked into them, but I suspect those tests require more than only 4 processors to run. Right now, I'll be happy to match the gate.

Just before the 4th, I ran the test suite on a machine running my mostly-finished prototype:

Result Total:
        FAIL: 34
        OTHER_139: 2
        PASS: 552
        UNRESOLVED: 124
        UNSUPPORTED: 5
UNRESOLVED and OTHER mean various random error conditions occurred which prevented the test from completing (e.g. some of the setup for the test failed) but the actual test does not necessarily yield incorrect results. The UNSUPPORTED tests in this run (and the next) are the same as those which were unsupported on the gate build. The prototype this ran on is the one mentioned in my previous entry, which implements support for most of the basic use cases of psets as temporary pools. It lacked support for (read: crashes upon) performing some tasks under conditions the pools framework doesn't like (e.g. deleting a pset which has things bound to it is valid for processor sets, but invalid for a pool) or things without direct analogs in the pools world (e.g. binding individual threads).

Some random stress testing however frequently took the system down, failing an assertion that the number of processes in a pool was greater than 0. It turns out much of the pools framework verifies consistency using counts of how many processes are in a pool - which doesn't make sense when we want to coexist with an API which permits binding (potentially) every thread of a program to a different pool! struct proc contains a pointer to the pool the process is in, used for querying where processes are - which also makes no sense if this varies within a process. So in my latest build, in addition to having various bug fixes and more feature support, the system is converted to having a per-thread pool pointer, and having the pool reference count be a thread count rather than a process count. There is still one race condition (famous last words) to fix in the new refcount code, but it's almost there. The build is also the first which actually turns pools on at boot, and does not permit turning them off - which along with psets and pools coexisting, is the main goal of the whole project.

The latest build test results from 7/12, before I turned my attention back to the refcount race condition in the psets code:

Result Total:
        FAIL: 16
        OTHER_13: 32
        PASS: 664
        UNSUPPORTED: 5

Of course, some of these tests no longer make sense to have, and some of them simply expect the wrong thing - mainly all of the tests which try to reproduce different behavior with pools turned on or off. Right now my kernel simply lies and always says that enabling or disabling succeeded, though queries for state always return POOL_ENABLED. This accounts for at least a couple of the remaining failures. Many of the OTHER results are hard to interpret - most of the tests giving these results are written differently from most of the others, and don't provide as much information - but this is probably an indication of something giving them errno 13 - EACCES.



Posted by csg [Sun] ( July 13, 2007 02:12 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070630 Saturday June 30, 2007

Deleting 14TB, and a Finished Prototype

Before I finished my prototype of merging processor sets and pools, this past week was already off to a great start:

The first, and possibly the most amusing bug of the week was found by Dan K. (and not in his own code). When we had left last Friday evening, Dan had kicked off a recursive find ... -exec ... to grep in basically every source file for calls to a particular function he was changing for his time independent zones. He redirected the output to a file in the search area, so once find hit the log file and started exec'ing grep on it... you get the idea. We arrived on Monday to several emails about the build server being out of space. People had deleted old workspaces once this happened, and were now working away again. Dan of course deleted the file (which due to compression, was only 654 GB or so on disk). This was all well and good. Then at some point I started doing a file operation, and it wasn't completing. Sending kill signals to the process yielded no results, so it was clearly stuck in the kernel somewhere (actual signal handling for things like kill signals only occurs when threads are entering, or exiting syscalls). My officemate was having a similar problem. I walked to Dan's office, and he also had that problem. We went to Dan Price, who determined there was a filesystem issue. The three of us went to see Matt Ahrens, who was already working on the bug, as another engineer had already tracked it down to ZFS. After some amount of standing there, most of us left, as we weren't really contributing anything. We went to lunch, came back, and about an hour later the server began working again. My chmod had finally finished - after two and a half hours. A short time later we got an email from Matt Ahrens saying that the problem was that a very large file had been deleted, and was just taking a long time. ZFS tries to cache all the level 1 indirect blocks for a file in memory when deleting. For a 14TB file, that's roughly 16GB - on a machine with 12 GB of RAM. And everyone else's threads were blocked on the thread busy deleting Dan's log file. The ZFS guys knew this sort of thing would be problematic, but had never actually run into it before... and so, they filed Bug 6573681.

At this point, I have a mostly functional version of a kernel in which processor sets and pools are merged, and both always enabled. There are a couple minor cases I haven't handled yet, but it's basically finished. I'd sent out a proposal for the change for review by a few people who had some experience with those systems, and am gathering feedback. Soon I'll start going through the formal process of filing for a significant change to the kernel. In the mean time, Im also looking at other possible projects for the rest of the summer - perhaps one of the projects I had considered earlier for my first project, and perhaps a new RFE...



Posted by csg [Sun] ( June 30, 2007 09:08 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070623 Saturday June 23, 2007

Processor Sets, and Pools, and Partitions! Oh My!

My summer project is merging the two processor grouping abstractions in the Solaris kernel, whose uses are currently mutually exclusive. This will allow an extremely useful feature of zones, which is currently disabled by default, to always be enabled. So far, I've discovered some interesting things in the course of this work, such as a trick the JVM plays on Solaris to evade resource controls.[Read More]

Posted by csg [Sun] ( June 23, 2007 12:10 AM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070615 Friday June 15, 2007

Integer Overflow and An Inverted Spin-Lock

An unexpected limitation for the C preprocessor to place on code, and how to aggravate a race condition by using additional synchronization... Plus an aside on formating code as HTML.[Read More]

Posted by csg [Sun] ( June 15, 2007 10:21 PM ) Permalink | Comments[1]
http://blogs.sun.com/csg/date/20070611 Monday June 11, 2007

It's Broken - Prove It

Today I finally did the putback (basically, the commit to the main repository) for my first, fairly trivial bugfix. However, I'm still working on the fix for the remaining race condition. I have the fix worked out, and am pretty certain that it is correct, but the challenge in this bug is in reproducing it. Or, as I don't actually know of anyone hitting this bug, producing it for the first time.

The problem itself is described in the bug report, so you should take a look at that to get a bit of background. Reproducing the bug requires three threads, in two processes. It revolves around the interactions between exitlwps() and kadmin(). exitlwps() is a function called by user processes to killl all threads other than the calling thread. Called concurrently, the first thread into the function takes a lock, sets a flag, unlocks it, and continues cleanup. Subsequent calls take the lock, see the flag, and then exit themselves. kadmin() does a number for things, but the basic flow of kadmin() for the cases in question is:

stuff
if(!trylock(m))
 return
if(exitlwps(0) != 0)
 unlock(m)
 return
...
shut down or reboot the system
...
unlock(m)
return

So consider this case: one thread takes the lock in kadmin(), another enters kadmin() and fails to take the lock (as this means another thread is shutting down the system already). The second thread returns. Now let the first thread get a nonzero return value from exitlwps(). It will return. This thread has returned on error, and the second thread to enter kadmin() returned success - but the machine is not rebooting or shutting down! Even setting aside the fact that the documentation for this function says that "successful" calls with the flags for rebooting or shutting down should not return, this is a problem. Subsequent attempts to shut down (if they don't hit this race condition) will still succeed (not the case before my first bugfix, where the system could land in a state where it returned "success" from this call repeatedly but not shut down).

The fix isn't terribly complicated, though it took some thought (and conversations) to arrive at. The mutex is supplemented with a condition variable and a flag. The winning thread locks the mutex, sets the flag, unlocks, and goes on its way. Subsequent threads lock, see the flag, and sleep on the condition variable. The winning thread either shuts down the system, or clears state and signals the condition variable if it encounters an error - problem solved.

But to submit a bug, I need to prove it was broken in the first place. I need a way to reproduce the bug. Timing-related bugs are usually very difficult to debug. But they seem to be even more difficult to intentionally cause. The solution I'm currently using is a combination of a hacky C program and DTrace. I have a program which forks. One process spawning an extra thread which calls exit, and has the original thread make a call to uadmin(), which then calls kadmin(). The other process only nanosleep()s and calls kadmin(). Using DTrace's chill() action, I can sleep the first thread to call exitlwps() (which is in the multi-threaded process because of the nanosleep()). This causes that thread's call to fail because the exit() call from the other thread succeeds before that exitlwps() call actually does anything. The difficulty now is to get the part of the single-threaded process's kadmin() call which attempts to take the mutex to occur in the brief window when the first kadmin() call is still holding the mutex - before or after, and the test machine reboots, making this a rather obnoxious bug to (fail to) test.

Why don't I just have the chill() in the exitlwps() call wait for 30 seconds? That's in the critical section, and surely long enough for the other process to be rescheduled and have its call spuriously return success. Because DTrace doesn't allow it. In their attempt to save DTrace users from themselves, they intentionally limited how much chill()ing could occur to no more than half of one second - for the sake of the sytem making progress. Normally this is great - I'm just using DTrace in a way which is the exact opposite of how it was intented to be used. It was meant to find and observe bugs - not to trigger them.

I'm now considering several options:

I'll make my decision tomorrow, unless I think of a better way this (though I'm not sure one exists).



Posted by csg [Sun] ( June 11, 2007 11:23 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070607 Thursday June 07, 2007

Figuring Things Out

Well, Monday I began my summer internship with the Solaris Kernel Group at Sun. Thus far things seem to be going pretty well - I've fixed one bug in the shutdown code (my changes are still uncommitted). I'm also working on a related race condition in the same section of code. I'm doing these small bug fixes as a way to learn the basics of the build system and such - so far I've learned how to build, how to update a running system, the basics of mdb (which I'm enjoying - more on that later, perhaps), and some DTrace basics. I have yet to pick an overarching summer project, though I'm leaning strongly towards working on something related to Solaris Zones.

Posted by csg [Sun] ( June 07, 2007 07:34 PM ) Permalink | Comments[0]