http://blogs.sun.com/csg/date/20070817 Friday August 17, 2007

Dan Stole the Subject for My Farewell

I had been thinking of what subject line to use for my farewell post, as my internship has come to an end - back to school. Halfway through last week I picked the perfect farewell subject line. Unfortunately, Dan used the Hitchhiker's Guide reference I'd settled on when he left last Friday. He should be most of the way back to the east coast now (he's driving). I've got a plane to catch early Tuesday morning. I started thinking about Spinal Tap references for the title, but nothing quite fitting came to mind.

My project is pretty much wrapped up for me, the code's cleaned up, and is now getting some wider attention from people other than Steve. And while waiting for feedback, I spent some time looking at a P2 panic that appeared Wednesday morning, which revealed an odd API quirk and an inconsistency in counting the number of CPUs available to a zone. These bugs, and my project RFEs have been transferred to Steve.

I've had a good time here, and I may be back after I graduate in May. Thanks to all the folks I've met here at Sun for showing me a good time, and teaching me a lot. I'm heading back to school having signed the OpenSolaris contributor agreement, so if I have any spare time at school, I may keep working on OpenSolaris while there. In fact one of the projects I passed on initially, which I later found myself wanting was prototyped by two people previously, but needs a bit of work, and was just posted as a new project on opensolaris.org, and I'm thinking of jumping into that.

As soon as I leave, my login account goes away, so this will be the last entry in this blog. I'm [colin at cs dot brown dot edu], or [colin dot s dot gordon at gmail dot com], and I'll keep blogging at http://ahamsandwich.wordpress.com/.



Posted by csg [Personal] ( August 17, 2007 04:19 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070809 Thursday August 09, 2007

Processor Sets as Processor Pools Putback Preparations?

I've now gotten some more feedback on my changes, and my project mentor Steve tells me that several others are interested in taking a look at my diffs. Approval for putting back a subsystem rewrite like this, even if it passes all tests and doesn't alter user APIs, takes longer than I have left on my internship. So most likely, if it goes back, it will be Steve's doing. But a number of people seem to really want this put back to the gate, and as a result I'm cleaning stuff up in preparation to hand it over. Also, tomorrow is the last day for my fellow kernel interns, Evan and Dan :-(

Posted by csg [Sun] ( August 09, 2007 01:47 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070806 Monday August 06, 2007

Kernel Technical Discussion Slides

The Intern KTD presentations went pretty well today. Slides are here. If I get a chance before I leave, I might post annotated versions.

Posted by csg [Sun] ( August 06, 2007 02:19 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070803 Friday August 03, 2007

Coming to a close

Well, I have two more full weeks left on my internship, and things are wrapping up. On Monday morning I'll be giving a KTD (Kernel Technical Discussion) to those interested. We just did a practice run, and I'm feeling pretty good about it. If I can, I'll try to post the slides from my talk (just have to double-check the whole NDA thing, but since I was encouraged to blog about this project, I'm not expecting to hear a "No").

In addition to that, my project has mostly wrapped up. I match the gate on results from the pools test suite. I have not been able to panic my system by poking at it or running the psets stress test. I'm basically going back through, making sure my code is well-commented, and that I clean up a few rough edges before I leave. I unfortunately will probably not be doing a putback of my stuff, at least not personally - the ARC process to approve project integration takes a minimum of a week, plus code review, and satisfying an RTI advocate. However, it's in pretty good shape, is pretty clean code, and if someone chose to put it back relatively soon, they shouldn't have much trouble with it. So I guess I'll be watching commit messages for the next few months (though I hope someone would email me if they put my stuff back! :-p).

Next week I'll probably do a final wrap-up summary of the project, or post annotated versions of my talk.



Posted by csg [Sun] ( August 03, 2007 06:35 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070801 Wednesday August 01, 2007

Favorite Comments and Constants in OpenSolaris Source

While perusing the code for OpenSolaris, my fellow interns and I have come across some great comments, variable names, and documentation, as any large code base must inevitably have:

Posted by csg [Personal] ( August 01, 2007 02:58 PM ) Permalink | Comments[1]
http://blogs.sun.com/csg/date/20070724 Tuesday July 24, 2007

Layers of Complexity

I've been learning a lot about the layering of software from the kernel through up to userland utilities lately. There are frequently many more layers of indirection than are immediately obvious. Many of the tests I'm running which still have problems have this problem as a result of the side-effect of one test which turns off default binding property of all pools. Normally this is fine to do because pooladm -x; pooladm -d; pooladm -e will clear everything out. Disabling pools wipes out all properties on the default pool and pool_pset, and enabling recreates the properties with default values. When I made disabling a no-op at the kernel level, this prevented these properties from being cleared by the test suite's environment cleanup routines. So a number of tests failed because they couldn't bind to anything. Another more obnoxious side effect of this is that sshd is unable to accept new sessions because it can't find a pool to bind to. Changing the kernel code for pool_set_status() to reinitialize the properties of the default resources should fix things, right? Well, it would be if the sequence of events I assumed occurred was correct. What I assumed initially, without thinking about it, was that the sequence of events would be: Unfortunately, I had been spending too much time in pset.c, where it really is that simple - the psrset command simply calls the function in the syscall layer which carries out the appropriate task. I completely forgot about both libpool and poold. The actual call sequence is: But this is enclosed in a conditional which only does all of this if you are setting the pools state to a different state. So because pools are permanently enabled, when tests cleaned up for themselves to set up a fresh testing environment, libpool saw that pools were already enabled, and didn't restart the pools service. So when the later tests ran which disable pools partway through, the invocation of svcadm saw the pools service was already disabled, and did not execute the additional pooladm -d, so no call was ever made into the kernel to reset these properties. I had not really noticed before how differently userland and kernelspace were written. The complexity in kernelspace tends to be intricacies - locking, complicated invariants and such. Userspace's complexity seems to be largely layering and indirection. Or perhaps that's just what I perceive most because it's so large, and unlike the kernel, where the definition a symbol corresponds to is always clear (making following flow of control with tools like cscope or OpenGrok usually straightforward), it's not always easy to figure out the flow of control between programs in userspace without some preexisting knowledge.

Fixing this cleaned up a lot of inconclusive tests, and a number of tests which were failing because of properties not being reset. After this and fixing a couple tests which made assumptions about processor sets and pools being mutually exclusive, as of this afternoon the pools test tally has improved from that in the last entry to:
Result Total:
        FAIL: 7
        PASS: 705
        UNSUPPORTED: 5
In addition to this, I've cleaned up some code, merged redundant code paths, done the final implementation for the last class of process sets, and fixed several race conditions and deadlocks related to the conversion of pool reference counting to per-thread. And because two of the failures also fail on the gate (it was one the other week, but I've found bugs in several test cases), there are really only 5 failures left. Current builds also hold up to as many runs of the processor set stress test as I've been able to subject it to, without deadlock or panic. Things are looking good.

Posted by csg [Personal] ( July 24, 2007 05:03 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070713 Friday July 13, 2007

A Cleaner Prototype - with full pset support

I haven't written in a while, but I've been busy.

At the start of July, I started digging through all of the old processor sets and processor pools correctness and stress tests. Right now I'm basically working on having the pools functionality test pass. Here are the results of testing the gate from a couple weeks ago (output from a test result parsing script):

Result Total:
        FAIL: 1
        PASS: 711
        UNSUPPORTED: 5
UNSUPPORTED means it doesn't apply to the platform - I haven't looked into them, but I suspect those tests require more than only 4 processors to run. Right now, I'll be happy to match the gate.

Just before the 4th, I ran the test suite on a machine running my mostly-finished prototype:

Result Total:
        FAIL: 34
        OTHER_139: 2
        PASS: 552
        UNRESOLVED: 124
        UNSUPPORTED: 5
UNRESOLVED and OTHER mean various random error conditions occurred which prevented the test from completing (e.g. some of the setup for the test failed) but the actual test does not necessarily yield incorrect results. The UNSUPPORTED tests in this run (and the next) are the same as those which were unsupported on the gate build. The prototype this ran on is the one mentioned in my previous entry, which implements support for most of the basic use cases of psets as temporary pools. It lacked support for (read: crashes upon) performing some tasks under conditions the pools framework doesn't like (e.g. deleting a pset which has things bound to it is valid for processor sets, but invalid for a pool) or things without direct analogs in the pools world (e.g. binding individual threads).

Some random stress testing however frequently took the system down, failing an assertion that the number of processes in a pool was greater than 0. It turns out much of the pools framework verifies consistency using counts of how many processes are in a pool - which doesn't make sense when we want to coexist with an API which permits binding (potentially) every thread of a program to a different pool! struct proc contains a pointer to the pool the process is in, used for querying where processes are - which also makes no sense if this varies within a process. So in my latest build, in addition to having various bug fixes and more feature support, the system is converted to having a per-thread pool pointer, and having the pool reference count be a thread count rather than a process count. There is still one race condition (famous last words) to fix in the new refcount code, but it's almost there. The build is also the first which actually turns pools on at boot, and does not permit turning them off - which along with psets and pools coexisting, is the main goal of the whole project.

The latest build test results from 7/12, before I turned my attention back to the refcount race condition in the psets code:

Result Total:
        FAIL: 16
        OTHER_13: 32
        PASS: 664
        UNSUPPORTED: 5

Of course, some of these tests no longer make sense to have, and some of them simply expect the wrong thing - mainly all of the tests which try to reproduce different behavior with pools turned on or off. Right now my kernel simply lies and always says that enabling or disabling succeeded, though queries for state always return POOL_ENABLED. This accounts for at least a couple of the remaining failures. Many of the OTHER results are hard to interpret - most of the tests giving these results are written differently from most of the others, and don't provide as much information - but this is probably an indication of something giving them errno 13 - EACCES.



Posted by csg [Sun] ( July 13, 2007 02:12 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070630 Saturday June 30, 2007

Deleting 14TB, and a Finished Prototype

Before I finished my prototype of merging processor sets and pools, this past week was already off to a great start:

The first, and possibly the most amusing bug of the week was found by Dan K. (and not in his own code). When we had left last Friday evening, Dan had kicked off a recursive find ... -exec ... to grep in basically every source file for calls to a particular function he was changing for his time independent zones. He redirected the output to a file in the search area, so once find hit the log file and started exec'ing grep on it... you get the idea. We arrived on Monday to several emails about the build server being out of space. People had deleted old workspaces once this happened, and were now working away again. Dan of course deleted the file (which due to compression, was only 654 GB or so on disk). This was all well and good. Then at some point I started doing a file operation, and it wasn't completing. Sending kill signals to the process yielded no results, so it was clearly stuck in the kernel somewhere (actual signal handling for things like kill signals only occurs when threads are entering, or exiting syscalls). My officemate was having a similar problem. I walked to Dan's office, and he also had that problem. We went to Dan Price, who determined there was a filesystem issue. The three of us went to see Matt Ahrens, who was already working on the bug, as another engineer had already tracked it down to ZFS. After some amount of standing there, most of us left, as we weren't really contributing anything. We went to lunch, came back, and about an hour later the server began working again. My chmod had finally finished - after two and a half hours. A short time later we got an email from Matt Ahrens saying that the problem was that a very large file had been deleted, and was just taking a long time. ZFS tries to cache all the level 1 indirect blocks for a file in memory when deleting. For a 14TB file, that's roughly 16GB - on a machine with 12 GB of RAM. And everyone else's threads were blocked on the thread busy deleting Dan's log file. The ZFS guys knew this sort of thing would be problematic, but had never actually run into it before... and so, they filed Bug 6573681.

At this point, I have a mostly functional version of a kernel in which processor sets and pools are merged, and both always enabled. There are a couple minor cases I haven't handled yet, but it's basically finished. I'd sent out a proposal for the change for review by a few people who had some experience with those systems, and am gathering feedback. Soon I'll start going through the formal process of filing for a significant change to the kernel. In the mean time, Im also looking at other possible projects for the rest of the summer - perhaps one of the projects I had considered earlier for my first project, and perhaps a new RFE...



Posted by csg [Sun] ( June 30, 2007 09:08 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070623 Saturday June 23, 2007

Processor Sets, and Pools, and Partitions! Oh My!

My summer project is merging the two processor grouping abstractions in the Solaris kernel, whose uses are currently mutually exclusive. This will allow an extremely useful feature of zones, which is currently disabled by default, to always be enabled. So far, I've discovered some interesting things in the course of this work, such as a trick the JVM plays on Solaris to evade resource controls.[Read More]

Posted by csg [Sun] ( June 23, 2007 12:10 AM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070619 Tuesday June 19, 2007

Two Weeks in the Kernel

Steve Lawrence suggested I make some notes about my first couple weeks at Sun, while it was still fresh. This seems to be the most appropriate place. He suggested a couple questions to answer for myself: I was also somewhat surprised, given my previous internship working in filesystems and the interest in filesystems following that experience, I'm a bit surprised that I'm not working on ZFS. Though really, I suppose it's in tune with my other goals for this internship - to try something different. To some degree an internship is to help decide if the work you do there is the sort of work you'd like to do in the long term. I think last summer I reached the conclusion that I enjoy operating systems work - filesystems, kernels and such. I enjoy the problems in those and related areas. So for me this is more about sampling the variety which can be had within that space. And thus far, the Solaris kernel group has been great, and I don't expect that to change.

Posted by csg [Personal] ( June 19, 2007 09:50 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070615 Friday June 15, 2007

Integer Overflow and An Inverted Spin-Lock

An unexpected limitation for the C preprocessor to place on code, and how to aggravate a race condition by using additional synchronization... Plus an aside on formating code as HTML.[Read More]

Posted by csg [Sun] ( June 15, 2007 10:21 PM ) Permalink | Comments[1]
http://blogs.sun.com/csg/date/20070611 Monday June 11, 2007

It's Broken - Prove It

Today I finally did the putback (basically, the commit to the main repository) for my first, fairly trivial bugfix. However, I'm still working on the fix for the remaining race condition. I have the fix worked out, and am pretty certain that it is correct, but the challenge in this bug is in reproducing it. Or, as I don't actually know of anyone hitting this bug, producing it for the first time.

The problem itself is described in the bug report, so you should take a look at that to get a bit of background. Reproducing the bug requires three threads, in two processes. It revolves around the interactions between exitlwps() and kadmin(). exitlwps() is a function called by user processes to killl all threads other than the calling thread. Called concurrently, the first thread into the function takes a lock, sets a flag, unlocks it, and continues cleanup. Subsequent calls take the lock, see the flag, and then exit themselves. kadmin() does a number for things, but the basic flow of kadmin() for the cases in question is:

stuff
if(!trylock(m))
 return
if(exitlwps(0) != 0)
 unlock(m)
 return
...
shut down or reboot the system
...
unlock(m)
return

So consider this case: one thread takes the lock in kadmin(), another enters kadmin() and fails to take the lock (as this means another thread is shutting down the system already). The second thread returns. Now let the first thread get a nonzero return value from exitlwps(). It will return. This thread has returned on error, and the second thread to enter kadmin() returned success - but the machine is not rebooting or shutting down! Even setting aside the fact that the documentation for this function says that "successful" calls with the flags for rebooting or shutting down should not return, this is a problem. Subsequent attempts to shut down (if they don't hit this race condition) will still succeed (not the case before my first bugfix, where the system could land in a state where it returned "success" from this call repeatedly but not shut down).

The fix isn't terribly complicated, though it took some thought (and conversations) to arrive at. The mutex is supplemented with a condition variable and a flag. The winning thread locks the mutex, sets the flag, unlocks, and goes on its way. Subsequent threads lock, see the flag, and sleep on the condition variable. The winning thread either shuts down the system, or clears state and signals the condition variable if it encounters an error - problem solved.

But to submit a bug, I need to prove it was broken in the first place. I need a way to reproduce the bug. Timing-related bugs are usually very difficult to debug. But they seem to be even more difficult to intentionally cause. The solution I'm currently using is a combination of a hacky C program and DTrace. I have a program which forks. One process spawning an extra thread which calls exit, and has the original thread make a call to uadmin(), which then calls kadmin(). The other process only nanosleep()s and calls kadmin(). Using DTrace's chill() action, I can sleep the first thread to call exitlwps() (which is in the multi-threaded process because of the nanosleep()). This causes that thread's call to fail because the exit() call from the other thread succeeds before that exitlwps() call actually does anything. The difficulty now is to get the part of the single-threaded process's kadmin() call which attempts to take the mutex to occur in the brief window when the first kadmin() call is still holding the mutex - before or after, and the test machine reboots, making this a rather obnoxious bug to (fail to) test.

Why don't I just have the chill() in the exitlwps() call wait for 30 seconds? That's in the critical section, and surely long enough for the other process to be rescheduled and have its call spuriously return success. Because DTrace doesn't allow it. In their attempt to save DTrace users from themselves, they intentionally limited how much chill()ing could occur to no more than half of one second - for the sake of the sytem making progress. Normally this is great - I'm just using DTrace in a way which is the exact opposite of how it was intented to be used. It was meant to find and observe bugs - not to trigger them.

I'm now considering several options:

I'll make my decision tomorrow, unless I think of a better way this (though I'm not sure one exists).



Posted by csg [Sun] ( June 11, 2007 11:23 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070607 Thursday June 07, 2007

Figuring Things Out

Well, Monday I began my summer internship with the Solaris Kernel Group at Sun. Thus far things seem to be going pretty well - I've fixed one bug in the shutdown code (my changes are still uncommitted). I'm also working on a related race condition in the same section of code. I'm doing these small bug fixes as a way to learn the basics of the build system and such - so far I've learned how to build, how to update a running system, the basics of mdb (which I'm enjoying - more on that later, perhaps), and some DTrace basics. I have yet to pick an overarching summer project, though I'm leaning strongly towards working on something related to Solaris Zones.

Posted by csg [Sun] ( June 07, 2007 07:34 PM ) Permalink | Comments[0]