It's Broken - Prove It
Today I finally did the putback (basically, the commit to the main repository) for my first, fairly trivial bugfix. However, I'm still working on the fix for the remaining race condition. I have the fix worked out, and am pretty certain that it is correct, but the challenge in this bug is in reproducing it. Or, as I don't actually know of anyone hitting this bug, producing it for the first time.
The problem itself is described in the bug report, so you should take a look at that to get a bit of background. Reproducing the bug requires three threads, in two processes. It revolves around the interactions between exitlwps() and kadmin(). exitlwps() is a function called by user processes to killl all threads other than the calling thread. Called concurrently, the first thread into the function takes a lock, sets a flag, unlocks it, and continues cleanup. Subsequent calls take the lock, see the flag, and then exit themselves. kadmin() does a number for things, but the basic flow of kadmin() for the cases in question is:
stuff
if(!trylock(m))
return
if(exitlwps(0) != 0)
unlock(m)
return
...
shut down or reboot the system
...
unlock(m)
return
So consider this case: one thread takes the lock in kadmin(), another enters kadmin() and fails to take the lock (as this means another thread is shutting down the system already). The second thread returns. Now let the first thread get a nonzero return value from exitlwps(). It will return. This thread has returned on error, and the second thread to enter kadmin() returned success - but the machine is not rebooting or shutting down! Even setting aside the fact that the documentation for this function says that "successful" calls with the flags for rebooting or shutting down should not return, this is a problem. Subsequent attempts to shut down (if they don't hit this race condition) will still succeed (not the case before my first bugfix, where the system could land in a state where it returned "success" from this call repeatedly but not shut down).
The fix isn't terribly complicated, though it took some thought (and conversations) to arrive at. The mutex is supplemented with a condition variable and a flag. The winning thread locks the mutex, sets the flag, unlocks, and goes on its way. Subsequent threads lock, see the flag, and sleep on the condition variable. The winning thread either shuts down the system, or clears state and signals the condition variable if it encounters an error - problem solved.
But to submit a bug, I need to prove it was broken in the first place. I need a way to reproduce the bug. Timing-related bugs are usually very difficult to debug. But they seem to be even more difficult to intentionally cause. The solution I'm currently using is a combination of a hacky C program and DTrace. I have a program which forks. One process spawning an extra thread which calls exit, and has the original thread make a call to uadmin(), which then calls kadmin(). The other process only nanosleep()s and calls kadmin(). Using DTrace's chill() action, I can sleep the first thread to call exitlwps() (which is in the multi-threaded process because of the nanosleep()). This causes that thread's call to fail because the exit() call from the other thread succeeds before that exitlwps() call actually does anything. The difficulty now is to get the part of the single-threaded process's kadmin() call which attempts to take the mutex to occur in the brief window when the first kadmin() call is still holding the mutex - before or after, and the test machine reboots, making this a rather obnoxious bug to (fail to) test.
Why don't I just have the chill() in the exitlwps() call wait for 30 seconds? That's in the critical section, and surely long enough for the other process to be rescheduled and have its call spuriously return success. Because DTrace doesn't allow it. In their attempt to save DTrace users from themselves, they intentionally limited how much chill()ing could occur to no more than half of one second - for the sake of the sytem making progress. Normally this is great - I'm just using DTrace in a way which is the exact opposite of how it was intented to be used. It was meant to find and observe bugs - not to trigger them.
I'm now considering several options:
- Playing with splitting the chill() call into two calls, which total to less than half a second, and hoping to get it right.
- Playing with the nanosleep time, and hoping to get it right.
- Recompiling DTrace without this limit, which under any other circumstances would be an absolutely horrendous idea.
Posted by csg [Sun] ( June 11, 2007 11:23 PM ) Permalink | Comments[0]
