Views on software from Bryan Cantrill's deck chair The Observation Deck

Monday Jun 27, 2005

I spent the day today at JavaOne in the City, where DTrace was featured prominently in the morning keynote: JohnnyL described DTrace in the abstract, and drew a (surprisingly large!) round of applause when he announced that DTrace now supports Java. He then called Adam on stage to do a demo, the most remarkable aspect of which is that it was actually DTrace and not the canned, gussied-up gimcrack that normally passes for a keynote demo. In fact, this demo was so tangibly real, Adam has blogged the scripts from the demo. With these scripts, you can now give Adam's keynote in the comfort of your own home!

In other JavaOne news, Adam, Stephen and I are going to be leading the charge at an OpenSolaris BOF tonight at 7:30pm at the San Francisco Marriot in Golden Gate B1. With the IBM announcement today supporting Solaris, with the new super-fast, super-cheap Solaris Opteron workstation, and with new Solaris technologies like DTrace and Zones, Solaris has never been more attractive to the Java developer; swing by our BOF to see what all the fuss is about!


Technorati tags:

Saturday Jun 25, 2005

Even though the launch of OpenSolaris was well over a week ago, and even though the Opening Day entries have now been sifted through in five different blog entries (here, here, here here and here), there are still some great uncategorized entries. Without further ado...

Phew! I think that about does it. When we first tried to encourage Solaris engineers to blog on Opening Day, I thought we were going to have a hard time convincing engineers to blog -- I knew that providing in-depth, technical content takes a lot of time, and I knew that everyone had other priorities. So when we were planning the launch and talking about the possibility of dealing with a massive amount of Opening Day content, my response was "hurt me with that problem." Well, as it turns out, most engineers didn't need much convincing -- many provided rich, deep content -- and I was indeed hurt with that problem! While it was time consuming to sift through them, hopefully you've enjoyed reading these entries as much as I have. And let it be said once more: welcome to OpenSolaris!


Technorati tags:

Monday Jun 20, 2005

Despite there being now four blog entries to sift through the Opening Day entries (one from me, one from Liane, one from Claire and another from me), there are still some great entries that have gone uncategorized. Many of these entries fall into the category of "debugging war stories" -- engineers describing a particularly tough or interesting bug that they nailed. The proliferation of these kinds of stories among the Opening Day entries is revealing: in some organizations, you're a hero among engineers if you come up with a whizzy new app (even if the implementation is a cesspool), while in others you might gain respect among engineers by getting a big performance win (even if it makes things a bit flakey), but in Solaris development, nothing gains the respect of peers like debugging some diabolical bug that has plagued the operating system for years. So if you're new to OpenSolaris and you're looking to make a name for yourself, a good place to start is in the never-ending search for elusive bugs. To get a flavor for the thrill of the hunt, check out: And here are a few more on the simple (but thorough) satisfaction from fixing old bugs: That about does it for today. There are still many more entries to categorize, but fortunately, I think I can see the light at the end of the tunnel -- or is that just the approach of death from the exhaustion of sifting through all of this content?


Technorati tags:

Friday Jun 17, 2005

If you didn't see it, Liane Praza picked up where my sifting left off, adding a blog entry pointing to more Opening Day entries -- this time in the categories of devices and device configuration, security, networking, and standards. But there are still a ton of entries to categorize, so picking up again in no particular order...

  • System calls. System calls are the among most fundamental mechanisms in operating systems: they are the mechanism by which untrusted, unprivileged software requests a service of trusted, privileged software. We are lucky to have two great entries describing the architecture-specific mechanisms of system calls in Solaris: check out Russ Blaine's entry on system calls on x86, and Gavin Maltby's entry on system calls on SPARC. Then, to understand the architectural-neutral aspects of system calls, head over to Eric Schrock's entry on how to add a system call.

    As a quick aside, that last entry is a great example of how we in Solaris Kernel Development are using blogs to write down information that (believe it or not) has just been an unspoked part of the craft before now. As Tim Bray observed, blogs have become a critical conduit of information for us -- we believe that they are the most scalable way to get information from the people who have it to the people who need it. If (when?) you become an OpenSolaris developer, you can expect some friendly peer pressure to create a blog and join the party.

  • Build process and workspace management. We pride ourselves on a seamless build process, and a couple of entries have gone into various aspects of this in depth. To give you an idea of how seriously we take the build process -- and why -- check out Scott Rotondo's entry on using lint to find security vulnerabilities. In particular, note what Scott says when he added a new lint option that generated 500 new warnings: "I needed to fix all of these before integrating my change to Makefile.master because we require the Solaris source to be lint-clean." To which I add only, "dammit." Next, head over to Jim Carlson's entry describing the work he did to support non-root builds. Jim's entry demonstrates how difficult it is to radically change the build process -- and how he managed to pull it off. Finally, if you want to really let your makefile flag fly, check out Mark Nelson's entry describing the build support for localized messages.

    In terms of workspace management, you'll want to check out Will Fiveash's entry describing our workspace management tool, wx. For a long time, wx was a shell script in Bonwick's home directory. It was incredibly useful, but it was also easy to accidentally blow your brains out. (As Bart is fond of saying, it was "all blade and no handle.") Will's rewrite made for a much more safer, much more sophisticated wx -- and it was a huge help to us in automating the final approach of the DTrace integration.

  • Debuggability. If you read just a couple of the Opening Day entries, you probably noticed a trend: many of the entries were about finding some nasty bug in the system. This is an accurate reflection of our ethos in developing Solaris: the operating system must be reliable above all else, and we view debugging the operating system as our primary responsibility. This responsibility runs deeper than just the act of debugging, because our needs so outstripped existing tools that we designed and built our own -- most notably mdb and DTrace. Fortunately, we ship these tools to you, so you can use them on your own system and on your own applications.

    There are many entries describing these tools and how they were used to tackle a problem. Fittingly, a good place to start is Mike Shapiro's entry describing using mdb to debug a sendmail bug. This bug is described in 4278156, which has one of the greatest bug synopses of all time: "sendmail died in a two SIGALRM fire." 1 For more on the power of mdb, take a look at Eric Saxe's entry on using mdb to debug a scheduling problem, Ashish Mehta's entry on using mdb to debug a race condition, and Eric Kustarz's entry demonstrating an mdb debugger command ("dcmd") that he wrote to retrieve NFSv4 recovery messages postmortem. This last example is a particularly good one because this is exactly the kind of custom debugging infrastructure that mdb's modular architecture makes easy to build. For a comprehensive example of how we have developed subsystem-specific debugging infrastructure, read Sasha Kolbasov's entry on the mdb dcmds related to STREAMS. As Sasha mentions, the place to start for learning to write your own modules is the documentation -- but you can get a flavor for it by reading Yu Xiangning's entry on writing a writing a module for kmdb. kmdb is the in-situ kernel debugger that implements mdb, and when you need it, nothing else will do -- as Dan Mick describes in his entry on debugging with kmdb and moddebug. For more details on kmdb itself, check out Matt Simmons' entry on kmdb's design and implementation. To see how mdb can help debug your application, take a look at Will Fiveash's notes on using debugging application memory problems. Will mentions ::findleaks, a debugger command that I originally implemented for kernel crash dumps, and that Jonathan Adams subsequently ported to work on application core files and -- as he mentions in his entry -- reworking it substantially in the process.

    While mdb is the acme of postmortem debugging, if the manifestation of a bug is non-fatal, it's often more effective to use DTrace to debug it. For an exanple of this, look at Bart Smaalders' entry on using DTrace to debug jitter. It was gratifying to see Bart debug this problem using DTrace, because latency bubbles were actually one of the motivating pathologies behind DTrace. And finally, debuggability doesn't end with tools; subsystems must be designed with debuggability in mind, as Stephen Hahn describes in his entry on designing libuutil for debuggability.

I think that about does it for today. As someone pointed out on Liane's blog, we need a Wiki for this; we agree -- it's on the list of planned enhancements for opensolaris.org. Until then, stay tuned for more sifting...


Technorati tags:

Thursday Jun 16, 2005

For those of you in the Bay Area, I will be giving a talk and demo on DTrace at tonight's BayLISA meeting in Cupertino. Starts at 7:30p, here are the directions. I will be demo'ing DTrace tonight on my laptop running OpenSolaris bits that I downloaded and built myself from outside of Sun's internal network. So if you're interested in DTrace, OpenSolaris or both (and you're in the Bay Area), you might want to check it out...


Technorati tags:

Wednesday Jun 15, 2005

Yesterday was Opening Day for OpenSolaris, and we welcomed OpenSolaris with hundreds of blog entries describing various aspects of the implementation. The breadth and depth of our blogging will hopefully put to rest any notion that open sourcing Solaris isn't a grass-roots effort: if nothing else, it should be clear that we in the trenches are very excited to finally be able to talk about the system that we have poured so much of our lives into -- and to welcome new would-be contributors into the fold.

In our excitement, we may have overwhelmed a tad: there was so much content yesterday, that it would have been impossible for anyone to keep up -- we blogged over 200,000 words (over 800 pages!) yesterday alone. So over the next few days, I want to highlight some entries that you might have missed, broken down by subject area. In no particular order...

  • Fault management. Fault management in Solaris 10 has been completely revolutionized by the new predictive self-healing feature pioneered by my longtime co-conspirator Mike Shapiro. There are two must-read entries in this area: Andy Rudoff's entry providing a predictive self-healing overview, and Dilpreet Bindra's entry going into more depth on PCI error handling. (If for nothing else, read Dilpreet's entry for his Reading of the Vows between OpenSolaris and the Community.)

  • Virtual memory. The virtual memory system is core to any modern operating system, and there are several interesting entries here. Start with Eric Lowe's extensive entry describing page fault handling. As Eric rightly points out, page fault handling is the epicenter of the VM system; one can learn a tremendous amount about the system just by following page fault processing -- and Eric is a great guide on this journey. Once you've read Eric's entry, check out Michael Corcoran's entry on page coalescing, a technique to assure availability of large-sized pages -- which are in turn necessary to increase TLB reach. And discussion of page_t's leads naturally brings you to Rick Mesta entry describing a big performance win by prefetching these structures during boot.

    A less-discussed aspect of virtual memory is the virtual memory layout of the kernel itself. To learn about some of the complexities of this, check out Kit Chow's entry on address space limitations on 32-bit kernels. The limitation that Kit describes is one of the nasty gotchas of running 32-bit x86 in flat mode. As Kit mentions, the best workaround is to run a 64-bit kernel -- but if you're stuck with a 32-bit x86 chip, you'll want to read Kit's suggestions carefully. Kit's entry is a good segue to Prakash Sangappa's entry describing his work on dynamic segkp for 32-bit x86 systems. Prakash's work was critical for getting some more breathing space on 32-bit x86 systems -- saving hundreds of megabytes of precious VA. Of course, the ultimate breathing space is that afforded by 64 bits of VA -- and in this vein check out Nils Nieuwejaar's entry on the kernel address space layout on x64. Both Prakash and Nils quote one of those comments in the kernel source code that you really need to know about if you're going to do serious kernel development: the comment describing the address space layout in i86pc/os/startup.c and sun4/os/startup.c. This comment is one of the canonical ASCII-art comments (more on these eventually), and I usually find these comments in startup.c by searching forward for "----".

  • Linking and Loading. One of the most polished subsystems in Solaris is the linker and loader -- the craftsmanship of the engineers that have built it has been an ongoing inspiration for many of us in Solaris development. To learn more about the linker, start with Rod Evans' entry taking you on a source tour of the link-editors, and then head over to Mike Walker's entry describing library bindings. As long as you're checking out the linker, be sure to look at past entries like Rod's entry tracing of a link-edit. As you can imagine, because the dynamic linker is invoked whenever a dynamically-linked binary is executed, it's a natural place to improve performance -- especially with complicated programs like Mozilla or StarOffice that are linked to hundreds (!) of shared objects. We've certainly found some big wins in the linker over the years, but we've also discovered that it's difficult to help megaprograms without hurting nanoprograms -- and vice versa. For an interesting description of this tradeoff, check out David Safford's entry on dynamic linker performance. If nothing else, you'll see from David's work the research element of operating system development: we often aren't assured of success when we endeavor to improve the system.

  • Scheduling. CPU scheduling is one of the most basic properties of a multitasking operating system. Despite being an old problem, we find ourselves constantly improving and extending this subsystem. To learn about CPU scheduling, start with Bill Kucharski's entry describing the architecture-specific elements of context switching. Then head over to Gavin Maltby's entry describing the short-term prevention of thread migration. (Before Gavin introduced this facility, the only way to prevent migration was to prevent kernel preemption -- an overly blunt mechanism that led to a really nasty latency bubble that I debugged many years ago.)

    If you're going to understand thread dispatching, you'll need to understand the way thread state is manipulated -- and for that you'll want to look at Saurabh Mishra's entry describing thread locks. Thread locks are different from normal synchronization primitives, as you can infer from my own entry describing a bug in user-level priority inheritance -- which is a good segue to a more general problem when dealing with thread control: how does one change the scheduling properties of a running thread? For an idea of how tricky this can be, check out Andrei Dorofeev's entry describing binding processes to resource pools. Andrei's problem was even more challenging than traditional thread manipulation, as he needed to change the scheduling properties of a group of threads atomically. If for no other reason, you should read Andrei's entry to learn of the "curse of disp.c." Speaking of the cursed, wrap up your tour of scheduling with Eric Saxe's entry describing debugging a wedged kernel -- you'll see from Eric's odyssey that scheduling problems can require a lot of brain-bending (and patience) to debug!

Okay, I think that's enough for today -- and yet it barely scratches the surface! I didn't even touch on gigantic topics with many Opening Day entries like security, networking, I/O, filesystems, performance, scheduling, service management, observability, etc. etc. Stay tuned -- or check out the Opening Day entries for yourself...


Technorati tags:

Tuesday Jun 14, 2005

OpenSolaris Sewer Tour Having OpenSolaris available today is an odd phenomenon in some ways. For me, having spent virtually my entire professional engineering life developing Solaris, opening the source is like having tourists suddenly flock to your hometown. And as the proud and unabashed local that I am, I feel compelled to welcome newcomers with a little bit of a personal tour through the source. But I don't want to lead the kind of bus tour that you'll get from the big tour operators (not that you don't want to take in those sights of course); I'd rather take you on something like a sewer tour through the city's underbelly: I want to show you the water mains and the power grid and the telco infrastructure that make Solaris what it is. Solaris is the New York City of operating systems: it is what you don't see that makes what you do see possible -- and like New York City, it's what you don't see that's really interesting...[1]

So, with that preamble, let's do a little source spelunking into the core of Solaris. A word of warning: we're going to go deep into the system, and this won't be for everyone. (Or even, perhaps, anyone?) But if you're ready to have your brain bent a bit, grab yourself a cup of coffee, close your office door and read on...

For today, I want to fully explain the context around this somewhat cryptic comment of mine in turnstile_block in usr/src/uts/common/os/turnstile.c:

        /*
         * Follow the blocking chain to its end, willing our priority to
         * everyone who's in our way.
         */
        while (t->t_sobj_ops != NULL &&
            (owner = SOBJ_OWNER(t->t_sobj_ops, t->t_wchan)) != NULL) {
                if (owner == curthread) {
                        if (SOBJ_TYPE(sobj_ops) != SOBJ_USER_PI) {
                                panic("Deadlock: cycle in blocking chain");
                        }

                        /*
                         * If the cycle we've encountered ends in mp,
                         * then we know it isn't a 'real' cycle because
                         * we're going to drop mp before we go to sleep.
                         * Moreover, since we've come full circle we know
                         * that we must have willed priority to everyone
                         * in our way.  Therefore, we can break out now.
                         */
                        if (t->t_wchan == (void *)mp)
                                break;


There's quite a story behind this comment -- both in human and engineering terms. The human side of the story is that this code was added in the last moments of Solaris 8, in one of the more intense experiences of my engineering career: a week-long collaboration with Jeff Bonwick that required so much shared mental state that he and I both came to call it "the mind-meld." The engineering story is quite a bit more intricate, and requires a lot more background. It starts earlier in Solaris 8, when we developed an infrastructure for user-level priority inheritance -- a mechanism to avoid priority inversion. For those unfamiliar with the problem of priority inversion, it is this: given three threads at three different priorities, if the highest priority thread blocks on a synchronization object held by the lowest priority thread, the middling priority thread could (in a pure priority preemptive system running on a uniprocessor) run in perpetuity. This is an inversion, and one mechanism to solve it is something called priority inheritance. Under priority inheritance, when one thread is going to block on a lower priority thread, the higher priority thread wills its priority to the lower priority thread. That is, the lower priority thread inherits the higher priority for the duration of the critical section.

Now, in Solaris we have long had priority inheritance for kernel synchronization primitives -- indeed, this is one of the architectural differences between SunOS 4.x and Solaris 2.x. And just getting priority inheritance right for kernel synchronization primitives is nasty: one must know who owns a lock, and one must know which lock that owner is blocked on (if any). That is, if a thread is blocking on a lock that itself is owned by a blocked thread, we need to be able to determine what lock the blocked thread is blocked on, and which thread owns that lock. In order to avoid missed wakeups, we must do this without one of the threads in the blocking chain changing its state. (That is, we don't want to conclude that a thread that we're blocking on isn't blocked itself, only to have it block before we go to sleep -- this would create a window for potential priority inversion.) The way that we traditionally prevent a thread from moving underneath us is to grab its thread lock (a clever synchronization mechanism that merits its own lengthy blog entry[2]), but this implies that we're going to be simultaneously holding two locks of the same type. This is a problem because -- in any parallel system -- simultaneously holding more than one lock of the same type is a tricky proposition: there is a possibility of deadlock if the lock acquisition order is not rigidly defined. This is a particular problem for thread locks, because both thread locks can be locks on turnstile hash chains -- and turnstile hash chains can be used by disjoint threads simultaneously. (A turnstile is the mechanism that we use to put a thread to sleep.) Jeff solved this particular problem of turnstile deadlocking in an elegant way; I'll let his comment in turnstile_interlock do the explaining:

/*
 * When we apply priority inheritance, we must grab the owner's thread lock
 * while already holding the waiter's thread lock.  If both thread locks are
 * turnstile locks, this can lead to deadlock: while we hold L1 and try to
 * grab L2, some unrelated thread may be applying priority inheritance to
 * some other blocking chain, holding L2 and trying to grab L1.  The most
 * obvious solution -- do a lock_try() for the owner lock -- isn't quite
 * sufficient because it can cause livelock: each thread may hold one lock,
 * try to grab the other, fail, bail out, and try again, looping forever.
 * To prevent livelock we must define a winner, i.e. define an arbitrary
 * lock ordering on the turnstile locks.  For simplicity we declare that
 * virtual address order defines lock order, i.e. if L1 < L2, then the
 * correct lock ordering is L1, L2.  Thus the thread that holds L1 and
 * wants L2 should spin until L2 is available, but the thread that holds
 * L2 and can't get L1 on the first try must drop L2 and return failure.
 * Moreover, the losing thread must not reacquire L2 until the winning
 * thread has had a chance to grab it; to ensure this, the losing thread
 * must grab L1 after dropping L2, thus spinning until the winner is done.
 * Complicating matters further, note that the owner's thread lock pointer
 * can change (i.e. be pointed at a different lock) while we're trying to
 * grab it.  If that happens, we must unwind our state and try again.
 *
 * On success, returns 1 with both locks held.
 * On failure, returns 0 with neither lock held.
 */

This lock ordering issue is part of what made it difficult to implement priority inheritance for kernel synchronization objects -- but priority inheritance for kernel synchronization objects only solves part of the larger priority inversion problem: in a multithreaded real-time system, one needs priority inheritance for both kernel-level and user-level synchronization objects. And this problem -- user-level priority inheritance -- is the problem that we endeavored to solve in Solaris 8. We assigned an engineer to solve it, and (with extensive guidance from those of us who best understand the guts of scheduling and synchronization), the new facility was integrated in October of 1999.

A few months later -- in December of 1999 -- I was looking at a crash dump from an operating system panic that a colleague had encountered. It was immediately clear that this was some sort of defect in our implementation of user-level priority inheritance, but as I understood the bug, I came to realize that this was no surface problem: this was a design defect. Here is my analysis:

[ bmc, 12/13/99 ]

The following sequence of events can explain the state in the dump (the arrow
denotes an ordering):

        Thread A (300039c8580)                 Thread B (30003c492a0)
	(executing on CPU 10)                   (executing on CPU 4)
+------------------------------------+ +-------------------------------------+
|                                    | |                                     |
|  Calls lwp_upimutex_lock() on      | |                                     |
|  lock 0xff350000                   | |                                     |
|                                    | |                                     |
|  lwp_upimutex_lock() acquires      | |                                     |
|  upibp->upib_lock                  | |                                     |
|                                    | |                                     |
|  lwp_upimutex_lock(), seeing the   | |                                     |
|  lock held, calls turnstile_block()| |                                     |
|                                    | |                                     |
|  turnstile_block():                | |                                     |
|  - Acquires A's thread lock        | |                                     |
|  - Transitions A into TS_SLEEP     | |                                     |
|  - Drops A's thread lock           | |                                     |
|  - Drops upibp->upib_lock          | |                                     |
|  - Calls swtch()                   | |                                     |
|                                    | |                                     |
|                                    | |                                     |
:                                    : :                                     :

   +----------------------------------------------------------------------+
   | Holder of 0xff350000 releases the lock, explicitly handing it off to |
   | thread A (and thus setting upi_owner to 300039c8580)                 |
   +----------------------------------------------------------------------+

:                                    : :                                     :
|                                    | |                                     |
|  Returns from turnstile_block()    | |                                     |
|                                    | |  Calls lwp_upimutex_lock() on       |
|                                    | |  lock 0xff350000                    |
|                                    | |                                     |
|                                    | |  lwp_upimutex_lock() acquires       |
|                                    | |  upibp->upib_lock                   |
|                                    | |                                     |
|                                    | |  Seeing the lock held (by A), calls |
|                                    | |  turnstile_block()                  |
|  Calls lwp_upimutex_owned() to     | |                                     |
|  check for lock hand-off           | |  turnstile_block():                 |
|                                    | |  - Acquires B's thread lock         |
|  lwp_upimutex_owned() attempts     | |  - Transitions B into TS_SLEEP,     |
|  to acquire upibp->upib_lock       | |    setting B's wchan to upimutex    |
|                                    | |    corresponding to 0xff350000      |
|  upibp->upib_lock is held by B;    | |  - Attempts to promote holder of    |
|  calls into turnstile_block()      | |    0xff350000 (Thread A)            |
|  through mutex_vector_enter()      | |  - Acquires A's thread lock         |
|                                    | |  - Adjusts A's priority             |
|  turnstile_block():                | |  - Drops A's thread lock            |
|                          <--------------+                                  |
|  - Acquires A's thread lock        | |  - Drops B's thread lock            |
|  - Attempts to promote holder of   | |                                     |
|    upibp->upib_lock (Thread B)     | |                                     |
|  - Acquires B's thread lock        | |  - Drops upibp->upib_lock           |
|  - Adjusts B's priority            | |                                     | 
|  - Drops B's thread lock           | |                                     |
|  - Seeing that B's wchan is not    | |                                     |
|    NULL, attempts to continue      | |                                     |
|    priority inheritance            | |                                     |
|  - Calls SOBJ_OWNER() on B's wchan | |                                     |
|  - Seeing that owner of B's wchan  | |                                     |
|    is A, panics with "Deadlock:    | |                                     |
|    cycle in blocking chain"        | |                                     |
|                                    | |                                     |
+------------------------------------+ +-------------------------------------+

As the above sequence implies, the problem is in turnstile_block():

        THREAD_SLEEP(t, &tc->tc_lock);
        t->t_wchan = sobj;
        t->t_sobj_ops = sobj_ops;
        ...
        /*
         * Follow the blocking chain to its end, or until we run out of
         * inversions, willing our priority to everyone who's in our way.
         */
        while (inverted && t->t_sobj_ops != NULL &&
            (owner = SOBJ_OWNER(t->t_sobj_ops, t->t_wchan)) != NULL) {
                ...
        }
(1) --> thread_unlock_nopreempt(t);
        /*
         * At this point, "t" may not be curthread. So, use "curthread", from
         * now on, instead of "t".
         */
        if (SOBJ_TYPE(sobj_ops) == SOBJ_USER_PI) {
(2) -->         mutex_exit(mp);
                ...

We're dropping the thread lock of the blocking thread (at (1)) before we drop
the upibp->upib_lock at (2).  From (1) until (2) we are violating one of
the invariants of SOBJ_USER_PI locks:  when sleeping on a SOBJ_USER_PI lock,
_no_ kernel locks may be held; any held kernel locks can yield a deadlock
panic.

Once I understood the problem, it was disconcertingly easy to reproduce: in a few minutes I was able to bang out a test case that panicked the system in the same manner as seen in the crash dump.[3] While I had some ideas on how to fix this, the late date in the release and the seriousness of the problem prompted me to call Jeff at home to discuss.[4] As Jeff and I discussed the problem, we couldn't seem to come up with a potential solution that didn't introduce a new problem. Indeed, the more we talked about the problem the harder it seemed. The essence of the problem is this: for user-level locks, we normally keep track of the state associated with the lock (for example, whether or not there's a waiter) at user-level -- and that information is considered purely advisory by the kernel. (There are several situations in which the waiters bit can't be trusted, and the kernel knows not to trust it in these situations.) To implement priority inheritance for user-level locks, however, one must become much more precise about ownership -- the ownership must be tracked the same way we track ownership for kernel-level synchronization primitives. That is, when we're doing the complicated thread lock dance that I described above, we can't be doing loads from user-level memory to determine ownership.[5] Here's the nasty implication of this: the kernel-level state tracking the ownership of the user-level lock must itself be protected by a lock, and that (in-kernel) lock must itself implement priority inheritance to avoid a potential inversion. This leads us to a deadlock that we did not predict: the in-kernel lock must be acquired and dropped to both acquire the user-level lock and to drop it. That is, there are conditions in which a thread owns the in-kernel lock and wants the user-level lock, and there are conditions in which a thread owns the user-level lock and wants the in-kernel lock. (The upib_lock is the in-kernel lock that implements this. Knowing this, you might want to pause to reread my description of the race condition above.)

The above bug captures one manifestation of this, but Jeff and I began to realize that there must be another manifestation lurking: if one were blocking on the in-kernel lock when the false deadlock was discovered, the kernel would clearly panic. But what if one were blocking on the user-level lock when the false deadlock was discovered? We quickly determined (and a test case confirmed) that in this case, the attempt to acquire the user-level would (erroneously) return EDEADLK. That is, in this case, the code saw that the "deadlock" was induced by a user-level synchronization primitive, and therefore assumed that it was an application-induced deadlock -- a bug in the application. So in this failure mode, a correct program would have one of its calls to pthread_mutex_lock erroneously fail -- a failure mode even more serious than a panic because it could easily lead to the application corrupting its data.

So how to solve these problems? We found this to be a hard problem because we kept trying to find a way to avoid that in-kernel lock. (I have presented the in-kernel lock as a natural constraint on the problem, but that was a conclusion that we only came to with tremendous reluctance.) Whenever one of us came up with some scheme to avoid the lock, the other would find some window invalidating the scheme. After exhausting ourselves on the alternatives, we were forced to the conclusion that the in-kernel lock was a constraint on the problem -- and our focus switched from avoiding the situation to detecting it.

There are two cases to detect: the panic case and the false deadlock case. The false deadlock case is actually pretty easy to detect and handle, because we always find ourselves at the end of the blocking chain -- and we always find that the lock that we own that induced the deadlock is the in-kernel lock passed as a parameter to turnstile_block (mp). Because we know that we have willed our priority to the entire blocking chain, we can just detect this and break out.

The panic case is nastier to deal with. To remind, this case is that the thread owns the user-level synchronization object, and is blocking trying to acquire the in-kernel lock. We might wish to handle this case in a similar way, by adding a case to check if the deadlock ends in the current thread and the last synchronization object in the blocking chain is a user-level synchronization object, then it's a false deadlock. (That is, handle this case by a more general handling of the above case.) This is simple, but it's also wrong: it ignores the possibility of an actual application-level deadlock (that is, an application bug), in which EDEADLK must be returned. To deal with this case, we observe that if a blocking chain runs from in-kernel synchronization objects to user-level synchronization objects, we know that we're in this case. Since we know that we've caught another thread in code in which they can't be preempted, we can fix this by busy-waiting until the lock changes and then restarting the priority inheritance dance. Here's the code to handle this case:[6]

                /*
                 * We now have the owner's thread lock.  If we are traversing
                 * from non-SOBJ_USER_PI ops to SOBJ_USER_PI ops, then we know
                 * that we have caught the thread while in the TS_SLEEP state,
                 * but holding mp.  We know that this situation is transient
                 * (mp will be dropped before the holder actually sleeps on
                 * the SOBJ_USER_PI sobj), so we will spin waiting for mp to
                 * be dropped.  Then, as in the turnstile_interlock() failure
                 * case, we will restart the priority inheritance dance.
                 */
                if (SOBJ_TYPE(t->t_sobj_ops) != SOBJ_USER_PI &&
                    owner->t_sobj_ops != NULL &&
                    SOBJ_TYPE(owner->t_sobj_ops) == SOBJ_USER_PI) {
                        kmutex_t *upi_lock = (kmutex_t *)t->t_wchan;

                        ASSERT(IS_UPI(upi_lock));
                        ASSERT(SOBJ_TYPE(t->t_sobj_ops) == SOBJ_MUTEX);

                        if (t->t_lockp != owner->t_lockp)
                                thread_unlock_high(owner);
                        thread_unlock_high(t);
                        if (loser)
                                lock_clear(&turnstile_loser_lock);

                        while (mutex_owner(upi_lock) == owner) {
                                SMT_PAUSE();
                                continue;
                        }

                        if (loser)
                                lock_set(&turnstile_loser_lock);
                        t = curthread;
                        thread_lock_high(t);
                        continue;
                }

Once these problems were fixed, we thought we were done. But further stress testing revealed that an even darker problem lurked -- one that I honestly wasn't sure that we would be able to solve. I'm actually going to leave the description of this problem (and its solution) for another day -- but for an embarrassing reason: when I went into the code to write this blog entry, I discovered (to my horror) that an addition was made to turnstile.c with an apparent disregard for the subtle intricacies of this subsystem. (Indeed, fear of such a hasty change is exactly why we went to such pains to comment these intricacies in the first place.) That is, this darker bug has been reintroduced in a new manifestation. The bug will be virtually impossible to hit, but there is no question that it's a bug. I'm not going to describe it here, but if you find it (and you can explain it fully), and you're in the job market, we'll fly you out for an interview. (I'm not joking.) Everything you need to know to find the bug is in turnstile.c; if you can find it, pretty soon you'll be leading your own sewer tour...


[1] New York City owes a particularly large debt to what isn't normally seen: its Manhattan schist is an extremely hard variant of granite, without which its buildings' towering heights would be impossible.

[2] Calling Joe Eykholt...

[3] This is one of the most gratifying feelings in software engineering: analyzing a failure postmortem, discovering that the bug should be easily reproduced, writing a test case testing the hypothesis, and then watching the system blow up just as you predicted. Nothing quite compares to this feeling; it's the software equivalent of the walk-off home run.

[4] Jeff was at home because it was a Sunday. And Jeff was up because it was very late on Sunday -- so much so that it was actually early on Monday morning.

[5] I won't go into details about why this is the case -- it's left as an exercise to the reader -- but suffice it to say it's one of the many reasons that we have joked about requiring a license to grab thread lock.

[6] The code dealing with turnstile_loser_lock didn't actually exist when we wrote this case -- that was added to deal with (yet) another problem we discovered as a result of our four day mind-meld. This problem deserves its own blog entry, if only for the great name that Jeff gave it: dueling losers.


Technorati Tag:
Technorati Tag: