rantings of a code minimalist :: meem simplex ::

Sunday Sep 18, 2005

Openly Excited The past few weeks have been some of the most exciting (and hectic!) of my professional career, as I acclimate to the power and responsibility of an increasingly open software development model.

To rewind for a second: Clearview is a project that Sebastien and I are spearheading to rationalize, unify, and enhance the way network interfaces are handled in Solaris. "What?,", you say? Well, by defining and implementing a set of core unifying attributes (or unalienable rights)[1] for all interfaces, a wide variety of administrative, programmatic, and diagnostic problems are addressed. For example, after this work, one will be able to:

  • Observe all IP layer network traffic, including loopback, IPMP group and IP tunnel traffic.
  • Observe all IP layer network traffic flowing to and from a zone.
  • Administrate all network interfaces using dladm(1M).
  • Use VLANs and form link aggregations on all Ethernet devices.
  • Use IPMP with technologies such as DHCP and routing protocols.

... though that list is far from comprehensive (nor does the above list constitute a commitment -- this is work-in-progress, mind you!)

Before I start hearing cries of "vaporware!", let's get back to the part that has me stoked: Clearview is one of the first Solaris projects to directly involve the OpenSolaris community in its design. To wit, our team is in the process of releasing design materials for each component of the Clearview project, and is actively soliciting input from the community at large.

In fact, the first document -- covering our massive overhaul of IPMP -- was sent out a week ago, and has already received some great feedback (keep it coming!) Further, just this morning, Sebastien unveiled our proposed Tunnelling Device Driver -- if you've spent much time with tunnels in Solaris, this will be a sight for sore eyes. Of course, that's not all: in the coming weeks, Phil Kirk will provide our design for IP-Level Observability Devices, and Cathy Zhou will detail our sinister plan to bring the power of dladm to all Solaris network interfaces (and more!)

When we say directly involve, we mean it. That is, this review is instead of, rather than in addition to our traditional Sun-internal design review. As such, other project teams inside Sun are seeing these materials for the first time as well -- and we have encouraged all of them to provide their feedback to the community forum. We hope that this underscores not only our commitment to open up our development process, but our desire to actively involve the community in each step, from design through integration, and beyond.

Speaking of design: we hope these hefty documents make it clear that we do not believe in leaving the design or architecture of Solaris to the whims of late-night hackery. Moreover, although technical documents are rarely known for their humanity, we hope that our materials provide a glimpse into the deep pride we each have in both Solaris and our craft.

Needless to say, your participation is most welcome -- and rest assured that we will continue to seek your involvement as the components proceed through various stages of our software development process.

Footnotes

[1] The rights themselves and their implications are too important to serve as a sidebar to this blog entry, and will surely get one of their own in the future.

Technorati Tag:
Technorati Tag:

Saturday Aug 13, 2005

On Locklessness

Unlike many other Unix kernels, the Solaris kernel is heavily multithreaded, including our networking stack. The conventional way to ensure consistency between multiple threads is to use synchronization primitives such as mutexes and readers-writer locks -- and indeed, these are widely used throughout the Solaris kernel. However, there are times when conventional locks are inappropriate, and more creative approaches to ensuring consistency must be devised. In this blog entry, I discuss a recent issue I encountered while working on the new Nemo[1] network driver framework, and describe an elegant but rarely-used approach.

The Problem

Among its other responsibilities, the Nemo framework provides the glue between the network protocol layer (e.g., IP) and the underlying network device driver. In particular, when a packet needs to be sent to an underlying device, the protocol stack will call the dls_tx() function. The dls_tx() function takes two arguments: the data-link to send the packet on, and the the packet to send. If any part of the packet could not be sent, the remainder is returned. In the original Nemo implementation, dls_tx() was implemented as:

	mblk_t *
	dls_tx(dls_channel_t dc, mblk_t *mp);
	{
	        dls_impl_t      *dip = (dls_impl_t *)dc;
	
	        return (dip->di_tx(dip->di_tx_arg, mp));
	}
Note that dc and dip refer to the same object (the data-link), but the first is opaque so that the caller cannot grovel around in the implementation details[2].

As is apparent, dls_tx() turns around and calls a previously-specified per-datalink transmit function pointer to send packets on that data-link. The transmit function is passed a previously-specified callback argument, and the packet to transmit. The transmit function and callback argument usually specify the underlying network driver's transmit function and a driver-instance data structure associated with that link.

However, when promiscuous-mode[3] is enabled on the underlying device, the Nemo framework interposes its own transmit function and callback argument between dls_tx() and the underlying driver's transmit routine. This interposed function loops back each transmitted packet so that a system will see its own transmitted packets through tools such as snoop(1M)[4].

Unfortunately, this presents a problem: as can be seen in dls_tx(), no locks are held as part of dereferencing di_tx and di_tx_arg. Thus, if one thread is calling dls_tx() while another is enabling or disabling promiscuous mode, it is possible for us to end up invoking the callback function with a mismatched callback argument.

As an example, suppose that the normal di_tx and di_tx_arg are tx and txarg, and the promiscuous di_tx and di_tx_arg are looptx and looptxarg. Then the following could happen:

          thread 1 (in dls_tx())      |  thread 2 (enabling promiscuous)
        ------------------------------|----------------------------------
  time    load di_tx = tx             |
   |                                  |  store di_tx = looptx
   |                                  |  store di_tx_arg = looptxarg
   V      load di_tx_arg = looptxarg  |
                                      |

Clearly, passing the wrong argument to di_tx could trigger panic or other misbehavior and is thus unacceptable. As such, we need to ensure that from the perspective of dls_tx(), di_tx and di_tx_arg are always either tx/txarg, or looptx/looptxarg.

So ... why not lock?

The simplest and most obvious way to address this problem would be to introduce locks, changing dls_tx() to be:

	mblk_t *
	dls_tx(dls_channel_t dc, mblk_t *mp)
        {
	        dls_impl_t      *dip = (dls_impl_t *)dc;
		mblk_t		*mp;

		mutex_enter(&dip->di_tx_lock);
	   	mp = dip->di_tx(dip->di_tx_arg, mp);
		mutex_exit(&dip->di_tx_lock);
		return (mp);
	}

Of course, the same lock would be acquired while assigning di_tx and di_tx_arg, making the assignment of both di_tx and di_tx_arg appear atomic from the perspective of dls_tx(). While this approach will work (and, moreover, is almost always the right way to solve this problem), in this particular case it has some drawbacks. To wit:

  1. All calls to the underlying di_tx function on the given channel are now serialized. That is, if multiple threads call dls_tx() for the same dc, only one thread will be able to call di_tx at a time.
  2. The dls_tx() function is called for every transmitted packet. Thus acquiring and releasing a lock -- even supposing that it is rarely contended -- introduces needless overhead [5].
  3. The dls_tx() function is no longer eligible for compiler tail-call optimization, impacting performance since the stack frame of dls_tx() can no longer be directly reused by di_tx()[6].

Lockless Atomicity

While the first problem could be fixed by switching to a readers-writer lock, this does nothing for the other two problems. Indeed, there appears to be no way to overpower this problem with complexity, so instead we return our requirement:

We need to ensure that from the perspective of dls_tx(), di_tx and di_tx_arg are always either tx/txarg, or looptx/looptxarg.
Now, if all we needed to do was guarantee that only di_tx or di_tx_arg assignments appeared atomic, we'd need no locks at all: all processors that Solaris supports guarantee that assignments to pointer-sized quantities will be atomic [7].

However, this of course means that assignment of structure pointers is also atomic. Thus, we can simply bundle these two fields into a structure:

	typedef struct mac_txinfo_s {
	        mac_tx_t                mt_fn;
	        void                    *mt_arg;
	} mac_txinfo_t;

... and fill in two such structures for each driver: one that contains the tx/txarg pair, and one that contains the looptx/looptxarg pair. Here's the actual code that does this in i_mac_create():

        /*
         * Set up the two possible transmit routines.
         */
        mip->mi_txinfo.mt_fn = mp->m_tx;
        mip->mi_txinfo.mt_arg = mp->m_driver;
        mip->mi_txloopinfo.mt_fn = mac_txloop;
        mip->mi_txloopinfo.mt_arg = mip;

We then replace di_tx and di_tx_arg with a single di_txinfo field:

        const mac_txinfo_t              *di_txinfo;

Again, since di_txinfo is a single pointer, it can be atomically set to point to either mi_txinfo or mi_txloopinfo. Coming back around to dls_tx(), we can now remove the need for locks, and also restore the compiler's ability to tail-call optimize dls_tx():

	mblk_t *
	dls_tx(dls_channel_t dc, mblk_t *mp)
	{
	        const mac_txinfo_t *mtp = ((dls_impl_t *)dc)->di_txinfo;
	
	        return (mtp->mt_fn(mtp->mt_arg, mp));
	}

Yes, that's it -- it's really that elegant. However, it's also a bit subtle: the code is careful to dereference di_txinfo only once. A more mechanical transformation of the original dls_tx() function would reintroduce a variant of the original bug, since di_txinfo might change between the two dereferences:

	mblk_t *
	dls_tx(dls_channel_t dc, mblk_t *mp)
	{
	        dls_impl_t      *dip = (dls_impl_t *)dc;

		/* WRONG */	
	        return (dip->di_txinfo->mt_fn(dip->di_txinfo->mt_arg, mp));
	}

It's also worth noting that even the correct lockless dls_tx() has subtlely different semantics from the version which used mutexes: since there is no explicit synchronization, it's possible a thread in dls_tx() may not see recent changes to the di_txinfo pointer. In this particular case, this is acceptable since:

  • The lifetime of either value is identical, and it is not possible for either value to become invalid while in dls_tx(), due to other invariants in the Nemo framework.
  • The effect of using the wrong function for a few packets is unobservable: either the kernel will attempt to loop back a few packets unnecessarily, or the first few packets after promiscuous-mode is enabled will not be looped back. The latter is unobservable because the application which enabled promiscuous-mode cannot observe the exact instant that it became possible to loop-back packets it transmits.

In many cases, these restrictions may limit the applicability of this approach -- and moreover, may suggest that a more conventional locking approach is needed. Nonetheless, as this discussion has hopefully made clear, there are times when locks are a poor match to the problem at hand, and "going lockless" both improves performance and code elegance.

Footnotes

[1] Nemo is also known as GLDv3, and is our brand new network driver framework that bge and xge currently use. All future Sun-provided network drivers on Solaris will use GLDv3. Once the interfaces stabilize, they will be published for third-party network driver developers.
[2] Throughout the kernel, we prefer this to void * since this enables the compiler to still type-check, thus maintaining type-safety. The actual definition of dls_channel_t is typedef struct dls_t *dls_channel_t, where dls_t is never defined.
[3] Enabling promiscuous-mode allows a network device to receive and send up packets on the network which are not destined for its hardware unicast, multicast, or broadcast addresses. In the old days, this would allow all traffic on the subnet to be observed -- but in these days of switched networks, enabling promiscuous-mode often yields only a few additional packets.
[4] Almost all network devices are deaf while transmitting, so transmitted packets which a host must also receive must be looped back in software.
[5] For instance, on SPARC, a mutex_exit() will perform a membar #LoadStore|#StoreStore, which forces the processor to ensure that all outstanding loads and stores complete before any future store complete.
[6] Those not familiar with the basics of tail-call optimization should read this wiki entry. On SPARC, this optimization can be important because it can avoid needless register spill/fill traps (among other things).
[7] To my knowledge, we have never officially stated this as a guarantee, but tons of kernel code relies on this invariant.

Technorati Tag:
Technorati Tag:

Wednesday Jul 06, 2005

On Bugs and Reproducibility

I love bugs -- specifically, I love putting an end to bugs[1]. I find myself a bit tongue-twisted when asked to articulate just what makes root-causing a bug so engrossing. In brief: it's the thrill of the hunt, the pull of a mystery, the search for a deeper understanding of how the system works, and -- of course -- the gripping fear that some customer somewhere will hit it again if we don't nail the bloody thing[2].

One way to subclass bugs is by reproducibility. That is, can the bug be reproduced on demand? However, as this blog entry will make clear, almost[3] any bug can be reproduced on demand once it is sufficiently understood. Thus, when someone states "this bug is not reproducible", they're likely really saying "this bug is not sufficiently understood". Sadly, one established way a bug can be "resolved" at Sun is for it to be closed out as "not reproducible" -- not just a shameful admission of utter defeat, but a near-guarantee to customers that they will be inconvenienced by the problem yet again. Thus, many of us simply refuse to close out bugs as "not reproducible"[4].

Nonetheless, sometimes bugs are indeed closed out as "not reproducible". In fact, if a bug takes up residence in a subsystem that is poorly understood but widely used, it can thump around for years -- or even decades -- before it is finally smoked out by a particularly determined engineer. Further, often the manifestations of the bug can be so varied that even after diagnosing, it can prove impossible to track down all of the duplicates that are either still open or have been closed out over the years as "not reproducible". Here I discuss my dealings with one such bug -- 4927647.

The Sewer of STREAMS

A classic example of a poorly-understood but widely-used area is the pseudo-terminal subsystem[5]. While the basics of how pseudo-terminals work is well-documented in several texts[6], the STREAMS code that implements it is a arcane and delicate piece of machinery from AT&T that has undergone little change since its initial impact with SVR3.2 some 18 years ago.

Again, that's not to say the code is unimportant. On the contrary, it is a core building block upon which many critical utilities, such as xterms, remote logins (ssh/telnet), and tools like expect(1) are built. Thus, when something goes wrong, someone has to be called in to diagnose it -- and since it's STREAMS code, these days that often means me[7].

Enter 4927647.

Specifically, I got called in because a high-profile customer filed the following bug:

    In expect 5.38.0 on Solaris 9, run the following script: 

        for { set i 1 } { 1 } { incr i } {
            spawn echo foobarbaz
            expect {
                "foobarbaz" {
                   puts "ok"
                   expect eof wait
                }
                eof {
                    puts "$i passes"
                    wait
                    exit 1
                }
            }
        }

    It should loop forever; instead, it exits after a few iterations on
    multiprocessor systems.  It appears to loop "forever" on uniprocessors.

Indeed, running this test I was able to confirm that it did fail on multiprocessor systems -- though usually after thousands of iterations (each iteration taking close to a second), rather than the "few" indicated in the bug report. Without a reproducible test case, this was surely going to be a painful battle.

The Hunt for Reproducibility

Using truss(1), it had already been confirmed that the problem was not with expect, but with the pseudo-terminal implementation itself. Specifically, something was causing the poll(2) to occasionally wake up with POLLERR, and the subsequent read(2) to return EINVAL rather than the number of bytes in the expected message:

    25258:  open("/dev/ptmx", O_RDWR)                       = 4
    [ ... lots happens ... ]
    25258:  poll(0xFFBFEBC0, 1, 10000)                      = 1
    25258:          fd=4  ev=POLLRDNORM|POLLRDBAND rev=POLLERR
    25258:  read(4, 0x00044FE8, 3999)                       Err#22 EINVAL

Of course, with this came the sobering realization that this bug was undoubtedly responsible for countless other sporadic "not reproducible" pseudo-terminal-related bugs, and an accompanying growing list of suspected duplicates. So, the key question was: could we intuit enough about this bug to make it reproducible and finally smoke this thing out?

At the time, DTrace had not yet integrated into Solaris, but was available for internal use and had an ever-growing buzz, so I was eager to see if it could shed some light on this bug. In particular, based on the truss output, my suspicion was that a toxic message was arriving on the /dev/ptmx stream that was causing STRDERR to be set, triggering POLLERR to be returned from strpoll():

        if (sd_flags & (STPLEX | STRDERR | STWRERR)) {
                if (sd_flags & STPLEX) {
                        *reventsp = POLLNVAL;
                        return (EINVAL);
                }
                if (((events & (POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI)) &&
                    (sd_flags & STRDERR)) ||
                    ((events & (POLLOUT | POLLWRNORM | POLLWRBAND)) &&
                    (sd_flags & STWRERR))) {
                        if (!(events & POLLNOERR)) {
                                *reventsp = POLLERR;
                                return (0);
                        }
                }
        }

Further, my suspicion was that most of the time, the application called poll() before this toxic message arrived, thus sidestepping the problem. However, occasionally, the application would get delayed (e.g., by other system activity), causing the toxic message to set STRDERR before the application was able to call poll().

Thus, I used the following simple D script to "force" the application to become stalled:

      #!/usr/sbin/dtrace -ws
     
      syscall::poll:entry
      {
             chill(30000000);
      }

And just like that, the problem became reproducible on the first try! That is, by simply widening the failure window, the problem became immediately reproducible, simultaneously confirming my hypothesis and significantly easing further analysis. (It should be noted that the proper value to chill(), which is specified in nanoseconds, is highly dependent on the underlying hardware -- for my hardware, delaying by 30ms was sufficient. Of course, to prevent widespread system misbehavior, the amount of time one can chill() is bounded.)

The Analysis

With the problem reproducible, further analysis with DTrace and other tools proceeded quickly. In less than an hour, the specifics behind the bug were well-understood, though still extraordinarily twisted.

For those who are interested in such sordid details, here's a crude diagram which shows the data flow (indicated by arrows) and the relevant streams modules (shown in brackets; more about the the modules can be gleaned from their manpages -- e.g., pts(7D)). Note that pckt(7D), while critical to the problem, is not actually on the stream -- hence the dotted lines going through it:

                  expect                   echo
            U
            -------------------------------------------
            K   
                [ stream ]              [ stream ]
                [  head  ]              [  head  ]
                   |  ^                    |  ^
                   |  |                    v  |
                   |  |                 [ ldterm ]
                   |  |                    |  ^
                   :  :                    v  |
                [  pckt  ]              [  ptem  ]
                   :  :                    |  ^
                   v  |                    v  |
                [  ptm   ]   <------>   [  pts   ]

                master side             slave side
Using the above diagram, here are the steps that lead to the problem [8]:
  1. The message (e.g., "foobarbaz" or "something") is sent down the pseudo-terminal slave stream by the echo command.
  2. The pseudo-terminal slave is closed by echo (e.g., through exit(2)) and the dismantling of the stream begins; the message sent in (1) is still in-flight, perhaps even still on the slave side.
  3. Eventually, ldterm's close routine (ldtermclose()) is called, which causes a TCSBRK M_IOCTL to be sent downstream (see Jim Carlson's blog for lots more on that misery).
  4. The ptem module receives the M_IOCTL, acks it, and then passes a copy of of the M_IOCTL downstream for pckt(7M), as documented in ptem(7M).
  5. Shortly after ldtermrput() receives the M_IOCACK, ldtermclose() is allowed to complete, and the following processing elements (ptem and pts) quickly follow suit, being careful to drain (if necessary) the message and the M_IOCTL copy to the master side stream.
  6. Both the M_IOCTL and original data message reach the master side stream. Eventually, the data message makes it to the stream head and a pollwakeup() is issued to notify the application (expect) that data is available. However, as discussed before, for whatever reason the application is delayed.
  7. Since the pckt(7M) module is actually not on the master side, the M_IOCTL reaches the master side's stream head. Since it's not expecting to receive this message, strrput_nondata() sends an M_IOCNAK back downstream.
  8. The M_IOCNAK reaches ptm. However, since the slave-side has been closed, ptmwput() frees the message and sends an M_ERROR upstream to indicate the problem.
  9. The stream head receives the M_ERROR and strrput_nondata() sets STRDERR on the stream.
  10. The application thread finally receives the notification from step (6) and is awoken, but since STRDERR is set, the block of code shown earlier causes strpoll() to return POLLERR rather than POLLIN. Thus, the observed behavior.

All That ... For This?

Despite the tortuous nature of the bug, the fix itself proved to be trivial[9]. In particular, while there are several possible simple fixes, the least risky is to change the ptmwput() logic in step (8) to use the obscure 2-byte form of the M_ERROR message to only mark the write-side as having an error. That is, to change:

         if ((ptmp->pt_state & PTLOCK) || (ptmp->pts_rdq == NULL)) {
                 DBG(("got msg. but no slave\n"));
                 (void) putnextctl1(RD(qp), M_ERROR, EINVAL);
                 PT_EXIT_READ(ptmp);
                 freemsg(mp);
                 return;
         }

... to[10]:

         if ((ptmp->pt_state & PTLOCK) || (ptmp->pts_rdq == NULL)) {
                 DBG(("got msg. but no slave\n"));
                 mp = mexchange(NULL, mp, 2, M_ERROR, -1);
                 if (mp != NULL) {
                         mp->b_rptr[0] = NOERROR;
                         mp->b_rptr[1] = EINVAL;
                         qreply(qp, mp);
                 }
                 PT_EXIT_READ(ptmp);
                 return;
         }

         [ See the putnextctl(9F), qreply(9F), and mexchange(9F)
           manpages to get a better handle on this code. ]

This seems eminently reasonable since ptm is unable to forward a message through the write-side, but the read-side is still usable -- so why wasn't it done this way from day one? The reason is simple: the 2-byte form of the M_ERROR message post-dates the original code.

Again, thanks to the reproducible test case, it was quick and easy to verify that this fixed the problem.

Due Diligence

As I previously mentioned, most of the pseudo-terminal code dates back to AT&T and has simply been hitched to the nearest wrecker and gingerly towed to the latest release. Indeed, inspecting an internal gates, it's clear this bug goes all the way back to the introduction of the STREAMS-based pseudo-terminal subsystem in 1987! While this is certainly an extreme case, it is not rare for apparently-intermittent bugs to terrorize subsystems for years before they are finally root-caused.

In this case, scanning the bug database suggests the conditions needed to reproduce this bug did not become widespread until Solaris 9 -- at which point, numerous bizarre application failures began to be reported, such as:

    4530041 Unable to restore/save backup during DSR upgrade to S9/S10 due to internal error
    4654418 ripv2 test routed.perf.3 failed intermittently
    4695559 spurious failures of tcp.serverclose
    4813669 ripv2 test routed.perf.3 failed intermittently
    4897163 ppp tests often return with "FAIL: Unable to ping 10.0.0.2."
    4892148 unexpected POLLERR causing early EOF from spawned processes in "expect"
    4926624 ssh exits with -1 if stdin is not a terminal
    4985311 rdisc.3 testcase randomly fails
    5012511 ripv2: some test cases in rtquery subdir failed randomly and intermittently

Of course, there were surely many more anomalies due to this bug which were either never filed, "fixed" by other means (workaround), or closed out as "not reproducible" -- we will never know the full scope of the chaos caused by this persistent little bug.

Summary

Hopefully, this treatise has clarified three key points regarding hard-to-reproduce bugs:

  1. Almost any bug can be reproduced once it is properly understood. Resist the temptation to close a bug out as non-reproducible -- it will come back to bite someone!
  2. DTrace is a powerful tool for investigating hard-to-reproduce bugs, and the chill() action is particularly ideal for investigating bugs that you suspect are timing-related.
  3. Once the problem is understood, scanning the bug database for similar symptoms can often reveal years -- or even decades -- of other bits of inexplicable behavior. Like nabbing a serial killer, the search through the cold cases can prove tedious but ultimately provides a sense of closure to others who have been victimized. Moreover, the search provides a basis for understanding the true impact of the bug and thus how many releases the fix needs to be patched back to.

Happy hunting!

Footnotes

[1] I literally put an end to bugs during a recent trip to China -- though sadly I was not able to dine on anything quite this large. (Upon seeing that photo, Jim Carlson quipped: "I guess if you can put a stick through it, it's food." -- a disturbingly accurate observation.)
[2] Of course, as a developer, much of my time is spent on the other side of the fence: introducing new code, which will invariably have a new crop of challenging bugs. Thus, there is another element to bugfixing: paying my dues to society.
[3] There exists a very small subset of bugs that are tied to factors that are beyond software's ability to control, and thus cannot be reproduced on demand, even when they are sufficiently understood.
[4] Unfortunately, our internal "list" of reasons for closing out a bug is currently too limited, forcing "not reproducible" to be pressed into service for "fixed as a side effect of X", or "insufficient information available to evaluate", among others.
[5] Actually, we have two pseudo-terminal subsystems on Solaris: an System V one, and a BSD one. However, the BSD flavor is provided only for compatibility and for purposes of this discussion can be ignored.
[6] An excellent description of pseudo-terminals can be found in the recently-revised Advanced Programming in the Unix Environment.
[7] For the record, I've never formally worked on STREAMS -- but Sasha Kolbasov and myself have become the unofficial caretakers of the much-maligned STREAMS core.
[8] To keep the description from growing too bloated, I have left out the code snippets, but inspecting the linked functions will allow more detailed study of the problem. Of course, the bug has been fixed, so the broken code in ptmwsrv() no longer exists.
[9] It is surprising how often hard bugs have trivial fixes. Surely there is a reason for this, but it's not obvious to me.
[10] Note that this fix makes use of mexchange(9F), which has long been available internally, but which I added to the Solaris DDI (along with a collection of other STREAMS utility routines) for Solaris 10. Hopefully these routines will make writing new STREAMS code a little less painful. Perhaps worthy of a future blog entry.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday Jun 14, 2005

dsvclockd(1M): Using Doors to Implement Inter-Process Readers/Writer Locks

As a long-time and rabid Solaris developer, the most personally satisfying part of the OpenSolaris launch is being able to finally share some of the creative (and borderline insane) solutions we've devised to real-world problems.

Here, I cover perhaps the most perverse use of Doors in Solaris: as the basis for a robust, multi-threaded, inter-process, inter-machine, readers/writer lock implementation. So, dust off that Morrison Hotel LP, and kill the lights ...

Mojo Rising

One groovy flavor of interprocess communication in Solaris is doors, which first appeared in Sun Lab's SpringOS, was implemented in Solaris 2.5 by Jim Voll, and rose to fame as the mechanism which made our nscd scream. My exposure to doors goes back to 1996, when I stumbled on the nascent facility while trussing [1] getent to root-cause a name-service problem -- the truss was quickly followed by a man door_call, which returned a curious manpage containing a grave warning and a fitting note:

  DESCRIPTION

     This family of system calls provide a new flavor of interprocess
     communication between client and server processes.  The doors mechanism
     is not yet available for public consumption.
 
  WARNING

     Please do not attempt to reverse-engineer the interface and program to
     it. If you do, your program will almost certainly fail to run on future
     versions of Solaris, and may even be broken by a patch. This document
     does not constitute an API.  Doors may not exist or may have a completely
     different set of semantics in a future release.

  NOTES

     This manual page is here solely for the benefit of anyone who noticed
     door_call() in truss(1) output and thought, "Gee, I wonder what that
     does..."
Of course, despite the warning, doors did continue to exist and these days the fundamentals of doors are well-documented, both in our own manpages and in the late Richard Stevens's Unix Network Programming: Volume Two. In fact, if you are not familiar with the basics of doors, I'd suggest at least browsing the door_create() and door_call() manpages before proceeding.

Synchronization Gone Wild

The original DHCP server that appeared in Solaris 2.6 was heavily based on CMU's bootpd server and, among other things, dreadfully slow. As such, not too long ago we embarked on a project to kick off the training wheels and generally make it scale to an enterprise-level workload. The result was a heavily multithreaded DHCP server, and a pluggable datastore infrastructure (see dhcp_modules(5)) -- this infrastructure is summarized in Dave Miner's DHCP Server Tour blog entry, and the implementation of SUNWbinfiles is worthy of an extended future blog entry as well.

One significant problem that arose during the implementation was how to synchronize access to the underlying DHCP data (e.g., lease information). Specifically, to ensure correctness, the multithreaded DHCP server and the DHCP administrative tools needed to synchronize with one another before accessing or modifying the underlying DHCP data.

Our first thought was to use cross-process (USYNC_PROCESS) readers-writer locks (see rwlock_init). However:

  • USYNC_PROCESS readers-writer locks are not robust. This means that if a process unexpectedly terminates with the lock held, all other processes may end up stuck forever -- this is clearly unacceptable [2].
  • USYNC_PROCESS readers-writer locks require setting up a shared memory region to store the locks in each process. Coordinating the management of this region can prove subtle, not to mention that some administrators have been known to lower the maximum shared memory size to unusably low levels. In addition, at the time it was unclear whether the use of shared memory was kosher from the native layer of Java applications (the DHCP administrative tools are all in Java).
  • USYNC_PROCESS readers-writer locks could not help us solve a related thorny issue: coordination with processes on other machines. Specifically, a little-known but supported configuration was to place the DHCP data on an NFS server and allow multiple DHCP servers (and multiple sets of tools) to access or modify that data. As luck would have it, this particular configuration was used extensively by Sun's own IT department [3] .

Another thought was to use file locking API's, but lockf() does not provide shared locks, and flock() is not MT-safe (as I found out the hard way). Furthermore, not all DHCP data requiring synchronization was required to be file-based -- and while that does not preclude use of files as strictly a locking mechanism, it does make it awkward.

The Birth of dsvclockd

Speeding home through a particularly dismal New England winter night, a promising and perverse solution dawned on me: the door_call() routine is synchronous, so why couldn't door_call() itself be used to provide the our desired rw_rdlock()/rw_wrlock() semantics?

A few (mostly sleepless) days later, I had a functioning prototype -- the premise was actually quite simple: a new daemon (later dubbed dsvclockd) arbitrated access to each of the underlying DHCP data tables [4]. Thus, for a random application to request read or write access to a given DHCP table, it need only issue a door_call() to dsvclockd, and pass it the following information:

  • The name of the DHCP table to access.
  • Whether it needed read (shared) or write (exclusive) access.
  • Whether it was willing to block if the DHCP table was not available for immediate access.
  • Whether it needed inter-machine synchronization[3].
Of course, this is all hidden behind a couple of functions inside libdhcpsvc: dsvcd_rdlock(), dsvcd_wrlock() - which are simple wrappers round a shared routine called dsvcd_lock():
   static int
   dsvcd_rdlock(dsvc_synch_t *sp, void **unlock_cookiep)
   {
           return (dsvcd_lock(sp, DSVCD_RDLOCK, unlock_cookiep));
   }
   
   static int
   dsvcd_wrlock(dsvc_synch_t *sp, void **unlock_cookiep)
   {
           return (dsvcd_lock(sp, DSVCD_WRLOCK, unlock_cookiep));
   }
Thus, DHCP table consumers can simply call dsvcd_rdlock() or dsvcd_wrlock() [5] to acquire the appropriate lock, completely unaware that doors are being used under the hood to coordinate their requests -- all they know is that if these routines return successfully, they have the requested access.

Of course, dsvcd_lock() does the actual grunt work of marshalling the needed arguments into a structure passed across the door, and issuing the door_call() to dsvclockd:

   static int
   dsvcd_lock(dsvc_synch_t *sp, dsvcd_locktype_t locktype, void **unlock_cookiep)
   {
        door_arg_t              args;
        dsvcd_lock_request_t    request;
        dsvcd_reply_t           reply;
        door_desc_t             *descp;
        int                     unlockfd;
        int                     i;
        dsvcd_synch_t           *dsp = sp->s_data;

        if (dsp->s_lockfd == -1)
                return (DSVC_NO_LOCKMGR);

        request.lrq_request.rq_version  = DSVCD_DOOR_VERSION;
        request.lrq_request.rq_reqtype  = DSVCD_LOCK;
        request.lrq_locktype            = locktype;
        request.lrq_nonblock            = sp->s_nonblock;
        request.lrq_crosshost           = dsp->s_crosshost;
        request.lrq_conver              = sp->s_datastore->d_conver;

        (void) strlcpy(request.lrq_loctoken, sp->s_loctoken,
            sizeof (request.lrq_loctoken));
        (void) strlcpy(request.lrq_conname, sp->s_conname,
            sizeof (request.lrq_conname));

        args.data_ptr   = (char *)&request;
        args.data_size  = sizeof (dsvcd_lock_request_t);
        args.desc_ptr   = NULL;
        args.desc_num   = 0;
        args.rbuf       = (char *)&reply;
        args.rsize      = sizeof (dsvcd_reply_t);

        if (door_call(dsp->s_lockfd, &args) == -1) {
                /*
                 * If the lock manager went away, we'll get back EBADF.
                 */
                return (errno == EBADF ? DSVC_NO_LOCKMGR : DSVC_SYNCH_ERR);
	}
On the dsvclockd side, the request comes into svc_lock(), which sanity-checks the arguments, looks up the appropriate DHCP table, and then calls either cn_rdlock() or cn_wrlock()[6] to perform the actual lock -- here's the tail end of svc_lock() which acquires the appropriate lock and sends a reply back:
       /*
         * Acquire the actual read or write lock on the container.
         */
        dhcpmsg(MSG_DEBUG, "tid %d: %s locking %s", thr_self(),
            lreq->lrq_locktype == DSVCD_RDLOCK ? "read" : "write", cn->cn_id);

        if (lreq->lrq_locktype == DSVCD_RDLOCK)
                reply.rp_retval = cn_rdlock(cn, lreq->lrq_nonblock);
        else if (lreq->lrq_locktype == DSVCD_WRLOCK)
                reply.rp_retval = cn_wrlock(cn, lreq->lrq_nonblock);

        dhcpmsg(MSG_DEBUG, "tid %d: %s %s lock operation: %s", thr_self(),
            cn->cn_id, lreq->lrq_locktype == DSVCD_RDLOCK ? "read" : "write",
            dhcpsvc_errmsg(reply.rp_retval));

        ds_release_container(ds, cn);
        if (reply.rp_retval != DSVC_SUCCESS) {
                ud_destroy(ud, B_FALSE);
                (void) close(door_desc.d_data.d_desc.d_descriptor);
                (void) door_return((char *)&reply, sizeof (reply), NULL, 0);
                return;
        }

        while (door_return((char *)&reply, sizeof (reply), &door_desc, 1)
            == -1 && errno == EMFILE) {
		/* ... */
 	}

Doors Through Doors: Implementing Unlock

Those of you keeping score at home have no doubt noticed a large hole in the preceding discussion: how does one unlock a previously locked DHCP table? While we could just implement an unlock operation on the same door used for locking, doors affords us a far more elegant, and as we will soon see, downright robust solution.

In particular, one little-used but powerful feature of doors is the ability to pass file descriptors through them. This enables a door server to act as an authentication mechanism, granting file descriptors to privileged resources once the door client has flashed the appropriate credentials. In our case, it also can be used to ensure that only door clients that have pending locks on a given DHCP table can unlock it -- and that a given lock can never be unlocked more than once by accident.

Specifically, as part of performing the lock operation, dsvclockd allocates another door -- the "unlock" door -- which is returned to the client through the original door as part of the lock operation. Thus, there is an open unlock door for every held reader or writer on a lock. This descriptor can be seen in the final door_return() above (door_desc).

On the client-side, this descriptor eventually bubbles up through dsvcd_lock(), and is then hidden inside the unlock_cookiep that is returned to the dsvcd_rdlock() or dsvcd_wrlock() caller. The dsvcd_unlock() routine converts this cookie back into a descriptor and issues the door_call():

  static int
  dsvcd_unlock(dsvc_synch_t *sp, void *unlock_cookie)
  {
          door_arg_t              args;
          dsvcd_unlock_request_t  request;
          dsvcd_reply_t           reply;
          int                     unlockfd = (int)unlock_cookie;
  
          request.urq_request.rq_version = DSVCD_DOOR_VERSION;
          request.urq_request.rq_reqtype = DSVCD_UNLOCK;
  
          args.data_ptr   = (char *)&request;
          args.data_size  = sizeof (dsvcd_unlock_request_t);
          args.desc_ptr   = NULL;
          args.desc_num   = 0;
          args.rbuf       = (char *)&reply;
          args.rsize      = sizeof (dsvcd_reply_t);
  
          if (door_call(unlockfd, &args) == -1) {
                  /*
                   * If the lock manager went away while we had a lock
                   * checked out, regard that as a synchronization error --
                   * it should never happen under correct operation.
                   */
                  return (DSVC_SYNCH_ERR);
          }

DOOR_UNREF_MULTI: Robust Unlocking

So that's certainly elegant, but what about the robustness I promised? Recall that one of our original objections to USYNC_PROCESS readers/writer locks was that if an application holding a lock crashed, other applications were effectively locked out. As it turns out, we can build on the descriptor-based unlocking model to solve this problem, too.

Specifically, doors provides a special mechanism whereby a door's associated procedure is automatically invoked by the kernel when there is only one reference left to it (see DOOR_UNREF and DOOR_UNREF_MULTI in door_create()). Since each lock has its own unlock door, we can make use of this to automatically be notified if a client holding a lock crashes. Here's the logic in dsvclockd's svc_unlock() associated with this case -- note that since even correctly unlocked doors have to be closed, we have to be careful to disambiguate a crash from normal operation.

        /*
         * First handle the case where the lock owner has closed the unlock
         * descriptor, either because they have unlocked the lock and are
         * thus done using the descriptor, or because they crashed.  In the
         * second case, print a message.
         */
        if (req == DOOR_UNREF_DATA) {
                /*
                 * The last reference is ours; we can free the descriptor.
                 */
                (void) mutex_unlock(&ud->ud_lock);
                ud_destroy(ud, B_TRUE);

                /*
                 * Normal case: the caller is closing the unlock descriptor
                 * on a lock they've already unlocked -- just return.
                 */
                if (cn == NULL) {
                        (void) door_return(NULL, 0, NULL, 0);
                        return;
                }

                /*
                 * Error case: the caller has crashed while holding the
                 * unlock descriptor (or is otherwise in violation of
                 * protocol).  Since all datastores are required to be
                 * robust even if unexpected termination occurs, we assume
                 * the container is not corrupt, even if the process
                 * crashed with the write lock held.
                 */
                switch (cn_locktype(cn)) {
                case DSVCD_RDLOCK:
                        dhcpmsg(MSG_WARNING, "process exited while reading "
                            "`%s'; unlocking", cn->cn_id);
                        (void) cn_unlock(cn);
                        break;

                case DSVCD_WRLOCK:
                        dhcpmsg(MSG_WARNING, "process exited while writing "
                            "`%s'; unlocking", cn->cn_id);
                        dhcpmsg(MSG_WARNING, "note that this write operation "
                            "may or may not have succeeded");
                        (void) cn_unlock(cn);
                        break;

                case DSVCD_NOLOCK:
                        dhcpmsg(MSG_CRIT, "unreferenced unheld lock");
                        break;
                }

                (void) door_return(NULL, 0, NULL, 0);
                return;
        }
Finally, let's take a look at the code that allocates the unlock descriptor in dsvclockd -- I've left it for last because, as the comments make clear, it's particularly subtle:
        /*
         * We need another door descriptor which is passed back with the
         * request.  This descriptor is used when the caller wants to
         * gracefully unlock or when the caller terminates abnormally.
         */
        ud = ud_create(cn, &reply.rp_retval);
        if (ud == NULL) {
                ds_release_container(ds, cn);
                (void) door_return((char *)&reply, sizeof (reply), NULL, 0);
                return;
        }

        /*
         * We pass a duped door descriptor with the DOOR_RELEASE flag set
         * instead of just passing the descriptor itself to handle the case
         * where the client has gone away before we door_return().  Since
         * we duped, the door descriptor itself will have a refcount of 2
         * when we go to pass it to the client; if the client does not
         * exist, the DOOR_RELEASE will drop the count from 2 to 1 which
         * will cause a DOOR_UNREF_DATA call.
         *
         * In the regular (non-error) case, the door_return() will handoff
         * the descriptor to the client, bumping the refcount to 3, and
         * then the DOOR_RELEASE will drop the count to 2.  If the client
         * terminates abnormally after this point, the count will drop from
         * 2 to 1 which will cause a DOOR_UNREF_DATA call.  If the client
         * unlocks gracefully, the refcount will still be 2 when the unlock
         * door server procedure is called, and the unlock procedure will
         * unlock the lock and note that the lock has been unlocked (so
         * that we know the DOOR_UNREF_DATA call generated from the client
         * subsequently closing the unlock descriptor is benign).
         *
         * Note that a DOOR_UNREF_DATA call will be generated *any time*
         * the refcount goes from 2 to 1 -- even if *we* cause it to
         * happen, which by default will happen in some of the error logic
         * below (when we close the duped descriptor).  To prevent this
         * scenario, we tell ud_destroy() *not* to cache the unlock
         * descriptor, which forces it to blow away the descriptor using
         * door_revoke(), making the close() that follows benign.
         */
        door_desc.d_attributes = DOOR_DESCRIPTOR|DOOR_RELEASE;
        door_desc.d_data.d_desc.d_descriptor = dup(ud->ud_fd);
        if (door_desc.d_data.d_desc.d_descriptor == -1) {
                dhcpmsg(MSG_ERR, "cannot dup unlock door; denying %s "
                    "lock request", cn->cn_id);
                ud_destroy(ud, B_TRUE);
                ds_release_container(ds, cn);
                reply.rp_retval = DSVC_NO_RESOURCES;
                (void) door_return((char *)&reply, sizeof (reply), NULL, 0);
                return;
        }
Anyone who has been inspired to implement something of like kin: pay particular attention to the final paragraph of the block comment above -- it took me days to track down that bloody bug!

The End

The dsvclockd daemon integrated in Solaris 9, has been in widespread use by both internal and external customers ever since, and has been incident-free to this point[7].

While far from a comprehensive tour of dsvclockd, I hope this overview has illustrated the power of the Solaris doors IPC mechanism, and more importantly has caused you to think about how to use (or abuse!) Solaris doors to solve a thorny inter-process problem or two of your own.

Footnotes:

[1] The use of truss as a verb follows the spirit Roger Faulkner (the author of truss) intended. To wit, he once admitted that one of the primary reasons for naming the tool "truss" was that one is rarely in a good frame of mind when using it -- thus, technicalities aside, truss should be considered a four-letter word, and "oh, truss it!" a cathartic confession.
[2] In the meantime, a USYNC_PROCESS_ROBUST mutex has been added to Solaris. This could probably be in-turn used to implement a USYNC_PROCESS_ROBUST readers/writer lock -- but the other issues still remain.
[3] The discussion of how dsvclockd implements inter-machine synchronization will have to wait for another time -- though brave readers are directed to the block comment atop container.c, which starts unabashedly with:

	/*
	 * Container locking code -- warning: serious pain ahead.
	 */

[4] These DHCP tables are known as "containers" in the source code -- a term that sadly has since become extremely overloaded, especially due to Zones.
[5] In fact, there is an additional abstraction layer that allows pluggable locking strategies to be added -- but we have yet to implement any additional strategies. If I had to do it over again, I'd torch that layer.
[6] Of course, the devil's in the details and the implementation of these routines are proof -- in addition to containing the aforementioned host-locking code, there are some other parlor tricks of note, such as the use of stack-local condition variables in cn_wait_for_lock(). You may also be curious why there is not a USYNC_THREAD readers/writer lock at the core of dsvclockd's implementation -- the problem is that we cannot guarantee that the same thread that write-locked a given DHCP table will be the one that unlocks it (since a random door server worker thread is tasked with the unlock operation).
[7] Though there have been several general doors bugfixes applied to it -- see 4786633and 6223254. Also, the development of dsvclockd uncovered several critical doors bugs -- see 4321534 and 4349741.

Technorati Tag:
Technorati Tag:

Thursday Jun 09, 2005

Hello there!

I've been working at Sun since 1995, when I managed to snag a summer internship at the impressionable age of 18. After 3 more years of college (and internships), I came on-board full-time in 1998 and have been working on core Solaris technologies ever since. Literally a decade has passed, but my interest has changed little: I'm still drawn to studying, characterizing, and ultimately simplifying complicated key low-level subsystems -- particularly those that have been neglected by others.

My main areas of focus have been our DHCP, IP multipathing, and STREAMS implementations -- recently, I've enhanced DHCP to support logical interfaces, spearheaded DHCP event scripting, made IPMP test addresses optional, and massively simplified STREAMS by hauling out over 3500 lines of aging kernel code while also enhancing our debugging infrastructure. I am also fascinated by system stress-testing, developer tools, and code quality, and have authored a collection of internal tools to both improve our product quality and developer efficiency.

Currently, I'm co-leading Project Clearview, which aims to both rationalize and enhance the way network interfaces operate in Solaris, and which I will discuss in much greater depth once OpenSolaris has launched. However, the initial phase of Clearview integrated recently, manifesting itself primarily as a simplified dladm in the next Solaris Express release. In addition to project-work, I take bugs very seriously, and look forward to integrating my 500th bugfix later this year (admittedly a far cry from baseball's famed 500-homer club).

Outside of work, my current passion is cars -- especially of the German persuasion -- and spend most of my personal time modifying, detailing, driving, or photographing them. I also love pinball, exotic foods, the Boston Red Sox, and live music. I live just outside of Boston and am regularly stopped by the police on my way to Sun's Burlington campus.

My plan is to update this blog a few times a month with a variety of pieces that shine some light on our core networking technologies, and Solaris a whole -- past, present, and future -- from a minimalist developer's perspective.