The past few weeks have been some of the most exciting (and hectic!) of
my professional career, as I acclimate to the power and responsibility
of an increasingly open software development model.
To rewind for a second: Clearview is a project that Sebastien and I are spearheading to
rationalize, unify, and enhance the way network
interfaces are handled in Solaris. "What?,", you say? Well, by
defining and implementing a set of core unifying attributes (or
unalienable rights)[1]
for all interfaces, a wide variety of administrative, programmatic, and
diagnostic problems are addressed. For example, after this work, one
will be able to:
- Observe all IP layer network traffic, including loopback, IPMP group
and IP tunnel traffic.
- Observe all IP layer network traffic flowing to and from a zone.
- Administrate all network interfaces using dladm(1M).
- Use VLANs and form link aggregations on all Ethernet devices.
- Use IPMP with technologies such as DHCP and routing protocols.
... though that list is far from comprehensive (nor does the above
list constitute a commitment -- this is work-in-progress, mind you!)
Before I start hearing cries of "vaporware!", let's get back to
the part that has me stoked: Clearview is one of the first Solaris
projects to directly involve the OpenSolaris community in its
design. To wit, our team is in the process of releasing design
materials for each component of the Clearview project, and is
actively soliciting input from the community at large.
In fact, the
first document -- covering our massive overhaul of IPMP -- was sent
out a week ago, and has already received some great feedback (keep it
coming!) Further, just this morning, Sebastien unveiled our proposed
Tunnelling Device Driver -- if you've spent much time with tunnels in
Solaris, this will be a sight for sore eyes. Of course, that's not
all: in the coming weeks, Phil Kirk will provide our design for
IP-Level Observability Devices, and Cathy Zhou will detail our
sinister plan to bring the power of dladm to all Solaris
network interfaces (and more!)
When we say directly involve, we mean it. That is, this review
is instead of, rather than in addition to our traditional
Sun-internal design review. As such, other project teams inside Sun
are seeing these materials for the first time as well -- and we have
encouraged all of them to provide their feedback to the community
forum. We hope that this underscores not only our commitment to open
up our development process, but our desire to actively involve the
community in each step, from design through integration, and beyond.
Speaking of design: we hope these hefty documents make it clear that we
do not believe in leaving the design or architecture of Solaris to the
whims of late-night hackery. Moreover, although technical documents
are rarely known for their humanity, we hope that our materials provide
a glimpse into the deep pride we each have in both Solaris and our craft.
Needless to say, your participation is most welcome -- and rest assured that
we will continue to seek your involvement as the components proceed through
various stages of our software development process.
Footnotes
[1]
The rights themselves and their implications are too important to serve
as a sidebar to this blog entry, and will surely get one of their own
in the future.
Technorati Tag:
OpenSolaris
Technorati Tag:
Solaris
( Sep 18 2005, 01:02:46 PM EDT )
Permalink

Saturday August 13, 2005
On Locklessness
On Locklessness
Unlike many other Unix kernels, the Solaris kernel is heavily
multithreaded, including our networking stack. The conventional way to
ensure consistency between multiple threads is to use synchronization
primitives such as mutexes and readers-writer locks -- and indeed, these
are widely used throughout the Solaris kernel. However, there are times
when conventional locks are inappropriate, and more creative approaches
to ensuring consistency must be devised. In this blog entry, I discuss
a recent issue I encountered while working on the new Nemo[1] network driver framework,
and describe an elegant but rarely-used approach.
The Problem
Among its other responsibilities, the Nemo framework provides the glue
between the network protocol layer (e.g., IP) and the underlying network
device driver. In particular, when a packet needs to be sent to an
underlying device, the protocol stack will call the dls_tx()
function. The dls_tx() function takes two arguments: the
data-link to send the packet on, and the the packet to send. If any
part of the packet could not be sent, the remainder is returned. In the
original Nemo implementation, dls_tx() was implemented as:
mblk_t *
dls_tx(dls_channel_t dc, mblk_t *mp);
{
dls_impl_t *dip = (dls_impl_t *)dc;
return (dip->di_tx(dip->di_tx_arg, mp));
}
Note that dc and dip refer to the same object (the data-link),
but the first is opaque so that the caller cannot grovel around in
the implementation details[2].
As is apparent, dls_tx() turns around and calls a previously-specified
per-datalink transmit function pointer to send packets on that
data-link. The transmit function is passed a previously-specified
callback argument, and the packet to transmit. The transmit function
and callback argument usually specify the underlying network driver's
transmit function and a driver-instance data structure associated with
that link.
However, when promiscuous-mode[3] is enabled on the underlying device,
the Nemo framework interposes its own transmit function and
callback argument between dls_tx() and the underlying driver's
transmit routine. This interposed function loops back each transmitted
packet so that a system will see its own transmitted packets through
tools such as snoop(1M)[4].
Unfortunately, this presents a problem: as can be seen in dls_tx(), no
locks are held as part of dereferencing di_tx and di_tx_arg. Thus,
if one thread is calling dls_tx() while another is enabling or
disabling promiscuous mode, it is possible for us to end up invoking the
callback function with a mismatched callback argument.
As an example, suppose that the normal di_tx and di_tx_arg are tx
and txarg, and the promiscuous di_tx and di_tx_arg are looptx
and looptxarg. Then the following could happen:
thread 1 (in dls_tx()) | thread 2 (enabling promiscuous)
------------------------------|----------------------------------
time load di_tx = tx |
| | store di_tx = looptx
| | store di_tx_arg = looptxarg
V load di_tx_arg = looptxarg |
|
Clearly, passing the wrong argument to di_tx could trigger panic or
other misbehavior and is thus unacceptable. As such, we need to ensure
that from the perspective of dls_tx(), di_tx and di_tx_arg are
always either tx/txarg, or looptx/looptxarg.
So ... why not lock?
The simplest and most obvious way to address this problem would be
to introduce locks, changing dls_tx() to be:
mblk_t *
dls_tx(dls_channel_t dc, mblk_t *mp)
{
dls_impl_t *dip = (dls_impl_t *)dc;
mblk_t *mp;
mutex_enter(&dip->di_tx_lock);
mp = dip->di_tx(dip->di_tx_arg, mp);
mutex_exit(&dip->di_tx_lock);
return (mp);
}
Of course, the same lock would be acquired while assigning di_tx and
di_tx_arg, making the assignment of both di_tx and di_tx_arg
appear atomic from the perspective of dls_tx(). While this approach
will work (and, moreover, is almost always the right way to solve this
problem), in this particular case it has some drawbacks. To wit:
- All calls to the underlying di_tx function on the given
channel are now serialized. That is, if multiple threads
call dls_tx() for the same dc, only one thread will be
able to call di_tx at a time.
- The dls_tx() function is called for every transmitted
packet. Thus acquiring and releasing a lock -- even supposing
that it is rarely contended -- introduces needless overhead
[5].
- The dls_tx() function is no longer eligible for compiler
tail-call optimization, impacting performance since the stack
frame of dls_tx() can no longer be directly reused by
di_tx()[6].
Lockless Atomicity
While the first problem could be fixed by switching to a readers-writer
lock, this does nothing for the other two problems. Indeed, there
appears to be no way to overpower this problem with complexity, so
instead we return our requirement:
We need to ensure that from the perspective of dls_tx(), di_tx and
di_tx_arg are always either tx/txarg, or looptx/looptxarg.
Now, if all we needed to do was guarantee that only
di_tx or
di_tx_arg assignments appeared atomic, we'd need no locks at all: all
processors that Solaris supports guarantee that assignments to
pointer-sized quantities will be atomic
[7].
However, this of course means that assignment of structure pointers is
also atomic. Thus, we can simply bundle these two fields into a
structure:
typedef struct mac_txinfo_s {
mac_tx_t mt_fn;
void *mt_arg;
} mac_txinfo_t;
... and fill in two such structures for each driver: one that contains
the tx/txarg pair, and one that contains the looptx/looptxarg
pair. Here's the actual code that does this in i_mac_create():
/*
* Set up the two possible transmit routines.
*/
mip->mi_txinfo.mt_fn = mp->m_tx;
mip->mi_txinfo.mt_arg = mp->m_driver;
mip->mi_txloopinfo.mt_fn = mac_txloop;
mip->mi_txloopinfo.mt_arg = mip;
We then replace di_tx and di_tx_arg with a single di_txinfo field:
const mac_txinfo_t *di_txinfo;
Again, since di_txinfo is a single pointer, it can be atomically set to
point to either mi_txinfo or mi_txloopinfo. Coming back around to
dls_tx(), we can now remove the need for locks, and also restore the
compiler's ability to tail-call optimize dls_tx():
mblk_t *
dls_tx(dls_channel_t dc, mblk_t *mp)
{
const mac_txinfo_t *mtp = ((dls_impl_t *)dc)->di_txinfo;
return (mtp->mt_fn(mtp->mt_arg, mp));
}
Yes, that's it -- it's really that elegant. However, it's also a bit
subtle: the code is careful to dereference di_txinfo only once. A
more mechanical transformation of the original dls_tx() function would
reintroduce a variant of the original bug, since di_txinfo might
change between the two dereferences:
mblk_t *
dls_tx(dls_channel_t dc, mblk_t *mp)
{
dls_impl_t *dip = (dls_impl_t *)dc;
/* WRONG */
return (dip->di_txinfo->mt_fn(dip->di_txinfo->mt_arg, mp));
}
It's also worth noting that even the correct lockless dls_tx() has
subtlely different semantics from the version which used mutexes: since
there is no explicit synchronization, it's possible a thread in
dls_tx() may not see recent changes to the di_txinfo pointer. In
this particular case, this is acceptable since:
- The lifetime of either value is identical, and it is not possible
for either value to become invalid while in dls_tx(), due to
other invariants in the Nemo framework.
- The effect of using the wrong function for a few packets is
unobservable: either the kernel will attempt to loop back a
few packets unnecessarily, or the first few packets after
promiscuous-mode is enabled will not be looped back. The
latter is unobservable because the application which enabled
promiscuous-mode cannot observe the exact instant that it
became possible to loop-back packets it transmits.
In many cases, these restrictions may limit the applicability of
this approach -- and moreover, may suggest that a more conventional
locking approach is needed. Nonetheless, as this discussion has
hopefully made clear, there are times when locks are a poor match to
the problem at hand, and "going lockless" both improves performance
and code elegance.
Footnotes
[1]
Nemo is also known as GLDv3, and is our brand new network driver
framework that
bge and
xge currently use. All
future Sun-provided network drivers on Solaris will use GLDv3.
Once the interfaces stabilize, they will be published for third-party
network driver developers.
[2]
Throughout the kernel, we prefer this to
void * since this enables
the compiler to still type-check, thus maintaining type-safety.
The actual definition of
dls_channel_t is
typedef
struct dls_t *dls_channel_t, where
dls_t is never defined.
[3]
Enabling promiscuous-mode allows a network device to receive and
send up packets on the network which are not destined for its
hardware unicast, multicast, or broadcast addresses. In the old days,
this would allow all traffic on the subnet to be observed -- but in
these days of switched networks, enabling promiscuous-mode often
yields only a few additional packets.
[4]
Almost all network devices are deaf while transmitting, so transmitted
packets which a host must also receive must be looped back in software.
[5]
For instance, on SPARC, a
mutex_exit() will perform a
membar #LoadStore|#StoreStore, which forces the processor to
ensure that
all outstanding loads and stores complete before
any future store complete.
[6]
Those not familiar with the basics of tail-call optimization should
read this wiki
entry. On SPARC, this optimization can be important because it can
avoid needless register spill/fill traps (among other things).
[7]
To my knowledge, we have never officially stated this as a guarantee,
but tons of kernel code relies on this invariant.
Technorati Tag:
OpenSolaris
Technorati Tag:
Solaris
( Aug 13 2005, 12:59:05 AM EDT )
Permalink

Wednesday July 06, 2005
On Bugs and Reproducibility
On Bugs and Reproducibility
I love bugs -- specifically, I love putting an end to
bugs[1]. I
find myself a bit tongue-twisted when asked to articulate just what makes
root-causing a bug so engrossing. In brief: it's the thrill of the hunt,
the pull of a mystery, the search for a deeper understanding of how
the system works, and -- of course -- the gripping fear that some
customer somewhere will hit it again if we don't nail the bloody thing[2].
One way to subclass bugs is by reproducibility. That is, can the bug be
reproduced on demand? However, as this blog entry will make clear,
almost[3] any bug
can be reproduced on demand once it is sufficiently understood. Thus,
when someone states "this bug is not reproducible", they're likely
really saying "this bug is not sufficiently understood". Sadly, one
established way a bug can be "resolved" at Sun is for it to be closed
out as "not reproducible" -- not just a shameful admission of utter
defeat, but a near-guarantee to customers that they will be
inconvenienced by the problem yet again. Thus, many of us simply
refuse to close out bugs as "not reproducible"[4].
Nonetheless, sometimes bugs are indeed closed out as "not reproducible".
In fact, if a bug takes up residence in a subsystem that is poorly
understood but widely used, it can thump around for years -- or even
decades -- before it is finally smoked out by a particularly determined
engineer. Further, often the manifestations of the bug can be so varied
that even after diagnosing, it can prove impossible to track down all of
the duplicates that are either still open or have been closed out over
the years as "not reproducible". Here I discuss my dealings with one
such bug -- 4927647.
The Sewer of STREAMS
A classic example of a poorly-understood but widely-used area is the
pseudo-terminal subsystem[5]. While the basics of how pseudo-terminals
work is well-documented in several texts[6], the STREAMS code that implements it is a
arcane and delicate piece of machinery from AT&T that has undergone
little change since its initial impact with SVR3.2 some 18 years ago.
Again, that's not to say the code is unimportant. On the
contrary, it is a core building block upon which many critical
utilities, such as xterms, remote logins (ssh/telnet),
and tools like expect(1) are built. Thus, when something goes
wrong, someone has to be called in to diagnose it -- and since
it's STREAMS code, these days that often means me[7].
Specifically, I got called in because a high-profile customer filed
the following bug:
In expect 5.38.0 on Solaris 9, run the following script:
for { set i 1 } { 1 } { incr i } {
spawn echo foobarbaz
expect {
"foobarbaz" {
puts "ok"
expect eof wait
}
eof {
puts "$i passes"
wait
exit 1
}
}
}
It should loop forever; instead, it exits after a few iterations on
multiprocessor systems. It appears to loop "forever" on uniprocessors.
Indeed, running this test I was able to confirm that it did fail on
multiprocessor systems -- though usually after thousands of iterations (each
iteration taking close to a second), rather than the "few" indicated in the
bug report. Without a reproducible test case, this was surely going to
be a painful battle.
The Hunt for Reproducibility
Using truss(1), it had already been confirmed that the problem was not
with expect, but with the pseudo-terminal implementation itself.
Specifically, something was causing the poll(2) to occasionally wake up
with POLLERR, and the subsequent read(2) to return EINVAL rather
than the number of bytes in the expected message:
25258: open("/dev/ptmx", O_RDWR) = 4
[ ... lots happens ... ]
25258: poll(0xFFBFEBC0, 1, 10000) = 1
25258: fd=4 ev=POLLRDNORM|POLLRDBAND rev=POLLERR
25258: read(4, 0x00044FE8, 3999) Err#22 EINVAL
Of course, with this came the sobering realization that this bug was
undoubtedly responsible for countless other sporadic "not reproducible"
pseudo-terminal-related bugs, and an accompanying growing list of
suspected duplicates. So, the key question was: could we intuit enough
about this bug to make it reproducible and finally smoke this thing out?
At the time, DTrace had
not yet integrated into Solaris, but was available for internal use and
had an ever-growing buzz, so I was eager to see if it could shed some
light on this bug. In particular, based on the truss output, my
suspicion was that a toxic message was arriving on the
/dev/ptmx stream that was causing STRDERR to be set,
triggering POLLERR to be returned from strpoll():
if (sd_flags & (STPLEX | STRDERR | STWRERR)) {
if (sd_flags & STPLEX) {
*reventsp = POLLNVAL;
return (EINVAL);
}
if (((events & (POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI)) &&
(sd_flags & STRDERR)) ||
((events & (POLLOUT | POLLWRNORM | POLLWRBAND)) &&
(sd_flags & STWRERR))) {
if (!(events & POLLNOERR)) {
*reventsp = POLLERR;
return (0);
}
}
}
Further, my suspicion was that
most of the time, the application called
poll() before this toxic message arrived, thus sidestepping the problem.
However, occasionally, the application would get delayed (e.g., by other
system activity), causing the toxic message to set
STRDERR before the
application was able to call
poll().
Thus, I used the following simple D script to "force" the application to
become stalled:
#!/usr/sbin/dtrace -ws
syscall::poll:entry
{
chill(30000000);
}
And just like that, the problem became reproducible on the first try!
That is, by simply widening the failure window, the problem became
immediately reproducible, simultaneously confirming my hypothesis and
significantly easing further analysis. (It should be noted that the
proper value to chill(), which is specified in nanoseconds, is highly
dependent on the underlying hardware -- for my hardware, delaying by 30ms
was sufficient. Of course, to prevent widespread system misbehavior,
the amount of time one can chill() is bounded.)
The Analysis
With the problem reproducible, further analysis with DTrace and other tools
proceeded quickly. In less than an hour, the specifics behind the bug were
well-understood, though still extraordinarily twisted.
For those who are interested in such sordid details, here's a crude
diagram which shows the data flow (indicated by arrows) and the relevant
streams modules (shown in brackets; more about the the modules can be
gleaned from their manpages -- e.g., pts(7D)). Note that pckt(7D),
while critical to the problem, is not actually on the stream -- hence the
dotted lines going through it:
expect echo
U
-------------------------------------------
K
[ stream ] [ stream ]
[ head ] [ head ]
| ^ | ^
| | v |
| | [ ldterm ]
| | | ^
: : v |
[ pckt ] [ ptem ]
: : | ^
v | v |
[ ptm ] <------> [ pts ]
master side slave side
Using the above diagram, here are the steps that lead to the problem
[8]:
- The message (e.g., "foobarbaz" or "something") is sent down
the pseudo-terminal slave stream by the echo command.
- The pseudo-terminal slave is closed by echo (e.g., through
exit(2)) and the dismantling of the stream begins; the
message sent in (1) is still in-flight, perhaps even still
on the slave side.
- Eventually, ldterm's close routine (ldtermclose()) is called,
which causes a TCSBRK M_IOCTL to be sent downstream (see
Jim Carlson's blog for lots more on that misery).
- The ptem module receives the M_IOCTL, acks
it, and then passes a copy of of the M_IOCTL downstream for
pckt(7M), as documented in ptem(7M).
- Shortly after ldtermrput() receives the M_IOCACK,
ldtermclose() is allowed to complete, and the
following processing elements (ptem and pts)
quickly follow suit, being careful to drain (if necessary) the
message and the M_IOCTL copy to the master side
stream.
- Both the M_IOCTL and original data message reach the
master side stream. Eventually, the data message makes
it to the stream head and a pollwakeup() is issued to
notify the application (expect) that data is available.
However, as discussed before, for whatever reason the
application is delayed.
- Since the pckt(7M) module is actually not on the master side,
the M_IOCTL reaches the master side's stream head. Since it's
not expecting to receive this message, strrput_nondata() sends
an M_IOCNAK back downstream.
- The M_IOCNAK reaches ptm. However, since the slave-side has
been closed, ptmwput() frees the message and sends an
M_ERROR upstream to indicate the problem.
- The stream head receives the M_ERROR and strrput_nondata()
sets STRDERR on the stream.
- The application thread finally receives the notification
from step (6) and is awoken, but since STRDERR is set, the
block of code shown earlier causes strpoll()
to return POLLERR rather than POLLIN. Thus,
the observed behavior.
All That ... For This?
Despite the tortuous nature of the bug, the fix itself proved to be
trivial[9]. In
particular, while there are several possible simple
fixes, the least risky is to change the ptmwput() logic in step (8) to
use the obscure 2-byte form of the M_ERROR message to only mark the
write-side as having an error. That is, to change:
if ((ptmp->pt_state & PTLOCK) || (ptmp->pts_rdq == NULL)) {
DBG(("got msg. but no slave\n"));
(void) putnextctl1(RD(qp), M_ERROR, EINVAL);
PT_EXIT_READ(ptmp);
freemsg(mp);
return;
}
... to[10]:
if ((ptmp->pt_state & PTLOCK) || (ptmp->pts_rdq == NULL)) {
DBG(("got msg. but no slave\n"));
mp = mexchange(NULL, mp, 2, M_ERROR, -1);
if (mp != NULL) {
mp->b_rptr[0] = NOERROR;
mp->b_rptr[1] = EINVAL;
qreply(qp, mp);
}
PT_EXIT_READ(ptmp);
return;
}
[ See the putnextctl(9F), qreply(9F), and mexchange(9F)
manpages to get a better handle on this code. ]
This seems eminently reasonable since ptm is unable to forward a
message through the write-side, but the read-side is still usable -- so
why wasn't it done this way from day one? The reason is simple: the
2-byte form of the M_ERROR message post-dates the original
code.
Again, thanks to the reproducible test case, it was quick and easy
to verify that this fixed the problem.
Due Diligence
As I previously mentioned, most of the pseudo-terminal code dates back to
AT&T and has simply been hitched to the nearest wrecker and gingerly towed to
the latest release. Indeed, inspecting an internal gates, it's clear
this bug goes all the way back to the introduction of the STREAMS-based
pseudo-terminal subsystem in 1987! While this is certainly an extreme
case, it is not rare for apparently-intermittent bugs to terrorize
subsystems for years before they are finally root-caused.
In this case, scanning the bug database suggests the conditions needed to
reproduce this bug did not become widespread until Solaris 9 -- at which
point, numerous bizarre application failures began to be reported, such as:
4530041 Unable to restore/save backup during DSR upgrade to S9/S10 due to internal error
4654418 ripv2 test routed.perf.3 failed intermittently
4695559 spurious failures of tcp.serverclose
4813669 ripv2 test routed.perf.3 failed intermittently
4897163 ppp tests often return with "FAIL: Unable to ping 10.0.0.2."
4892148 unexpected POLLERR causing early EOF from spawned processes in "expect"
4926624 ssh exits with -1 if stdin is not a terminal
4985311 rdisc.3 testcase randomly fails
5012511 ripv2: some test cases in rtquery subdir failed randomly and intermittently
Of course, there were surely many more anomalies due to this bug which
were either never filed, "fixed" by other means (workaround), or closed
out as "not reproducible" -- we will never know the full scope of the
chaos caused by this persistent little bug.
Summary
Hopefully, this treatise has clarified three key points regarding
hard-to-reproduce bugs:
- Almost any bug can be reproduced once it is properly understood.
Resist the temptation to close a bug out as non-reproducible --
it will come back to bite someone!
- DTrace
is a powerful tool for investigating hard-to-reproduce bugs, and
the chill() action is particularly ideal for
investigating bugs that you suspect are timing-related.
- Once the problem is understood, scanning the bug database for
similar symptoms can often reveal years -- or even decades -- of
other bits of inexplicable behavior. Like nabbing a serial
killer, the search through the cold cases can prove tedious but
ultimately provides a sense of closure to others who have been
victimized. Moreover, the search provides a basis for
understanding the true impact of the bug and thus how many
releases the fix needs to be patched back to.
Happy hunting!
Footnotes
[1]
I literally put an end to bugs during a recent trip to China -- though
sadly I was not able to dine on anything quite this large.
(Upon seeing that photo, Jim
Carlson quipped: "I guess if you can put a stick through it, it's food."
-- a disturbingly accurate observation.)
[2]
Of course, as a developer, much of my time is spent on the other side of
the fence: introducing new code, which will invariably have a new crop of
challenging bugs. Thus, there is another element to bugfixing: paying
my dues to society.
[3]
There exists a very small subset of bugs that are tied to factors
that are beyond software's ability to control, and thus cannot be
reproduced on demand, even when they are sufficiently understood.
[4]
Unfortunately, our internal "list" of reasons for closing out a
bug is currently too limited, forcing "not reproducible" to be
pressed into service for "fixed as a side effect of X", or
"insufficient information available to evaluate", among others.
[5]
Actually, we have two pseudo-terminal subsystems on Solaris: an
System V one, and a BSD one. However, the BSD flavor is provided only
for compatibility and for purposes of this discussion can be ignored.
[6]
An excellent description of pseudo-terminals can be found in
the recently-revised Advanced Programming in the Unix Environment.
[7]
For the record, I've never formally worked on STREAMS -- but Sasha Kolbasov and myself have become the unofficial caretakers of the
much-maligned STREAMS core.
[8]
To keep the description from growing too bloated, I have left out
the code snippets, but inspecting the linked functions will allow
more detailed study of the problem. Of course, the bug has been
fixed, so the broken code in ptmwsrv() no longer exists.
[9]
It is surprising how often hard bugs have trivial fixes. Surely
there is a reason for this, but it's not obvious to me.
[10]
Note that this fix makes use of mexchange(9F), which has long been
available internally, but which I added to the Solaris DDI (along with
a collection of other STREAMS utility routines) for Solaris 10.
Hopefully these routines will make writing new STREAMS code a little
less painful. Perhaps worthy of a future blog entry.
Technorati Tag:
OpenSolaris
Technorati Tag:
Solaris
Technorati Tag:
DTrace
( Jul 06 2005, 08:27:26 PM EDT )
Permalink

Tuesday June 14, 2005
dsvclockd(1M): Using Doors to Implement Inter-Process Readers/Writer Locks
dsvclockd(1M): Using Doors to Implement Inter-Process Readers/Writer Locks
As a long-time and rabid Solaris developer, the most personally satisfying
part of the OpenSolaris launch
is being able to finally share some of the creative (and borderline
insane) solutions we've devised to real-world problems.
Here, I cover perhaps the most perverse use of Doors in Solaris: as the
basis for a robust, multi-threaded, inter-process, inter-machine,
readers/writer lock implementation. So, dust off that Morrison Hotel LP,
and kill the lights ...
Mojo Rising
One groovy flavor of interprocess communication in Solaris is doors, which
first appeared in Sun Lab's SpringOS, was implemented in Solaris
2.5 by Jim Voll, and rose to fame as the mechanism which made our nscd
scream. My exposure to doors goes back to 1996, when I stumbled on the
nascent facility while trussing [1] getent
to root-cause a name-service problem -- the truss was quickly followed by a
man door_call, which returned a curious manpage containing a grave
warning and a fitting note:
DESCRIPTION
This family of system calls provide a new flavor of interprocess
communication between client and server processes. The doors mechanism
is not yet available for public consumption.
WARNING
Please do not attempt to reverse-engineer the interface and program to
it. If you do, your program will almost certainly fail to run on future
versions of Solaris, and may even be broken by a patch. This document
does not constitute an API. Doors may not exist or may have a completely
different set of semantics in a future release.
NOTES
This manual page is here solely for the benefit of anyone who noticed
door_call() in truss(1) output and thought, "Gee, I wonder what that
does..."
Of course, despite the warning, doors did continue to exist and these days the
fundamentals of doors are well-documented, both in our own manpages and in the
late Richard St