On Bugs and Reproducibility
I love bugs -- specifically, I love putting an end to
bugs[1]. I
find myself a bit tongue-twisted when asked to articulate just what makes
root-causing a bug so engrossing. In brief: it's the thrill of the hunt,
the pull of a mystery, the search for a deeper understanding of how
the system works, and -- of course -- the gripping fear that some
customer somewhere will hit it again if we don't nail the bloody thing[2].
One way to subclass bugs is by reproducibility. That is, can the bug be
reproduced on demand? However, as this blog entry will make clear,
almost[3] any bug
can be reproduced on demand once it is sufficiently understood. Thus,
when someone states "this bug is not reproducible", they're likely
really saying "this bug is not sufficiently understood". Sadly, one
established way a bug can be "resolved" at Sun is for it to be closed
out as "not reproducible" -- not just a shameful admission of utter
defeat, but a near-guarantee to customers that they will be
inconvenienced by the problem yet again. Thus, many of us simply
refuse to close out bugs as "not reproducible"[4].
Nonetheless, sometimes bugs are indeed closed out as "not reproducible".
In fact, if a bug takes up residence in a subsystem that is poorly
understood but widely used, it can thump around for years -- or even
decades -- before it is finally smoked out by a particularly determined
engineer. Further, often the manifestations of the bug can be so varied
that even after diagnosing, it can prove impossible to track down all of
the duplicates that are either still open or have been closed out over
the years as "not reproducible". Here I discuss my dealings with one
such bug -- 4927647.
The Sewer of STREAMS
A classic example of a poorly-understood but widely-used area is the
pseudo-terminal subsystem[5]. While the basics of how pseudo-terminals
work is well-documented in several texts[6], the STREAMS code that implements it is a
arcane and delicate piece of machinery from AT&T that has undergone
little change since its initial impact with SVR3.2 some 18 years ago.
Again, that's not to say the code is unimportant. On the
contrary, it is a core building block upon which many critical
utilities, such as xterms, remote logins (ssh/telnet),
and tools like expect(1) are built. Thus, when something goes
wrong, someone has to be called in to diagnose it -- and since
it's STREAMS code, these days that often means me[7].
Specifically, I got called in because a high-profile customer filed
the following bug:
In expect 5.38.0 on Solaris 9, run the following script:
for { set i 1 } { 1 } { incr i } {
spawn echo foobarbaz
expect {
"foobarbaz" {
puts "ok"
expect eof wait
}
eof {
puts "$i passes"
wait
exit 1
}
}
}
It should loop forever; instead, it exits after a few iterations on
multiprocessor systems. It appears to loop "forever" on uniprocessors.
Indeed, running this test I was able to confirm that it did fail on
multiprocessor systems -- though usually after thousands of iterations (each
iteration taking close to a second), rather than the "few" indicated in the
bug report. Without a reproducible test case, this was surely going to
be a painful battle.
The Hunt for Reproducibility
Using truss(1), it had already been confirmed that the problem was not
with expect, but with the pseudo-terminal implementation itself.
Specifically, something was causing the poll(2) to occasionally wake up
with POLLERR, and the subsequent read(2) to return EINVAL rather
than the number of bytes in the expected message:
25258: open("/dev/ptmx", O_RDWR) = 4
[ ... lots happens ... ]
25258: poll(0xFFBFEBC0, 1, 10000) = 1
25258: fd=4 ev=POLLRDNORM|POLLRDBAND rev=POLLERR
25258: read(4, 0x00044FE8, 3999) Err#22 EINVAL
Of course, with this came the sobering realization that this bug was
undoubtedly responsible for countless other sporadic "not reproducible"
pseudo-terminal-related bugs, and an accompanying growing list of
suspected duplicates. So, the key question was: could we intuit enough
about this bug to make it reproducible and finally smoke this thing out?
At the time, DTrace had
not yet integrated into Solaris, but was available for internal use and
had an ever-growing buzz, so I was eager to see if it could shed some
light on this bug. In particular, based on the truss output, my
suspicion was that a toxic message was arriving on the
/dev/ptmx stream that was causing STRDERR to be set,
triggering POLLERR to be returned from strpoll():
if (sd_flags & (STPLEX | STRDERR | STWRERR)) {
if (sd_flags & STPLEX) {
*reventsp = POLLNVAL;
return (EINVAL);
}
if (((events & (POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI)) &&
(sd_flags & STRDERR)) ||
((events & (POLLOUT | POLLWRNORM | POLLWRBAND)) &&
(sd_flags & STWRERR))) {
if (!(events & POLLNOERR)) {
*reventsp = POLLERR;
return (0);
}
}
}
Further, my suspicion was that
most of the time, the application called
poll() before this toxic message arrived, thus sidestepping the problem.
However, occasionally, the application would get delayed (e.g., by other
system activity), causing the toxic message to set
STRDERR before the
application was able to call
poll().
Thus, I used the following simple D script to "force" the application to
become stalled:
#!/usr/sbin/dtrace -ws
syscall::poll:entry
{
chill(30000000);
}
And just like that, the problem became reproducible on the first try!
That is, by simply widening the failure window, the problem became
immediately reproducible, simultaneously confirming my hypothesis and
significantly easing further analysis. (It should be noted that the
proper value to chill(), which is specified in nanoseconds, is highly
dependent on the underlying hardware -- for my hardware, delaying by 30ms
was sufficient. Of course, to prevent widespread system misbehavior,
the amount of time one can chill() is bounded.)
The Analysis
With the problem reproducible, further analysis with DTrace and other tools
proceeded quickly. In less than an hour, the specifics behind the bug were
well-understood, though still extraordinarily twisted.
For those who are interested in such sordid details, here's a crude
diagram which shows the data flow (indicated by arrows) and the relevant
streams modules (shown in brackets; more about the the modules can be
gleaned from their manpages -- e.g., pts(7D)). Note that pckt(7D),
while critical to the problem, is not actually on the stream -- hence the
dotted lines going through it:
expect echo
U
-------------------------------------------
K
[ stream ] [ stream ]
[ head ] [ head ]
| ^ | ^
| | v |
| | [ ldterm ]
| | | ^
: : v |
[ pckt ] [ ptem ]
: : | ^
v | v |
[ ptm ] <------> [ pts ]
master side slave side
Using the above diagram, here are the steps that lead to the problem
[8]:
- The message (e.g., "foobarbaz" or "something") is sent down
the pseudo-terminal slave stream by the echo command.
- The pseudo-terminal slave is closed by echo (e.g., through
exit(2)) and the dismantling of the stream begins; the
message sent in (1) is still in-flight, perhaps even still
on the slave side.
- Eventually, ldterm's close routine (ldtermclose()) is called,
which causes a TCSBRK M_IOCTL to be sent downstream (see
Jim Carlson's blog for lots more on that misery).
- The ptem module receives the M_IOCTL, acks
it, and then passes a copy of of the M_IOCTL downstream for
pckt(7M), as documented in ptem(7M).
- Shortly after ldtermrput() receives the M_IOCACK,
ldtermclose() is allowed to complete, and the
following processing elements (ptem and pts)
quickly follow suit, being careful to drain (if necessary) the
message and the M_IOCTL copy to the master side
stream.
- Both the M_IOCTL and original data message reach the
master side stream. Eventually, the data message makes
it to the stream head and a pollwakeup() is issued to
notify the application (expect) that data is available.
However, as discussed before, for whatever reason the
application is delayed.
- Since the pckt(7M) module is actually not on the master side,
the M_IOCTL reaches the master side's stream head. Since it's
not expecting to receive this message, strrput_nondata() sends
an M_IOCNAK back downstream.
- The M_IOCNAK reaches ptm. However, since the slave-side has
been closed, ptmwput() frees the message and sends an
M_ERROR upstream to indicate the problem.
- The stream head receives the M_ERROR and strrput_nondata()
sets STRDERR on the stream.
- The application thread finally receives the notification
from step (6) and is awoken, but since STRDERR is set, the
block of code shown earlier causes strpoll()
to return POLLERR rather than POLLIN. Thus,
the observed behavior.
All That ... For This?
Despite the tortuous nature of the bug, the fix itself proved to be
trivial[9]. In
particular, while there are several possible simple
fixes, the least risky is to change the ptmwput() logic in step (8) to
use the obscure 2-byte form of the M_ERROR message to only mark the
write-side as having an error. That is, to change:
if ((ptmp->pt_state & PTLOCK) || (ptmp->pts_rdq == NULL)) {
DBG(("got msg. but no slave\n"));
(void) putnextctl1(RD(qp), M_ERROR, EINVAL);
PT_EXIT_READ(ptmp);
freemsg(mp);
return;
}
... to[10]:
if ((ptmp->pt_state & PTLOCK) || (ptmp->pts_rdq == NULL)) {
DBG(("got msg. but no slave\n"));
mp = mexchange(NULL, mp, 2, M_ERROR, -1);
if (mp != NULL) {
mp->b_rptr[0] = NOERROR;
mp->b_rptr[1] = EINVAL;
qreply(qp, mp);
}
PT_EXIT_READ(ptmp);
return;
}
[ See the putnextctl(9F), qreply(9F), and mexchange(9F)
manpages to get a better handle on this code. ]
This seems eminently reasonable since ptm is unable to forward a
message through the write-side, but the read-side is still usable -- so
why wasn't it done this way from day one? The reason is simple: the
2-byte form of the M_ERROR message post-dates the original
code.
Again, thanks to the reproducible test case, it was quick and easy
to verify that this fixed the problem.
Due Diligence
As I previously mentioned, most of the pseudo-terminal code dates back to
AT&T and has simply been hitched to the nearest wrecker and gingerly towed to
the latest release. Indeed, inspecting an internal gates, it's clear
this bug goes all the way back to the introduction of the STREAMS-based
pseudo-terminal subsystem in 1987! While this is certainly an extreme
case, it is not rare for apparently-intermittent bugs to terrorize
subsystems for years before they are finally root-caused.
In this case, scanning the bug database suggests the conditions needed to
reproduce this bug did not become widespread until Solaris 9 -- at which
point, numerous bizarre application failures began to be reported, such as:
4530041 Unable to restore/save backup during DSR upgrade to S9/S10 due to internal error
4654418 ripv2 test routed.perf.3 failed intermittently
4695559 spurious failures of tcp.serverclose
4813669 ripv2 test routed.perf.3 failed intermittently
4897163 ppp tests often return with "FAIL: Unable to ping 10.0.0.2."
4892148 unexpected POLLERR causing early EOF from spawned processes in "expect"
4926624 ssh exits with -1 if stdin is not a terminal
4985311 rdisc.3 testcase randomly fails
5012511 ripv2: some test cases in rtquery subdir failed randomly and intermittently
Of course, there were surely many more anomalies due to this bug which
were either never filed, "fixed" by other means (workaround), or closed
out as "not reproducible" -- we will never know the full scope of the
chaos caused by this persistent little bug.
Summary
Hopefully, this treatise has clarified three key points regarding
hard-to-reproduce bugs:
- Almost any bug can be reproduced once it is properly understood.
Resist the temptation to close a bug out as non-reproducible --
it will come back to bite someone!
- DTrace
is a powerful tool for investigating hard-to-reproduce bugs, and
the chill() action is particularly ideal for
investigating bugs that you suspect are timing-related.
- Once the problem is understood, scanning the bug database for
similar symptoms can often reveal years -- or even decades -- of
other bits of inexplicable behavior. Like nabbing a serial
killer, the search through the cold cases can prove tedious but
ultimately provides a sense of closure to others who have been
victimized. Moreover, the search provides a basis for
understanding the true impact of the bug and thus how many
releases the fix needs to be patched back to.
Happy hunting!
Footnotes
[1]
I literally put an end to bugs during a recent trip to China -- though
sadly I was not able to dine on anything quite this large.
(Upon seeing that photo, Jim
Carlson quipped: "I guess if you can put a stick through it, it's food."
-- a disturbingly accurate observation.)
[2]
Of course, as a developer, much of my time is spent on the other side of
the fence: introducing new code, which will invariably have a new crop of
challenging bugs. Thus, there is another element to bugfixing: paying
my dues to society.
[3]
There exists a very small subset of bugs that are tied to factors
that are beyond software's ability to control, and thus cannot be
reproduced on demand, even when they are sufficiently understood.
[4]
Unfortunately, our internal "list" of reasons for closing out a
bug is currently too limited, forcing "not reproducible" to be
pressed into service for "fixed as a side effect of X", or
"insufficient information available to evaluate", among others.
[5]
Actually, we have two pseudo-terminal subsystems on Solaris: an
System V one, and a BSD one. However, the BSD flavor is provided only
for compatibility and for purposes of this discussion can be ignored.
[6]
An excellent description of pseudo-terminals can be found in
the recently-revised Advanced Programming in the Unix Environment.
[7]
For the record, I've never formally worked on STREAMS -- but Sasha Kolbasov and myself have become the unofficial caretakers of the
much-maligned STREAMS core.
[8]
To keep the description from growing too bloated, I have left out
the code snippets, but inspecting the linked functions will allow
more detailed study of the problem. Of course, the bug has been
fixed, so the broken code in ptmwsrv() no longer exists.
[9]
It is surprising how often hard bugs have trivial fixes. Surely
there is a reason for this, but it's not obvious to me.
[10]
Note that this fix makes use of mexchange(9F), which has long been
available internally, but which I added to the Solaris DDI (along with
a collection of other STREAMS utility routines) for Solaris 10.
Hopefully these routines will make writing new STREAMS code a little
less painful. Perhaps worthy of a future blog entry.
Technorati Tag:
OpenSolaris
Technorati Tag:
Solaris
Technorati Tag:
DTrace