James Carlson's Weblog
Close Hang (or 4028137)
Every once in a while, a bug sticks with you long enough that you remember the ID number without having to think about it, and you find yourself mentioning the number in conversation as though it were universally known. 4028137 is one of those bugs. Like a public works project gone horribly wrong, many people had contributed to this bug over many years without actually coming much closer to a real solution. This entry describes how we got a solution.
The cause of the problemThis problem has been around since we switched over to a System V code base, and it's deceptively simple. Serial ports (especially those used for console lines) seem to "hang" in a state where the process using the port cannot be killed; it becomes <defunct> and hangs around forever. Worse, the serial port itself cannot be used again until the system is rebooted. Other processes trying to open the port become wedged in the specfs layer. Those innocent victims are the first thing the user notices, and they look like this in mdb(1):
cv_wait_sig(0x61f83db8,0x61f83db0,0x62a26860,0x0,0x1,0x66817710) spec_open(0x61f83dc0,0x2202,0x64402d20,0x61f83dc4,0x1d,0x740000) + 158 cnopen(?) + 7c dev_open(0x3190192c,0x2202,0x2,0x64402d20,0x10439000,0x61f83dc4) spec_open(0x61f82aa0,0x2202,0x64402d20,0x61f82aa4,0x0,0x0) + 5c0 vn_open(0x0,0x0,0x2302,0x3f01b6,0x31901a74,0x0) + 2dc copen(0x38de0,0x2302,0x1b6,0x37400,0xef7a3680,0x196e8) + 84
That's spec_open attempting to do LOCK_CSP_SIG(). It's stuck because we have SLOCKED set, which means that there's another thread somewhere in open or close on this same stream. Quickly scanning through the ::threadlist output for spec_open and spec_close, we see this culprit:
cv_wait_sig+0x160 (0x61022dd2,0x61022db0,0x61015048,0x0,0x60fbf5d8,0x60ff7c40) qwait_sig+0x180 (0x61022dc2,0x40,0xc0006,0x61022dd2,0x61022db0,0x1) ldtermclose+0x264 (0x61022d00,0x4,0x60d48ee0,0x60aaaf48,0x60d48f4c,0x80000) qdetach+0x12c (0x61022d00,0x1,0x3,0x6049d9e0,0x1046c1b8,0x61022d58) strclose+0x4b0 (0x6102219e,0x610215b0,0x61021578,0x61022d00,0x61022164,0x610215b4) device_close+0x6c (0x60e8e464,0x3,0x6049d9e0,0x4,0x60e8e4f4,0x740001) spec_close+0xd8 (0x60e8e464,0x3,0x60e8e564,0x60e8e56c,0x60e8e564,0x6049d9e0) closef+0xa8 (0x60b4e428,0x0,0x60ff7c40,0x1,0x3,0x60e8e464) closeall+0x4c (0x1,0x0,0x61015614,0x60b4e428,0x4,0x6100b420) exit+0x134 (0x1,0x0,0x60fbf5d8,0x610151f0,0xef7948a8,0x0) syscall_trap+0x194 (0x0,0x0,0xef791000,0x0,0xef7948a8,0xffffffff)
So how does this happen? In mdb we see that the ldterm(7M) STREAMS module is stuck waiting on a reply from the driver that never arrives. Turning to the code, the ldtermclose routine sends down a TCSBRK ioctl to the driver in order to drain the remaining output data before allowing the close to complete. A quick look at the driver shows that it's stuck on flow control and can't drain the data. This code before the fix looked like this (eliding a few lines of noise):
/*
* Wait for output to drain completely from the modules and
* driver below us before completing the close. Doing so
* guarantees that the proper tty semantics -- in particular,
* flow control processing -- still apply for the remaining
* pending output.
*
* We do this by sending a TCSBRK ioctl downstream and waiting
* for the driver's response. (When issued with a nonzero
* argument, this ioctl does nothing other than wait for
* output to drain, which is precisely what we want.)
*/
/*
* Set up the ioctl.
*/
[...]
iocb = (struct iocblk *)bp->b_rptr;
iocb->ioc_count = sizeof (int);
*(int *)datap->b_rptr = 1; /* arbitrary nonzero value */
datap->b_wptr += sizeof (int);
bp->b_cont = datap;
/*
* Fire it off.
*/
tp->t_state |= TS_IOCWAIT;
tp->t_iocid = iocb->ioc_id;
(void) putq(WR(q), bp);
/*
* Now wait for it. Let our read queue put routine wake us
* up when it arrives.
*/
while (tp->t_state & TS_IOCWAIT) {
if (!qwait_sig(q))
break;
}
So why does ldterm do this? The code first appears in the 1.11 version of the driver from 1990, and it seems the real answer is lost to time, but the likely reason is that some serial drivers written back then may have discarded data on close, and that's not what was wanted. Rather than fix all of the drivers to drain when expected, someone added this ioctl in the ldterm path as a "global" fix.
That's not the whole story, though. Ldterm is stuck in qwait_sig(7F), which should have allowed us to send a signal to wake the process back up. Why isn't that working? Looking further up the stack, we see exit calling closeall which gets us here. The problem is that the closeall invocation happens after exit has disabled all signal reception. We just can't wake up on signals here.
It turns out that both of these are "right." A serial stream must, according to the standards, drain the output data when close is called, and must behave as though close were called on all open descriptors when exit is called. But it's also true that once exit is called, the process is torn down, and there's no legitimate way to deliver any signals that might interrupt the stream-close handling because there's no user left to which to send them.
Towards a solutionFrom here, the fix isn't clear. Over the years, several possibilities were proposed, including:
Just remove the TCSBRK from ldterm. While the use of TCSBRK here was likely flawed from the start (the standards don't say that you must have the ldterm module pushed in order for drain-on-close to work right), removing this is not a fix. This doesn't work because some serial drivers do correctly drain data during close(9E) handling, and those will be just as stuck during exit as ldterm once was. Worse, any broken drivers that fail to drain data as expected will now have latent bugs exposed to the user.
Move the call to closeall before the place where signals are disabled. This doesn't work because it means that an application could unexpectedly get a signal invocation after having called exit, and after all atexit routines have returned, which violates the expected semantics.
"Fix" cv_wait_sig so that it wakes up on a signal even if there's no user thread available. This could possibly be made to work but has at least one ugly (and decidedly non-UNIX) side-effect: the user would have to issue SIGKILL multiple times to get it to work.
Don't drain the data at exit time, and instead just flush it as implied by the standards. This doesn't work at all. A simple cat /etc/motd > /dev/cua/a would never work right: when cat exited, its stdout would be closed, and the output would be discarded.
Keep the driver structures allocated and draining data even though the user is allowed to complete the close. This won't work because it means that we'll be lying to the application, and telling it that we've transmitted data that may never actually make it out onto the wire.
In short, none of these is really right, and all cause other problems. We need a better answer. There are too many constraints here to achieve perfection, so we need to shoot for the least awful behavior. The answer I concocted looks like this:
- When the driver's close(9E) routine is called by a user thread invoking close(2), the user's thread should wait forever for data to drain. If a signal is received, though, we can discard the output data and finish the close. (Note that the user can himself use TCSBRK with a non-zero argument to drain the output stream without losing any enqueued data on signal reception.)
- When it's called by exit(2), the thread should wait as long as we're still making forward progress. If a "long time" (described below) passes without transmitting data, then we just assume we'll never finish, and we can give up.
- Ldterm needs to figure out what sort of underlying driver is present. If it's one that's known to handle close properly (as documented above), then it should just terminate like any good STREAMS module. If not, then it must resort to the legacy TCSBRK behavior.
- A set of special conditions exist under which draining output is
just impossible, and we should always discard instead without waiting
at all. These need to be handled in each serial driver:
- User sent TIOCSBRK without TIOCCBRK before closing. We'll be in break state "forever," so no more output is possible.
- User sent TCXONC with argument 0 ("suspend output") before closing.
- User set the O_NDELAY or O_NONBLOCK flags on the descriptor being closed. We cannot block to drain the data.
This gives us the chance to allow for reasonable behavior out of well-designed applications (those that carefully close descriptors when done with them and use the existing termio(7I) ioctls to good effect), but prevents things such as a ham-handed administrator using SIGKILL on tip(1) from rendering serial ports permanently unusable.
To do all this, we need two new features: a new DDI function (ddi_can_receive_sig(9F)) to determine if it's possible to receive signals in the current context (i.e., we're not being called from process exit) and a new private message (MC_POSIXQUERY, using an ioctl-formatted M_CTL just like the existing MC_CANONQUERY).
The new close logic can be seen in (among other serial drivers) asyclose and async_progress_check.
static void
async_progress_check(void *arg)
{
struct asyncline *async = arg;
struct asycom *asy = async->async_common;
mblk_t *bp;
/*
* We define "progress" as either waiting on a timed break or delay, or
* having had at least one transmitter interrupt. If none of these are
* true, then just terminate the output and wake up that close thread.
*/
mutex_enter(&asy->asy_excl);
mutex_enter(&asy->asy_excl_hi);
if (!(async->async_flags & (ASYNC_BREAK|ASYNC_DELAY|ASYNC_PROGRESS))) {
async->async_ocnt = 0;
async->async_flags &= ~ASYNC_BUSY;
async->async_timer = 0;
bp = async->async_xmitblk;
async->async_xmitblk = NULL;
mutex_exit(&asy->asy_excl_hi);
if (bp != NULL)
freeb(bp);
/*
* Since this timer is running, we know that we're in exit(2).
* That means that the user can't possibly be waiting on any
* valid ioctl(2) completion anymore, and we should just flush
* everything.
*/
flushq(async->async_ttycommon.t_writeq, FLUSHALL);
cv_broadcast(&async->async_flags_cv);
} else {
async->async_flags &= ~ASYNC_PROGRESS;
async->async_timer = timeout(async_progress_check, async,
drv_usectohz(asy_drain_check));
mutex_exit(&asy->asy_excl_hi);
}
mutex_exit(&asy->asy_excl);
}
[...]
/*
* Close routine.
*/
/*ARGSUSED2*/
static int
asyclose(queue_t *q, int flag, cred_t *credp)
{
[...]
/*
* There are two flavors of break -- timed (M_BREAK or TCSBRK) and
* untimed (TIOCSBRK). For the timed case, these are enqueued on our
* write queue and there's a timer running, so we don't have to worry
* about them. For the untimed case, though, the user obviously made a
* mistake, because these are handled immediately. We'll terminate the
* break now and honor his implicit request by discarding the rest of
* the data.
*/
if (async->async_flags & ASYNC_OUT_SUSPEND) {
if (async->async_utbrktid != 0) {
(void) untimeout(async->async_utbrktid);
async->async_utbrktid = 0;
}
mutex_enter(&asy->asy_excl_hi);
lcr = ddi_io_get8(asy->asy_iohandle, asy->asy_ioaddr + LCR);
ddi_io_put8(asy->asy_iohandle,
asy->asy_ioaddr + LCR, (lcr & ~SETBREAK));
mutex_exit(&asy->asy_excl_hi);
async->async_flags &= ~ASYNC_OUT_SUSPEND;
goto nodrain;
}
/*
* If the user told us not to delay the close ("non-blocking"), then
* don't bother trying to drain.
*
* If the user did M_STOP (ASYNC_STOPPED), there's no hope of ever
* getting an M_START (since these messages aren't enqueued), and the
* only other way to clear the stop condition is by loss of DCD, which
* would discard the queue data. Thus, we drop the output data if
* ASYNC_STOPPED is set.
*/
if ((flag & (FNDELAY|FNONBLOCK)) ||
(async->async_flags & ASYNC_STOPPED)) {
goto nodrain;
}
/*
* If there's any pending output, then we have to try to drain it.
* There are two main cases to be handled:
* - called by close(2): need to drain until done or until
* a signal is received. No timeout.
* - called by exit(2): need to drain while making progress
* or until a timeout occurs. No signals.
*
* If we can't rely on receiving a signal to get us out of a hung
* session, then we have to use a timer. In this case, we set a timer
* to check for progress in sending the output data -- all that we ask
* (at each interval) is that there's been some progress made. Since
* the interrupt routine grabs buffers from the write queue, we can't
* trust changes in async_ocnt. Instead, we use a progress flag.
*
* Note that loss of carrier will cause the output queue to be flushed,
* and we'll wake up again and finish normally.
*/
if (!ddi_can_receive_sig() && asy_drain_check != 0) {
async->async_flags &= ~ASYNC_PROGRESS;
async->async_timer = timeout(async_progress_check, async,
drv_usectohz(asy_drain_check));
}
while (async->async_ocnt > 0 ||
async->async_ttycommon.t_writeq->q_first != NULL ||
(async->async_flags & (ASYNC_BUSY|ASYNC_BREAK|ASYNC_DELAY))) {
if (cv_wait_sig(&async->async_flags_cv, &asy->asy_excl) == 0)
break;
}
if (async->async_timer != 0) {
(void) untimeout(async->async_timer);
async->async_timer = 0;
}
nodrain:
If we can't receive signals, then we start a timer to check every 15 seconds to see if we're making any progress. If not, we flush everything and let the close complete. The detection logic for enhanced drivers is in ldtermopen, where we send MC_POSIXQUERY, and ldtermrput, where we handle the MC_HAS_POSIX reply. Old drivers will just discard MC_POSIXQUERY and we'll keep the preallocated TCSBRK drain message. New ones send up MC_HAS_POSIX in reply, and ldterm then discards the preallocated TCSBRK message so it won't be used during close. (The "ldterm_drain_limit" flag below is an undocumented /etc/system flag that allows us to disable this logic just in case it runs into trouble in the field, rather than having to recompile the module and deliver a patch.)
ldtermopen(queue_t *q, dev_t *devp, int oflag, int sflag, cred_t *crp)
[...]
/*
* Find out if the underlying driver supports proper POSIX close
* semantics. If not, we'll have to approximate it using TCSBRK. If
* it does, it will respond with MC_HAS_POSIX, and we'll catch that in
* the ldtermrput routine.
*
* When the ldterm_drain_limit tunable is set to zero, we behave the
* same as old ldterm: don't send this new message, and always use
* TCSBRK during close.
*/
if (ldterm_drain_limit != 0) {
if ((qryp = open_ioctl(q, MC_POSIXQUERY)) == NULL)
goto open_abort;
qryp->b_datap->db_type = M_CTL;
putnext(wq, qryp);
}
[...]
ldtermrput(queue_t *q, mblk_t *mp)
[...]
case MC_HAS_POSIX:
/* no longer any reason to drain from ldterm */
if (ldterm_drain_limit != 0) {
freemsg(tp->t_drainmsg);
tp->t_drainmsg = NULL;
}
break;
Compared to negotiating the shark-infested design waters, the fix itself looks simple.
Finishing the jobThe next big step was to test this fix (during which I found around a dozen other lurking problems) and then chase down follow-on problems either uncovered or caused by this fix, and update documentation and third-party developers about the problem and solution. In fact, the testing alone for this fix took many times the amount of effort than finding, fixing, and documenting the problem, and running through all the development process steps and reviews. It took many weeks on-and-off running both manual and automated test suites. But that story is a (lengthy) blog for another day.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: mdb
Posted at 11:26AM Jun 14, 2005 by carlson in Solaris | Comments[0]
Tuesday Jun 14, 2005