Tuesday June 14, 2005 Now, when the OpenSolaris is finally a reality, engineers can really start talking about the interesting staff. We spend most of our time dealing with code and it quite difficult to talk about what you do without being able to provide examples. Now is the time for the real technical blogging instead of the hand-waving.
I spent quite a lot of my time at Sun hacking STREAMS internals - things that almost no one ever notices, except for some unlucky folks who run into some nasty issues. Here I want to continue the sewer tour, started by Bryan Cantrill and take interested visitors to some of the STREAMS sewers.
Our tour will start with a few lines of code in the putnext() function. In the original implementation it was just a simple macro calling the put procedure of the next module in the STREAM:
#define putnext(q, mp) ((*(q)->q_next->q_qinfo->qi_putp)((q)->q_next, (mp)))
In Solaris it evolved into a rather complicated code1. Its inner workings deserve quite a few separate blog entries, but now we will take a side tour to a small piece of code that may catch your attention:
/*
* If there are writers or exclusive waiters, there is not much
* we can do. Place the message on the syncq and schedule a
* background thread to drain it.
*
* Also if we are approaching end of stack, fill the syncq and
* switch processing to a background thread - see comments on
* top.
*/
if ((flags & (SQ_STAYAWAY|SQ_EXCL|SQ_EVENTS)) ||
(sq->sq_needexcl != 0) || PUT_STACK_NOTENOUGH()) {
/*
* NOTE: qfill_syncq will need QLOCK. It is safe to drop
* SQLOCK because positive sq_count keeps the syncq from
* closing.
*/
mutex_exit(SQLOCK(sq));
qfill_syncq(sq, qp, mp);
/*
* NOTE: after the call to qfill_syncq() qp may be
* closed, both qp and sq should not be referenced at
* this point.
*
* This ASSERT is located here to prevent stack frame
* consumption in the DEBUG code.2
*/
ASSERT(sqciplock == NULL);
return;
}
The code uses a notion of a syncq, which is a synchronization abstraction used by STREAMS. Whenever some module calls the putnext() function, the code checks for various conditions and if everything seems all-right, it just calls the put procedure of the next module in the STREAM. If something doesn't seem right (e.g. the target module is busy processing other messages), the message is placed on the special queue in the syncq and the framework arranges a different kernel thread to pass the enqueued message to the next module when the module is ready3. The important observation is that this kernel thread executes has its own fresh stack. The highlighted code above hijacks this normal STREAMS mechanism to protect from a rather nasty problem - kernel stack overflow.
Note that STREAMS is a very flexible framework that allows many modules to be linked together in a chain (or even something more complex like a tree or U-pipe). Every new Solaris release includes some new STREAMS modules and drivers that often play in concert to provide some exciting new functionality. The downside is that the chain-calling of module's put procedures creates really long call chains on the stack and sometimes causes kernel panics caused by the stack overflow.
Soon after Solaris 9 release I saw quite a few similar panics, happening because all the kernel thread stack is consumed by a pile of STREAMS modules. Here is a typical example:
402760b0 allocb+0x180()
40276110 ip_wput_frag_copyhdr+0x14()
40276170 ip_wput_frag+0x124()
40276290 ip_wput_ire+0x1a98()
402763e8 ire_send+0x218()
40276448 ire_add_then_send+0x304()
402764b0 ip_newroute+0xb78()
40276600 ire_send+0x198()
40276660 ire_add_then_send+0x304()
402766c8 ip_wput_nondata+0x874()
40276748 putnext+0x400()
402767a8 ar_query_reply+0x150()
40276808 ar_entry_query+0x154()
40276878 ar_rput+0x144()
402768e8 putnext+0x400()
40276948 ip_newroute+0x1760()
40276a98 putnext+0x400()
40276af8 udp_wput+0x63c()
40276b80 putnext+0x400()
40276be0 putnext+0x400()
40276c40 strput+0x57c()
40276d68 kstrputmsg+0x3bc()
40276de8 tli_send+0x20()
40276e48 t_ksndudata+0x284()
40276ea8 clnt_clts_kcallit_addr+0x3dc()
40276f90 clnt_clts_kcallit+0x38()
40277000 rfscall+0x46c()
402770b8 rfs3call+0x68()
40277130 nfs3write+0x11c()
402772b0 nfs3_bio+0x334()
40277310 nfs3_rdwrlbn+0xf0()
40277380 nfs3_sync_putapage+0x34()
402773e0 nfs3_putapage+0x364()
40277470 pvn_vplist_dirty+0x424()
40277568 nfs_putpages+0x174()
402775e0 nfs3_putpage+0xa8()
40277640 nfs_purge_caches+0x98()
402776a0 nfs_cache_check+0xf8()
40277700 nfs3_getattr_otw+0x12c()
40277840 nfs3_validate_caches+0x11c()
40277908 nfs3_getpage+0x44()
40277990 segvn_fault+0xa74()
40277a90 as_fault+0x4c4()
40277b40 pagefault+0x40()
40277ba8 trap+0xd90()
40277c80 utl0+0x4c()
Although each modules uses just a small amount, numbers quickly add up. It is like going to the grocery store to buy a whole bunch of small items. Each one is really cheap, but you end up paying a lot at the counter.
At the time I was fixing some bugs in the syncq implementation and spent a lot of time fooling around putnext() and its friends, so I spotted the possibility to use the existing putnext() ability to delegate work to a new kernel thread to avoid such panics. The idea was really simple: in addition to the usual work handout in case the perimeter is busy, perform the same handout when we are too close to blowing away the stack. So I added the highlighted code above together with the definition for PUT_STACK_NOTENOUGH:
#define PUT_STACK_NEEDED 5000
#define PUT_STACK_NOTENOUGH() \
(((STACK_BIAS + (uintptr_t)getfp() - \
(uintptr_t)curthread->t_stkbase) < put_stack_needed))
The value of PUT_STACK_NEEDED was chosen experimentally. It is impossible to predict how much stack will be used in the future. For example, a simple call to allocb() may, in some unlucky circumstances, trigger a long chain of calls through the kmem and vmem memory allocation layers. So the value of PUT_STACK_NEEDED was chosen to prevent common panics that we saw during the PIT runs 4.
I run the fix through the real kernel experts - Bryan Cantrill, who fixed a tricky kernel stack overflow problem before5 and Mike Shapiro. Mike suggested making PUT_STACK_NOTENOUGH a generic kernel function for others to use, while Bryan objected. We had a heated meeting in Mike's office and as a result, the following comment appeared at the top of putnext.c file:
/* * Streams with many modules may create long chains of calls via putnext() which * may exhaust stack space. When putnext detects that the stack space left is * too small (less then PUT_STACK_NEEDED), the call chain is broken and * further processing is delegated to the background thread via call to * putnext_tail(). Unfortunately there is no generic solution with fixed stack * size, and putnext() is recursive function, so this hack is a necessary evil. * * The redzone value is chosen dependent on the default stack size which is 8K * on 32-bit kernels and on x86 and 16K on 64-bit kernels. The values are chosen * empirically. For 64-bit kernels it is 5000 and for 32-bit kernels it is 2500. * Experiments showed that 2500 is not enough for 64-bit kernels and 2048 is not * enough for 32-bit. * * The redzone value is a tuneable rather then a constant to allow adjustments * in the field. * * The check in PUT_STACK_NOTENOUGH is taken from segkp_map_red() function. It * is possible to define it as a generic function exported by seg_kp, but * * a) It may sound like an open invitation to use the facility indiscriminately. * b) It adds extra function call in putnext path. * * We keep a global counter `put_stack_notenough' which keeps track how many * times the stack switching hack was used. */
The hack was integrated6 early in Solaris 10 and backported to S9 updates. Later in S10 timeframe another engineer did the real fix for the stack overflow problem - increased the stack size on 64-bit kernel (see bug 4922366 ). It is worth mentioning a comment, made by Jeff Bonwick in the evaluation of this bug:
Yep. Grow the kernel stack. Memory is cheap. Panics are expensive.
I've been through several kernel stack crises before. They always unfold the same way. Some particular workload goes too deep. We prune a few stack frames to fix the offending code path. Then another one comes up. And another. (Right about now someone suggests that instead of growing the stack for every thread, we should finally bite the bullet and make the kernel stack growable. I dig up my mail archive from the last time we contemplated this, and explain why it's much harder than it sounds.) Eventually the panic rate becomes so high that we have to act. Nobody can figure out how to make dynamic stack growth work reliably, so after one more jurassic outage we accept physics and grow the stack.
So my proposal is that this time, we dispense with all the hand-wringing and just grow the damn thing.
As it often happens, the stack protection hack uncovered another long-standing and interesting bug7 in Solaris qfill_syncq() function, but this is a subject of another blog... .
1 The added complexity comes from the multi-threaded nature of the kernel. While one thread tries to access q_next, another thread may change it at the same, which leads to chaos. Solaris STREAMS provide a rich set of synchronization mechanism, called STREAMS perimeters which simplify the life of module and driver writers at the cost of internal complexity of the implementation.
2 Modern compilers are very smart and can optimize a function call immediately followed by a return statement so that the callee reuses the stack frame of the caller. This is called tail-call optimization. It saves the stack space, but obscures debugging and sometimes even the DTrace. The problem is that the function call back-trace you see in the stack trace does not accurately represent the actuall calling sequences in the presence of tail-calls. To simplify debugging the seemingly useless ASSERT prevents such tail-call optimization on DEBUG kernels, while keeping all the performance benefits of production builds.
3 The STREAMS framework is very careful to pass down all messages in the order received.
4 It turned out that the stack barrier value of 5000 was too aggressive for 8K stacks on 32-bit sparc systems, so it was adjusted to 2500 on 32-bit platforms.
5 Bryan fixed bug 1259818 back in Solaris 2.6.
6 This was fixed as bug 4525533.
7 Don't spend your time trying to find the bug in qfill_syncq() function - it was fixed before S10 was released.