Thursday June 16, 2005 It is common for a kernel programmer to postpone processing of some tasks and delegate their execution to another kernel thread. There may be several reasons for doing this:
In all these cases programmer, in essense, needs to execute a piece of code (task) in a different context, where context usually means another kernel thread with different set of locks held and, possibly, a different priority.
Until introduction of task queues in Solaris 8 there was no generic OS facility for such in-kernel context change. Every subsystem used its own ad-hoc mechanisms, usually utilizing ``worker threads'' together with a list of jobs to give them. The task queues interface abstracts common code out of these mechanisms and provides simple way of scheduling asynchronous tasks.
A task queue consists of a list of tasks, together with one or more threads to service the list. If a task queue has a single service thread, all tasks are guaranteed to execute in the order they were dispatched. Otherwise they can be executed in any order. Note that since tasks are placed on a list, execution of one task and should not depend on the execution of another task or a deadlock may occur. A taskq created with a single servicing thread guarantees that all the tasks are serviced in the order in which they are scheduled.
Kernel users should use the documented DDI interface for all taskq operations. These interfaces are defined in the usr/src/uts/common/sys/sunddi.h header file. The exported interface consists of the following functions:
Every taskq created in the system keeps a set of kstat counters associated with it. Try running the following command on your system:
$ kstat -c taskq
module: unix instance: 0
name: ata_nexus_enum_tq class: taskq
crtime 53.877907833
executed 0
maxtasks 0
nactive 1
nalloc 0
priority 60
snaptime 258059.249256749
tasks 0
threads 1
totaltime 0
module: unix instance: 0
name: callout_taskq class: taskq
crtime 0
executed 13956358
maxtasks 4
nactive 4
nalloc 0
priority 99
snaptime 258059.24981709
tasks 13956358
threads 2
totaltime 120247890619
...
The kstat information above includes:
You can use the power of the kstat command to observe how some counter increases over time:
$ kstat -p unix:0:callout_taskq:tasks 1 5
unix:0:callout_taskq:tasks 13994642
unix:0:callout_taskq:tasks 13994711
unix:0:callout_taskq:tasks 13994784
unix:0:callout_taskq:tasks 13994855
unix:0:callout_taskq:tasks 13994926
...
The taskq implementation also provides several useful SDT probes: All the probes described below have two arguments: the taskq pointer and the pointer to the pointer to the taskq_ent_t structure. It can be used to extract the function and the argument from the D script.
Developers can use these probes to collect precise timing information about individual task queues and individual tasks being executed through them. For example, the following script will print what functions were scheduled via task queues for every 10 seconds:
#!/usr/sbin/dtrace -qs
sdt:genunix::taskq-enqueue
{
this->tq = (taskq_t *)arg0;
this->tqe = (taskq_ent_t *) arg1;
@[this->tq->tq_name,
this->tq->tq_instance,
this->tqe->tqent_func] = count();
}
tick-10s
{
printa ("%s(%d): %a called %@d times\n", @);
trunc(@);
}
Running this on my desktop produced the following output1:
callout_taskq(1): genunix`callout_execute called 51 times
callout_taskq(0): genunix`callout_execute called 701 times
kmem_taskq(0): genunix`kmem_update_timeout called 1 times
kmem_taskq(0): genunix`kmem_hash_rescale called 4 times
callout_taskq(1): genunix`callout_execute called 40 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 256 times
callout_taskq(0): genunix`callout_execute called 702 times
kmem_taskq(0): genunix`kmem_update_timeout called 1 times
kmem_taskq(0): genunix`kmem_hash_rescale called 4 times
callout_taskq(1): genunix`callout_execute called 28 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 228 times
callout_taskq(0): genunix`callout_execute called 706 times
callout_taskq(1): genunix`callout_execute called 24 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 141 times
callout_taskq(0): genunix`callout_execute called 708 times
Suppose that two friends, Bob and Alice are staying in the cafeteria line with Alice standing behind Bob. The cashier checks Bobs' tray and it turns out that Bob doesn't have enough money, so he wants to borrow from Alice. But Alice is not sure whether she has enough cash until she knows the cost of her lunch. This is a typical deadlock situation - both Bob and Alice can not make any forward progress waiting for each other. The same kind of deadlock may occur if two tasks A and B are placed on a queue which is served by a single thread when there is a resource dependency between A and B. One way to prevent such a deadlock is to guarantee that A and B are processed by two different threads, so that when A stalls for B the thread processing A will block until B makes enough progress and can provide the needed resource to B.
Dynamic task queues provide exactly such deadlock-free way of scheduling potentially dependent tasks on the same queues. They guarantee that every task is processed by a separate thread. Since the amount of tasks that can be scheduled at the same time is not known in advance, dynamic task queues maintain a dynamic thread pool that grows when the workload increases and shrinks when the workload dries off.
Dynamic task queues can not (yet) be used via the DDI interfaces. Some kernel
subsystems use the internal taskq calls directly to create and use
dynamic task queues. The system also maintains one shared dynamic task queue
called system_taskq. It can be used by specifying
system_taskq as the taskq argument to the
taskq_dispatch() function. It is really a good idea to also add
"TQ_NOSLEEP | TQ_NOQUEUE" to the flags when using
system_taskq.
Each taskq is implemented as a list of tasks protected by a per-taskq lock. One or more worker threads take tasks one by one and execute them by calling f(a) and then sleep, waiting for new entries. A taskq created with a single servicing thread has an important property: it guarantees that all its tasks are executed in the order they are scheduled. When a task queue is created with several servicing threads, task execution order is not predictable.
If you want to look at the actual implementation you need to look at the following files:
The first taskq implementation was done by Jeff Bonwick for Solaris 8. It was successfully used to replace many calls to the low-level thread_create() function. I added Dynamic Task Queues in Solaris 9 and used them to completely re-implement the STREAMS scheduler. In Solaris 10 I added DDI interfaces for task queues and also added kstat counters and DTrace probes.
1 For curious minds: the callout_taskq is used to handle system timers. As an exercise in your DTrace skills, try to figure out what actual timers are firing on each CPU. Hint - use the callout-start SDT probe, which has a pointer to the callout_t structure as its sole argument.
Technorati Tag: Solaris
Technorati Tag: OpenSolaris
Technorati Tag: DTrace
Technorati Tag: Kernel
OpenSolaris includes a very powerful and modular debugger MDB that is an invaluable tool for analyzing crash dumps and live systems. It comes with a comprehensive manual which is a great read. Here is a brief information for brave souls having to debug STREAMS-related issues.
The most useful MDB dcmd for most module and driver developers is
::queue. Typing ::help queue will show the full usage
information:
NAME
queue - filter and display STREAM queue
SYNOPSIS
addr ::queue [-q|v] [-m mod] [-f flag] [-F flag] [-s syncq_addr]
DESCRIPTION
Print queue information for a given queue pointer.
Without the address of a "queue_t" structure given, print information about al
l
queues in the "queue_cache".
Options:
-v: be verbose - print symbolic flags falues
-q: be quiet - print queue pointer only
-f flag: print only queues with flag set
-F flag: print only queues with flag NOT set
-m modname: print only queues with specified module name
-s syncq_addr: print only queues which use specified syncq
Available conversions:
q2rdq: given a queue addr print read queue pointer
q2wrq: given a queue addr print write queue pointer
q2otherq: given a queue addr print other queue pointer
q2syncq: given a queue addr print syncq pointer (::help syncq)
q2stream: given a queue addr print its stream pointer
(see ::help stream and ::help stdata)
To walk q_next pointer of the queue use
queue_addr::walk qnext
ATTRIBUTES
Target: kvm
Module: genunix
Interface Stability: Unstable
If you just type ::queue, MDB will print information about all
opened queue instances 1 in the
system. For example, on my desktop:
> ::queue
ADDR MODULE FLAGS NBLK
d0353008 fifostrrhead 044032 0 00000000
d0353180 ip 204032 0 00000000
d03532f8 consms 202032 0 00000000
d0353470 strrhead 044032 0 00000000
d03535e8 tcp 204032 0 00000000
d0353760 strrhead 044032 0 00000000
d03538d8 conskbd 20c032 0 00000000
d0353a50 wc 242032 0 00000000
d0353bc8 ip 204032 0 00000000
d0353d40 strrhead 044032 0 00000000
d033f010 usbms 002032 0 00000000
d033f188 hid 201032 0 00000000
d033f300 consms 242032 0 00000000
d033f478 rts 108832 0 00000000
...
What you see in the output is the actual read queue address for each open module or instance, the module name, queue flags, number of messages on the module queue and the pointer to the first message in the queue. You can translate hex flag values into symbolic names adding -v option:
> ::queue -v
ADDR MODULE FLAGS NBLK
d0353008 fifostrrhead 044032 0 00000000
|
+--> QWANTR Someone wants to read Q
QREADR This is the reader (first) Q
QUSE This queue in use (allocation)
QMTSAFE stream module is MT-safe
QEND last queue in stream
d0353180 ip 204032 0 00000000
|
+--> QWANTR Someone wants to read Q
QREADR This is the reader (first) Q
QUSE This queue in use (allocation)
QMTSAFE stream module is MT-safe
QISDRV the Queue is attached to a driver
...
You can ask about specific queue by providing its address for the
::queue command:
> d0353180::queue -v
ADDR MODULE FLAGS NBLK
d0353180 ip 204032 0 00000000
|
+--> QWANTR Someone wants to read Q
QREADR This is the reader (first) Q
QUSE This queue in use (allocation)
QMTSAFE stream module is MT-safe
QISDRV the Queue is attached to a driver
What if you want to find all open instances of a specific module? This is easy
with the -m flag. For example, to find all open instances of IP you
can type
> ::queue -m ip
d0353180
d0353bc8
d197e5f8
d2263020
...
Notice that when you use any of the filtering flags, the
::queuedcmd prints only the address of matching queues. You can
still get more detailed information by pipelining the output to the
::queue dcmd:
> ::queue -m ip | ::queue
ADDR MODULE FLAGS NBLK
d0353180 ip 204032 0 00000000
d0353bc8 ip 204032 0 00000000
d197e5f8 ip 244032 0 00000000
d2263020 ip 204032 0 00000000
...
You can use filtering options to find all queues with specific flag value set:
> ::queue -f QISDRV|::queue
ADDR MODULE FLAGS NBLK
d0353180 ip 204032 0 00000000
d03532f8 consms 202032 0 00000000
d03535e8 tcp 204032 0 00000000
d03538d8 conskbd 20c032 0 00000000
d0353a50 wc 242032 0 00000000
...
prints information about all driver queues. What if you want information about modules and not drivers? Easy:
> ::queue -F QISDRV|::queue
ADDR MODULE FLAGS NBLK
d0353008 fifostrrhead 044032 0 00000000
d0353470 strrhead 044032 0 00000000
d0353760 strrhead 044032 0 00000000
d0353d40 strrhead 044032 0 00000000
d033f010 usbms 002032 0 00000000
d033f478 rts 108832 0 00000000
...
Similarly you can find all queues which are flow controlled. Here is an example from a real core dump:
> ::queue -f QFULL|::queue
ADDR MODULE FLAGS NBLK
ffffffff83e9d000 strrhead 04403c 64 ffffffff953c0600
ffffffff82548018 strrhead 04403c 331 ffffffff949c2680
ffffffff85f4ba48 timod 00083c 7 ffffffffbfa04140
ffffffff83acf2e8 timod 00083c 3 ffffffff84659f00
Careful reader will notice that this will print information only about flow
controlled read side queues. How about write side queues? Here comes a useful
little ::q2wrq dcmd:
> ::queue -q | ::q2wrq | ::queue -f QFULL|::queue -v
ADDR MODULE FLAGS NBLK
ffffffff83cb0388 tl 24402c 205 fffffe80e3017040
|
+--> QWANTW Someone wants to write Q
QFULL Q is considered full
QUSE This queue in use (allocation)
QMTSAFE stream module is MT-safe
QEND last queue in stream
QISDRV the Queue is attached to a driver
The ::q2wrq dcmd simply maps the read queue pointer to the write
queue pointer. The ::q2rdq performs the opposite mapping and the
::q2otherq dcmd maps read queue pointer to the write queue pointer
and visa versa.
::q2stream dcmd which maps a queue
pointer to the stream head pointer. You can use a nice ::stream
dcmd to display the whole stream graphically:
> ffffffff83acf2e8::q2stream|::stream
+-----------------------+-----------------------+
| 0xffffffff82548110 | 0xffffffff82548018 |
| strwhead | strrhead |
| | |
| cnt = 0t0 | cnt = 0t57420 |
| flg = 0x00004022 | flg = 0x0004403c |
+-----------------------+-----------------------+
| ^
v |
+-----------------------+-----------------------+
| 0xffffffff83acf3e0 | 0xffffffff83acf2e8 |
| timod | timod |
| | |
| cnt = 0t0 | cnt = 0t528 |
| flg = 0x00000822 | flg = 0x0000083c |
+-----------------------+-----------------------+
| ^
v |
+-----------------------+-----------------------+
| 0xffffffff83acf670 | 0xffffffff83acf578 |
| udp | udp |
| | |
| cnt = 0t0 | cnt = 0t0 |
| flg = 0x00000822 | flg = 0x00000832 |
+-----------------------+-----------------------+
| ^
v |
+-----------------------+-----------------------+
| 0xffffffff82a86128 | 0xffffffff82a86030 |
| ip | ip |
| | |
| cnt = 0t0 | cnt = 0t0 |
| flg = 0x00244022 | flg = 0x00204032 |
+-----------------------+-----------------------+
You can see all read and write queues of the stream and their state. The read side stream head is not very happy - it has 57420 bytes hanging around and blocking another 528 bytes in timod.
>p> Another way to look at the stream is to walk the read or write side using theqnext walker:
> 0xffffffff82548110::walk qnext|::queue
ADDR MODULE FLAGS NBLK
ffffffff82548110 strwhead 004022 0 0000000000000000
ffffffff83acf3e0 timod 000822 0 0000000000000000
ffffffff83acf670 udp 000822 0 0000000000000000
ffffffff82a86128 ip 244022 0 0000000000000000
> 0xffffffff82a86030::walk qnext|::queue
ADDR MODULE FLAGS NBLK
ffffffff82a86030 ip 204032 0 0000000000000000
ffffffff83acf578 udp 000832 0 0000000000000000
ffffffff83acf2e8 timod 00083c 3 ffffffff84659f00
ffffffff82548018 strrhead 04403c 331 ffffffff949c2680
A careful reader will notice that the ::stream dcmd displays how
many bytes bytes are in the queue while ::queue displays
ho many messages are there. Do we want to know what are these messages?
here comes the next useful dcmd.
::mblk dcmd. Typing
::help mblk will show you all the gory details:
NAME
mblk - print an mblk
SYNOPSIS
addr ::mblk [-q|v] [-f|F flag] [-t|T type] [-l|L|B len] [-d dbaddr]
DESCRIPTION
Print mblock information for a given mblk pointer.
Without the address, print information about all mblocks.
Fields printed:
ADDR: mblk address
FL: Flags
TYPE: Type of corresponding dblock
LEN: Data length as b_wptr - b_rptr
BLEN: Dblock space as db_lim - db_base
RPTR: Read pointer
DBLK: Dblock pointer
Options:
-v: be verbose - print symbolic flags falues
-q: be quiet - print mblk pointer only
-d dbaddr: print mblks with specified dblk address
-f flag: print only mblks with flag set
-F flag: print only mblks with flag NOT set
-t type: print only mblks of specified db_type
-T type: print only mblks other then the specified db_type
-l len: tprint only mblks with MBLKL == len
-L len: print only mblks with MBLKL <= len
-G len: print only mblks with MBLKL >= len
-b len: print only mblks with db_lim - db_base == len
ATTRIBUTES
Target: kvm
Module: genunix
Interface Stability: Unstable
It is easy to look at a specific message:
> ffffffff949c2680::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
ffffffff949c2680 0 proto 56 80 ffffffff88f145b0 ffffffff88f14540
This output shows us that this is an M_PROTO message with 56 bytes of
information starting at address ffffffff88f145b02. We know that messages like to hang out together.
How do we print the whole b_cont chain? Like this:
> ffffffff949c2680::walk b_cont|::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
ffffffff949c2680 0 proto 56 80 ffffffff88f145b0 ffffffff88f14540
ffffffff943b5240 0 data 108 144 ffffffff9868878c ffffffff98688700
Now we can see both the M_PROTO and the attached M_DATA messages, and we can see
that UDP is sending 108 bytes of data upstream.
Similarly, the b_next walker will follow the b_next
chain:
> ffffffff84659f00::walk b_next|::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
ffffffff84659f00 0 proto 56 80 ffffffff94257c70 ffffffff94257c00
fffffe80e3078340 0 proto 56 80 ffffffff94257bb0 ffffffff94257b40
ffffffff84658a40 0 proto 56 80 ffffffff942575b0 ffffffff94257540
And, of course, we can combine both in the pipeline:
> ffffffff84659f00::walk b_next|::walk b_cont|::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
ffffffff84659f00 0 proto 56 80 ffffffff94257c70 ffffffff94257c00
ffffffff84659300 0 data 120 208 ffffffff82fc144c ffffffff82fc13c0
fffffe80e3078340 0 proto 56 80 ffffffff94257bb0 ffffffff94257b40
ffffffff846599c0 0 data 120 208 ffffffff8465120c ffffffff84651180
ffffffff84658a40 0 proto 56 80 ffffffff942575b0 ffffffff94257540
ffffffff84658b00 0 data 120 208 ffffffff84651d4c ffffffff84651cc0
If you are really curious, you can see all allocated
messages by simply typing ::mblk:
> ::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
d6d0e000 0 proto 24 64 cfedaf00 cfedaec0
d6d0e020 0 data 0 8192 d377d400 d23d7c00
d6d0e040 0 data 3 64 dc14d580 dc14d540
d6d0e060 0 proto 156 320 cfedbd00 cfedbcc0
d6d0e080 0 data 0 64 d366a9c4 d366a980
d6d0e0a0 0 data 80 320 cfedbe80 cfedbe40
d6d0e0c0 0 proto 24 64 cfe5e0c0 cfe5e080
d6d0e0e0 0 data 115 320 cfe490e7 cfe49080
As with the ::queue command you can filter by the message type. For
example, you can look at M_DATA messages3 only:
> ::mblk -t M_DATA|::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
d6d0e020 0 data 0 8192 d377d400 d23d7c00
d6d0e040 0 data 16 64 dc14d580 dc14d540
d6d0e080 0 data 0 64 d366a9c4 d366a980
d6d0e0a0 0 data 80 320 cfedbe80 cfedbe40
d6d0e0e0 0 data 115 320 cfe490e7 cfe49080
d6d0e140 0 data 0 64 d77c7480 d77c7440
Or you can find all 0-bytes messages:
> ::mblk -l 0|::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
d6d0e020 0 data 0 8192 d377d400 d23d7c00
d6d0e040 0 data 0 64 dc1ab714 dc1ab6c0
d6d0e080 0 data 0 64 d366a9c4 d366a980
d6d0e140 0 data 0 64 d77c7480 d77c7440
...
In my opinion these example shows why MDB is really cool for kernel debugging. It can present huge amount of information in an informative way. Want to know how many 1-byte messages are floating around your system?
> ::mblk -l 0t1 ! wc -l
51
1The
::queue command will print all read queues.
2The UDP module talks a protocol called TPI, so the M_PROTO message is one of the TPI primitives. The first part of the primitive is always its ID, we can get this by typing
> ffffffff88f145b0/D
0xffffffff88f145b0: 20
Looking at usr/src/uts/common/sys/tihdr.h
we see that this is the T_UNITDATA_IND message (makes sense). We can print the
whole message now:
> ffffffff88f145b0::print 'struct T_unitdata_ind'
{
PRIM_type = 0x14
SRC_length = 0x10
SRC_offset = 0x14
OPT_length = 0x14
OPT_offset = 0x24
}
3 Underground hackers special:
> ::mblk -t M_DATA|::print mblk_t b_rptr|/s
Jun 15 09:57:16 lpr[13390]: [ID 575460 FACILITY_AND_PRIORITY] net_response(18)
NOTICE: NFS4 FACT SHEET:
Action: NR_CLIENTID
NFS4 error: NFS4ERR_STALE_CLIENTID
Suspected server reboot.
Jun 15 09:57:16 lpr[13390]: [ID 626751 FACILITY_AND_PRIORITY] net_write(18, 0x8045478, 12)
NOTICE: [NFS4][Server: onnv.sfbay][Mntpt: /ws/onnv-gate/usr]NFS server onnv.sfbay not responding; still trying
·gp0Jun 15 09:57:16 lpr[13390]: [ID 458697 FACILITY_AND_PRIORITY] net_open(bulka.SFBay, 5)
NOTICE: [NFS4][Server: onnv.sfbay][Mntpt: /ws/onnv-gate/usr]NFS server onnv.sfbay ok
...
Prints the content of all the data messages. I can see eyes flashing in the darkness!
Now, when the OpenSolaris is finally a reality, engineers can really start talking about the interesting staff. We spend most of our time dealing with code and it quite difficult to talk about what you do without being able to provide examples. Now is the time for the real technical blogging instead of the hand-waving.
I spent quite a lot of my time at Sun hacking STREAMS internals - things that almost no one ever notices, except for some unlucky folks who run into some nasty issues. Here I want to continue the sewer tour, started by Bryan Cantrill and take interested visitors to some of the STREAMS sewers.
Our tour will start with a few lines of code in the putnext() function. In the original implementation it was just a simple macro calling the put procedure of the next module in the STREAM:
#define putnext(q, mp) ((*(q)->q_next->q_qinfo->qi_putp)((q)->q_next, (mp)))
In Solaris it evolved into a rather complicated code1. Its inner workings deserve quite a few separate blog entries, but now we will take a side tour to a small piece of code that may catch your attention:
/*
* If there are writers or exclusive waiters, there is not much
* we can do. Place the message on the syncq and schedule a
* background thread to drain it.
*
* Also if we are approaching end of stack, fill the syncq and
* switch processing to a background thread - see comments on
* top.
*/
if ((flags & (SQ_STAYAWAY|SQ_EXCL|SQ_EVENTS)) ||
(sq->sq_needexcl != 0) || PUT_STACK_NOTENOUGH()) {
/*
* NOTE: qfill_syncq will need QLOCK. It is safe to drop
* SQLOCK because positive sq_count keeps the syncq from
* closing.
*/
mutex_exit(SQLOCK(sq));
qfill_syncq(sq, qp, mp);
/*
* NOTE: after the call to qfill_syncq() qp may be
* closed, both qp and sq should not be referenced at
* this point.
*
* This ASSERT is located here to prevent stack frame
* consumption in the DEBUG code.2
*/
ASSERT(sqciplock == NULL);
return;
}
The code uses a notion of a syncq, which is a synchronization abstraction used by STREAMS. Whenever some module calls the putnext() function, the code checks for various conditions and if everything seems all-right, it just calls the put procedure of the next module in the STREAM. If something doesn't seem right (e.g. the target module is busy processing other messages), the message is placed on the special queue in the syncq and the framework arranges a different kernel thread to pass the enqueued message to the next module when the module is ready3. The important observation is that this kernel thread executes has its own fresh stack. The highlighted code above hijacks this normal STREAMS mechanism to protect from a rather nasty problem - kernel stack overflow.
Note that STREAMS is a very flexible framework that allows many modules to be linked together in a chain (or even something more complex like a tree or U-pipe). Every new Solaris release includes some new STREAMS modules and drivers that often play in concert to provide some exciting new functionality. The downside is that the chain-calling of module's put procedures creates really long call chains on the stack and sometimes causes kernel panics caused by the stack overflow.
Soon after Solaris 9 release I saw quite a few similar panics, happening because all the kernel thread stack is consumed by a pile of STREAMS modules. Here is a typical example:
402760b0 allocb+0x180()
40276110 ip_wput_frag_copyhdr+0x14()
40276170 ip_wput_frag+0x124()
40276290 ip_wput_ire+0x1a98()
402763e8 ire_send+0x218()
40276448 ire_add_then_send+0x304()
402764b0 ip_newroute+0xb78()
40276600 ire_send+0x198()
40276660 ire_add_then_send+0x304()
402766c8 ip_wput_nondata+0x874()
40276748 putnext+0x400()
402767a8 ar_query_reply+0x150()
40276808 ar_entry_query+0x154()
40276878 ar_rput+0x144()
402768e8 putnext+0x400()
40276948 ip_newroute+0x1760()
40276a98 putnext+0x400()
40276af8 udp_wput+0x63c()
40276b80 putnext+0x400()
40276be0 putnext+0x400()
40276c40 strput+0x57c()
40276d68 kstrputmsg+0x3bc()
40276de8 tli_send+0x20()
40276e48 t_ksndudata+0x284()
40276ea8 clnt_clts_kcallit_addr+0x3dc()
40276f90 clnt_clts_kcallit+0x38()
40277000 rfscall+0x46c()
402770b8 rfs3call+0x68()
40277130 nfs3write+0x11c()
402772b0 nfs3_bio+0x334()
40277310 nfs3_rdwrlbn+0xf0()
40277380 nfs3_sync_putapage+0x34()
402773e0 nfs3_putapage+0x364()
40277470 pvn_vplist_dirty+0x424()
40277568 nfs_putpages+0x174()
402775e0 nfs3_putpage+0xa8()
40277640 nfs_purge_caches+0x98()
402776a0 nfs_cache_check+0xf8()
40277700 nfs3_getattr_otw+0x12c()
40277840 nfs3_validate_caches+0x11c()
40277908 nfs3_getpage+0x44()
40277990 segvn_fault+0xa74()
40277a90 as_fault+0x4c4()
40277b40 pagefault+0x40()
40277ba8 trap+0xd90()
40277c80 utl0+0x4c()
Although each modules uses just a small amount, numbers quickly add up. It is like going to the grocery store to buy a whole bunch of small items. Each one is really cheap, but you end up paying a lot at the counter.
At the time I was fixing some bugs in the syncq implementation and spent a lot of time fooling around putnext() and its friends, so I spotted the possibility to use the existing putnext() ability to delegate work to a new kernel thread to avoid such panics. The idea was really simple: in addition to the usual work handout in case the perimeter is busy, perform the same handout when we are too close to blowing away the stack. So I added the highlighted code above together with the definition for PUT_STACK_NOTENOUGH:
#define PUT_STACK_NEEDED 5000
#define PUT_STACK_NOTENOUGH() \
(((STACK_BIAS + (uintptr_t)getfp() - \
(uintptr_t)curthread->t_stkbase) < put_stack_needed))
The value of PUT_STACK_NEEDED was chosen experimentally. It is impossible to predict how much stack will be used in the future. For example, a simple call to allocb() may, in some unlucky circumstances, trigger a long chain of calls through the kmem and vmem memory allocation layers. So the value of PUT_STACK_NEEDED was chosen to prevent common panics that we saw during the PIT runs 4.
I run the fix through the real kernel experts - Bryan Cantrill, who fixed a tricky kernel stack overflow problem before5 and Mike Shapiro. Mike suggested making PUT_STACK_NOTENOUGH a generic kernel function for others to use, while Bryan objected. We had a heated meeting in Mike's office and as a result, the following comment appeared at the top of putnext.c file:
/* * Streams with many modules may create long chains of calls via putnext() which * may exhaust stack space. When putnext detects that the stack space left is * too small (less then PUT_STACK_NEEDED), the call chain is broken and * further processing is delegated to the background thread via call to * putnext_tail(). Unfortunately there is no generic solution with fixed stack * size, and putnext() is recursive function, so this hack is a necessary evil. * * The redzone value is chosen dependent on the default stack size which is 8K * on 32-bit kernels and on x86 and 16K on 64-bit kernels. The values are chosen * empirically. For 64-bit kernels it is 5000 and for 32-bit kernels it is 2500. * Experiments showed that 2500 is not enough for 64-bit kernels and 2048 is not * enough for 32-bit. * * The redzone value is a tuneable rather then a constant to allow adjustments * in the field. * * The check in PUT_STACK_NOTENOUGH is taken from segkp_map_red() function. It * is possible to define it as a generic function exported by seg_kp, but * * a) It may sound like an open invitation to use the facility indiscriminately. * b) It adds extra function call in putnext path. * * We keep a global counter `put_stack_notenough' which keeps track how many * times the stack switching hack was used. */
The hack was integrated6 early in Solaris 10 and backported to S9 updates. Later in S10 timeframe another engineer did the real fix for the stack overflow problem - increased the stack size on 64-bit kernel (see bug 4922366 ). It is worth mentioning a comment, made by Jeff Bonwick in the evaluation of this bug:
Yep. Grow the kernel stack. Memory is cheap. Panics are expensive.
I've been through several kernel stack crises before. They always unfold the same way. Some particular workload goes too deep. We prune a few stack frames to fix the offending code path. Then another one comes up. And another. (Right about now someone suggests that instead of growing the stack for every thread, we should finally bite the bullet and make the kernel stack growable. I dig up my mail archive from the last time we contemplated this, and explain why it's much harder than it sounds.) Eventually the panic rate becomes so high that we have to act. Nobody can figure out how to make dynamic stack growth work reliably, so after one more jurassic outage we accept physics and grow the stack.
So my proposal is that this time, we dispense with all the hand-wringing and just grow the damn thing.
As it often happens, the stack protection hack uncovered another long-standing and interesting bug7 in Solaris qfill_syncq() function, but this is a subject of another blog... .
1 The added complexity comes from the multi-threaded nature of the kernel. While one thread tries to access q_next, another thread may change it at the same, which leads to chaos. Solaris STREAMS provide a rich set of synchronization mechanism, called STREAMS perimeters which simplify the life of module and driver writers at the cost of internal complexity of the implementation.
2 Modern compilers are very smart and can optimize a function call immediately followed by a return statement so that the callee reuses the stack frame of the caller. This is called tail-call optimization. It saves the stack space, but obscures debugging and sometimes even the DTrace. The problem is that the function call back-trace you see in the stack trace does not accurately represent the actuall calling sequences in the presence of tail-calls. To simplify debugging the seemingly useless ASSERT prevents such tail-call optimization on DEBUG kernels, while keeping all the performance benefits of production builds.
3 The STREAMS framework is very careful to pass down all messages in the order received.
4 It turned out that the stack barrier value of 5000 was too aggressive for 8K stacks on 32-bit sparc systems, so it was adjusted to 2500 on 32-bit platforms.
5 Bryan fixed bug 1259818 back in Solaris 2.6.
6 This was fixed as bug 4525533.
7 Don't spend your time trying to find the bug in qfill_syncq() function - it was fixed before S10 was released.