Weblog

All | CMT | General | NUMA | OpenSolaris | Perl | Photo | Programmers Desk | STREAMS
« Previous month (May 2005) | Main | Next month (Jul 2005) »
20050616 Thursday June 16, 2005

Task Queues in OpenSolaris

Overview of Task Queues

It is common for a kernel programmer to postpone processing of some tasks and delegate their execution to another kernel thread. There may be several reasons for doing this:

In all these cases programmer, in essense, needs to execute a piece of code (task) in a different context, where context usually means another kernel thread with different set of locks held and, possibly, a different priority.

Until introduction of task queues in Solaris 8 there was no generic OS facility for such in-kernel context change. Every subsystem used its own ad-hoc mechanisms, usually utilizing ``worker threads'' together with a list of jobs to give them. The task queues interface abstracts common code out of these mechanisms and provides simple way of scheduling asynchronous tasks.

A task queue consists of a list of tasks, together with one or more threads to service the list. If a task queue has a single service thread, all tasks are guaranteed to execute in the order they were dispatched. Otherwise they can be executed in any order. Note that since tasks are placed on a list, execution of one task and should not depend on the execution of another task or a deadlock may occur. A taskq created with a single servicing thread guarantees that all the tasks are serviced in the order in which they are scheduled.

DDI interface for task queues

Kernel users should use the documented DDI interface for all taskq operations. These interfaces are defined in the usr/src/uts/common/sys/sunddi.h header file. The exported interface consists of the following functions:

ddi_taskq_create():
Creates a new taskq object with specified number of threads servicing it. All threads will run with a single specified priority. The priority may have a special value TASKQ_DEFAULTPRI meaning that the priority will be chosen by the system. The arguments are:

dip
The pointer to the dev_info_t structure. Some subsystems do not have a dip pointer and may pass NULL instead.
name
Descriptive string.
nthreads
Number of threads servicing the task.
pri
Priority of threads servicing the task queue. Drivers and modules should specify TASKQ_DEFAULTPRI.
flags
Should be always zero. This argument is reserved for future extensions.

ddi_taskq_dispatch():
Schedules a task for a specified taskq. A task is just a pair {f, a} where f is a function, accepting a single pointer argument and a is its argument value. Additional flags specify whether dispatch may or may not sleep waiting for resources. Once the task is dispatched it will be scheduled asynchronously at some later time and there is no way to cancel a task that is dispatched but have not been executed yet. All the tasks are executed with a fixed priority specified at the time of taskq creation. The arguments are:

tq
The taskq pointer returned by taskq_create().
func
The callback function to call. The function should accept a single argument.
arg
Argument to the callback function.
flags
Flags controlling the dispatch behavior. Possible flags are:

DDI_SLEEP
Allow sleeping (blocking) until memory is available.
DDI_NOSLEEP
Return DDI_FAILURE immediately if memory is not available.

ddi_taskq_wait():
Blocks the taskq from any new dispatches and waits for all previously scheduled tasks to complete, then unblocks the taskq. This function does not stop any new task dispatches. Its single argument is the taskq to wait for.

ddi_taskq_suspend():
suspends all task execution until ddi_taskq_resume() is called. Although ddi_taskq_suspend() attempts to suspend pending tasks, there are no guarantees that they will be suspended. The only guarantee is that all tasks dispatched after ddi_taskq_suspend() will not be executed. Because it will trigger a deadlock, the function should never be called by a task executing on a taskq. Its single argument is the taskq to suspend.

ddi_taskq_suspended():
returns B_TRUE if taskq is suspended, and B_FALSE otherwise. It is intended to ASSERT that the task queue is suspended. Its single argument is the taskq to check.>

ddi_taskq_resume():
resumes taskq execution. Its single argument is the taskq to resume.

ddi_taskq_destroy():
waits for all pending tasks to complete and destroys the task queue and all associated threads.

Observability

Counters

Every taskq created in the system keeps a set of kstat counters associated with it. Try running the following command on your system:


$ kstat -c taskq
module: unix                            instance: 0     
name:   ata_nexus_enum_tq               class:    taskq
        crtime                          53.877907833
        executed                        0
        maxtasks                        0
        nactive                         1
        nalloc                          0
        priority                        60
        snaptime                        258059.249256749
        tasks                           0
        threads                         1
        totaltime                       0

module: unix                            instance: 0     
name:   callout_taskq                   class:    taskq
        crtime                          0
        executed                        13956358
        maxtasks                        4
        nactive                         4
        nalloc                          0
        priority                        99
        snaptime                        258059.24981709
        tasks                           13956358
        threads                         2
        totaltime                       120247890619
 ...

The kstat information above includes:

You can use the power of the kstat command to observe how some counter increases over time:


$ kstat -p unix:0:callout_taskq:tasks 1 5  
unix:0:callout_taskq:tasks      13994642

unix:0:callout_taskq:tasks      13994711

unix:0:callout_taskq:tasks      13994784

unix:0:callout_taskq:tasks      13994855

unix:0:callout_taskq:tasks      13994926

...

DTrace SDT Probes

The taskq implementation also provides several useful SDT probes: All the probes described below have two arguments: the taskq pointer and the pointer to the pointer to the taskq_ent_t structure. It can be used to extract the function and the argument from the D script.

Developers can use these probes to collect precise timing information about individual task queues and individual tasks being executed through them. For example, the following script will print what functions were scheduled via task queues for every 10 seconds:

#!/usr/sbin/dtrace -qs

sdt:genunix::taskq-enqueue
{
  this->tq  = (taskq_t *)arg0;
  this->tqe = (taskq_ent_t *) arg1;
  @[this->tq->tq_name,
    this->tq->tq_instance,
    this->tqe->tqent_func] = count();
}

tick-10s
{
  printa ("%s(%d): %a called %@d times\n", @);
  trunc(@);
}

Running this on my desktop produced the following output1:


callout_taskq(1): genunix`callout_execute called 51 times
callout_taskq(0): genunix`callout_execute called 701 times
kmem_taskq(0): genunix`kmem_update_timeout called 1 times
kmem_taskq(0): genunix`kmem_hash_rescale called 4 times
callout_taskq(1): genunix`callout_execute called 40 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 256 times
callout_taskq(0): genunix`callout_execute called 702 times
kmem_taskq(0): genunix`kmem_update_timeout called 1 times
kmem_taskq(0): genunix`kmem_hash_rescale called 4 times
callout_taskq(1): genunix`callout_execute called 28 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 228 times
callout_taskq(0): genunix`callout_execute called 706 times
callout_taskq(1): genunix`callout_execute called 24 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 141 times
callout_taskq(0): genunix`callout_execute called 708 times

Dynamic Task Queues

Suppose that two friends, Bob and Alice are staying in the cafeteria line with Alice standing behind Bob. The cashier checks Bobs' tray and it turns out that Bob doesn't have enough money, so he wants to borrow from Alice. But Alice is not sure whether she has enough cash until she knows the cost of her lunch. This is a typical deadlock situation - both Bob and Alice can not make any forward progress waiting for each other. The same kind of deadlock may occur if two tasks A and B are placed on a queue which is served by a single thread when there is a resource dependency between A and B. One way to prevent such a deadlock is to guarantee that A and B are processed by two different threads, so that when A stalls for B the thread processing A will block until B makes enough progress and can provide the needed resource to B.

Dynamic task queues provide exactly such deadlock-free way of scheduling potentially dependent tasks on the same queues. They guarantee that every task is processed by a separate thread. Since the amount of tasks that can be scheduled at the same time is not known in advance, dynamic task queues maintain a dynamic thread pool that grows when the workload increases and shrinks when the workload dries off.

Dynamic task queues can not (yet) be used via the DDI interfaces. Some kernel subsystems use the internal taskq calls directly to create and use dynamic task queues. The system also maintains one shared dynamic task queue called system_taskq. It can be used by specifying system_taskq as the taskq argument to the taskq_dispatch() function. It is really a good idea to also add "TQ_NOSLEEP | TQ_NOQUEUE" to the flags when using system_taskq.

Implementation Notes

Each taskq is implemented as a list of tasks protected by a per-taskq lock. One or more worker threads take tasks one by one and execute them by calling f(a) and then sleep, waiting for new entries. A taskq created with a single servicing thread has an important property: it guarantees that all its tasks are executed in the order they are scheduled. When a task queue is created with several servicing threads, task execution order is not predictable.

If you want to look at the actual implementation you need to look at the following files:

History

The first taskq implementation was done by Jeff Bonwick for Solaris 8. It was successfully used to replace many calls to the low-level thread_create() function. I added Dynamic Task Queues in Solaris 9 and used them to completely re-implement the STREAMS scheduler. In Solaris 10 I added DDI interfaces for task queues and also added kstat counters and DTrace probes.


Footnotes:

1 For curious minds: the callout_taskq is used to handle system timers. As an exercise in your DTrace skills, try to figure out what actual timers are firing on each CPU. Hint - use the callout-start SDT probe, which has a pointer to the callout_t structure as its sole argument.


Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

( Jun 16 2005, 05:00:22 PM PDT ) Permalink Comments [6]

20050615 Wednesday June 15, 2005

MDB for STREAMS

OpenSolaris includes a very powerful and modular debugger MDB that is an invaluable tool for analyzing crash dumps and live systems. It comes with a comprehensive manual which is a great read. Here is a brief information for brave souls having to debug STREAMS-related issues.

Queues

The most useful MDB dcmd for most module and driver developers is ::queue. Typing ::help queue will show the full usage information:


NAME
  queue - filter and display STREAM queue

SYNOPSIS
  addr ::queue [-q|v] [-m mod] [-f flag] [-F flag] [-s syncq_addr]

DESCRIPTION
  
  Print queue information for a given queue pointer.
  
  Without the address of a "queue_t" structure given, print information about al
  l
  queues in the "queue_cache".
  
  Options:
        -v:             be verbose - print symbolic flags falues
        -q:             be quiet - print queue pointer only
        -f flag:        print only queues with flag set
        -F flag:        print only queues with flag NOT set
        -m modname:     print only queues with specified module name
        -s syncq_addr:  print only queues which use specified syncq
  
  Available conversions:
        q2rdq:          given a queue addr print read queue pointer
        q2wrq:          given a queue addr print write queue pointer
        q2otherq:       given a queue addr print other queue pointer
        q2syncq:        given a queue addr print syncq pointer (::help syncq)
        q2stream:       given a queue addr print its stream pointer
                (see ::help stream and ::help stdata)
  
  To walk q_next pointer of the queue use
        queue_addr::walk qnext

ATTRIBUTES

  Target: kvm
  Module: genunix
  Interface Stability: Unstable

If you just type ::queue, MDB will print information about all opened queue instances 1 in the system. For example, on my desktop:


> ::queue
    ADDR MODULE         FLAGS NBLK
d0353008 fifostrrhead  044032    0 00000000
d0353180 ip            204032    0 00000000
d03532f8 consms        202032    0 00000000
d0353470 strrhead      044032    0 00000000
d03535e8 tcp           204032    0 00000000
d0353760 strrhead      044032    0 00000000
d03538d8 conskbd       20c032    0 00000000
d0353a50 wc            242032    0 00000000
d0353bc8 ip            204032    0 00000000
d0353d40 strrhead      044032    0 00000000
d033f010 usbms         002032    0 00000000
d033f188 hid           201032    0 00000000
d033f300 consms        242032    0 00000000
d033f478 rts           108832    0 00000000
 ...

What you see in the output is the actual read queue address for each open module or instance, the module name, queue flags, number of messages on the module queue and the pointer to the first message in the queue. You can translate hex flag values into symbolic names adding -v option:


> ::queue -v
    ADDR MODULE         FLAGS NBLK
d0353008 fifostrrhead  044032    0 00000000
                       |
                       +-->  QWANTR       Someone wants to read Q
                             QREADR       This is the reader (first) Q
                             QUSE         This queue in use (allocation)
                             QMTSAFE      stream module is MT-safe
                             QEND         last queue in stream
d0353180 ip            204032    0 00000000
                       |
                       +-->  QWANTR       Someone wants to read Q
                             QREADR       This is the reader (first) Q
                             QUSE         This queue in use (allocation)
                             QMTSAFE      stream module is MT-safe
                             QISDRV       the Queue is attached to a driver
 ...

You can ask about specific queue by providing its address for the ::queue command:


> d0353180::queue -v                  
    ADDR MODULE         FLAGS NBLK
d0353180 ip            204032    0 00000000
                       |
                       +-->  QWANTR       Someone wants to read Q
                             QREADR       This is the reader (first) Q
                             QUSE         This queue in use (allocation)
                             QMTSAFE      stream module is MT-safe
                             QISDRV       the Queue is attached to a driver

What if you want to find all open instances of a specific module? This is easy with the -m flag. For example, to find all open instances of IP you can type


> ::queue -m ip
d0353180
d0353bc8
d197e5f8
d2263020
  ...

Notice that when you use any of the filtering flags, the ::queuedcmd prints only the address of matching queues. You can still get more detailed information by pipelining the output to the ::queue dcmd:


> ::queue -m ip | ::queue
    ADDR MODULE         FLAGS NBLK
d0353180 ip            204032    0 00000000
d0353bc8 ip            204032    0 00000000
d197e5f8 ip            244032    0 00000000
d2263020 ip            204032    0 00000000
  ...

You can use filtering options to find all queues with specific flag value set:


> ::queue -f QISDRV|::queue
    ADDR MODULE         FLAGS NBLK
d0353180 ip            204032    0 00000000
d03532f8 consms        202032    0 00000000
d03535e8 tcp           204032    0 00000000
d03538d8 conskbd       20c032    0 00000000
d0353a50 wc            242032    0 00000000
  ...

prints information about all driver queues. What if you want information about modules and not drivers? Easy:


> ::queue -F QISDRV|::queue
    ADDR MODULE         FLAGS NBLK
d0353008 fifostrrhead  044032    0 00000000
d0353470 strrhead      044032    0 00000000
d0353760 strrhead      044032    0 00000000
d0353d40 strrhead      044032    0 00000000
d033f010 usbms         002032    0 00000000
d033f478 rts           108832    0 00000000
  ...

Similarly you can find all queues which are flow controlled. Here is an example from a real core dump:


> ::queue -f QFULL|::queue
            ADDR MODULE         FLAGS NBLK
ffffffff83e9d000 strrhead      04403c   64 ffffffff953c0600
ffffffff82548018 strrhead      04403c  331 ffffffff949c2680
ffffffff85f4ba48 timod         00083c    7 ffffffffbfa04140
ffffffff83acf2e8 timod         00083c    3 ffffffff84659f00

Careful reader will notice that this will print information only about flow controlled read side queues. How about write side queues? Here comes a useful little ::q2wrq dcmd:


> ::queue -q | ::q2wrq | ::queue -f QFULL|::queue -v
            ADDR MODULE         FLAGS NBLK
ffffffff83cb0388 tl            24402c  205 fffffe80e3017040
                               |
                               +-->  QWANTW       Someone wants to write Q
                                     QFULL        Q is considered full
                                     QUSE         This queue in use (allocation)
                                     QMTSAFE      stream module is MT-safe
                                     QEND         last queue in stream
                                     QISDRV       the Queue is attached to a driver

The ::q2wrq dcmd simply maps the read queue pointer to the write queue pointer. The ::q2rdq performs the opposite mapping and the ::q2otherq dcmd maps read queue pointer to the write queue pointer and visa versa.

Stream Heads

Once you are comfortable dealing with individual queues you may wonder about the state of a whole open STREAM. The simplest way to find a stream containing a given queue is to apply a ::q2stream dcmd which maps a queue pointer to the stream head pointer. You can use a nice ::stream dcmd to display the whole stream graphically:


> ffffffff83acf2e8::q2stream|::stream

+-----------------------+-----------------------+
| 0xffffffff82548110    | 0xffffffff82548018    | 
| strwhead              | strrhead              | 
|                       |                       |
| cnt = 0t0             | cnt = 0t57420         | 
| flg = 0x00004022      | flg = 0x0004403c      | 
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xffffffff83acf3e0    | 0xffffffff83acf2e8    | 
| timod                 | timod                 | 
|                       |                       |
| cnt = 0t0             | cnt = 0t528           | 
| flg = 0x00000822      | flg = 0x0000083c      | 
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xffffffff83acf670    | 0xffffffff83acf578    | 
| udp                   | udp                   | 
|                       |                       |
| cnt = 0t0             | cnt = 0t0             | 
| flg = 0x00000822      | flg = 0x00000832      | 
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xffffffff82a86128    | 0xffffffff82a86030    | 
| ip                    | ip                    | 
|                       |                       |
| cnt = 0t0             | cnt = 0t0             | 
| flg = 0x00244022      | flg = 0x00204032      | 
+-----------------------+-----------------------+

You can see all read and write queues of the stream and their state. The read side stream head is not very happy - it has 57420 bytes hanging around and blocking another 528 bytes in timod.

>p> Another way to look at the stream is to walk the read or write side using the qnext walker:


> 0xffffffff82548110::walk qnext|::queue 
            ADDR MODULE         FLAGS NBLK
ffffffff82548110 strwhead      004022    0 0000000000000000
ffffffff83acf3e0 timod         000822    0 0000000000000000
ffffffff83acf670 udp           000822    0 0000000000000000
ffffffff82a86128 ip            244022    0 0000000000000000

> 0xffffffff82a86030::walk qnext|::queue 
            ADDR MODULE         FLAGS NBLK
ffffffff82a86030 ip            204032    0 0000000000000000
ffffffff83acf578 udp           000832    0 0000000000000000
ffffffff83acf2e8 timod         00083c    3 ffffffff84659f00
ffffffff82548018 strrhead      04403c  331 ffffffff949c2680

A careful reader will notice that the ::stream dcmd displays how many bytes bytes are in the queue while ::queue displays ho many messages are there. Do we want to know what are these messages? here comes the next useful dcmd.

Messages

Let me introduce a new player - the ::mblk dcmd. Typing ::help mblk will show you all the gory details:


NAME
  mblk - print an mblk

SYNOPSIS
  addr ::mblk [-q|v] [-f|F flag] [-t|T type] [-l|L|B len] [-d dbaddr]

DESCRIPTION
  
  Print mblock information for a given mblk pointer.
  Without the address, print information about all mblocks.
  
  Fields printed:
        ADDR:   mblk address
        FL:     Flags
        TYPE:   Type of corresponding dblock
        LEN:    Data length as b_wptr - b_rptr
        BLEN:   Dblock space as db_lim - db_base
        RPTR:   Read pointer
        DBLK:   Dblock pointer
  
  Options:
        -v:             be verbose - print symbolic flags falues
        -q:             be quiet - print mblk pointer only
        -d dbaddr:      print mblks with specified dblk address
        -f flag:        print only mblks with flag set
        -F flag:        print only mblks with flag NOT set
        -t type:        print only mblks of specified db_type
        -T type:        print only mblks other then the specified db_type
        -l len:         tprint only mblks with MBLKL == len
        -L len:         print only mblks with MBLKL <= len 
        -G len:         print only mblks with MBLKL >= len 
        -b len:         print only mblks with db_lim - db_base == len
  

ATTRIBUTES

  Target: kvm
  Module: genunix
  Interface Stability: Unstable

It is easy to look at a specific message:


> ffffffff949c2680::mblk
            ADDR FL TYPE    LEN   BLEN              RPTR             DBLK
ffffffff949c2680 0  proto   56    80    ffffffff88f145b0 ffffffff88f14540

This output shows us that this is an M_PROTO message with 56 bytes of information starting at address ffffffff88f145b02. We know that messages like to hang out together. How do we print the whole b_cont chain? Like this:


> ffffffff949c2680::walk b_cont|::mblk
            ADDR FL TYPE    LEN   BLEN              RPTR             DBLK
ffffffff949c2680 0  proto   56    80    ffffffff88f145b0 ffffffff88f14540
ffffffff943b5240 0  data    108   144   ffffffff9868878c ffffffff98688700

Now we can see both the M_PROTO and the attached M_DATA messages, and we can see that UDP is sending 108 bytes of data upstream. Similarly, the b_next walker will follow the b_next chain:


> ffffffff84659f00::walk b_next|::mblk
            ADDR FL TYPE    LEN   BLEN              RPTR             DBLK
ffffffff84659f00 0  proto   56    80    ffffffff94257c70 ffffffff94257c00
fffffe80e3078340 0  proto   56    80    ffffffff94257bb0 ffffffff94257b40
ffffffff84658a40 0  proto   56    80    ffffffff942575b0 ffffffff94257540

And, of course, we can combine both in the pipeline:


> ffffffff84659f00::walk b_next|::walk b_cont|::mblk
            ADDR FL TYPE    LEN   BLEN              RPTR             DBLK
ffffffff84659f00 0  proto   56    80    ffffffff94257c70 ffffffff94257c00
ffffffff84659300 0  data    120   208   ffffffff82fc144c ffffffff82fc13c0
fffffe80e3078340 0  proto   56    80    ffffffff94257bb0 ffffffff94257b40
ffffffff846599c0 0  data    120   208   ffffffff8465120c ffffffff84651180
ffffffff84658a40 0  proto   56    80    ffffffff942575b0 ffffffff94257540
ffffffff84658b00 0  data    120   208   ffffffff84651d4c ffffffff84651cc0

If you are really curious, you can see all allocated messages by simply typing ::mblk:


> ::mblk
    ADDR FL TYPE    LEN   BLEN      RPTR     DBLK
d6d0e000 0  proto   24    64    cfedaf00 cfedaec0
d6d0e020 0  data    0     8192  d377d400 d23d7c00
d6d0e040 0  data    3     64    dc14d580 dc14d540
d6d0e060 0  proto   156   320   cfedbd00 cfedbcc0
d6d0e080 0  data    0     64    d366a9c4 d366a980
d6d0e0a0 0  data    80    320   cfedbe80 cfedbe40
d6d0e0c0 0  proto   24    64    cfe5e0c0 cfe5e080
d6d0e0e0 0  data    115   320   cfe490e7 cfe49080

As with the ::queue command you can filter by the message type. For example, you can look at M_DATA messages3 only:


> ::mblk -t M_DATA|::mblk
    ADDR FL TYPE    LEN   BLEN      RPTR     DBLK
d6d0e020 0  data    0     8192  d377d400 d23d7c00
d6d0e040 0  data    16    64    dc14d580 dc14d540
d6d0e080 0  data    0     64    d366a9c4 d366a980
d6d0e0a0 0  data    80    320   cfedbe80 cfedbe40
d6d0e0e0 0  data    115   320   cfe490e7 cfe49080
d6d0e140 0  data    0     64    d77c7480 d77c7440

Or you can find all 0-bytes messages:


> ::mblk -l 0|::mblk
    ADDR FL TYPE    LEN   BLEN      RPTR     DBLK
d6d0e020 0  data    0     8192  d377d400 d23d7c00
d6d0e040 0  data    0     64    dc1ab714 dc1ab6c0
d6d0e080 0  data    0     64    d366a9c4 d366a980
d6d0e140 0  data    0     64    d77c7480 d77c7440
...

In my opinion these example shows why MDB is really cool for kernel debugging. It can present huge amount of information in an informative way. Want to know how many 1-byte messages are floating around your system?


> ::mblk -l 0t1 ! wc -l 
      51

Conclusions

We flexed just a small muscle of the MDB and we already can go a long way in debugging complicated STREAMS issues in the kernel. We can study individual queues, STREAMS, messages, or get a bird's eye view and look at the picture from some distance. Whenever you encounter a difficult debugging problem MDB is your friend (if you know how to talk to it politely).


Footnotes:

1The ::queue command will print all read queues.

2The UDP module talks a protocol called TPI, so the M_PROTO message is one of the TPI primitives. The first part of the primitive is always its ID, we can get this by typing


> ffffffff88f145b0/D                      
0xffffffff88f145b0:             20              

Looking at usr/src/uts/common/sys/tihdr.h we see that this is the T_UNITDATA_IND message (makes sense). We can print the whole message now:


> ffffffff88f145b0::print 'struct T_unitdata_ind'
{
    PRIM_type = 0x14
    SRC_length = 0x10
    SRC_offset = 0x14
    OPT_length = 0x14
    OPT_offset = 0x24
}

3 Underground hackers special:


> ::mblk -t M_DATA|::print mblk_t b_rptr|/s
Jun 15 09:57:16 lpr[13390]: [ID 575460 FACILITY_AND_PRIORITY] net_response(18)
NOTICE: NFS4 FACT SHEET: 
 Action: NR_CLIENTID 
 NFS4 error: NFS4ERR_STALE_CLIENTID 
 Suspected server reboot.   
Jun 15 09:57:16 lpr[13390]: [ID 626751 FACILITY_AND_PRIORITY] net_write(18, 0x8045478, 12)
NOTICE: [NFS4][Server: onnv.sfbay][Mntpt: /ws/onnv-gate/usr]NFS server onnv.sfbay not responding; still trying
·gp0Jun 15 09:57:16 lpr[13390]: [ID 458697 FACILITY_AND_PRIORITY] net_open(bulka.SFBay, 5)
NOTICE: [NFS4][Server: onnv.sfbay][Mntpt: /ws/onnv-gate/usr]NFS server onnv.sfbay ok
...

Prints the content of all the data messages. I can see eyes flashing in the darkness!


Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

( Jun 15 2005, 12:29:08 PM PDT ) Permalink Comments [1]

20050614 Tuesday June 14, 2005

Putnext and stack overflows

Now, when the OpenSolaris is finally a reality, engineers can really start talking about the interesting staff. We spend most of our time dealing with code and it quite difficult to talk about what you do without being able to provide examples. Now is the time for the real technical blogging instead of the hand-waving.

I spent quite a lot of my time at Sun hacking STREAMS internals - things that almost no one ever notices, except for some unlucky folks who run into some nasty issues. Here I want to continue the sewer tour, started by Bryan Cantrill and take interested visitors to some of the STREAMS sewers.

Our tour will start with a few lines of code in the putnext() function. In the original implementation it was just a simple macro calling the put procedure of the next module in the STREAM:


#define putnext(q,  mp) ((*(q)->q_next->q_qinfo->qi_putp)((q)->q_next, (mp)))

In Solaris it evolved into a rather complicated code1. Its inner workings deserve quite a few separate blog entries, but now we will take a side tour to a small piece of code that may catch your attention:


        /*
         * If there are writers or exclusive waiters, there is not much
         * we can do.  Place the message on the syncq and schedule a
         * background thread to drain it.
         *
         * Also if we are approaching end of stack, fill the syncq and
         * switch processing to a background thread - see comments on
         * top.
         */
        if ((flags & (SQ_STAYAWAY|SQ_EXCL|SQ_EVENTS)) ||
            (sq->sq_needexcl != 0) || PUT_STACK_NOTENOUGH()) {

                /*
                 * NOTE: qfill_syncq will need QLOCK. It is safe to drop
                 * SQLOCK because positive sq_count keeps the syncq from
                 * closing.
                 */
                mutex_exit(SQLOCK(sq));

                qfill_syncq(sq, qp, mp);
                /*
                 * NOTE: after the call to qfill_syncq() qp may be
                 * closed, both qp and sq should not be referenced at
                 * this point.
                 *
                 * This ASSERT is located here to prevent stack frame
                 * consumption in the DEBUG code.2
                 */
                ASSERT(sqciplock == NULL);
                return;
        }

The code uses a notion of a syncq, which is a synchronization abstraction used by STREAMS. Whenever some module calls the putnext() function, the code checks for various conditions and if everything seems all-right, it just calls the put procedure of the next module in the STREAM. If something doesn't seem right (e.g. the target module is busy processing other messages), the message is placed on the special queue in the syncq and the framework arranges a different kernel thread to pass the enqueued message to the next module when the module is ready3. The important observation is that this kernel thread executes has its own fresh stack. The highlighted code above hijacks this normal STREAMS mechanism to protect from a rather nasty problem - kernel stack overflow.

Note that STREAMS is a very flexible framework that allows many modules to be linked together in a chain (or even something more complex like a tree or U-pipe). Every new Solaris release includes some new STREAMS modules and drivers that often play in concert to provide some exciting new functionality. The downside is that the chain-calling of module's put procedures creates really long call chains on the stack and sometimes causes kernel panics caused by the stack overflow.

Soon after Solaris 9 release I saw quite a few similar panics, happening because all the kernel thread stack is consumed by a pile of STREAMS modules. Here is a typical example:


  402760b0 allocb+0x180()
  40276110 ip_wput_frag_copyhdr+0x14()
  40276170 ip_wput_frag+0x124()
  40276290 ip_wput_ire+0x1a98()
  402763e8 ire_send+0x218()
  40276448 ire_add_then_send+0x304()
  402764b0 ip_newroute+0xb78()
  40276600 ire_send+0x198()
  40276660 ire_add_then_send+0x304()
  402766c8 ip_wput_nondata+0x874()
  40276748 putnext+0x400()
  402767a8 ar_query_reply+0x150()
  40276808 ar_entry_query+0x154()
  40276878 ar_rput+0x144()
  402768e8 putnext+0x400()
  40276948 ip_newroute+0x1760()
  40276a98 putnext+0x400()
  40276af8 udp_wput+0x63c()
  40276b80 putnext+0x400()
  40276be0 putnext+0x400()
  40276c40 strput+0x57c()
  40276d68 kstrputmsg+0x3bc()
  40276de8 tli_send+0x20()
  40276e48 t_ksndudata+0x284()
  40276ea8 clnt_clts_kcallit_addr+0x3dc()
  40276f90 clnt_clts_kcallit+0x38()
  40277000 rfscall+0x46c()
  402770b8 rfs3call+0x68()
  40277130 nfs3write+0x11c()
  402772b0 nfs3_bio+0x334()
  40277310 nfs3_rdwrlbn+0xf0()
  40277380 nfs3_sync_putapage+0x34()
  402773e0 nfs3_putapage+0x364()
  40277470 pvn_vplist_dirty+0x424()
  40277568 nfs_putpages+0x174()
  402775e0 nfs3_putpage+0xa8()
  40277640 nfs_purge_caches+0x98()
  402776a0 nfs_cache_check+0xf8()
  40277700 nfs3_getattr_otw+0x12c()
  40277840 nfs3_validate_caches+0x11c()
  40277908 nfs3_getpage+0x44()
  40277990 segvn_fault+0xa74()
  40277a90 as_fault+0x4c4()
  40277b40 pagefault+0x40()
  40277ba8 trap+0xd90()
  40277c80 utl0+0x4c()

Although each modules uses just a small amount, numbers quickly add up. It is like going to the grocery store to buy a whole bunch of small items. Each one is really cheap, but you end up paying a lot at the counter.

At the time I was fixing some bugs in the syncq implementation and spent a lot of time fooling around putnext() and its friends, so I spotted the possibility to use the existing putnext() ability to delegate work to a new kernel thread to avoid such panics. The idea was really simple: in addition to the usual work handout in case the perimeter is busy, perform the same handout when we are too close to blowing away the stack. So I added the highlighted code above together with the definition for PUT_STACK_NOTENOUGH:

#define PUT_STACK_NEEDED 5000

#define PUT_STACK_NOTENOUGH()                                   \
        (((STACK_BIAS + (uintptr_t)getfp() -                    \
            (uintptr_t)curthread->t_stkbase) < put_stack_needed))

The value of PUT_STACK_NEEDED was chosen experimentally. It is impossible to predict how much stack will be used in the future. For example, a simple call to allocb() may, in some unlucky circumstances, trigger a long chain of calls through the kmem and vmem memory allocation layers. So the value of PUT_STACK_NEEDED was chosen to prevent common panics that we saw during the PIT runs 4.

I run the fix through the real kernel experts - Bryan Cantrill, who fixed a tricky kernel stack overflow problem before5 and Mike Shapiro. Mike suggested making PUT_STACK_NOTENOUGH a generic kernel function for others to use, while Bryan objected. We had a heated meeting in Mike's office and as a result, the following comment appeared at the top of putnext.c file:

/*
 * Streams with many modules may create long chains of calls via putnext() which
 * may exhaust stack space. When putnext detects that the stack space left is
 * too small (less then PUT_STACK_NEEDED), the call chain is broken and
 * further processing is delegated to the background thread via call to
 * putnext_tail(). Unfortunately there is no generic solution with fixed stack
 * size, and putnext() is recursive function, so this hack is a necessary evil.
 *
 * The redzone value is chosen dependent on the default stack size which is 8K
 * on 32-bit kernels and on x86 and 16K on 64-bit kernels. The values are chosen
 * empirically. For 64-bit kernels it is 5000 and for 32-bit kernels it is 2500.
 * Experiments showed that 2500 is not enough for 64-bit kernels and 2048 is not
 * enough for 32-bit.
 *
 * The redzone value is a tuneable rather then a constant to allow adjustments
 * in the field.
 *
 * The check in PUT_STACK_NOTENOUGH is taken from segkp_map_red() function. It
 * is possible to define it as a generic function exported by seg_kp, but
 *
 * a) It may sound like an open invitation to use the facility indiscriminately.
 * b) It adds extra function call in putnext path.
 *
 * We keep a global counter `put_stack_notenough' which keeps track how many
 * times the stack switching hack was used.
 */

The hack was integrated6 early in Solaris 10 and backported to S9 updates. Later in S10 timeframe another engineer did the real fix for the stack overflow problem - increased the stack size on 64-bit kernel (see bug 4922366 ). It is worth mentioning a comment, made by Jeff Bonwick in the evaluation of this bug:

Yep. Grow the kernel stack. Memory is cheap. Panics are expensive.

I've been through several kernel stack crises before. They always unfold the same way. Some particular workload goes too deep. We prune a few stack frames to fix the offending code path. Then another one comes up. And another. (Right about now someone suggests that instead of growing the stack for every thread, we should finally bite the bullet and make the kernel stack growable. I dig up my mail archive from the last time we contemplated this, and explain why it's much harder than it sounds.) Eventually the panic rate becomes so high that we have to act. Nobody can figure out how to make dynamic stack growth work reliably, so after one more jurassic outage we accept physics and grow the stack.

So my proposal is that this time, we dispense with all the hand-wringing and just grow the damn thing.

Source: Jeff Bonwick comment to bug 4922366

Postscriptum

As it often happens, the stack protection hack uncovered another long-standing and interesting bug7 in Solaris qfill_syncq() function, but this is a subject of another blog... .


Footnotes:

1 The added complexity comes from the multi-threaded nature of the kernel. While one thread tries to access q_next, another thread may change it at the same, which leads to chaos. Solaris STREAMS provide a rich set of synchronization mechanism, called STREAMS perimeters which simplify the life of module and driver writers at the cost of internal complexity of the implementation.

2 Modern compilers are very smart and can optimize a function call immediately followed by a return statement so that the callee reuses the stack frame of the caller. This is called tail-call optimization. It saves the stack space, but obscures debugging and sometimes even the DTrace. The problem is that the function call back-trace you see in the stack trace does not accurately represent the actuall calling sequences in the presence of tail-calls. To simplify debugging the seemingly useless ASSERT prevents such tail-call optimization on DEBUG kernels, while keeping all the performance benefits of production builds.

3 The STREAMS framework is very careful to pass down all messages in the order received.

4 It turned out that the stack barrier value of 5000 was too aggressive for 8K stacks on 32-bit sparc systems, so it was adjusted to 2500 on 32-bit platforms.

5 Bryan fixed bug 1259818 back in Solaris 2.6.

6 This was fixed as bug 4525533.

7 Don't spend your time trying to find the bug in qfill_syncq() function - it was fixed before S10 was released.


Technorati Tag:
Technorati Tag:
Technorati Tag:

( Jun 14 2005, 01:20:00 PM PDT ) Permalink Comments [2]

Calendar

RSS Feeds

Search

Links

Navigation

Referers