Andrew Rutz's blog

Friday Jun 29, 2007

Recursive panic

I will dispell any rumor to the contrary: it is possible to call panic() during the execution of panic(). As is true for any (language that supports it and) function that performs it, this is an "act of recursion" !

I recently stumbled across an instance of a recursive-panic in a core file I was analyzing. After attaching mdb to the core file and running ::msgbuf to dump the console messages written by the kernel to its circular, in-memory buffer, I noticed a "curious, single line" buried within its output:

panic sync timeout
An annotated replay of the "tail" of ::msgbuf's output is shown here, along with our "curious, single line":
# ls
bounds    unix.0    vmcore.0
# mdb -k 0
Loading modules: [ unix krtld ... ]
> ::msgbuf
[...]
panic[cpu17]/thread=3000372c680:
BAD TRAP: type=28 rp=2a101871200 addr=124ed4c mmu_fsr=0

event_pvr_exec-5:
integer divide zero trap:
addr=0x124ed4c
pid=3751, pc=0x124ed4c, sp=0x2a101870aa1, tstate=0x80001606, context=0x1d64
g1-g7: ffff, fc00, 4000, 0, 0, 0, 3000372c680
000002a101870f20 unix:die+9c (28, 2a101871200, 124ed4c, 0, 2a101870fe0, 10000)
  %l0-3: 0000000000000000 0000000000000028 000000000000000f 000000000179ed54
  %l4-7: 0000030000074f40 0000000000000061 000000000179ed53 0000000001074000
000002a101871000 unix:trap+560 (2a101871200, 58, 0, 0, 300019aa000, 3000372c680)
  %l0-3: 0000000000000000 0000060012e8efd0 0000000000000028 00000600133be388
  %l4-7: 0000000000000000 000006001f282700 0000000000000001 00000300019aa180
000002a101871150 unix:ktl0+48 (4000, 40, 0, 0, 3002189493c, 0)
  %l0-3: 0000000000000007 0000000000001400 0000000080001606 00000000010195c4
  %l4-7: 0000000000f0007f 0000000000000100 0000000000000000 000002a101871200
000002a1018712a0 ssd:ssd_get_physical_geometry+204 (3001eda2898, 2a10187140c, 0, 0, 0, 60003df3580)
  %l0-3: 0000030021894938 0000000000000024 0000000000000000 000000007fffffff
  %l4-7: 0000000080000000 0000000080000000 000003002189493c 0000000000004000
000002a101871350 ssd:ssd_resync_geom_caches+b4 (60003df3580, ffffffff80000000, 200, 1, 3ec1, ff)
  %l0-3: 00000000000000ff 000000000000003f 0000030023a7b420 000003000e760780
  %l4-7: 00000000003f0000 0000000000ff0000 0000060003de0b00 0000000000020a2a
000002a101871420 ssd:ssd_validate_geometry+b4 (60003df3580, 1, 3000372c838, 1, 7, fc000092)
  %l0-3: 000002a101871484 0000000000000006 0000000000000200 000000000179ed52
  %l4-7: 0000000080000000 000000000000003c 0000060003de0b00 00000000018b8c00
000002a1018714e0 ssd:ssd_ready_and_valid+2cc (60003df3580, 3000372c680, 60003df36c0, 0, 2, 2)
  %l0-3: 0000000000000000 00000000018acd68 0000030000008ff0 0000060003de0b00
  %l4-7: 0000000000000000 0000000000000000 0000060003de0b00 0000060003de0b00
000002a1018715f0 ssd:ssdopen+260 (8, 60003df3580, 3, 18cfb68, 2, 60003df3650)
  %l0-3: 0000000000000000 0000000000000002 000000760000013b 0000000000000000
  %l4-7: 0000060003de0b00 0000000000000001 00000000018c7c00 0000000000000000
000002a1018716a0 specfs:spec_open+438 (2a101871930, 1, 60002c01ee8, 18c88f0, 600138a0de8, 30010464c80)
  %l0-3: 0000000000000001 0000060016670f40 0000000000000000 00000000ffffffff
  %l4-7: 0000000000000000 00000600166b13e0 00000600138a0e70 000000760000013b
000002a101871760 genunix:fop_open+78 (2a101871930, 2, 60002c01ee8, 1, 60016670f40, 1)
  %l0-3: 0000000000000001 00000300000a4dd8 00000600068a23d8 00000600068c9738
  %l4-7: 0000000000000000 00000600068a23f0 00000600068c9738 0000000000000004
000002a101871810 genunix:vn_openat+4d4 (0, 99c, 0, 1, 0, 1)
  %l0-3: 000000007fffffff 0000000000000000 0000000000000000 0000000000000000
  %l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
000002a1018719d0 genunix:copen+260 (ffffffffffd19553, ffbff60c, 0, 3b99c, 0, 1)
  %l0-3: 00000600133be590 00000600133be5b0 000003501b56956d 0000000000000000
  %l4-7: 00000000018a7c00 000000000000099c 0000000000000012 0000060012e8efd0

syncing file systems...
 
panic[cpu17]/thread=3000372c680:
panic sync timeout

dumping to /dev/md/dsk/d1, offset 64422019072, content: kernel
The characters in green were output by the first call to panic(), which was caused by a division-by-zero trap that occurred in the function ssd_get_physical_geometry(). The output shown in red is the output from the second panic (eg, the panic that occurred while servicing the first panic).

Our "curious, single line" (1), shown below, is found at the articulation between the first and second panics. The first panic calls vfs_syncall() to write any in-memory representation of a filesystem to its associated backing-store (2). As will be seen shortly, this execution times-out, and panic() is called again (3), this time with the argument "panic sync timeout". panic()'s design for re-entrancy ensures that most of its implementation is not repeated. Progress resumes with the writing of the core file to disk (4).


[...]
syncing file systems...             (2)
 
panic[cpu17]/thread=3000372c680:    (3)
panic sync timeout                  (1)

dumping to /dev/md/dsk/d1, offset 64422019072, content: kernel    (4)
panic()'s re-entrancy includes the use of "panic triggers", which are three global booleans which are atomically updated at specific points in panic()'s control flow:
  /*
   * Triggers for panic state transitions:
   */
 int panic_quiesce;                      /* trigger for CALM    -> QUIESCE */
 int panic_sync;                         /* trigger for QUIESCE -> SYNC */
 int panic_dump;                         /* trigger for SYNC    -> DUMP */
panic_trigger() is the function that atomically updates a specified trigger and returns True if the current call to the function caused the first (and only) change from an un-triggered state to a triggered state. If panic_trigger() returns False, then panic() knows that the associated trigger has already been triggered, and that an associated critical section of panic()'s control flow must not be run (again).

The high-level control flow of panic() is:

panic
  vpanic
    panicsys
      ...
      flush_windows()       ;; (on SPARC, write register stack to memory stack
      panic_trigger(&panic_quiesce)  ;; is first time panic has been called ?
      if is-first-panic
        switch thread's stack pointer, %sp, to panic_stack
      save registers to memory stack
      panicsys
        ...
        if (on_panic_stack)     ;; eg, "if is first-panic ..."
          ...
          panic_stopcpus()      ;; idle the current thread on (n-1) cpu's
          ...
          save lots of values in panic_xxx variables   ;; eg, panic_thread
          ...
        if panic_trigger(&panic_sync)    ;; have we not sync'd filesystems ?
          set cpu's interrupt level to ten  ;; CLOCK_LEVEL
          do_polled_io = 1;                 ;; force drivers to use polling vs interrupts
          vfs_syncall()

        if panic_trigger(&panic_dump)   ;; have we not written core file ? 
          set cpu's interrupt level to ten
          do_polled_io = 1;     ;; us polling vs interrupts  
          dumpsys()
        do a reboot    ;;  mdboot(A_REBOOT, ...)
The call-sites to the three triggers are shown in bold characters. As can now be seen, the first panic triggers panic_quiesce and panic_sync, and is in its call to vfs_syncall (to sync all the filesystems). When this sync'ing causes a panic, this second panic does not (re-)trigger either panic_quiesce or panic_sync, thus skipping much of panic's control flow. However, the second panic does trigger panic_dump, causing a call to dumpsys() to create the core file (using polled-IO, vs. interrupts).

"OK, this all 'sounds nice', but how do we know that this is what happened ?"

The trick lies in knowing which stack pointer, %sp, to use when "dumping the stack".

The stack above that is shown in green is the stack at the time the current thread's division-by-zero trap forced control to flow to die()'s call to panic(). The thread that enters panic() (and carries out panic()'s actions) is the same kernel thread (0x3000372c680) that was handling the open() system call (see frame at bottom of stack).

We also know that this same thread was the thread that was executing in vfs_syncall() when it panic'd again, as the same thread address is shown for the second panic:


[...]
syncing file systems...           
 
panic[cpu17]/thread=3000372c680: <----- !!!
panic sync timeout                 

dumping to /dev/md/dsk/d1, offset 64422019072, content: kernel    (4)
We also know that panic() obviously tries hard to ensure that "the dump" shows the state as it existed for any first panic, which means anything in the green part of the stack obviously only represents the first panic.

The trick is to look in what turns out to be the most obvious place, the per-thread state (kthread_t) of the thread that caused the second panic. When a thread is idled, Solaris saves the thread's last PC and stack pointer in kthread_t.t_pcb:

# mdb ...
> 3000372c680::print kthread_t t_pcb
{
    t_pcb.val = [ 0x105bec4, 0x2a100beb2d1 ]
}
> 0x105bec4/ia
panicsys+0x48:  call      -0x1aa88      
panicsys+0x4c:  
If we dump the stack using this stack-pointer, we obtain a lot of insight:
> 3000372c680::print kthread_t t_pcb.val[1] | ::stack
vpanic+0xcc(11d8d20, 2a100bebe78, 22bc8af, 64, 1813c00, 0)
panic+0x1c(11d8d20, 18ab000, 0, 18ab000, 1, 11d8c00)
cyclic_fire+0x64(60003df4eb8, 14b5188803b50, 3b9ac800, 0, 60003dfcb08, 60003e6f6f8)
cbe_level14+0x38(0, 0, e, 300019aa000, 20080, 100b170)
current_thread+0x140(1, 6001708de98, 1, 2710, 10741b8, 183c800)
clnt_cots_kcallit+0x580(6001708de40, 64, 0, 6001708de40, 6001708de60, 6001708deb8)
rfscall+0x58c(60002d52380, 15, 18e9ab0, 3c, 1841dd0, 18371e0)
rfs3call+0x60(60002d52380, 15, 7afa466c, 1842098, 7afa4820, 1842000)
nfs3_commit+0xc0(3001052f500, 1, 30012db13d0, 60002c01ee8, 30016159a38, 300161599c8)
nfs3_dispose+0x2c0(3001052f500, 32300000, 8000, 0, 60002c01ee8, 30016159bd8)
page_release+0xf8(70007de1680, 1, 168, 1836bb8, 1836800, 18a7000)
pvn_getdirty+0x1fc(70007de1680, 400, 55, e1c0fa355, 0, 1)
pvn_vplist_dirty+0x324(0, 400, 0, 0, 10000, 3001052f500)
nfs_putpages+0xcc(3001052f500, 0, 0, 400, 60002c01ee8, 600076241c0)
nfs3_putpage+0xcc(3001052f500, 0, 0, 1, 60002c01ee8, 0)
rflush+0x1c4(30024040000, 60002c01ee8, 30014f97480, 180a9, 7048efec, 3)
nfs3_sync+0x2c(0, 7048f000, 60002c01ee8, 1, 0, 0)
vfs_sync+0x98(18ae560, 185d550, 185dad0, 0, 0, 1)
vfs_syncall+0x48(a, a, 18b3000, 11d8000, 0, 1)
panicsys+0x4e8(3000372c680, 60002c01ee8, 184b800, 1813c00, 1077800, 0)
vpanic+0xcc(10741b8, 2a101870fa8, 2000, 1, 8, 8)
panic+0x1c(10741b8, 28, 2a101871200, 124ed4c, 0, 1c)
die+0x9c(28, 2a101871200, 124ed4c, 0, 2a101870fe0, 10000)
trap+0x560(2a101871200, 58, 0, 0, 300019aa000, 3000372c680)
ktl0+0x48(4000, 40, 0, 0, 3002189493c, 0)
ssd_get_physical_geometry+0x204(3001eda2898, 2a10187140c, 0, 0, 0, 60003df3580)
ssd_resync_geom_caches+0xb4(60003df3580, ffffffff80000000, 200, 1, 3ec1, ff)
ssd_validate_geometry+0xb4(60003df3580, 1, 3000372c838, 1, 7, fc000092)
ssd_ready_and_valid+0x2cc(60003df3580, 3000372c680, 60003df36c0, 0, 2, 2)
ssdopen+0x260(8, 60003df3580, 3, 18cfb68, 2, 60003df3650)
spec_open+0x438(2a101871930, 1, 60002c01ee8, 18c88f0, 600138a0de8, 30010464c80)
fop_open+0x78(2a101871930, 2, 60002c01ee8, 1, 60016670f40, 1)
vn_openat+0x4d4(0, 99c, 0, 1, 0, 1)
copen+0x260(ffffffffffd19553, ffbff60c, 0, 3b99c, 0, 1)
syscall_trap32+0xcc(ffbff60c, 0, 3b99c, 303240ff, 80808080, 1010101)
The green part of stack is the stack we saw when we ran ::msgbuf, and it represents the canonical stack as was present when the first call to panic() was done. The red part of the stack represents where the first call to panic() is within its control-flow. We can see the call to vfs_syncall(), and we can see for this core file that NFS and RFS code are being used to write out pages to one of their filesystems.

We also see current_thread() being called, which is a well-known function within Solaris. It runs as part of executing a high-level interrupt (eg, an interrupt whole PIL (Processor Interrupt Level)) is greater than LOCK_LEVEL (which currently is ten). As an aside, a non-high-level-interrupt's handler will have intr_thread on the stack). current_thread() calls cbe_level14(), which is used in the Cyclics implementation; "level14" gives a clue that this is a handler for a high-level interrupt whose PIL is fourteen.

The blue part of the stack represents the second panic.

To prove that this stack trace generates the "panic sync timeout" message, we disassemble the code at "cyclic_fire+0x64", as this is the last stack frame before the second panic occurs:

> cyclic_fire+0x64
> cyclic_fire+0x64::dis
cyclic_fire+0x3c:               sllx      %g4, 3, %g3
cyclic_fire+0x40:               ldx       [%g3 + %i5], %i1
cyclic_fire+0x44:               cmp       %i1, %l2
cyclic_fire+0x48:               bg,pn     %xcc, +0x80   <cyclic_fire+0xc8>
cyclic_fire+0x4c:               clr       %i3
cyclic_fire+0x50:               sra       %g2, 0, %o1
cyclic_fire+0x54:               sethi     %hi(0x3b9ac800), %i2
cyclic_fire+0x58:               add       %i5, %g3, %l4
cyclic_fire+0x5c:               add       %i2, 0x200, %l3
cyclic_fire+0x60:               mov       %l1, %o0
cyclic_fire+0x64:               call      -0x114        <cyclic_expire>
cyclic_fire+0x68:               mov       %l4, %o2
[...]
We see a call to cyclic_expire(), and from the source code, we know its signature is
static void cyclic_expire(cyc_cpu_t *cpu, cyc_index_t ndx, cyclic_t *cyclic)
From the instruction at cyclic_fire+0x68, we see that register %o2 received its value from local register %l4. If we dump our blue/red/green stack using ::stackregs instead of ::stack, we'll see this stack frame for the call to cyclic_expire:
> 3000372c680::print kthread_t t_pcb.val[1] | ::stackregs
[...]
000002a100beb5f1 cyclic_fire+0x64(60003df4eb8, 14b5188803b50, 3b9ac800, 0, 60003dfcb08, 60003e6f6f8)
  %l0-%l3:      60003c8a810      60003dab8c0    14b5188803da8         3b9aca00
  %l4-%l7:      60003e6f770    14b51995f5d50                5                1
  cbe_level14+0x38:        call      +0xa8338      
[...]
Dumping the memory referenced by %l4 will show us the third argument to cyclic_expire, which is of type cyclic_t:
> 60003e6f770::print -t cyclic_t      
{
    hrtime_t cy_expire = 0x14b5188803b50
    hrtime_t cy_interval = 0x3b9aca00
    int (*)() cy_handler = deadman
    void *cy_arg = 0
    uint32_t cy_pend = 0
    uint16_t cy_flags = 0
    cyc_level_t cy_level = 0x2
}
This cyclic runs the function deadman(), which has a few purposes, one of which is to check if the sync that is done as part of panic() has timed-out. deadman() checks the panic_sync trigger to see if a sync is in progress. If a time-out has occurred, we can see that line 1833 (below) calls panic() with our "curious, single line":
  1794  static void
  1795  deadman(void)
  1796  {
  1797          if (panicstr) {
[...]
  1819
  1820                  /*
  1821                   * If we are generating a crash dump or syncing filesystems and
  1822                   * the corresponding timer is set, decrement it and re-enter
  1823                   * the panic code to abort it and advance to the next state.
  1824                   * The panic states and triggers are explained in panic.c.
  1825                   */
  1826                  if (panic_dump) {
[...]
  1831                  } else if (panic_sync) {
  1832                          if (sync_timeleft && (--sync_timeleft == 0)) {
  1833                                  panic("panic sync timeout");
  1834                                  /*NOTREACHED*/
  1835                          }
  1836                  }
[...]
So, the Big Lesson here is that every line panic() does (and does NOT) print could be important in understanding what really occurred. Read every line of panic's output !!! =:-)

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed

Calendar

Search my blog

Links

Navigation