Andrew Rutz's blog
Recursive panic
I will dispell any rumor to the contrary: it is possible to call panic() during the execution of panic(). As is true for any (language that supports it and) function that performs it, this is an "act of recursion" !
I recently stumbled across an instance of a recursive-panic in a core file I was analyzing. After
attaching mdb to the core file and running ::msgbuf to dump the console
messages written by the kernel to its circular, in-memory buffer, I noticed a
"curious, single line" buried within its output:
panic sync timeoutAn annotated replay of the "tail" of
::msgbuf's output is shown here,
along with our "curious, single line":
# ls bounds unix.0 vmcore.0 # mdb -k 0 Loading modules: [ unix krtld ... ] > ::msgbuf [...] panic[cpu17]/thread=3000372c680: BAD TRAP: type=28 rp=2a101871200 addr=124ed4c mmu_fsr=0 event_pvr_exec-5: integer divide zero trap: addr=0x124ed4c pid=3751, pc=0x124ed4c, sp=0x2a101870aa1, tstate=0x80001606, context=0x1d64 g1-g7: ffff, fc00, 4000, 0, 0, 0, 3000372c680 000002a101870f20 unix:die+9c (28, 2a101871200, 124ed4c, 0, 2a101870fe0, 10000) %l0-3: 0000000000000000 0000000000000028 000000000000000f 000000000179ed54 %l4-7: 0000030000074f40 0000000000000061 000000000179ed53 0000000001074000 000002a101871000 unix:trap+560 (2a101871200, 58, 0, 0, 300019aa000, 3000372c680) %l0-3: 0000000000000000 0000060012e8efd0 0000000000000028 00000600133be388 %l4-7: 0000000000000000 000006001f282700 0000000000000001 00000300019aa180 000002a101871150 unix:ktl0+48 (4000, 40, 0, 0, 3002189493c, 0) %l0-3: 0000000000000007 0000000000001400 0000000080001606 00000000010195c4 %l4-7: 0000000000f0007f 0000000000000100 0000000000000000 000002a101871200 000002a1018712a0 ssd:ssd_get_physical_geometry+204 (3001eda2898, 2a10187140c, 0, 0, 0, 60003df3580) %l0-3: 0000030021894938 0000000000000024 0000000000000000 000000007fffffff %l4-7: 0000000080000000 0000000080000000 000003002189493c 0000000000004000 000002a101871350 ssd:ssd_resync_geom_caches+b4 (60003df3580, ffffffff80000000, 200, 1, 3ec1, ff) %l0-3: 00000000000000ff 000000000000003f 0000030023a7b420 000003000e760780 %l4-7: 00000000003f0000 0000000000ff0000 0000060003de0b00 0000000000020a2a 000002a101871420 ssd:ssd_validate_geometry+b4 (60003df3580, 1, 3000372c838, 1, 7, fc000092) %l0-3: 000002a101871484 0000000000000006 0000000000000200 000000000179ed52 %l4-7: 0000000080000000 000000000000003c 0000060003de0b00 00000000018b8c00 000002a1018714e0 ssd:ssd_ready_and_valid+2cc (60003df3580, 3000372c680, 60003df36c0, 0, 2, 2) %l0-3: 0000000000000000 00000000018acd68 0000030000008ff0 0000060003de0b00 %l4-7: 0000000000000000 0000000000000000 0000060003de0b00 0000060003de0b00 000002a1018715f0 ssd:ssdopen+260 (8, 60003df3580, 3, 18cfb68, 2, 60003df3650) %l0-3: 0000000000000000 0000000000000002 000000760000013b 0000000000000000 %l4-7: 0000060003de0b00 0000000000000001 00000000018c7c00 0000000000000000 000002a1018716a0 specfs:spec_open+438 (2a101871930, 1, 60002c01ee8, 18c88f0, 600138a0de8, 30010464c80) %l0-3: 0000000000000001 0000060016670f40 0000000000000000 00000000ffffffff %l4-7: 0000000000000000 00000600166b13e0 00000600138a0e70 000000760000013b 000002a101871760 genunix:fop_open+78 (2a101871930, 2, 60002c01ee8, 1, 60016670f40, 1) %l0-3: 0000000000000001 00000300000a4dd8 00000600068a23d8 00000600068c9738 %l4-7: 0000000000000000 00000600068a23f0 00000600068c9738 0000000000000004 000002a101871810 genunix:vn_openat+4d4 (0, 99c, 0, 1, 0, 1) %l0-3: 000000007fffffff 0000000000000000 0000000000000000 0000000000000000 %l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000002a1018719d0 genunix:copen+260 (ffffffffffd19553, ffbff60c, 0, 3b99c, 0, 1) %l0-3: 00000600133be590 00000600133be5b0 000003501b56956d 0000000000000000 %l4-7: 00000000018a7c00 000000000000099c 0000000000000012 0000060012e8efd0 syncing file systems... panic[cpu17]/thread=3000372c680: panic sync timeout dumping to /dev/md/dsk/d1, offset 64422019072, content: kernelThe characters in green were output by the first call to
panic(), which was
caused by a division-by-zero trap that occurred in the function
ssd_get_physical_geometry(). The output shown in red
is the output from the second panic (eg, the panic that occurred while servicing the
first panic).
Our "curious, single line" (1), shown below, is found at the articulation between the
first and second panics. The first panic calls vfs_syncall() to write any in-memory
representation of a filesystem to its associated backing-store (2). As
will be seen shortly, this execution times-out, and panic() is called again (3), this
time with the argument "panic sync timeout". panic()'s design for re-entrancy
ensures that most of its implementation is not repeated. Progress resumes with the
writing of the core file to disk (4).
[...] syncing file systems... (2) panic[cpu17]/thread=3000372c680: (3) panic sync timeout (1) dumping to /dev/md/dsk/d1, offset 64422019072, content: kernel (4)
panic()'s re-entrancy includes the use of "panic triggers", which are three global
booleans which are atomically updated at specific points in panic()'s control flow:
/* * Triggers for panic state transitions: */ int panic_quiesce; /* trigger for CALM -> QUIESCE */ int panic_sync; /* trigger for QUIESCE -> SYNC */ int panic_dump; /* trigger for SYNC -> DUMP */
panic_trigger() is the function that atomically updates a specified trigger and returns
True if the current call to the function caused the first (and only) change from an
un-triggered state to a triggered state. If panic_trigger() returns False, then panic()
knows that the associated trigger has already been triggered, and that an
associated critical section of panic()'s control flow must not be run (again).
The high-level control flow of panic() is:
panic
vpanic
panicsys
...
flush_windows() ;; (on SPARC, write register stack to memory stack
panic_trigger(&panic_quiesce) ;; is first time panic has been called ?
if is-first-panic
switch thread's stack pointer, %sp, to panic_stack
save registers to memory stack
panicsys
...
if (on_panic_stack) ;; eg, "if is first-panic ..."
...
panic_stopcpus() ;; idle the current thread on (n-1) cpu's
...
save lots of values in panic_xxx variables ;; eg, panic_thread
...
if panic_trigger(&panic_sync) ;; have we not sync'd filesystems ?
set cpu's interrupt level to ten ;; CLOCK_LEVEL
do_polled_io = 1; ;; force drivers to use polling vs interrupts
vfs_syncall()
if panic_trigger(&panic_dump) ;; have we not written core file ?
set cpu's interrupt level to ten
do_polled_io = 1; ;; us polling vs interrupts
dumpsys()
do a reboot ;; mdboot(A_REBOOT, ...)
The call-sites to the three triggers are shown in bold characters. As can now be seen,
the first panic triggers panic_quiesce and panic_sync, and is in its call to
vfs_syncall (to sync all the filesystems). When this sync'ing causes a panic, this
second panic does not (re-)trigger either panic_quiesce or panic_sync, thus skipping much of panic's
control flow. However, the second panic does trigger panic_dump, causing a call to dumpsys() to create the
core file (using polled-IO, vs. interrupts).
"OK, this all 'sounds nice', but how do we know that this is what happened ?"
The trick lies in knowing which stack pointer, %sp, to use
when "dumping the stack".
The stack above that is shown in green is the stack
at the time the current thread's division-by-zero trap forced control to flow to
die()'s call to panic(). The thread that enters panic()
(and carries out panic()'s actions) is the same kernel thread (0x3000372c680) that
was handling the open() system call (see frame at bottom of stack).
We also know that this same thread was the thread that was executing in vfs_syncall() when it panic'd again, as the same thread address is shown for the second panic:
[...] syncing file systems... panic[cpu17]/thread=3000372c680: <----- !!! panic sync timeout dumping to /dev/md/dsk/d1, offset 64422019072, content: kernel (4)We also know that panic() obviously tries hard to ensure that "the dump" shows the state as it existed for any first panic, which means anything in the green part of the stack obviously only represents the first panic.
The trick is to look in what turns out to be the most obvious place, the per-thread state (kthread_t) of the thread that caused the second panic. When a thread
is idled, Solaris saves
the thread's last PC and stack pointer in kthread_t.t_pcb:
# mdb ...
> 3000372c680::print kthread_t t_pcb
{
t_pcb.val = [ 0x105bec4, 0x2a100beb2d1 ]
}
> 0x105bec4/ia
panicsys+0x48: call -0x1aa88
panicsys+0x4c:
If we dump the stack using this stack-pointer, we obtain a lot of insight:
> 3000372c680::print kthread_t t_pcb.val[1] | ::stack vpanic+0xcc(11d8d20, 2a100bebe78, 22bc8af, 64, 1813c00, 0) panic+0x1c(11d8d20, 18ab000, 0, 18ab000, 1, 11d8c00) cyclic_fire+0x64(60003df4eb8, 14b5188803b50, 3b9ac800, 0, 60003dfcb08, 60003e6f6f8) cbe_level14+0x38(0, 0, e, 300019aa000, 20080, 100b170) current_thread+0x140(1, 6001708de98, 1, 2710, 10741b8, 183c800) clnt_cots_kcallit+0x580(6001708de40, 64, 0, 6001708de40, 6001708de60, 6001708deb8) rfscall+0x58c(60002d52380, 15, 18e9ab0, 3c, 1841dd0, 18371e0) rfs3call+0x60(60002d52380, 15, 7afa466c, 1842098, 7afa4820, 1842000) nfs3_commit+0xc0(3001052f500, 1, 30012db13d0, 60002c01ee8, 30016159a38, 300161599c8) nfs3_dispose+0x2c0(3001052f500, 32300000, 8000, 0, 60002c01ee8, 30016159bd8) page_release+0xf8(70007de1680, 1, 168, 1836bb8, 1836800, 18a7000) pvn_getdirty+0x1fc(70007de1680, 400, 55, e1c0fa355, 0, 1) pvn_vplist_dirty+0x324(0, 400, 0, 0, 10000, 3001052f500) nfs_putpages+0xcc(3001052f500, 0, 0, 400, 60002c01ee8, 600076241c0) nfs3_putpage+0xcc(3001052f500, 0, 0, 1, 60002c01ee8, 0) rflush+0x1c4(30024040000, 60002c01ee8, 30014f97480, 180a9, 7048efec, 3) nfs3_sync+0x2c(0, 7048f000, 60002c01ee8, 1, 0, 0) vfs_sync+0x98(18ae560, 185d550, 185dad0, 0, 0, 1) vfs_syncall+0x48(a, a, 18b3000, 11d8000, 0, 1) panicsys+0x4e8(3000372c680, 60002c01ee8, 184b800, 1813c00, 1077800, 0) vpanic+0xcc(10741b8, 2a101870fa8, 2000, 1, 8, 8) panic+0x1c(10741b8, 28, 2a101871200, 124ed4c, 0, 1c) die+0x9c(28, 2a101871200, 124ed4c, 0, 2a101870fe0, 10000) trap+0x560(2a101871200, 58, 0, 0, 300019aa000, 3000372c680) ktl0+0x48(4000, 40, 0, 0, 3002189493c, 0) ssd_get_physical_geometry+0x204(3001eda2898, 2a10187140c, 0, 0, 0, 60003df3580) ssd_resync_geom_caches+0xb4(60003df3580, ffffffff80000000, 200, 1, 3ec1, ff) ssd_validate_geometry+0xb4(60003df3580, 1, 3000372c838, 1, 7, fc000092) ssd_ready_and_valid+0x2cc(60003df3580, 3000372c680, 60003df36c0, 0, 2, 2) ssdopen+0x260(8, 60003df3580, 3, 18cfb68, 2, 60003df3650) spec_open+0x438(2a101871930, 1, 60002c01ee8, 18c88f0, 600138a0de8, 30010464c80) fop_open+0x78(2a101871930, 2, 60002c01ee8, 1, 60016670f40, 1) vn_openat+0x4d4(0, 99c, 0, 1, 0, 1) copen+0x260(ffffffffffd19553, ffbff60c, 0, 3b99c, 0, 1) syscall_trap32+0xcc(ffbff60c, 0, 3b99c, 303240ff, 80808080, 1010101)The green part of stack is the stack we saw when we ran
::msgbuf, and it represents
the canonical stack
as was present when the first call to panic() was done. The red part of the stack
represents where the first call to panic() is within its control-flow. We can see
the call to vfs_syncall(), and we can see for this core file that NFS and RFS code are being used to
write out pages to one of their filesystems.
We also see current_thread() being called, which is a well-known function within
Solaris. It runs as part of executing a high-level interrupt (eg, an interrupt
whole PIL (Processor Interrupt Level)) is greater than LOCK_LEVEL (which currently
is ten). As an aside, a non-high-level-interrupt's handler will have intr_thread on the stack). current_thread() calls cbe_level14(), which
is used in the Cyclics implementation; "level14" gives a clue that this is
a handler for a high-level interrupt whose PIL is fourteen.
The blue part of the stack represents the second panic.
To prove that this stack trace generates the "panic sync timeout" message, we disassemble the code at "cyclic_fire+0x64", as this is the last stack frame before the second panic occurs:
> cyclic_fire+0x64 > cyclic_fire+0x64::dis cyclic_fire+0x3c: sllx %g4, 3, %g3 cyclic_fire+0x40: ldx [%g3 + %i5], %i1 cyclic_fire+0x44: cmp %i1, %l2 cyclic_fire+0x48: bg,pn %xcc, +0x80 <cyclic_fire+0xc8> cyclic_fire+0x4c: clr %i3 cyclic_fire+0x50: sra %g2, 0, %o1 cyclic_fire+0x54: sethi %hi(0x3b9ac800), %i2 cyclic_fire+0x58: add %i5, %g3, %l4 cyclic_fire+0x5c: add %i2, 0x200, %l3 cyclic_fire+0x60: mov %l1, %o0 cyclic_fire+0x64: call -0x114 <cyclic_expire> cyclic_fire+0x68: mov %l4, %o2 [...]We see a call to cyclic_expire(), and from the source code, we know its signature is
static void cyclic_expire(cyc_cpu_t *cpu, cyc_index_t ndx, cyclic_t *cyclic)From the instruction at cyclic_fire+0x68, we see that register %o2 received its value from local register %l4. If we dump our blue/red/green stack using
::stackregs instead of ::stack, we'll see
this stack frame for the call to cyclic_expire:
> 3000372c680::print kthread_t t_pcb.val[1] | ::stackregs [...] 000002a100beb5f1 cyclic_fire+0x64(60003df4eb8, 14b5188803b50, 3b9ac800, 0, 60003dfcb08, 60003e6f6f8) %l0-%l3: 60003c8a810 60003dab8c0 14b5188803da8 3b9aca00 %l4-%l7: 60003e6f770 14b51995f5d50 5 1 cbe_level14+0x38: call +0xa8338Dumping the memory referenced by %l4 will show us the third argument to cyclic_expire, which is of type cyclic_t:[...]
> 60003e6f770::print -t cyclic_t
{
hrtime_t cy_expire = 0x14b5188803b50
hrtime_t cy_interval = 0x3b9aca00
int (*)() cy_handler = deadman
void *cy_arg = 0
uint32_t cy_pend = 0
uint16_t cy_flags = 0
cyc_level_t cy_level = 0x2
}
This cyclic runs the function deadman(), which has a few purposes, one of which
is to check if the sync that is done as part of panic() has timed-out.
deadman() checks the panic_sync trigger to see if a sync is in progress.
If a time-out has occurred, we can see that line 1833 (below) calls panic() with
our "curious, single line":
1794 static void
1795 deadman(void)
1796 {
1797 if (panicstr) {
[...]
1819
1820 /*
1821 * If we are generating a crash dump or syncing filesystems and
1822 * the corresponding timer is set, decrement it and re-enter
1823 * the panic code to abort it and advance to the next state.
1824 * The panic states and triggers are explained in panic.c.
1825 */
1826 if (panic_dump) {
[...]
1831 } else if (panic_sync) {
1832 if (sync_timeleft && (--sync_timeleft == 0)) {
1833 panic("panic sync timeout");
1834 /*NOTREACHED*/
1835 }
1836 }
[...]
So, the Big Lesson here is that every line panic() does (and does NOT)
print could be important in understanding what really occurred. Read every line of
panic's output !!! =:-)
Posted at 11:12PM Jun 29, 2007 by Andrew Rutz in debug | Comments[0]