Monday May 07, 2007
Monday May 07, 2007
This infrastructural upgrade has replaced the kernel's legacy CMT scheduling framework (implemented in chip.c/chip.h), which provided for only a single level of optimization (physical processor). The PG infrastructure enables Solaris to optimize for N levels, which is needed in cases where multiple levels/types of hardware sharing exists between logical CPUs in the system. Longer term, we're interested in providing a PG based implementation for the scheduling side of the kernel's NUMA optimizations. In addition to simplifying the implementation, this would potentially get us to the point of a having an affinity/load balancing policy implementation that spans both NUMA and CMT.
The road ahead is an exciting one
Over the next year or two, i'll be focusing my efforts on Solaris platform independent power management policy...which will entail bringing (or coordinating to bring) power management awareness to the platform independent kernel subsystems that deal with power manageable resources. We'll start with the dispatcher. :)
"Tesla" is the code name for the project, which will be run "in the open" via OpenSolaris. Over the last week, i've been working on the logistical aspects of the project (getting content on the page, setting up mail aliases, figuring out the Mercurial SCM, etc.). I'm hoping the project will go live either tomorrow (uh, today) or Tuesday.
"Eric Saxe's Blog" doesn't cut it. We need a proper title. Hmm...:
YOUR ACCESS SUSPENDED PLEASE REPORT TO DILLINGER IMMEDIATELY AUTHORIZATION: MASTER CONTROL PROGRAM END OF LINE
What better way to engross an impressionable, video game loving 7 year boy with computers than to tell him that inside that unassuming white box programs in glowing red and blue suits were battling it out in a video game arena? For better or worse, the nostalgia of the time is still with me 26 years later. :) Campy? Perhaps, but who's to say that without such inspiration i'd be doing what I love today?
Wednesday Dec 07, 2005
Tuesday Dec 06, 2005
esaxe@ontario-mc25$ psrinfo 0 on-line since 10/14/2000 20:54:37 1 on-line since 10/14/2000 20:54:39 2 on-line since 10/14/2000 20:54:39 3 on-line since 10/14/2000 20:54:39 4 on-line since 10/14/2000 20:54:39 5 on-line since 10/14/2000 20:54:39 6 on-line since 10/14/2000 20:54:39 7 on-line since 10/14/2000 20:54:39 8 on-line since 10/14/2000 20:54:39 9 on-line since 10/14/2000 20:54:39 10 on-line since 10/14/2000 20:54:39 11 on-line since 10/14/2000 20:54:39 12 on-line since 10/14/2000 20:54:39 13 on-line since 10/14/2000 20:54:39 14 on-line since 10/14/2000 20:54:39 15 on-line since 10/14/2000 20:54:39 16 on-line since 10/14/2000 20:54:39 17 on-line since 10/14/2000 20:54:39 18 on-line since 10/14/2000 20:54:39 19 on-line since 10/14/2000 20:54:39 20 on-line since 10/14/2000 20:54:39 21 on-line since 10/14/2000 20:54:39 22 on-line since 10/14/2000 20:54:39 23 on-line since 10/14/2000 20:54:39 24 on-line since 10/14/2000 20:54:39 25 on-line since 10/14/2000 20:54:39 26 on-line since 10/14/2000 20:54:39 27 on-line since 10/14/2000 20:54:39 28 on-line since 10/14/2000 20:54:39 29 on-line since 10/14/2000 20:54:39 30 on-line since 10/14/2000 20:54:39 31 on-line since 10/14/2000 20:54:39Yes, I know the system's clock is off by a few years. But seriously, output like this is something i'm used to seeing on monsters like the Sun Fire E25K and Sun Fire E6900 Servers. It was mind expanding indeed to see this sort of output from a small box with but a single physical processor.
Wednesday Nov 30, 2005
Here's Steve's OpenSolaris presentation.
We then followed up with a 30 minute presentation (with demos) of DTrace, which was well received. The presentation I used was derived from another (much longer) presentation Bryan, Mike, and Adam put together. The first part of the presentation provides the foundations for understanding why DTrace is such a key part of Solaris's observability portfolio. The rest of the presentation is structured mainly around the demos which highlight some of the key DTrace features. I would have loved to delve deeper, but there's only so much one can do in 30 minutes.
Here's the DTrace presentation I used.
Here's the corresponding demo scripts.
Monday Nov 28, 2005
Thursday Jun 16, 2005
Tuesday Jun 14, 2005
# mdb -k Loading modules: [ unix krtld genunix specfs ufs sd sgsbbc sgenv ip sctp fcp fctl ssd nca random nfs sppp crypto ptm logindmux lofs cpc ] > ::class SLOT NAME INIT FCN CLASS FCN 0 SYS sys_init sys_classfuncs 1 TS ts_init ts_classfuncs 2 FX fx_init fx_classfuncs 3 IA ia_init ia_classfuncs 4 RT rt_init rt_classfuncs 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 >Note that mdb -k invokes the debugger on the running kernel, *very* useful for debugging a live system. If you want to debug the running kernel, and have it stop so that you can look at a "snapshot" of the kernel, set breakpoints, data watchpoints, etc, you would use "mdb -K", which would drop you into kmdb. For those who remember kadb, kmdb is it's successor. All hail ::print!!! So moving along, from ::class we see:
esaxe@sark$ ls SCCS/ disp_lock.c ia.c shuttle.c ts_dptbl.c class.c fss.c priocntl.c sysclass.c cpupart.c fx.c rt.c thread.c disp.c fx_dptbl.c rt_dptbl.c ts.cAs an example of how the dispatcher and scheduling classes hook together, let's look at preempt(), the function invoked when the currently running thread on a CPU should yield in favor of a higher priority runnable thread:
/*
* Preempt the currently running thread in favor of the highest
* priority thread. The class of the current thread controls
* where it goes on the dispatcher queues. If panicking, turn
* preemption off.
*/
void
preempt()
{
kthread_t *t = curthread;
klwp_t *lwp = ttolwp(curthread);
if (panicstr)
return;
TRACE_0(TR_FAC_DISP, TR_PREEMPT_START, "preempt_start");
thread_lock(t);
if (t->t_state != TS_ONPROC || t->t_disp_queue != CPU->cpu_disp) {
/*
* this thread has already been chosen to be run on
* another CPU. Clear kprunrun on this CPU since we're
* already headed for swtch().
*/
CPU->cpu_kprunrun = 0;
thread_unlock_nopreempt(t);
TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");
} else {
if (lwp != NULL)
lwp->lwp_ru.nivcsw++;
CPU_STATS_ADDQ(CPU, sys, inv_swtch, 1);
THREAD_TRANSITION(t);
CL_PREEMPT(t);
DTRACE_SCHED(preempt);
thread_unlock_nopreempt(t);
TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");
swtch(); /* clears CPU->cpu_runrun via disp() */
}
}
Typically the "else" route is taken, and we see that the scheduling class
specific preemption routine is called, CL_PREEMPT(). If the thread is in the
TS (timeshare) scheduling class, then this evaluates to ts_preempt(), which ultimately results in either setbackdq() or setfrondq() being called, which place the thread at the front, or the back of
the queue (respectively) of threads that exist at that thread's priority. At the end of
preempt() above, we see swtch() is called (switch threads), which in turn calls disp() (return the highest priority thread that the CPU can
now run (or the idle thread if there's nothing)), and finally resume()
which does the actual context switch. Whew!
# mdb -k Loading modules: [ unix krtld genunix specfs ufs sd sgsbbc sgenv ip sctp fcp fctl ssd nca random nfs sppp ptm logindmux lofs cpc ] > ::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 0000180c000 1b 0 0 -1 no no t-11 2a10001fcc0 (idle) 1 300423b2000 1b 0 0 -1 no no t-63 2a1002f7cc0 (idle) 2 300423ba000 1b 0 0 -1 no no t-1 2a10037fcc0 (idle) 3 300423c6000 1b 0 0 -1 no no t-91 2a1003c7cc0 (idle) 8 300423ce000 1b 0 0 59 no no t-0 30d4775abc0 mdb 9 300423de000 1b 0 0 -1 no no t-0 2a100539cc0 (idle) 10 300423ea000 1b 0 0 -1 no no t-71 2a1005c1cc0 (idle) 11 300423f2000 1b 0 0 -1 no no t-8 2a100609cc0 (idle) 16 300423fa000 1b 0 0 -1 no no t-10 2a100691cc0 (idle) 17 30042402000 1b 0 0 -1 no no t-191 2a100719cc0 (idle) 18 3004240a000 1b 0 0 -1 no no t-0 2a1007a1cc0 (idle) 19 30042412000 1b 0 0 -1 no no t-0 2a1007e9cc0 (idle) >Well, that's not very exciting. All but CPU 8 is idle, and CPU 8 is running my mdb process thread. Let me kick off a build:
> ::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 0000180c000 1b 0 0 30 no no t-0 325d59b8820 acomp 1 300423b2000 1b 1 0 30 no no t-0 308e67c02a0 acomp 2 300423ba000 1b 1 0 55 no no t-2 30760fbce60 cg 3 300423c6000 1b 2 0 53 no no t-1 33886adf500 cg 8 300423ce000 1b 1 0 39 no no t-1 35876d042c0 iropt 9 300423de000 1b 2 0 59 no no t-1 30d4775abc0 mdb 10 300423ea000 1b 0 0 42 no no t-1 3081f8cc200 cg 11 300423f2000 1b 2 0 32 no no t-0 319c6969400 cg 16 300423fa000 1b 1 0 53 no no t-0 338853f1ae0 cg 17 30042402000 1b 0 0 43 no no t-0 307215023a0 cg 18 3004240a000 1b 1 0 49 no no t-0 30d4771c5e0 cg 19 30042412000 1b 0 0 43 no no t-1 308e67dbc80 cgThat's better, ::cpuinfo gives you an overview of what threads every CPU is running. ::cpuinfo -v gives you even more. Lets look at CPU 1 in detail:
> 300423b2000::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
1 300423b2000 1b 4 0 0 no no t-5 3075514a440 acomp
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 0 35876d042c0 sh
EXISTS 0 3075514b140 acomp
ENABLE 0 30041c654a0 dmake
0 308e629b120 dmake
So CPU1 has 4 runnable threads, all at priority 0. The current thread
"acomp" has been on for 5 ticks (5/100) of a second, and this CPU has the
right set of flags set for an online and active CPU.> ::walk thread 180e000 2a10001fcc0 2a100017cc0 2a10000fcc0 2a100007cc0 2a10003fcc0 2a100037cc0 2a10002fcc0 2a100027cc0 2a10005fcc0 2a100057cc0 2a10004fcc0 >> More [(It's a long list, and i'll spare you all the output). Now let's pipe the output of ::walk thread to ::findstack, which prints a stack trace for a thread given a thread structure:, , q, n, c, a] ?
> ::walk thread |::findstack stack pointer for thread 180e000: 180b0f1 [ 000000000180b0f1 sched+0x570() ] 000000000180b211 main+0x458() 000000000180b2f1 _start+0x110() stack pointer for thread 2a10001fcc0: 2a10001f021 [ 000002a10001f021 ktl0+0x48() ] 000002a10001f171 disp_getwork+0x164() 000002a10001f221 idle+0xdc() 000002a10001f2d1 thread_start+4() stack pointer for thread 2a100017cc0: 2a100017151 [ 000002a100017151 cv_wait+0x5c() ] 000002a100017201 taskq_thread+0x13c() 000002a1000172d1 thread_start+4() stack pointer for thread 2a10000fcc0: 2a10000f141 [ 000002a10000f141 sema_p+0x25c() ] 000002a10000f1f1 kmdbmod`kctl_wr_thread+0xbc() 000002a10000f2d1 thread_start+4() stack pointer for thread 2a100047cc0: 2a100046f51 000002a100046ff1 clock+0x20() 000002a1000470e1 cyclic_softint+0x374() 000002a100047221 cbe_level10+8() 000002a1000472d1 intr_thread+0x294() 000002a10007f2d1 utl0+0x4c() 000002a10001f171 disp_getwork+0x164() 000002a10001f221 idle+0xdc() 000002a10001f2d1 thread_start+4() >> More [When a system seems hung. I'll typically break into kmdb, and will do a ::walk thread |::findstack to look for interesting stacks. If there's a pile of threads all asleep with mutex_vector_enter() near the top of the stack, then from there i'll dig down to find who owns what locks, and why the various lock owners aren't letting go (or why they can't let go because they can't run)., , q, n, c, a] ?
> ::walk cpu |::print cpu_t cpu_thread |::print kthread_t t_pri t_pri = 0xffff t_pri = 0xffff t_pri = 0xffff t_pri = 0xffff t_pri = 0xffff t_pri = 0xffff t_pri = 0x3b t_pri = 0xffff t_pri = 0xffff t_pri = 0xffff t_pri = 0xffff t_pri = 0xffffSo most of the CPUs are running a priority -1 thread (idle), and one is running a priority 59 thread (my mdb process). You'll find that ::print is your friend:
> cpu0::print cpu_t cpu_disp |::print disp_t
{
disp_lock = 0
disp_npri = 0xaa
disp_q = 0x309c1a19000
disp_q_limit = 0x309c1a19ff0
disp_qactmap = 0x300425ed380
disp_maxrunpri = 0xffff
disp_max_unbound_pri = 0xffff
disp_nrunnable = 0
disp_cpu = cpu0
}
mdb and kmdb know how to print out *every* structure thanks to the CTF data
compiled in. This eliminates the need for clunky adb macros. Relating to the
scheduler/dispatcher, you may want to consider getting familiar with the kthread_t, cpu_t, and disp_t structures.set snooping = 1... or by hotpatching the variable's value early in boot.
> panic[cpu1]/thread=30018dd6cc0: deadman: timed out after 9 seconds of > clock inactivity > > 000002a10005fdf0 genunix:cyclic_fire+64 (300138a9fb8, 20d5a6fe1472, > 3b9ac800, 0, 300121d9698, 300348b0280) > %l0-3: 00000300195648c0 0000030012a7eb80 000020d5a6fe17c0 000000003b9aca00 > %l4-7: 00000300348b0320 000020d5a77b5280 0000000000000014 0000000000000004 > 000002a10005fea0 unix:cbe_level14+38 (0, 0, e, 30006356000, 400080, 100aef8) > %l0-3: 000000000000022e 0000000000010002 0000000001911ab8 0000000000000000 > %l4-7: 0000000000000001 0000000000000001 0000000001911ab0 0000000001910470 > 000002a10005ff50 unix:current_thread+140 (1, 6912166044, 6912166040, > 18af260, f0d0e, f0d0f) > %l0-3: 0000000001007674 000002a102682a31 000000000000000e 0000000000010002 > %l4-7: 00000000018670b0 0000000000000000 000000000000000b 000002a1026832e0 > 000002a102683380 unix:mp_cpu_quiesce+90 (180c000, 200, 1, 0, 3938700, fc00) > %l0-3: 0000000000000000 0000000000000000 0000000000000000 00000000018af260 > %l4-7: 00000000018af000 00000000018aa660 0000000000000000 000000000085daf8 > 000002a102683430 sbdp:sbdp_cpu_poweroff+d4 (180c000, 1c00, 0, 0, 0, > f082e520) > %l0-3: 00000000000000bb 00000000705e3e20 00000000705e3c00 0000030015828000 > %l4-7: 000000007043dc00 000007fff0914000 0000000000001fff 00000000705e3e18 > 000002a102683500 unix:cpu_poweroff+24 (180c000, 10, 22, 2f, 2f, 180c000) > %l0-3: 0000000000000000 00000000705e2900 00000000705e2930 0000000000000030 > %l4-7: 000000000190be28 000002a102683590 fffffffff082e520 00000000705e2800 > 000002a1026835b0 sbd:sbd_pre_detach_cpu+1b0 (300062944c0, 300139c0000, > 180c000, 0, 30006294510, 30012bdd2c8) > %l0-3: 0000030015848000 000000000000008f 0000030015b65658 00000000000bab15 > %l4-7: 0000000000000000 ffffffffffffffff 0000000000000001 0000000000000000 > 000002a102683660 sbd:sbd_dev_unconfigure+50 (300062944c0, 300139c0000, > 1, 0, 0, 2a10268371c) > %l0-3: 0000000000000005 0000000000000001 0000000000000019 000003000f675180 > %l4-7: 0000000000000000 0000000000000001 0000000000000000 0000000000000001 > 000002a102683720 sbd:sbd_exec_op+94 (300062944c0, 0, 445000, 7, 7, 8c) > %l0-3: 0000030015848000 00000000704bf800 0000000000000003 ffffffffffffff1f > %l4-7: ffffffffffffffd3 0000000000445208 0000030006294510 0000000000000000 > 000002a1026837f0 sbd:sbd_ioctl+150 (a500000000, 445208, 293c0, 100003, > 2a102683977, 445000) > %l0-3: 0000000000000003 00000000704bf800 0000000000445208 0000000000445201 > 000002a102683990 genunix:ioctl+184 (3, 30019513580, 293c0, ff1d3108, > ff1ea000, 445208) > %l0-3: 000003001301c590 000003001301c5b0 0000000000000004 0000000000000001 > %l4-7: 000003001301c5b0 0000000000000000 0000000000000000 0000000000000000 > > panic: entering debugger (continue to save dump) [1]> $c kaif_enter(18340f8, 0, 11e0a14, 1828400, 11e0a14, 4c00) panicsys+0x408(18b3410, 18aa2b0, 30006356000, 1, 20040, 1883800) vpanic+0xcc(11d6c20, 2a10005fe78, 0, 0, 43c998540, 1b5872) panic+0x1c(11d6c20, 9, 18c9000, 11d6c00, 18c9000, 0) cyclic_fire+0x64(300138a9fb8, 20d5a6fe1472, 3b9ac800, 0, 300121d9698, 300348b0280) cbe_level14+0x38(0, 0, e, 30006356000, 400080, 100aef8) current_thread+0x140(1, 6912166044, 6912166040, 18af260, f0d0e, f0d0f) mp_cpu_quiesce+0x90(180c000, 200, 1, 0, 3938700, fc00) sbdp`sbdp_cpu_poweroff+0xd4(180c000, 1c00, 0, 0, 0, f082e520) cpu_poweroff+0x24(180c000, 10, 22, 2f, 2f, 180c000) sbd`sbd_pre_detach_cpu+0x1b0(300062944c0, 300139c0000, 180c000, 0, 30006294510, 30012bdd2c8) sbd`sbd_dev_unconfigure+0x50(300062944c0, 300139c0000, 1, 0, 0, 2a10268371c) sbd`sbd_exec_op+0x94(300062944c0, 0, 445000, 7, 7, 8c) sbd`sbd_ioctl+0x150(a500000000, 445208, 293c0, 100003, 2a102683977, 445000) ssm`ssm_ioctl+0xcc(a500000000, 3000313f880, 293c0, 100003, 3001218cf80, 445208) ioctl+0x184(3, 30019513580, 293c0, ff1d3108, ff1ea000, 445208) syscall_trap32+0xcc(3, 445208, 293c0, ff1d3108, ff1ea000, b)Ok, so the golden question. Why wasn't the clock interrupt able to fire so that it could change lbolt to make deadman() happy?
> ::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 0000180c000 2d 1 9 -1 no yes t-31536 2a1015a3cc0 (idle) 1 0000187e7a0 1b 9 0 49 no no t-1 30018dd6cc0 cfgadm 2 300061d2000 d 10 0 169 no no t-0 2a101433cc0 sched 3 3000631e000 2d 0 0 -1 no no t-31541 2a102053cc0 (idle) 8 30006340000 d 8 0 169 no no t-0 2a101943cc0 sched 10 30008c74000 1d 10 0 169 no no t-0 2a10117bcc0 sched 11 30008864000 d 6 0 169 no no t-0 2a101f2bcc0 sched 16 30008442000 1d 9 0 169 no no t-0 2a101d23cc0 sched 17 3000da1e000 1d 6 0 169 no no t-0 2a100611cc0 sched 18 3000c9fc000 1d 6 0 169 no no t-0 2a101983cc0 sched 19 30005bda000 1d 8 0 169 no no t-0 2a1020a3cc0 schedWho are the priority 169 threads running on 2, 8-19?
> 2a101433cc0::findstack stack pointer for thread 2a101433cc0: 2a101433021 [ 000002a101433021 panic_idle+0x1c() ] 000002a1014330d1 ktl0+0x48() 000002a101433221 cpu_pause+0x5c() 000002a1014332d1 thread_start+4() >Ah, those are pause threads. There are certain lists and data structures walked by the dispatcher code without any locks held. One example is the per processor set list of active CPUs walked in disp_anywork(), a function used by the dispatcher to determine if there's any thing better to run in the system besides idle(). This enables lockless performance in the dispatcher, at the cost of requiring that all CPUs be paused (run high priority threads that just spin) when those lists/structures are changed around (ensuring no other CPUs are executing dispatcher code) ...which of course happens very rarely.
> 30018dd6cc0::findstack stack pointer for thread 30018dd6cc0: 2a102682ae1 000002a102682b81 mp_cpu_quiesce+0x90() 000002a102682c31 sbdp_cpu_poweroff+0xd4() 000002a102682d01 cpu_poweroff+0x24() 000002a102682db1 sbd_pre_detach_cpu+0x1b0() 000002a102682e61 sbd_dev_unconfigure+0x50() 000002a102682f21 sbd_exec_op+0x94() 000002a102682ff1 sbd_ioctl+0x150() 000002a1026830c1 ssm_ioctl+0xcc() 000002a102683191 ioctl+0x184() 000002a1026832e1 syscall_trap32+0xcc() > mp_cpu_quiesce+0x90::dis mp_cpu_quiesce+0x80: nop mp_cpu_quiesce+0x84: clr %g2 mp_cpu_quiesce+0x88: ba +0x34 <mp_cpu_quiesce+0xbc> mp_cpu_quiesce+0x8c: cmp %g2, 0 mp_cpu_quiesce+0x90: call +0x1db71c <drv_usecwait> mp_cpu_quiesce+0x94: mov %i2, %o0 mp_cpu_quiesce+0x98: add %l7, 1, %l7 mp_cpu_quiesce+0x9c: cmp %l7, %i4 mp_cpu_quiesce+0xa0: bl,pt %icc, -0x40 <mp_cpu_quiesce+0x60>Which, looking at the C code puts us here in mp_cpu_quiesce():
/*
* Declare CPU as no longer being READY to process interrupts and
* wait for them to stop. A CPU that is not READY can no longer
* participate in x-calls or x-traps.
*/
cp->cpu_flags &= ~CPU_READY;
CPUSET_DEL(cpu_ready_set, cpuid);
membar_sync();
for (i = 0; i < sanity_limit; i++) {
if (cp->cpu_intr_actv == 0 &&
cp->cpu_thread == cp->cpu_idle_thread) {
found_intr = 0;
break;
}
DELAY(1); <-------------- **
}
So CPU1 is caught in the for loop waiting for CPU0's cpu_intr_actv
bitmask to become empty (cpu_intr_actv is a bitmap of active interrupts that
lives in the CPU structure). Sure enough, it isn't:
> 180c000::print cpu_t cpu_intr_actv cpu_intr_actv = 0x200Taking a close look at CPU0:
> 180c000::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 0000180c000 2d 1 9 -1 no yes t-31536 2a1015a3cc0 (idle)
| | |
RUNNING <--+ | +--> PIL THREAD
QUIESCED | 9 2a101423cc0
EXISTS |
OFFLINE +--> PRI THREAD PROC
168 2a101423cc0 sched
> 2a101423cc0::findstack
stack pointer for thread 2a101423cc0: 2a1014230d1
000002a101423171 scrub_ecache+0x80()
000002a101423221 swtch+0x78()
000002a1014232d1 intr_thread+0x240()
Hey look, there's a level 9 interrupt thread runnable and eagerly
awaiting the CPU to grab it, and run it.
/*
* If CPU is completely quiesced by p_online(2), just wait
* here with minimal bus traffic until put online.
*/
while (cp->cpu_flags & CPU_QUIESCED)
(*idle_cpu)();
What happens if the interrupt thread blocks? You're looking at it.
The thread will remain unrunnable. It's only hope is for the QUIESCED flag
to be cleared should the CPU be brought back online.> 2a101423cc0::findstack stack pointer for thread 2a101423cc0: 2a1014230d1 000002a101423171 scrub_ecache+0x80() 000002a101423221 swtch+0x78() 000002a1014232d1 intr_thread+0x240() >The topmost stack frame already popped off, so really the last bit of code we executed was switch, called by intr_thread+0x240:
intr_thread+0x234: ldub [%o2 + 0xe9], %o5 intr_thread+0x238: brz,pt %o5, +0x1c <intr_thread+0x254> intr_thread+0x23c: mov %o2, %l3 intr_thread+0x240: call +0x486a8 <preempt> intr_thread+0x244: mov %o3, %l2 intr_thread+0x248: mov %l3, %o2 intr_thread+0x24c: mov %l2, %o3Oh oh. Looks like intr_thread+0x240 called preempt(), which then called swtch(). This thread voluntarily gave up the CPU. Why? Code exists in intr_thread() that prevents that pathological scenario of a thread being pinned by an interrupt for an excessive period of time. Some networking interrupt handlers will happily stay in interrupt context as long as there's a steady stream of packets to service, so intr_thread(), once it detects that we've processed 16 or so interrupts without giving the poor pinned thread a chance to run, will force itself, and whomever it's pinning to preempt to give the interrupted thread a chance to migrate elsewhere. Unfortunately, while it may have had the best intentions, it did this on a QUIESCED CPU, which made the ecache scrubber interrupt thread unrunnable. The ecache scrubber interrupt is sent regularly via a cyclic firing on another CPU (the clock, and deadman are both cyclic based).
> ::cycinfo -v
...
CPU CYC_CPU STATE NELEMS ROOT FIRE HANDLER
10 30011394900 online 12 300189d9480 a926ae697c00 do_scrub_offline
6
|
+------------------+------------------+
10 1
| |
+---------+--------+ +---------+---------+
9 5 3 2
| | | |
+----+----+ +----+----+ +----+----+ +----+---+
4 0 11 7 8
| | | | |
+-+-+ +--+-+ +-+-+ +--+-+ +-+-+
ADDR NDX HEAP LEVL PEND FIRE USECINT HANDLER
300189d9480 0 8 low 212603 a926cd692a74 705032 cheetah_nudge_buddy
300189d94a8 1 2 low 14989212 a926ae697c00 10000 do_scrub
300189d94d0 2 6 low 14989212 a926ae697c00 10000 do_scrub
300189d94f8 3 5 low 14989212 a926ae697c00 10000 do_scrub
300189d9520 4 7 high 0 a926c7527474 1000000 deadman
300189d9548 5 4 low 14989212 a926ae697c00 10000 do_scrub_offline
300189d9570 6 0 low 14989212 a926ae697c00 10000 do_scrub_offline
300189d9598 7 10 low 14989212 a926ae697c00 10000 do_scrub_offline
300189d95c0 8 11 low 149892 a926c6410000 1000000 dtrace_state_deadman
300189d95e8 9 3 low 15139104 a926adf27592 9900 dtrace_state_clean
300189d9610 10 1 lock 14989212 a926ae697c00 10000 clock <---
300189d9638 11 9 high 0 a926ae697c00 10000 cbe_hres_tick
...
Wow, CPU10 has the clock cyclic, and there's a *lot* of pending clock
interrupts. Note that the clock fires at PIL 10, but the pause
thread running at that CPU has masked interrupts PIL10 and below. :P
Well, there it is, that's why clock couldn't bump lbolt, and after 9 seconds
the deadman pulled the plug. A pause_thread running on CPU0 (in the past) was
probably also responsible for backing up the number of pending PIL 9 scrub
requests, such that when PIL was lowered, the overwhelming number of pending
PIL 9 interrupts triggered the preempt mechanism in intr_thread().