Debugging Solaris Scheduling Problems, and Other Fun Party Tricks
The day has arrived. At last. This morning you awoke to the sound of birds
singing a pleasant song, sunshine beaming in from outside...and your first
waking thought brought a smile to your face....as it did mine. Today Solaris
is open, and to you we proudly offer it. Welcome.
Great. So now what?
OpenSolaris is
yours, and you'd like to get your feet wet,
right? Well, where should we start? Over the last few years most of my
professional time has been spent understanding, improving and evolving the
scheduler/dispatcher. That's right, many a performance issue have I seen, and
to many a hung system have I cursed. I've learned a lot in that time,
but the time has come to begin dumping a cerebral core of my share of
arcane information that will hopefully save someone else working in this area
of the kernel some time and frustration. That's not to say that the scheduler/
dispatcher code is frustrating, but some of it is "historical", and more often
than not, it's easier to understand why and how the code works if you know
why it was written in the first place.
The Dispatcher and Scheduling Classes Intro
When someone says "hey, I think there's a scheduling problem here", they
might be talking about one (or more) of a few different things:
- Unfairness - Some threads are getting more (or less) CPU time
than they should
- Starvation - An extreme example of the fairness problem, some runnable
thread(s) are barely running, or aren't getting to run at all
- Excess Migrations - Some thread(s) are hopping between CPUs too often.
This can hurt performance since a migrating thread will likely be welcomed
by a cold cache.
- Poor load balancing - Scheduling threads to run on CPUs in such a way
that they are forced to excessively contend for resources...where a
resource could be CPU time, bandwidth, use of a shared component
(e.g. memory controller, pipeline, etc.)
But really, the above deals with both dispatching (where threads run, (on which
CPUs)) and scheduling (when they run (designated by thread priorities)).
In Solaris, the core dispatcher code lives in
disp.c under
usr/src/uts/common/disp. When a thread becomes runnable, this code is responsible for picking the
CPU on which the thread should next run.
When it will get a chance to
run depends on that thread's priority relative to other runnable threads
waiting for that CPU. A threads priority is governed by it's scheduling class.
You can take a look at the current scheduling classes loaded on the system
from within mdb:
# mdb -k
Loading modules: [ unix krtld genunix specfs ufs sd sgsbbc sgenv ip sctp fcp fctl ssd nca random nfs sppp crypto ptm logindmux lofs cpc ]
> ::class
SLOT NAME INIT FCN CLASS FCN
0 SYS sys_init sys_classfuncs
1 TS ts_init ts_classfuncs
2 FX fx_init fx_classfuncs
3 IA ia_init ia_classfuncs
4 RT rt_init rt_classfuncs
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
>
Note that mdb -k invokes the debugger on the running kernel, *very* useful for
debugging a live system. If you want to debug the running kernel, and have it
stop so that you can look at a "snapshot" of the kernel, set breakpoints, data
watchpoints, etc, you would use "mdb -K", which would drop you into kmdb.
For those who remember kadb, kmdb is it's successor. All hail ::print!!!
So moving along, from ::class we see:
- System scheduling class - For kernel service threads. Threads in this
class have no time quantums, and they will remain on the CPU until such
time as they voluntarily yield it. Typical thread priorities: 60 - 99
- Time Share scheduling class - The default class for user threads.
Threads in this class will have their priority lowered as they consume
more CPU time, and increased as they use less.
Thread priorities: 0 - 59
- FX (fixed priority) scheduling class - Threads in this class have fixed
priority. By default it's 0, but privileged users can run threads in
this scheduling class at priorities greater than 0.
Thread priorities: 0 - 60.
- Interactive scheduling class - Used by the windowing system. Very similar
to TS, except that there's a mechanism allowing threads to have a "boost"
of priority. The windowing system uses this to give processes associated
with "in focus" windows higher priority. Thread priorities: 0 - 59
- Real Time scheduling class - Threads in the RT class assume the highest
system priorities, even higher than system. Thread priorities:
100 - 159
You can find more about these scheduling classes from the priocntl(1) man page,
Note that there's a difference between a scheduling class priority, and the
thread's system wide priority. The priorities cited above are system
priorities, which is the global, scheduling class independent scale of
priorities used for all threads by the dispatcher.
In the source base, the code which implements these scheduling classes can
again be found under
usr/src/uts/common/disp.
esaxe@sark$ ls
SCCS/ disp_lock.c ia.c shuttle.c ts_dptbl.c
class.c fss.c priocntl.c sysclass.c
cpupart.c fx.c rt.c thread.c
disp.c fx_dptbl.c rt_dptbl.c ts.c
As an example of how the dispatcher and scheduling classes hook together, let's
look at
preempt(), the function invoked when the currently running
thread on a CPU should yield in favor of a higher priority runnable thread:
/*
* Preempt the currently running thread in favor of the highest
* priority thread. The class of the current thread controls
* where it goes on the dispatcher queues. If panicking, turn
* preemption off.
*/
void
preempt()
{
kthread_t *t = curthread;
klwp_t *lwp = ttolwp(curthread);
if (panicstr)
return;
TRACE_0(TR_FAC_DISP, TR_PREEMPT_START, "preempt_start");
thread_lock(t);
if (t->t_state != TS_ONPROC || t->t_disp_queue != CPU->cpu_disp) {
/*
* this thread has already been chosen to be run on
* another CPU. Clear kprunrun on this CPU since we're
* already headed for swtch().
*/
CPU->cpu_kprunrun = 0;
thread_unlock_nopreempt(t);
TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");
} else {
if (lwp != NULL)
lwp->lwp_ru.nivcsw++;
CPU_STATS_ADDQ(CPU, sys, inv_swtch, 1);
THREAD_TRANSITION(t);
CL_PREEMPT(t);
DTRACE_SCHED(preempt);
thread_unlock_nopreempt(t);
TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");
swtch(); /* clears CPU->cpu_runrun via disp() */
}
}
Typically the "else" route is taken, and we see that the scheduling class
specific preemption routine is called, CL_PREEMPT(). If the thread is in the
TS (timeshare) scheduling class, then this evaluates to
ts_preempt(), which ultimately results in either
setbackdq() or
setfrondq() being called, which place the thread at the front, or the back of
the queue (respectively) of threads that exist at that thread's priority. At the end of
preempt() above, we see
swtch() is called (switch threads), which in turn calls
disp() (return the highest priority thread that the CPU can
now run (or the idle thread if there's nothing)), and finally
resume()
which does the actual context switch. Whew!
There's something important to note here. All this is being done in the context
of the thread that's being preempted! That is, the thread that is being
preempted is actually placing itself on a run queue somewhere, is finding
the next best thread to run, and is then context switching to it! The thread
is told it needs to preempt by the setting of the "cpu_runrun, or cpu_kprunrun"
flags in the CPU structure. That flag may have been set by another, higher
priority thread that placed itself on the run queue (see
cpu_resched()). Whenever the thread takes a trap, returns from a
system call, etc. It checks for this flag and proceeds to preempt itself if it
is set. The idea that the dispatcher isn't some group of one or more
bureaucratic threads "directing traffic" may seem surprising, but it's
implemented this way for performance and scalability. All threads on the
system share in executing dispatcher code, and we must take care that the bits
of policy employed by every thread add up to serve the greater good.
Debugging
So lets talk a bit about debugging. What's that you say, debugging is painful?
Tired of adding printk() statements here and there, re-compiling, and adding
more printk() statements? I feel your pain. But let me assure you that when
that smile erupted on your face this morning, it was for a reason. I enjoy
developing Solaris, and it's not because I think hunting down fantastically
complex and subtle problems with print statements is fun, it's because tools
like (k)mdb and DTrace free me to investigate, observe, and understand what's
going on without having to figure out what's needed to make the problem
observable.
For now, lets look at one of the tools at your disposal:
(k)mdb, dcmds, and CTF
Let's invoke the kernel debugger on my build machine's running kernel to
see what it's up to:
# mdb -k
Loading modules: [ unix krtld genunix specfs ufs sd sgsbbc sgenv ip sctp fcp fctl ssd nca random nfs sppp ptm logindmux lofs cpc ]
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 0000180c000 1b 0 0 -1 no no t-11 2a10001fcc0 (idle)
1 300423b2000 1b 0 0 -1 no no t-63 2a1002f7cc0 (idle)
2 300423ba000 1b 0 0 -1 no no t-1 2a10037fcc0 (idle)
3 300423c6000 1b 0 0 -1 no no t-91 2a1003c7cc0 (idle)
8 300423ce000 1b 0 0 59 no no t-0 30d4775abc0 mdb
9 300423de000 1b 0 0 -1 no no t-0 2a100539cc0 (idle)
10 300423ea000 1b 0 0 -1 no no t-71 2a1005c1cc0 (idle)
11 300423f2000 1b 0 0 -1 no no t-8 2a100609cc0 (idle)
16 300423fa000 1b 0 0 -1 no no t-10 2a100691cc0 (idle)
17 30042402000 1b 0 0 -1 no no t-191 2a100719cc0 (idle)
18 3004240a000 1b 0 0 -1 no no t-0 2a1007a1cc0 (idle)
19 30042412000 1b 0 0 -1 no no t-0 2a1007e9cc0 (idle)
>
Well, that's not very exciting. All but CPU 8 is idle, and CPU 8 is running
my mdb process thread. Let me kick off a build:
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 0000180c000 1b 0 0 30 no no t-0 325d59b8820 acomp
1 300423b2000 1b 1 0 30 no no t-0 308e67c02a0 acomp
2 300423ba000 1b 1 0 55 no no t-2 30760fbce60 cg
3 300423c6000 1b 2 0 53 no no t-1 33886adf500 cg
8 300423ce000 1b 1 0 39 no no t-1 35876d042c0 iropt
9 300423de000 1b 2 0 59 no no t-1 30d4775abc0 mdb
10 300423ea000 1b 0 0 42 no no t-1 3081f8cc200 cg
11 300423f2000 1b 2 0 32 no no t-0 319c6969400 cg
16 300423fa000 1b 1 0 53 no no t-0 338853f1ae0 cg
17 30042402000 1b 0 0 43 no no t-0 307215023a0 cg
18 3004240a000 1b 1 0 49 no no t-0 30d4771c5e0 cg
19 30042412000 1b 0 0 43 no no t-1 308e67dbc80 cg
That's better, ::cpuinfo gives you an overview of what threads every
CPU is running. ::cpuinfo -v gives you even more. Lets look at CPU 1 in
detail:
> 300423b2000::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
1 300423b2000 1b 4 0 0 no no t-5 3075514a440 acomp
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 0 35876d042c0 sh
EXISTS 0 3075514b140 acomp
ENABLE 0 30041c654a0 dmake
0 308e629b120 dmake
So CPU1 has 4 runnable threads, all at priority 0. The current thread
"acomp" has been on for 5 ticks (5/100) of a second, and this CPU has the
right set of flags set for an online and active CPU.
There's a whole array of dcmds available to ease looking around the system.
You can get a list using the ::dcmds dcmd. Also of interest are the walkers,
which allow you to easily traverse lists of things....CPUs, threads,
pages, etc. For example, let's walk the list of all active thread structures:
> ::walk thread
180e000
2a10001fcc0
2a100017cc0
2a10000fcc0
2a100007cc0
2a10003fcc0
2a100037cc0
2a10002fcc0
2a100027cc0
2a10005fcc0
2a100057cc0
2a10004fcc0
>> More [, , q, n, c, a] ?
(It's a long list, and i'll spare you all the output). Now let's pipe the
output of ::walk thread to ::findstack, which prints a stack trace for a thread
given a thread structure:
> ::walk thread |::findstack
stack pointer for thread 180e000: 180b0f1
[ 000000000180b0f1 sched+0x570() ]
000000000180b211 main+0x458()
000000000180b2f1 _start+0x110()
stack pointer for thread 2a10001fcc0: 2a10001f021
[ 000002a10001f021 ktl0+0x48() ]
000002a10001f171 disp_getwork+0x164()
000002a10001f221 idle+0xdc()
000002a10001f2d1 thread_start+4()
stack pointer for thread 2a100017cc0: 2a100017151
[ 000002a100017151 cv_wait+0x5c() ]
000002a100017201 taskq_thread+0x13c()
000002a1000172d1 thread_start+4()
stack pointer for thread 2a10000fcc0: 2a10000f141
[ 000002a10000f141 sema_p+0x25c() ]
000002a10000f1f1 kmdbmod`kctl_wr_thread+0xbc()
000002a10000f2d1 thread_start+4()
stack pointer for thread 2a100047cc0: 2a100046f51
000002a100046ff1 clock+0x20()
000002a1000470e1 cyclic_softint+0x374()
000002a100047221 cbe_level10+8()
000002a1000472d1 intr_thread+0x294()
000002a10007f2d1 utl0+0x4c()
000002a10001f171 disp_getwork+0x164()
000002a10001f221 idle+0xdc()
000002a10001f2d1 thread_start+4()
>> More [, , q, n, c, a] ?
When a system seems hung. I'll typically break into kmdb, and will do a ::walk thread |::findstack to look for interesting stacks. If there's a pile of threads
all asleep with
mutex_vector_enter() near the top of the stack,
then from there i'll dig down to find who owns what locks, and why the various
lock owners aren't letting go (or why they can't let go because they can't run).
Let's walk all the CPU structures, and for each of them print out the
cpu_thread field (pointer to the kthread_t for the on_proc) thread, and
for each of those threads, let's print out their t_pri field, the thread's
current priority:
> ::walk cpu |::print cpu_t cpu_thread |::print kthread_t t_pri
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0x3b
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
So most of the CPUs are running a priority -1 thread (idle), and one is
running a priority 59 thread (my mdb process). You'll find that ::print is your friend:
> cpu0::print cpu_t cpu_disp |::print disp_t
{
disp_lock = 0
disp_npri = 0xaa
disp_q = 0x309c1a19000
disp_q_limit = 0x309c1a19ff0
disp_qactmap = 0x300425ed380
disp_maxrunpri = 0xffff
disp_max_unbound_pri = 0xffff
disp_nrunnable = 0
disp_cpu = cpu0
}
mdb and kmdb know how to print out *every* structure thanks to the CTF data
compiled in. This eliminates the need for clunky adb macros. Relating to the
scheduler/dispatcher, you may want to consider getting familiar with the
kthread_t,
cpu_t, and
disp_t structures.
So with a basic knowledge of a few structures, a few dcmds and walkers, you're
on your way! The same tricks we've been doing here are of course relevant on
crash dumps. In fact, let's take a look at a crash dump from a recent
interrupt related scheduling bug...
The ecache_scrubber stuck in run queue limbo bug
A few weeks back Shiv Agarwal pointed me at his Sun-Fire 6900 test rig that
seemed to be hard hanging every now and again while executing some Dynamic
Reconfiguration (DR) stress tests. (DR refers to the machine's ability to
add and remove CPU/Memory/IO boards/Disks, etc from a running system without
requiring reboot). Hard hangs are particularly frustrating, because the
system is so hung that it's impossible to break the system into kmdb
for further analysis. This is where the "deadman" feature in Solaris becomes
useful. When enabled, every second, a high (level 15) interrupt will fire
on every CPU, and will execute the
deadman() function, which watches
kernel variable "lbolt" (a tick counter incremented
each time the
clock interrupt fires).
If the deadman observes that lbolt hasen't changed in a while, it will induce
a panic. The deadman feature can be enabled by setting the kernel variable
"snooping" to 1 in /etc/system:
set snooping = 1
... or by hotpatching the variable's value early in boot.
Shiv enabled the deadman, and when he reproduced the problem instead of hard
hanging the system panicked:
> panic[cpu1]/thread=30018dd6cc0: deadman: timed out after 9 seconds of
> clock inactivity
>
> 000002a10005fdf0 genunix:cyclic_fire+64 (300138a9fb8, 20d5a6fe1472,
> 3b9ac800, 0, 300121d9698, 300348b0280)
> %l0-3: 00000300195648c0 0000030012a7eb80 000020d5a6fe17c0 000000003b9aca00
> %l4-7: 00000300348b0320 000020d5a77b5280 0000000000000014 0000000000000004
> 000002a10005fea0 unix:cbe_level14+38 (0, 0, e, 30006356000, 400080, 100aef8)
> %l0-3: 000000000000022e 0000000000010002 0000000001911ab8 0000000000000000
> %l4-7: 0000000000000001 0000000000000001 0000000001911ab0 0000000001910470
> 000002a10005ff50 unix:current_thread+140 (1, 6912166044, 6912166040,
> 18af260, f0d0e, f0d0f)
> %l0-3: 0000000001007674 000002a102682a31 000000000000000e 0000000000010002
> %l4-7: 00000000018670b0 0000000000000000 000000000000000b 000002a1026832e0
> 000002a102683380 unix:mp_cpu_quiesce+90 (180c000, 200, 1, 0, 3938700, fc00)
> %l0-3: 0000000000000000 0000000000000000 0000000000000000 00000000018af260
> %l4-7: 00000000018af000 00000000018aa660 0000000000000000 000000000085daf8
> 000002a102683430 sbdp:sbdp_cpu_poweroff+d4 (180c000, 1c00, 0, 0, 0,
> f082e520)
> %l0-3: 00000000000000bb 00000000705e3e20 00000000705e3c00 0000030015828000
> %l4-7: 000000007043dc00 000007fff0914000 0000000000001fff 00000000705e3e18
> 000002a102683500 unix:cpu_poweroff+24 (180c000, 10, 22, 2f, 2f, 180c000)
> %l0-3: 0000000000000000 00000000705e2900 00000000705e2930 0000000000000030
> %l4-7: 000000000190be28 000002a102683590 fffffffff082e520 00000000705e2800
> 000002a1026835b0 sbd:sbd_pre_detach_cpu+1b0 (300062944c0, 300139c0000,
> 180c000, 0, 30006294510, 30012bdd2c8)
> %l0-3: 0000030015848000 000000000000008f 0000030015b65658 00000000000bab15
> %l4-7: 0000000000000000 ffffffffffffffff 0000000000000001 0000000000000000
> 000002a102683660 sbd:sbd_dev_unconfigure+50 (300062944c0, 300139c0000,
> 1, 0, 0, 2a10268371c)
> %l0-3: 0000000000000005 0000000000000001 0000000000000019 000003000f675180
> %l4-7: 0000000000000000 0000000000000001 0000000000000000 0000000000000001
> 000002a102683720 sbd:sbd_exec_op+94 (300062944c0, 0, 445000, 7, 7, 8c)
> %l0-3: 0000030015848000 00000000704bf800 0000000000000003 ffffffffffffff1f
> %l4-7: ffffffffffffffd3 0000000000445208 0000030006294510 0000000000000000
> 000002a1026837f0 sbd:sbd_ioctl+150 (a500000000, 445208, 293c0, 100003,
> 2a102683977, 445000)
> %l0-3: 0000000000000003 00000000704bf800 0000000000445208 0000000000445201
> 000002a102683990 genunix:ioctl+184 (3, 30019513580, 293c0, ff1d3108,
> ff1ea000, 445208)
> %l0-3: 000003001301c590 000003001301c5b0 0000000000000004 0000000000000001
> %l4-7: 000003001301c5b0 0000000000000000 0000000000000000 0000000000000000
>
> panic: entering debugger (continue to save dump)
[1]> $c
kaif_enter(18340f8, 0, 11e0a14, 1828400, 11e0a14, 4c00)
panicsys+0x408(18b3410, 18aa2b0, 30006356000, 1, 20040, 1883800)
vpanic+0xcc(11d6c20, 2a10005fe78, 0, 0, 43c998540, 1b5872)
panic+0x1c(11d6c20, 9, 18c9000, 11d6c00, 18c9000, 0)
cyclic_fire+0x64(300138a9fb8, 20d5a6fe1472, 3b9ac800, 0, 300121d9698,
300348b0280)
cbe_level14+0x38(0, 0, e, 30006356000, 400080, 100aef8)
current_thread+0x140(1, 6912166044, 6912166040, 18af260, f0d0e, f0d0f)
mp_cpu_quiesce+0x90(180c000, 200, 1, 0, 3938700, fc00)
sbdp`sbdp_cpu_poweroff+0xd4(180c000, 1c00, 0, 0, 0, f082e520)
cpu_poweroff+0x24(180c000, 10, 22, 2f, 2f, 180c000)
sbd`sbd_pre_detach_cpu+0x1b0(300062944c0, 300139c0000, 180c000, 0, 30006294510,
30012bdd2c8)
sbd`sbd_dev_unconfigure+0x50(300062944c0, 300139c0000, 1, 0, 0, 2a10268371c)
sbd`sbd_exec_op+0x94(300062944c0, 0, 445000, 7, 7, 8c)
sbd`sbd_ioctl+0x150(a500000000, 445208, 293c0, 100003, 2a102683977, 445000)
ssm`ssm_ioctl+0xcc(a500000000, 3000313f880, 293c0, 100003, 3001218cf80, 445208)
ioctl+0x184(3, 30019513580, 293c0, ff1d3108, ff1ea000, 445208)
syscall_trap32+0xcc(3, 445208, 293c0, ff1d3108, ff1ea000, b)
Ok, so the golden question. Why wasn't the clock interrupt able to fire so
that it could change lbolt to make deadman() happy?
Using ::cpuinfo to look around:
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 0000180c000 2d 1 9 -1 no yes t-31536 2a1015a3cc0 (idle)
1 0000187e7a0 1b 9 0 49 no no t-1 30018dd6cc0 cfgadm
2 300061d2000 d 10 0 169 no no t-0 2a101433cc0 sched
3 3000631e000 2d 0 0 -1 no no t-31541 2a102053cc0 (idle)
8 30006340000 d 8 0 169 no no t-0 2a101943cc0 sched
10 30008c74000 1d 10 0 169 no no t-0 2a10117bcc0 sched
11 30008864000 d 6 0 169 no no t-0 2a101f2bcc0 sched
16 30008442000 1d 9 0 169 no no t-0 2a101d23cc0 sched
17 3000da1e000 1d 6 0 169 no no t-0 2a100611cc0 sched
18 3000c9fc000 1d 6 0 169 no no t-0 2a101983cc0 sched
19 30005bda000 1d 8 0 169 no no t-0 2a1020a3cc0 sched
Who are the priority 169 threads running on 2, 8-19?
> 2a101433cc0::findstack
stack pointer for thread 2a101433cc0: 2a101433021
[ 000002a101433021 panic_idle+0x1c() ]
000002a1014330d1 ktl0+0x48()
000002a101433221 cpu_pause+0x5c()
000002a1014332d1 thread_start+4()
>
Ah, those are pause threads. There are certain lists and data structures
walked by the dispatcher code without any locks held. One example is the
per processor set list of active CPUs walked in
disp_anywork(), a function used by the dispatcher to determine if there's any thing better
to run in the system besides
idle(). This enables lockless
performance in the dispatcher, at the cost of requiring that all CPUs be
paused (run high priority threads that just spin) when those lists/structures
are changed around (ensuring no other CPUs are executing dispatcher code)
...which of course happens
very rarely.
OK, so some of the CPUs are running their pause threads, while CPUs 0 and 3 are
idle(). The thread on CPU1 is in the middle of a DR operation. Looking at
the first argument to
mp_cpu_quiesce() above, we see it's trying to power off and quiesce CPU0 of interrupts.
> 30018dd6cc0::findstack
stack pointer for thread 30018dd6cc0: 2a102682ae1
000002a102682b81 mp_cpu_quiesce+0x90()
000002a102682c31 sbdp_cpu_poweroff+0xd4()
000002a102682d01 cpu_poweroff+0x24()
000002a102682db1 sbd_pre_detach_cpu+0x1b0()
000002a102682e61 sbd_dev_unconfigure+0x50()
000002a102682f21 sbd_exec_op+0x94()
000002a102682ff1 sbd_ioctl+0x150()
000002a1026830c1 ssm_ioctl+0xcc()
000002a102683191 ioctl+0x184()
000002a1026832e1 syscall_trap32+0xcc()
> mp_cpu_quiesce+0x90::dis
mp_cpu_quiesce+0x80: nop
mp_cpu_quiesce+0x84: clr %g2
mp_cpu_quiesce+0x88: ba +0x34 <mp_cpu_quiesce+0xbc>
mp_cpu_quiesce+0x8c: cmp %g2, 0
mp_cpu_quiesce+0x90: call +0x1db71c <drv_usecwait>
mp_cpu_quiesce+0x94: mov %i2, %o0
mp_cpu_quiesce+0x98: add %l7, 1, %l7
mp_cpu_quiesce+0x9c: cmp %l7, %i4
mp_cpu_quiesce+0xa0: bl,pt %icc, -0x40 <mp_cpu_quiesce+0x60>
Which, looking at the C code puts us here in mp_cpu_quiesce():
/*
* Declare CPU as no longer being READY to process interrupts and
* wait for them to stop. A CPU that is not READY can no longer
* participate in x-calls or x-traps.
*/
cp->cpu_flags &= ~CPU_READY;
CPUSET_DEL(cpu_ready_set, cpuid);
membar_sync();
for (i = 0; i < sanity_limit; i++) {
if (cp->cpu_intr_actv == 0 &&
cp->cpu_thread == cp->cpu_idle_thread) {
found_intr = 0;
break;
}
DELAY(1); <-------------- **
}
So CPU1 is caught in the
for loop waiting for CPU0's cpu_intr_actv
bitmask to become empty (cpu_intr_actv is a bitmap of active interrupts that
lives in the CPU structure). Sure enough, it isn't:
> 180c000::print cpu_t cpu_intr_actv
cpu_intr_actv = 0x200
Taking a close look at CPU0:
> 180c000::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 0000180c000 2d 1 9 -1 no yes t-31536 2a1015a3cc0 (idle)
| | |
RUNNING <--+ | +--> PIL THREAD
QUIESCED | 9 2a101423cc0
EXISTS |
OFFLINE +--> PRI THREAD PROC
168 2a101423cc0 sched
> 2a101423cc0::findstack
stack pointer for thread 2a101423cc0: 2a1014230d1
000002a101423171 scrub_ecache+0x80()
000002a101423221 swtch+0x78()
000002a1014232d1 intr_thread+0x240()
Hey look, there's a level 9 interrupt thread runnable and eagerly
awaiting the CPU to grab it, and run it.
What's an interrupt thread? Interrupt service code in Solaris executes
while the existing thread on the processor is "pinned". That is, when an
interrupt happens, the CPU immediately jumps from executing the interrupted
thread to the interrupt service code. When complete, control is restored to the
interrupted thread and life moves on.
It is not often that interrupts lose the CPU prior to finishing execution,
but it does happen, most often because the interrupt tries to grab an adaptive
lock that is already held. When this happens, the interrupt is turned into a
real thread (rather than leaching off the interrupted thread), and blocks as
usual. Interrupt threads, like real time threads are at the top of the food
chain in terms of priorities (With the RT class loaded, 160-169). PIL 10
interrupts and below are permitted to grab adaptive locks, and can therefore
block (and be turned into real threads). Interrupts above PIL 10 may
never block. The dispatcher operates at PIL 11, and if it were possible
to take an interrupt that later blocked while already in the dispatcher all
sorts of nastiness could ensue.
But for our runnable interrupt thread, despite it's eagerness to run again,
it's day will never come. CPU0 is idle(), but looking at the flags we see
that it's OFFLINE and QUIESCED. This is the state your CPU assumes after you
use psradm -f, for example. The CPU isn't permitted to execute any threads,
and strictly speaking it's dangerous for it be servicing interrupts. The
offline CPU will remain parked in this loop in
idle()
/*
* If CPU is completely quiesced by p_online(2), just wait
* here with minimal bus traffic until put online.
*/
while (cp->cpu_flags & CPU_QUIESCED)
(*idle_cpu)();
What happens if the interrupt thread blocks? You're looking at it.
The thread will remain unrunnable. It's only hope is for the QUIESCED flag
to be cleared should the CPU be brought back online.
So did this interrupt block? Not exactly. The ecache scrubbing code is brave,
and wants to scrub the caches of even offline CPUs. It does this by having
one CPU send a cross trap to the target (offline) CPU, which causes the target
to take a PIL 9 interrupt. Again, this is very dicey doing this on offline
CPUs, but looking at the interrupt code executed there's no evidence of it
grabbing any locks. So how did it lose the CPU? Looking at the interrupt
thread stack we find our clue:
> 2a101423cc0::findstack
stack pointer for thread 2a101423cc0: 2a1014230d1
000002a101423171 scrub_ecache+0x80()
000002a101423221 swtch+0x78()
000002a1014232d1 intr_thread+0x240()
>
The topmost stack frame already popped off, so really the last bit of code
we executed was switch, called by intr_thread+0x240:
intr_thread+0x234: ldub [%o2 + 0xe9], %o5
intr_thread+0x238: brz,pt %o5, +0x1c <intr_thread+0x254>
intr_thread+0x23c: mov %o2, %l3
intr_thread+0x240: call +0x486a8 <preempt>
intr_thread+0x244: mov %o3, %l2
intr_thread+0x248: mov %l3, %o2
intr_thread+0x24c: mov %l2, %o3
Oh oh. Looks like intr_thread+0x240 called preempt(), which then called
swtch(). This thread
voluntarily gave up the CPU. Why?
Code exists in
intr_thread() that prevents that pathological
scenario of a thread being pinned by an interrupt for an excessive period of
time. Some networking interrupt handlers will happily stay in interrupt context
as long as there's a steady stream of packets to service, so intr_thread(),
once it detects that we've processed 16 or so interrupts without giving the
poor pinned thread a chance to run, will force itself, and whomever it's
pinning to preempt to give the interrupted thread a chance to migrate
elsewhere. Unfortunately, while it may have had the best intentions, it did
this on a QUIESCED CPU, which made the ecache scrubber interrupt thread
unrunnable. The ecache scrubber interrupt is sent regularly via a
cyclic firing on another CPU (the clock, and deadman are both cyclic based).
So with the ecache scrubber left unrunnable, CPU0 became stuck at PIL 9, which
left CPU1 stuck waiting in a loop for CPU0's PIL to drop. But CPU1 got stuck
after it had already paused CPUs.
(Done via
pause_cpus()
One key point I should mention, the pause thread raises it's CPU's PIL to 10.
Hmm, I wonder where clock should be running...
> ::cycinfo -v
...
CPU CYC_CPU STATE NELEMS ROOT FIRE HANDLER
10 30011394900 online 12 300189d9480 a926ae697c00 do_scrub_offline
6
|
+------------------+------------------+
10 1
| |
+---------+--------+ +---------+---------+
9 5 3 2
| | | |
+----+----+ +----+----+ +----+----+ +----+---+
4 0 11 7 8
| | | | |
+-+-+ +--+-+ +-+-+ +--+-+ +-+-+
ADDR NDX HEAP LEVL PEND FIRE USECINT HANDLER
300189d9480 0 8 low 212603 a926cd692a74 705032 cheetah_nudge_buddy
300189d94a8 1 2 low 14989212 a926ae697c00 10000 do_scrub
300189d94d0 2 6 low 14989212 a926ae697c00 10000 do_scrub
300189d94f8 3 5 low 14989212 a926ae697c00 10000 do_scrub
300189d9520 4 7 high 0 a926c7527474 1000000 deadman
300189d9548 5 4 low 14989212 a926ae697c00 10000 do_scrub_offline
300189d9570 6 0 low 14989212 a926ae697c00 10000 do_scrub_offline
300189d9598 7 10 low 14989212 a926ae697c00 10000 do_scrub_offline
300189d95c0 8 11 low 149892 a926c6410000 1000000 dtrace_state_deadman
300189d95e8 9 3 low 15139104 a926adf27592 9900 dtrace_state_clean
300189d9610 10 1 lock 14989212 a926ae697c00 10000 clock <---
300189d9638 11 9 high 0 a926ae697c00 10000 cbe_hres_tick
...
Wow, CPU10 has the clock cyclic, and there's a *lot* of pending clock
interrupts. Note that the clock fires at PIL 10, but the pause
thread running at that CPU has masked interrupts PIL10 and below. :P
Well, there it is, that's why clock couldn't bump lbolt, and after 9 seconds
the deadman pulled the plug. A pause_thread running on CPU0 (in the past) was
probably also responsible for backing up the number of pending PIL 9 scrub
requests, such that when PIL was lowered, the overwhelming number of pending
PIL 9 interrupts triggered the preempt mechanism in intr_thread().
The fix for this bug was pretty trivial.
intr_thread shouldn't employ it's preempt() mechanism if the CPU is quiesced.
This was bug
6256245.
Fun stuff. In another posting, we'll look at how DTrace can be an indispensable
tool in tracking down scheduling problems, and the joys of the sched provider.
Technorati Tag:
OpenSolaris
Technorati Tag:
Solaris
Technorati Tag:
DTrace
Technorati Tag:
mdb