
Wednesday June 29, 2005
Current activities
It's been a little while since I have written about a topic in Solaris Kernel. I'll soon blog on how Preemption (both user and kernel level) works in Solaris. With the help of Jonathan Chew, Andrei Doffee and Eric Saxe in Solaris Kernel Development, I'm currently working on a project which will enable you to specify multi-CPU binding and define affinity between the processes/lwps. We are still in the stage of drafting and developing a prototype. I'm getting to learn lgroup (latency group) and HLS (Hierarchical Lgroup Support) too. lgroup improves the performance on NUMA (Non-uniform memory access) machines like E15k, E25k, Serengeti 6800 and so on. It is my pleasure to work with Solaris Kernel Development engineers on this project. I'm sure I'll learn loads of things as we go along.
In the meantime, we recently cracked a problem in Solaris which delayed the response of a thread when a signal was pending. These days my fellow colleague Sudheer Abdul Salam (we call him hot gun in Solaris Kernel Sustaining group) parterns with all of us when working on a bug. We truely believe in team building and dont' hesitate to take help from others. At the same time, we don't hesitate in cracking pranks too :-)
Now that our group owns picld(1), I've been little busy with few bugs too in this area. picld(1) has interfaces which allows you to get the platform information in a tree form manner (abstract configuration of the system). The current users of picld(1m) interfaces are prtpicl(1m) and SunMC and I'm sure third-party applications will be using picld too.
This weekend (2nd/3rd July) I'm going for a long drive and a long trek too. I'm hoping that it'll just rain and rain. Weather in Bangalore is rejuvenating us (my uncle and myself) and western ghats are attracting us again with their lush green forests and beautiful mountain ranges.
(2005-06-29 20:28:18.0)
Permalink

Monday June 27, 2005
Rope leader returned from Mt. Everest Expedition
This is what my Rope leader of 180th Basic Mountaineering Course had to say when he returned from Mt. Everest Expedition this year. We all are praying for his fast recovery...
------------<8@>------------
hi everybody
it's been quite sometime even after icame back from everest that i wrote something here. ofcource everybody knows that my right hand is temporarily unservicible, but my left hand is still fine. onlything i am quite slow with it. anyway, by now everybody is quite aware abt our expedition thru media. i along with three other members tried the peak on 21st. we cud reach only upto 8400mtr and had come back because of technical problem.the oxygen system we were using was not suitable for heigh altitude. i was forced to change my mask thrice. even then i cud not fix it. in that process my inner gloves become cold and next day morning ifound my fingers are numb. 4 of us beaten back by luck rather i shud say we came back because of inexperience. otherwise we were fighting fit and till we started and faced the problem we were determind to do it. but our bitter experience paid dearly to the team. we changed the oxygen system of our second team to the traditional one and on !
our
advice few more things were also sorted out. the second team summited smootly. but sqn/ldr chaitanya caught into blizard while coming back and got lost on the way back camp 3. mountain is merciless sometime.
rt now i have 3rd degree frost bite in all the fingers of my rt hand. middle and ring finger are little serious. so i may have to lose some part of those two fingers. otherwise i am fine. presently getting treated in air force stn hindon and admitted in 11 af hospital in the same place. next time i shall try to postthe phone no of the hospital itself. bye and take care guys.
------------<8@>------------
(2005-06-27 01:02:21.0)
Permalink

Monday June 13, 2005
Dispatcher locks and Bug 5017148
Dispatcher locks and Bug 5017148
As part of the opensolaris release, I'm going to describe about the dispatcher
locks, thread locks and a bug which I root-caused last year. The investigation
didn't take much time, but it was an interesting one because door does magic
in the kernel at the time of handoff to other thread (client to server
or server to client). So let me begin with what's a dispatcher lock:
1. What's a dispatcher lock
Dispatcher lock is a one byte lock (disp_lock_t) which is acquired
at high pil (DISP_LEVEL) and DISP_LEVEL
is the interrupt level at which dispatcher operations should be performed.
There are other symbolic interrupt levels viz. CLOCK_LEVEL and LOCK_LEVEL
in machlock.h
Following are the interfaces for dispatcher lock which are described
in disp_lock.c
disp_lock_init()
initializes dispatcher lock.
disp_lock_destroy()
destroys dispatcher lock.
disp_lock_enter()
acquires dispatcher lock.
disp_lock_exit()
releases dispatcher lock and checks for kernel preemption.
disp_lock_exit_nopreempt()
releases dispatcher lock without checking for kernel preemption.
disp_lock_enter_high()
acquires another dispatcher lock when the thread is already holding
a dispatcher lock.
disp_lock_exit_high()
releases the top level dispatcher lock.
Here are the facts about dispatcher locks :-
(a) Being a spin lock which are acquired at high level, dispatcher
locks should be acquired for a short duration and shouldn't make blocking
calls.
(b) While releasing dispatcher lock, you can be preempted if
cpu_kprunrun
(kernel preemption) is set. You can use disp_lock_exit_nopreempt()
if you don't want to be preempted.
(c) While holding dispatcher lock, you are not preemptible.
(d) Since dispatcher lock raises pil to DISP_LEVEL, the old
pil is saved in t_oldspl
of the thread structure (kthread_t)
2. What's a thread lock
Thread lock is a per-thread entity which protects t_state
and state-related flags of a kernel thread. Thread lock hangs off kthread_t
as t_lockp. t_lockp
is a pointer to thread dispatcher lock and the pointer is changed whenever
the state of the kernel thread is changed. One would acquire thread
lock using thread_lock()
routine giving the kernel thread pointer. thread_lock() is responsible
for getting the correct dispatcher lock for the thread. The dance
done by thread_lock() is interesting because t_lockp is pointer and can
get changed during the course of spinning for a dispatcher lock. Hence
thread_lock() saves t_lockp pointer and ensures that we acquire the right
thread lock.
Now lets take a look at the interfaces in Solaris kernel which
are described in disp_lock.c
and thread.h
thread_lock() is called to require thread lock.
thread_unlock() is called to release thread lock and it checks
for kernel preemption.
thread_lock_high() is called to acquire another thread lock
while holding one.
thread_unlock_high() is called to release thread lock while
holding one.
thread_unlock_nopreempt() is called to release thread lock without
checking for kernel preemption.
3. Various types of thread locks in Solaris Kernel
Now that I've described about thread lock, it's very important
for us to understand what dispatcher locks are acquired depending upon
the state of the thread. In order to find out this, you need to first
understand the one-to-one mapping between the state of the thread and
it's corresponding dispatcher lock:
TS_RUN
(runnable) --->
disp_lock
of the dispatch queue in a CPU (cpu_t)
or global preemption queue of a CPU partition
TS_ONPROC
(running ) ---> cpu_thread_lock
in a CPU (cpu_t)
TS_SLEEP
(sleep)
---> sleepq
bucket lock or turnstile
chain lock
TS_STOPPED
(stopped) ---> stop_lock
(a global dispatcher lock) for stopped threads.
There're two global dispatcher locks: shuttle_lock
and transition_lock
in Solaris Kernel. When thread lock of a thread is pointing to shuttle_lock,
it means that the thread is sleeping on a door and when thread lock
points to transition_lock, it means that thread is in transition to another
state (for instance when the state of the thread sleeping on a semaphore
is changed from TS_SLEEP to TS_RUN or during yield()).
transition_lock is always held and is never released.
4. Examples of thread lock
Now lets understand what all thread locks will be involved from
wakeup (or unsleep) to onproc (running) of a thread. Lets assume
that T1 (thread 1) is blocked on a condition variable CV1 and T2 (thread
2) signals T1 as part of wakeup. First cv_signal()
grabs sleepq bucket lock and decrements the waiters count on CV1. It
then calls sleepq_wakeone_chan()
to wakeup T1. sleepq_wakeone_chan()'s
responsibility is to unlink T1 from the sleepq list (using t_link of
kthread_t) and calls CL_WAKEUP
(scheduling class specific wakeup routine). Assuming T1 is in time sharing
class (TS), ts_wakeup()
gets called. Now ts_wakeup()
which in turn calls dispatcher enqueue routine (setfrontdq() or
setbackdq()) changes the state of T1 thread to TS_RUN and changes t_lockp
to point to disp_lock of the chosen CPU. At last sleepq_wakeone_chan()
drops disp_lock of the dispatch queue and finally sleepq dispatcher
lock is also released in cv_signal().
Once T1 is chosen to run, disp()
removes T1 from the dispatch queue of the CPU and changes the state
to TS_ONPROC and t_lockp to cpu_thread_lock of the CPU.
void cv_signal(kcondvar_t *cvp) { condvar_impl_t *cp = (condvar_impl_t *)cvp;
/* make sure the cv_waiters field looks sane */ ASSERT(cp->cv_waiters <= CV_MAX_WAITERS); if (cp->cv_waiters > 0) { sleepq_head_t *sqh = SQHASH(cp); disp_lock_enter(&sqh->sq_lock); ASSERT(CPU_ON_INTR(CPU) == 0); if (cp->cv_waiters & CV_WAITERS_MASK) { kthread_t *t; cp->cv_waiters--; t = sleepq_wakeone_chan(&sqh->sq_queue, cp); /* * If cv_waiters is non-zero (and less than * CV_MAX_WAITERS) there should be a thread * in the queue. */ ASSERT(t != NULL); } else if (sleepq_wakeone_chan(&sqh->sq_queue, cp) == NULL) { cp->cv_waiters = 0; } disp_lock_exit(&sqh->sq_lock); } }
The second example is from the phase of preemption. We know that
there are two types of preemption in Solaris kernel viz. user preemption
(cpu_runrun) and kernel preemption (cpu_kprunrun). Assume that T1 is
being preempted in favour of a high priority thread. As a result T1
will call preempt()
once T1 realizes that it has to give up the CPU (there're hooks in Solaris
kernel to determine this). preempt()
first grabs thread lock effectively cpu_thread_lock on itself and calls
THREAD_TRANSITION()
to change the t_lockp to transition_lock. Note that the state of T1
is still TS_ONPROC while t_lockp is pointing to transition_lock, because
T1 is in transition phase (from TS_ONPROC -> TS_RUN). THREAD_TRANSITION()
also releases previous dispatcher lock because transition_lock is always
held. preempt()
then calls CL_PREEMPT(), scheduling class specific preemption routine,
to enqueue T1 on a particular CPU. From here on it's same as described
in the first example.
void preempt() { kthread_t *t = curthread; klwp_t *lwp = ttolwp(curthread);
if (panicstr) return;
TRACE_0(TR_FAC_DISP, TR_PREEMPT_START, "preempt_start");
thread_lock(t);
if (t->t_state != TS_ONPROC || t->t_disp_queue != CPU->cpu_disp) { /* * this thread has already been chosen to be run on * another CPU. Clear kprunrun on this CPU since we're * already headed for swtch(). */ CPU->cpu_kprunrun = 0; thread_unlock_nopreempt(t); TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end"); } else { if (lwp != NULL) lwp->lwp_ru.nivcsw++; CPU_STATS_ADDQ(CPU, sys, inv_swtch, 1); THREAD_TRANSITION(t); CL_PREEMPT(t); DTRACE_SCHED(preempt); thread_unlock_nopreempt(t);
TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");
swtch(); /* clears CPU->cpu_runrun via disp() */ } }
5. An example of a dispatcher lock and Bug 5017148.
Apart from illustrating dispatcher lock, I'll also describe
a problem which I had found a while back. This's involves kernel door
implementation too.
I usually begin with looking at what CPUs are doing whenever
I take a look at a crash dump from a system hang:
> ::cpuinfo
ID ADDR
FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD
PROC
0 0001041d2b0 1b
1
0 60
no no
t-0
3001ba04900 cluster
1 30019fe4030 1d
2
0 101 no
no t-0
3003d873a40 rgmd
2 3001a38aab8 1d
1 0
165 yes
yes t-0
2a1003ebd20 sched
3 0001041b778 1d
2
0 60 yes
yes t-0
3004fac3c80 cluster
CPU 0 is spinning for a mutex 0x30001d7cae0 which is held by
thread 0x3004fac3c80 running on CPU 3. Please note that thread will spin
for a mutex only when the owner is running and in this case owner of
the mutex happens to be onproc on CPU 3.
> 0x30001d7cae0$<mutex
0x30001d7cae0: owner/waiters
3004fac3c80
>
CPU 3 is our clock interrupt CPU (run ::cycinfo -v and figure
out where the clock handler is registered) and thread 0x3004fac3c80
on CPU 3 seems to be spinning in cv_block()
for sleepq bucket lock (sleepq_head[]).
In order to find out which sleepq bucket this thread is looking for,
we can look at wait chanel t_wchan
(t_lwpchan.lc_wchan) and using the hash function SQHASH(),
I found out the right bucket. Since we're already holding thread lock
(effectively cpu_thread_lock of CPU 3) and looking for sleepq bucket lock,
this would have blocked clock interrupts too. This can be verifyed from
the pending clock interrupts in ::cycinfo -v.
Lets disassemble cv_block() thread 3004fac3c80 is stuck
cv_block+0x9c:
add %i2, 8, %i0
cv_block+0xa0:
call -0x460e0 <disp_lock_enter_high>
cv_block+0xa4:
mov %i0, %o0
> 0x3004fac3c80::print kthread_t t_lockp
t_lockp = cpu0+0xb8
> cpu0=J
1041b778
// CPU 3
> 0x3004fac3c80::print kthread_t ! grep wchan
lc_wchan = 0x3006fc52d20
And the sleepq bucket happens to be :-
> 0x10471d88::print sleepq_head_t
{
sq_queue = {
sq_first = 0x3001b476ee0
}
sq_lock = 0xff
<----- dispatcher lock is held
}
Thread 3003d873a40 running on CPU 1 is spinning in thread_lock_high().
> 3003d873a40::findstack
stack pointer for thread 3003d873a40: 2a1025964a1
[ 000002a1025964a1 panic_idle+0x1c() ]
000002a102596551 prom_rtt()
000002a1025966a1 thread_lock_high+0xc()
000002a102596751 sema_p+0x60()
000002a102596801 kobj_open+0x84()
000002a1025968d1 kobj_open_file+0x44()
[.]
000002a102597011 xdoor_proxy+0x20c()
000002a1025971f1 door_call+0x204()
000002a1025972f1 syscall_trap32+0xa8()
>
Now this's an interesting stack. Looking at the sema_p() code,
we see that we first grab the sleepq bucket lock and then try
to grab thread lock.
Since the hashing function SQHASH() would return the same index
for 0x3006fc52d20 and 0x300819f3118, we see that sema_p() getting stuck
on the thread lock which is held by thread running on CPU 3 and thread
running on CPU 3 is stuck because sleep queue bucket lock is held by thread
running on CPU 1.
> 0x3003d873a40::print kthread_t t_lockp
t_lockp = cpu0+0xb8
> cpu0+0xb8/x
cpu0+0xb8: ff00
Now lets find out the real problem of this deadlock. Lets
look t_cpu of thread 0x3003d873a40 and we see that thread 0x3003d873a40
running on CPU 1 has t_lockp pointing to CPU 3's cpu_thread_lock. This's
really nasty as we would expect it to point to CPU 1's cpu_thread_lock.
> 0x3003d873a40::print kthread_t ! grep cpu
t_bound_cpu = 0
t_cpu = 0x30019fe4030
t_lockp = cpu0+0xb8
// CPU 3's cpu_thread_lock
t_disp_queue = cpu0+0x78
The cause of this problem is that the door_get_server(),
while doing the handoff to server thread, is getting preempted because
disp_lock_exit() checks for kernel preemption.
static kthread_t * door_get_server(door_node_t *dp) { [.] /* * Mark the thread as ONPROC and take it off the list * of available server threads. We are committed to * resuming this thread now. */ disp_lock_t *tlp = server_t->t_lockp; cpu_t *cp = CPU;
pool->dp_threads = server_t->t_door->d_servers; server_t->t_door->d_servers = NULL; /* * Setting t_disp_queue prevents erroneous preemptions * if this thread is still in execution on another processor */ server_t->t_disp_queue = cp->cpu_disp; CL_ACTIVE(server_t); /* * We are calling thread_onproc() instead of * THREAD_ONPROC() because compiler can reorder * the two stores of t_state and t_lockp in * THREAD_ONPROC(). */ thread_onproc(server_t, cp); disp_lock_exit(tlp); return (server_t); [.]
As a result server thread's t_lockp points to incorrect cpu_thread_lock
because client thread started running on different CPU when client thread
did shuttle_resume()
to server thread. We can see that door_return()
(which return the results to the caller) releases dispatcher lock without
getting preempted, so we didn't notice this problem in door_return().
On the move for cracking another problem now...In fact we don't get sleep
if we don't take a look at the crash dump :-)
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
(2005-06-13 12:00:00.0)
Permalink
Compiler reordering problem
Compiler reordering problem
I'm going to write about a compiler reordering problem in door_return()
function which was observed in July 2002. The customer was able to reproduce
the problem for us and it took me a while to figure out that it was
a compiler reordering problem. I must thank our customers for being so co-operative
when we get such issues. I must have given instrumented kernels for at least
five times before I found out the problem. It's bug 4699850.
The symptom was very clear. System used to panic in Solaris Kernel Dispatcher
routines and one of the symptom was system panicing in dispdeq() while removing
a kernel thread from the dispatch queue of a CPU.
We know that compiler can reorder C statments if they are independent. Assume
this piece of C code:
#define THREAD_SET_STATE(tp, state, lp) \ ((tp)->t_state = state, (tp)->t_lockp = lp)
t_lockp is a pointer to a dispatcher lock and we don't know whether lp is
held or not. When a thread is made TS_ONPROC, the t_lockp of the corresponding
thread points to cpu_thread_lock of CPU (cpu_t). In the above mentioned C
code, these stores can be reordered can be re-ordered by compiler, so the
lp should be held while calling setting the threads state.
In door_return(),
when server thread is about to handoff to client thread to return the results,
it makes the client thread TS_ONPROC and calls shuttle_resume() on client
thread. The responsibility of shuttle_resume()
is to make client/server thread TS_ONPROC and the caller sleeps on shuttle_lock
sync obj.
While putting a thread onproc, dispatcher routines need not hold cpu_thread_lock
and hence in door_return() if we call THREAD_ONPROC(), we effectively lost
thread lock on the client thread.
Now lets look at the two stores again. It t_lockp reaches global visibility
before t_state, we can effectively lose thread lock on the thread. Assume
another thread on different CPU is sending a signal to client door thread.
Once the thread lock is lost on the client thread, the thread which is sending
signal to client thread could see the old state of client thread (in this
case it happens to be TS_SLEEP). Since the state is TS_SLEEP, eat_signal()
will do setrun() on the client thread which enqueues client thread in the
dispatch queue of the CPU. As a result, we can see some very strange things
happening which also included dispdeq() panic.
The following code in door_return() was faulty:
int door_return(caddr_t data_ptr, size_t data_size, door_desc_t *desc_ptr, uint_t desc_num, caddr_t sp) { [.] tlp = caller->t_lockp; /* * Setting t_disp_queue prevents erroneous preemptions * if this thread is still in execution on another * processor */ caller->t_disp_queue = cp->cpu_disp; CL_ACTIVE(caller); /* * We are calling thread_onproc() instead of * THREAD_ONPROC() because compiler can reorder * the two stores of t_state and t_lockp in * THREAD_ONPROC(). */ thread_onproc(caller, cp); disp_lock_exit_high(tlp); shuttle_resume(caller, &door_knob); [.] }
I had used TNF (trace normal form) for finding out this problem. But
now we have a powerful tool to trace from userland to kernel and of course
it's Dtrace.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace
(2005-06-13 12:00:00.0)
Permalink
|
|
| archives |
|
|
| « June 2005 » | | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|
| | | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | | 28 | | 30 | | | | | | | | | | | | Today |
|
|
|
|
|
| links |
|
|
|
|
|
| referers |
|
|
|
Today's Page Hits: 22
|
|
|
|
|
|