Saurabh Mishra's Weblog
All | Biking | General | Photographs | Solaris Operating System | Stock Market | Trekking & Mountaineering

20050613 Monday June 13, 2005

Compiler reordering problem Compiler reordering problem
I'm going to write about a compiler reordering problem in door_return() function which was observed in July 2002. The customer was able to reproduce the problem for us and it took me a  while to figure out that it was a compiler reordering problem. I must thank our customers for being so co-operative when we get such issues. I must have given instrumented kernels for at least five times before I found out the problem. It's bug  4699850.

The symptom was very clear. System used to panic in Solaris Kernel Dispatcher routines and one of the symptom was system panicing in dispdeq() while removing a kernel thread from the dispatch queue of a CPU.

We know that compiler can reorder C statments if they are independent.  Assume this piece of C code:

#define THREAD_SET_STATE(tp, state, lp) \
((tp)->t_state = state, (tp)->t_lockp = lp)

t_lockp is a pointer to a dispatcher lock and we don't know whether lp is held or not. When a thread is made TS_ONPROC, the t_lockp of the corresponding thread points to cpu_thread_lock of CPU (cpu_t). In the above mentioned C code, these stores can be reordered can be re-ordered by compiler, so the lp should be held while calling setting the threads state.

In door_return(), when server thread is about to handoff to client thread to return the results, it makes the client thread TS_ONPROC and calls shuttle_resume() on client thread. The responsibility of shuttle_resume() is to make client/server thread TS_ONPROC and the caller sleeps on shuttle_lock sync obj.

While putting a thread onproc, dispatcher routines need not hold cpu_thread_lock and hence in door_return() if we call THREAD_ONPROC(), we effectively lost thread lock on the client thread.

Now lets look at the two stores again. It t_lockp reaches global visibility before t_state, we can effectively lose thread lock on the thread. Assume another thread on different CPU is sending a signal to client door thread. Once the thread lock is lost on the client thread, the thread which is sending signal to client thread could see the old state of client thread (in this case it happens to be TS_SLEEP). Since the state is TS_SLEEP, eat_signal() will do setrun() on the client thread which enqueues client thread in the dispatch queue of the CPU. As a result, we can see some very strange things happening which also included dispdeq() panic.

The following code in door_return() was faulty:

int
door_return(caddr_t data_ptr, size_t data_size,
door_desc_t *desc_ptr, uint_t desc_num, caddr_t sp)
{
[.]
tlp = caller->t_lockp;
/*
* Setting t_disp_queue prevents erroneous preemptions
* if this thread is still in execution on another
* processor
*/
caller->t_disp_queue = cp->cpu_disp;
CL_ACTIVE(caller);
/*
* We are calling thread_onproc() instead of
* THREAD_ONPROC() because compiler can reorder
* the two stores of t_state and t_lockp in
* THREAD_ONPROC().
*/
thread_onproc(caller, cp);
disp_lock_exit_high(tlp);
shuttle_resume(caller, &door_knob);
[.]
}


I had used TNF (trace normal form) for finding out this problem. But now we have a powerful tool to trace from userland to kernel and of course it's Dtrace.


Technorati Tag:
Technorati Tag:
Technorati Tag:

(2005-06-13 12:00:00.0) Permalink Comments [0]

20050524 Tuesday May 24, 2005

::cpupart -v for mdb(1m)

Most of you would have used ::cpupart in mdb(1m) to determine the partitions you have on your system. For those who don't know what's partition, then it's an objecct (or kernel entity) which consists of set of CPUs and a global dispatch queue (or global preemption queue). In fact processor sets (which are created from userland using psrset(1M) are abstraction of CPU partitions.

One of the thing which I'm currently working on is to introduce a new option to ::cpupart which will print all the runnable threads in the global dispatch queue of a CPU partition. It's very similar to what ::cpuinfo -v does. Here is the sample output :-

On x86 :-
---------
> ::cpupart -v
 ID     ADDR NRUN #CPU CPUS
  0 fec2a1f8  298    2 0-1
                |
                +-->  PRI THREAD   PROC
                      100 d19b1000 sema
                      100 d19aca00 sema
                      100 d19ab200 sema
                      100 d19a7000 sema
                      100 d0b90a00 sema
                      100 d19a5a00 sema
                      100 d19a2e00 sema
                      100 d19aea00 sema
                      100 d19b4200 sema
                      [.]
>

On SPARC :-
-----------
> ::cpupart -v
 ID             ADDR NRUN #CPU CPUS
  0          18a8c50   25    8 4-11
                        |
                        +-->  PRI THREAD      PROC
                              100 3000a7b1660 sema
                              100 3000a7c55e0 sema
                              100 3000a7b0d00 sema
                              100 3000a7c4960 sema
                              100 3000a7b5c80 sema
                              100 3000a7b4380 sema
                              100 300084a2c80 sema-1
                              100 3000826c3a0 sema-1
                              [.]
>

(2005-05-24 19:44:25.0) Permalink Comments [1]

20050216 Wednesday February 16, 2005

calculating time spent in functions

The other day (10th Feb 2005), I was giving a demo of Dtrace to BEA India folks in ITPL Bangalore and thought that having a Dtrace script which could demonstrate following things would be nice :-

(a) time spent in functions called
(b) shows both user and kernel level tracing capabilities in the same function call context.
(c) uses $target and -c to demonstrate tracing when a process is launched.

I hope people would find it interesting. Here is the output from script. Please note that the time spent is expressed in nanoseconds.

# which w
/usr/bin/w
# dtrace -c w -Fs ./userfunc.target.d main
[.]
  5  -> malloc                                        0
  5    -> assert_no_libc_locks_held                   0
  5    <- assert_no_libc_locks_held                        6880
  5    -> lmutex_lock                                 0
  5    <- lmutex_lock                                      7840
  5    -> _malloc_unlocked                            0
  5      -> cleanfree                                 0
  5      <- cleanfree                                      7600
  5      -> realfree                                  0
  5      <- realfree                                       8000
  5    <- _malloc_unlocked                                35360
  5    -> lmutex_unlock                               0
  5    <- lmutex_unlock                                    8240
  5  <- malloc                                            89200
  5  -> sysinfo                                       0              <-------- getting into kernel
  5    -> pre_syscall                                 0
  5      -> syscall_mstate                            0
  5      <- syscall_mstate                                 4560
  5    <- pre_syscall                                     13120
  5    -> systeminfo                                  0
  5    <- systeminfo                                       5760
  5    -> post_syscall                                0
  5      -> clear_stale_fd                            0
  5      <- clear_stale_fd                                 5200
  5      -> syscall_mstate                            0
  5      <- syscall_mstate                                 4000
  5    <- post_syscall                                    21680

  5  <- sysinfo                                           60720
[.]

# cat userfunc.target.d
/*
 * This script calculates time spent in all the functions (userland & kernel)
 * called once a particular function is traced.
 *
 * Please see /usr/demo/dtrace/userfunc.d also.
 *
 * Usage:
 * # dtrace -c  -Fs ./userfunc.target.d 
 */

BEGIN
{
        self->depth = 0;
}

pid$target::$1:entry
{
        self->trace = 1;
        self->depth = 0;
        self->timestamp[self->depth++] = timestamp;
}

pid$target::$1:return
/self->trace/
{
        self->trace = 0;
        trace(timestamp - self->timestamp[0]);
}

fbt:::entry,
pid$target:::entry
/self->trace && self->timestamp[self->depth - 1]/
{
        self->timestamp[self->depth++] = timestamp;
        trace(0);
}

fbt:::return,
pid$target:::return
/self->trace && self->timestamp[self->depth - 1]/
{
        self->depth--;
        trace(timestamp - self->timestamp[self->depth]);
        self->timestamp[self->depth] = 0;
}
(2005-02-16 21:00:19.0) Permalink Comments [0]

20050201 Tuesday February 01, 2005

Small demo of Resource Management, Contracts and Service Management Framework

I thought I should share these small demos with you folks. I've started playing with Resource Management, Zones,
CPU-shares,  Service Management Framework and Contract. I hope you will find it a bit interesting.

Projects 
I first created two projects and following resource controls were added to these projects. projadd(1m) is used to add a project and projects(1) is used to list projects in the system. # projadd -G other -c "Project 1" -K "project.cpu-shares=(privileged,1,none)" proj1 # projmod -a -K "task.max-lwps=(privileged,3,deny)" proj1 # projmod -a -K "rcap.max-rss=20971520" proj1 # projadd -G other -c "Project 2" -K "project.cpu-shares=(privileged,3,none)" -K "rcap.max-rss=10485760" proj2 Now lets assign proj1 to user1 and proj2 to user2 so that whenever user1 and/or user2 login, the enforcement of these resource control ome into effect. # projmod -U user1 proj1 # projmod -U user2 proj2 We need to add these two lines into /etc/user_attr :- # diff /etc/user_attr.org /etc/user_attr 12a13,14 > user1::::project=proj1 > user2::::project=proj2 So this's how our two projects look like :- # projects -l [.] proj1 projid : 100 comment: "Project 1" users : user1 groups : other attribs: project.cpu-shares=(privileged,1,none) rcap.max-rss=20971520 task.max-lwps=(privileged,3,deny) proj2 projid : 101 comment: "Project 2" users : user2 groups : other attribs: project.cpu-shares=(privileged,3,none) rcap.max-rss=10485760 # proj1 specifies three things (a) It assign CPU-shares as 1 (b) resident set size (RSS) can go upto 20m (c) maximum number of lwps we can have in this project is 3 proj2 specifies two things (a) It assign CPU-shares as 3 (b) resident set size can go upto 10m What's the indent of this demo on CPU-shares -- I'd like to show that 25% utilization in proj1 and 75% utilization in proj2 will be enforced when there is competition from other projects. Since I use same workload which is to spin and hog CPU, it's little easy for me to demostrate this fact here. # prstat -J [.] PROJID NPROC SIZE RSS MEMORY TIME CPU PROJECT 101 2 2408K 1816K 0.0% 0:01:33 74% proj2 100 2 2408K 1816K 0.0% 0:00:56 25% proj1 1 2 7504K 6352K 0.0% 0:00:00 0.2% user.root 0 67 336M 183M 0.3% 0:00:27 0.1% system Total: 73 processes, 247 lwps, load averages: 1.75, 0.77, 0.32 Now, lets change the CPU-shares of proj2 to 2 and see what happens :- # prctl -n project.cpu-shares -r -v 2 -i project proj2 # prctl -n project.cpu-shares -i project proj2 project: 101: proj2 NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT project.cpu-shares privileged 2 - none - system 65.5K max none - # prstat -J [.] PROJID NPROC SIZE RSS MEMORY TIME CPU PROJECT 101 2 2408K 1816K 0.0% 0:03:24 65% proj2 100 2 2408K 1816K 0.0% 0:01:44 33% proj1 1 2 7504K 6344K 0.0% 0:00:00 0.2% user.root 0 67 336M 183M 0.3% 0:00:27 0.1% system Resource Management in Zones
Here I show how to create a simply zone (without a logical interface attached to it) and subsequently we will assign shares to newly created zones. # zonecfg -z myzone1 myzone1: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:myzone1> create zonecfg:myzone1> add fs zonecfg:myzone1:fs> set dir=/mnt/local zonecfg:myzone1:fs> set special=/opt/sfw zonecfg:myzone1:fs> set type=lofs zonecfg:myzone1:fs> set options=[ro,nodevices] zonecfg:myzone1:fs> end zonecfg:myzone1> add rctl zonecfg:myzone1:rctl> set name=zone.cpu-shares zonecfg:myzone1:rctl> add value (priv=privileged,limit=1,action=none) zonecfg:myzone1:rctl> end zonecfg:myzone1> add attr zonecfg:myzone1:attr> set name=comment zonecfg:myzone1:attr> set type=string zonecfg:myzone1:attr> set value="first zone" zonecfg:myzone1:attr> end zonecfg:myzone1> set autoboot=true zonecfg:myzone1> set zonepath=/export/home/zone/myzone1 zonecfg:myzone1> verify zonecfg:myzone1> info zonepath: /export/home/zone/myzone1 autoboot: true pool: inherit-pkg-dir: dir: /lib inherit-pkg-dir: dir: /platform inherit-pkg-dir: dir: /sbin inherit-pkg-dir: dir: /usr fs: dir: /mnt/local special: /opt/sfw raw not specified type: lofs options: [ro,nodevices] rctl: name: zone.cpu-shares value: (priv=privileged,limit=1,action=none) attr: name: comment type: string value: "first zone" zonecfg:myzone1> Now lets install and boot the zone. Remember you can boot, reboot and halt zones independently. # zoneadm -z myzone1 install # zoneadm -z myzone1 boot Repeat these steps to create myzone2, but assign 3 shares to myzone2. myzone2 will look like this :- # zonecfg -z myzone2 zonecfg:myzone2> info zonepath: /export/home/zone/myzone2 autoboot: true pool: inherit-pkg-dir: dir: /lib inherit-pkg-dir: dir: /platform inherit-pkg-dir: dir: /sbin inherit-pkg-dir: dir: /usr fs: dir: /mnt/local special: /opt/sfw raw not specified type: lofs options: [ro,nodevices] rctl: name: zone.cpu-shares value: (priv=privileged,limit=3,action=none) attr: name: comment type: string value: "Second Zone" zonecfg:myzone2> Now I run spin program from /mnt/local to put load on each zone. # prstat -Z ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE 3 9 27M 22M 0.1% 0:11:57 74% myzone2 2 9 27M 21M 0.1% 0:04:17 25% myzone1 0 50 188M 123M 0.3% 0:01:17 0.3% global 1 6 24M 19M 0.1% 0:00:10 0.0% myzone3 Total: 74 processes, 256 lwps, load averages: 2.02, 2.00, 1.73 # prctl -n zone.cpu-shares -i zone myzone1 zone: 2: myzone1 NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT zone.cpu-shares privileged 1 - none - system 65.5K max none - # prctl -n zone.cpu-shares -i zone myzone2 zone: 3: myzone2 NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT zone.cpu-shares privileged 3 - none - system 65.5K max none - # You can use following command to see the status of zones # zoneadm list -cv Contracts
ctrun(1) can be used for restarting the command again in the event of signal, core, hardware error. You don't require ps(1m) and other onitering scripts to see what's happening to your command. See contrat(4) for more details. Here is a small demo of ctrun(1) :- # ctrun -f signal -r 0 /spin # ps -eaf | grep spin root 1175 1083 0 13:27:52 pts/1 0:00 ctrun -f signal -r 0 /spin root 1177 1 20 13:27:52 pts/1 0:05 /spin # kill -9 1177 # ps -eaf | grep spin root 1175 1083 0 13:27:52 pts/1 0:00 ctrun -f signal -r 0 /spin root 1181 1 15 13:28:04 pts/1 0:04 /spin # In case of SIGKILL, ctrun takes care of re-starting the command automatically. Option -r is to specify the re-try count and zero means indefinitely. You can also use -i option to see what events are taking places. For instace if there is a process getting added to the contract or exiting the contract, the newly created contract would post those events. Service Management Framework
This is my first service which I created using Service Management Framewrok. This demonstrates how a simple service can be created and what will happen in an event of service dieing. # svccfg svc:> import /sample.xml svc:> quit # svcs -p spin STATE STIME FMRI online 13:41:45 svc:/system/spin:default 13:41:45 1257 spin # svcs -l spin fmri svc:/system/spin:default name First SMF service enabled true state online next_state none state_time Tue Feb 01 13:41:45 2005 logfile /var/svc/log/system-spin:default.log restarter svc:/system/svc/restarter:default contract_id 181 dependency require_all/none svc:/system/filesystem/local (online) # ps -eaf | grep spin root 1257 1 54 13:41:46 ? 0:17 /spin # kill -9 1257 # svcs -p spin STATE STIME FMRI online 13:42:08 svc:/system/spin:default 13:42:08 1263 spin # ps -eaf | grep spin root 1263 1 45 13:42:08 ? 0:13 /spin # svccfg svc:> select spin svc:/system/spin> listprop general framework general/entity_stability astring Unstable general/single_instance boolean true fs dependency fs/entities fmri svc:/system/filesystem/local fs/grouping astring require_all fs/restart_on astring none fs/type astring service start method start/exec astring /spin start/project astring :default start/resource_pool astring :default start/timeout_seconds count 60 start/type astring method start/working_directory astring :default application framework application/auto_enable boolean true application/stability astring Evolving tm_common_name template tm_common_name/C ustring "First SMF service" svc:/system/spin> Isn't that great? No need to have fancy scripts to moniter the daemons. Moreover you can specify dependencies between the services. For example if you want to make sure that network service is up and running, you would specify this dependency in the dependency property. You can also add methods like stop (whenever a service is stopped; could be using svcadm disable) or refresh (whenever the service configration is read again). These two programs were used for the demo. spin.c ------ main() { for(;;) ; } spin-service.c -------------- int main() { if (fork() == 0) { /* * Child process and let it spin */ for(;;) ; } return (0); } Here are few good links which would come handy (a) Resource Management Configuration Example (b) CPU-shares (c) Resource Controls (d) Overview on Projects and Tasks (e) Resource Caping (f) Resource Management in Solaris Zones
(2005-02-01 18:23:34.0) Permalink Comments [0]

20050106 Thursday January 06, 2005

inter-process mutex

Recently, Surya Prakki (colleague of mine in Kernel Sustaining Group) and I worked on an interesting problem. We initially thought that it would be a random corruption, but later we figured out that it was an application problem. Customer also told us that his application was working fine on Solaris 9 and when they migrated to Solaris 10, they started seeing their application dumping core.

Owner field of mutex_t was getting corrupted with an invalid address which was resulting in application dumping core (SIGSEGV). This problem was reproducible easily on single CPU machine. The application was dieing in mutex_trylock_adaptive() routine when it tried to dereference the owner. The owner field of the mutex had strange data, but when you look at the core dump the owner field was zero. So this surprised us a lot and we believed that there is some race.

All sorts of thoughts came to our mind including corruption of registers when cpu switches context to another thread or some other thread overwriting the member in the mutex due to over/under run of an array.

We first started debugging this problem using procfs watchpoint. We first set a watchpoint on the virtual address of the owner field of mutex_t structure. We used mdb's :w macro for this purpose. Watchpoint used to fire frequently because application had two threads which contend for lock quite often. So we decided to use a script having "$c, $r, :c" in it. But whenever corruption happened, target process never got the corresponding  watchpoint trap.  So it surprised us a lot and we started wondering how this would happen.

We then started using Dtrace and truss to figure what is happening. We were trying to find out what is happening from the point mutex_unlock() clears the owner field till the process dumps core. In this process, we started ruling out the things which we thought in the beginning. We were running out of ideas now when we carefully looked at the mutex_t members and noticed that magic number is correct and type of the mutex is USYNC_THREAD. We then started using Dtrace probes when we context switch to another thread. We wanted to figure out whether context switching is playing any role here or not. During this course, we noticed that another process was getting on to the CPU after the process which dumped core released the mutex. This rang the bell in our mind. We also noticed that the mutex address (virtual address) was same when this context switch happened.

We took a look at the pmap(1) output and noticed that the mutex is from shared memory segment. The other process had used the same key (see shmget(2) system call). What it means is the mutex was used between the processes. We noticed that the corrupted value was a valid address in the other process (a thread address in fact). This surprised us again because we had seen USYNC_THREAD as the type of the mutex and we had *believed* that this mutex is being used between the threads of the same process. This disappointed us a lot. If the mutex is to be used between processes, then the type of the mutex has to be USYNC_PROCESS because one can't really dereference the owner when the mutex is being used between the processes (inter-process mutex).

From the man pages of mutex_init(3THR)

     USYNC_THREAD
           The mutex can synchronize threads only  in  this  pro-
           cess. arg is ignored.

     USYNC_PROCESS
           The mutex can synchronize threads in this process  and
           other  processes.  arg is ignored. The object initial-
           ized with this attribute must be allocated  in  memory
           shared  between  processes,  either in System V shared
           memory (see  shmop(2)) or in memory mapped to  a  file
           (see  mmap(2)). If the object is not allocated in such
           shared  memory,  it  will  not   be   shared   between
           processes.

We asked the submitter of the bug to make this modification and it all worked fine. Customer came back saying our diagnosis is correct and he modified the application accordingly.

Having spent a week or so, the bottom line is that don't take things for granted :)
(2005-01-06 02:06:03.0) Permalink Comments [0]

20041213 Monday December 13, 2004

Solaris 10 presentation to ISV's

The other day, I presented cool features of Solaris 10 to ISV's in Mumbai (on 9th Dec) and New Delhi (on 10th Dec). This technology show was organized by Sun for ISV's and most of the people those who came there were Java Developers. It was very well received. People didn't ask much questions because of lack of time, but most of the questions that came were on Solaris containers and Solaris x86. I couldn't show the demo of Dtrace, Zones and ZFS because of lack of time. Since this talk was meant for ISV's only, the number of people came for the presentation was around 80 only. We showed the demo of 3-D looking glass as well. People were very happy to see the demo. The other talks were on J2EE and Java. Thanks.
(2004-12-13 18:30:14.0) Permalink

20041011 Monday October 11, 2004

Solaris 10 Presentation


Yesterday (11th October), Pramod Batni and I presented cool features of Solaris 10 and Dtrace to Hughes software  (www.hssworld.com). It was well received by the folks there and they were very keen in deploying Solaris Zones (http://wwws.sun.com/software/solaris/10/inside.jsp). They also wanted to use Dtrace (http://wwws.sun.com/software/solaris/10/inside.jsp) for improving performance of their application. Though there were few people who could manage to attend Solaris 10 and Dtrace talk, but people there were willing to adopt new technologies in Solaris 10 and were very keen on more detailed talk on Solaris 10 especially on Solaris Zones and Dtrace. Girish (GSO folk) also gave a presentation on our Processor Road map, Sun Cluster, and Volume Server Products.

This is not our first presentation to an Indian customer. We have presented cool features of Solaris 10 to loads of companies in India. I have presented cool features of Solaris 10 at places like Sun Developer Days (New Delhi), Sun Technology Conference (Bangalore), Infosys (Finacle Division), and Nucleus Software. It had been a great experience for me to present on Solaris 10 at such places. Infact we gave a demo of Dtrace to Infosys folks (developers of Finacle software). Infosys folks were very impressed about Dtrace and wants to use Dtrace for java programs as well.

We are looking for more such companies and developers to adopt Solaris 10.

Thanks,
--
Saurabh Mishra, Solaris Kernel Sustaining and Engineering.

(2004-10-11 23:08:59.0) Permalink Comments [1]


Locations of visitors to this page
archives
links
referers