Weblog

All | General | Solaris
Main | Next page »
20071010 Wednesday October 10, 2007

Dump device configuration - part2 Dump device configuration - part2 To my earlier post on 'why the dump device configured so late', I got a very interesting reply, questioning the claim: 'window will be significantly reduced'. Let me be frank, as of the first post, I didn't conduct any studies as to quantify the gain.
So the mail prompted me to do some tests to quantify the gain.
I want to understand how many function calls get made from the word 'go' [i.e from main()] till dumpinit() gets called; And also figure out how many modules get loaded in the same window.

That seems a pretty arduous task. At the same time I can't spend days trying to just get this data. This is where OpenSolaris rocks - if not for DTrace, I would have given up on this. DTrace kernel module gets loaded very early in the boot right after the kernel starts [i.e unix/genunix/krtld modules are loaded]. DTrace supports a feature called 'Anonymous tracing' - which can be used in this context just to get the data I am interested. This won't get me right from main() but as close to main() as possible.  This option is good enough to get an idea of what the numbers look like.

I needed answers for atleast the following questions :
    - How many function calls are made from the point DTrace can start to the point of dumpinit() ?
    - How many function calls are made from the point root is mounted to the point of dumpinit() ?
    - How many function calls are made from start_other_cpus() [this routine starts the rest of the CPUs in a multi cpu system; Till this point, only the boot cpu does all the work; This opens up scope for race conditions / corruption / hangs - So even if we could capture crash file from this point, that would still be a nice win] to the point of dumpinit() ?

I put together the following simple script to answer the 3rd question.

# cat mp_init.d
BEGIN {
        dumpinit = 0;
}
/* start collecting data here */
fbt::start_other_cpus:return
/dumpinit == 0/
{
        dumpinit = 1;
}
fbt:::entry
/dumpinit == 1/
{
        @[probefunc]=count();
}
/* stop collecting data here */
fbt::dumpinit:return
/dumpinit == 1/
{
        dumpinit = 0;
}
#
And invoke the script as follows :
# dtrace -As ./mp_init.d
dtrace: saved anonymous enabling in /kernel/drv/dtrace.conf
dtrace: added forceload directives to /etc/system
dtrace: run update_drv(1M) or reboot to enable changes
#

Then reboot the system and when system comes back, just run :
#dtrace -ae > mp_init.out

mp_init.out has thousands of lines which look something like :
  tsc_scalehrtime                                              369164
  syscall_exit                                                 375467

Module information can be gotten using 'kmdb' - For this I booted the system under kmdb, put a break point in start_other_cpus() and let it continue.  When the break point is hit, I ran '::modinfo' and saved the o/p; Then I put one more break point in dumpinit() and again let it continue; When the new breakpoint is hit collect the modinfo o/p again and comparing this with the previous one gives you an idea of how many modules got loaded in between.

With simple changes to the script, I got answers for the 1st and 2nd questions as well. Module info can be gotten as  described above.

Then I used awk(1) to post process mp_init.out  and the resulting numbers look like this :
     Q) How many function calls are made from the point DTrace can start to the point of dumpinit() ?
     A) 59 million function calls were made from the point dtrace can start   to the point of dumpinit:return

     Q) How many function calls are made from the point root is mounted to the point of dumpinit() ?
      A) 52 million calls were made from mounting root to dumpinit:return
           - 120 modules got loaded in the same window
    
     Q) How many function calls are made from start_other_cpus()  to the point of dumpinit() ?
     A)  43 million calls were made from start_other_cpus() to dumpinit:return
          - 52 modules got loaded in the same window

The data speaks for itself !! Data modified by 43million routines can be captured in a dump, if we could advance dump device configuration to just before start_other_cpus(). Gain is more if it could be advanced closer to mount of root FS.
( Oct 10 2007, 05:36:01 AM PDT ) Permalink

20070925 Tuesday September 25, 2007

Why is dump device configured so late ?? Why is dump device configured so late ?? The other day I was playing around with kmdb [setting breakpoints, single stepping, clearing bkpts etc..] and at one stage, thought, Let me take a dump and then issued  :

    [0]> $<systemdump
    nopanicdebug:   0               =       0x1

     panic[cpu0]/thread=fffffffffbc25d20: BAD TRAP: type=d (#gp General protection) rp=fffffffffbc46670     addr=f000ff53f000ff00

     fffffffffbc46550 unix:die+ea ()
     fffffffffbc46660 unix:trap+3de ()
     fffffffffbc46670 unix:cmntrap+e9 ()

     skipping system dump - no dump device configured
     rebooting...

It is the last two lines of o/p that matter to this post (so lot was left out intentionally).
I couldn't collect the crash file - the reason being 'no dump device configured'; I felt bad that the information I wanted was gone and I had to repeat the whole cycle to get to that state ...

I know that by default 'primary swap partition' itself acts as a dump device. And swap device is specified as part of install itself and is actually available in /etc/vfstab as (something like) :
        /dev/dsk/c1t0d0s1       -       -       swap    -       no      -

That brings me to the question:  Where in the boot sequence do we configure swap as dump device ?
Looking a the code to see where the above error msg comes from: it turns out to be dumpsys()
          uprintf("skipping system dump - no dump device configured\n");
dumpvp not being set is the reason why dumpsys() prints out this msg;  To understand where it gets set, I put a break point in dumpinit() and then cotinued.
The state when the brkpt is hit :

        [1]> $c
        dumpinit(ffffff0154b1c880, 0, 0)
        swapctl+0x68a(1, fffffd7fffdffc90, ffffff00042acee4)
        uadmin+0x10f(10,
        [1]> ::ptree
        fffffffffbc250b0  sched
           ffffff01513923a8  fsflush
           ffffff015138e3a8  pageout
           ffffff015138c3a8  init
                 ffffff0152b7a3a8  svc.configd
                 ffffff0152aaf3a8  svc.startd
                       ffffff01548f13a8  fs-usr
                            ffffff01549ce3a8  swapadd
                                 ffffff0154a1b3a8  swap
        [1]

That means, in response to the cmd 'swap -1 -a /dev/dsk/c1t0d0s1', dumpvp gets set in the kernel. So if the system were to panic till this happened we would not be able to collect a crash file for subsequent analysis. Running '::modinfo' shows 161 kernel modules loaded (on my test system) - All this state also would be lost.

I am wondering why should it be pushed to this stage - why can't it be done earlier?

The way /etc/system is read, /etc/vfstab could also be read, look out for an entry with FStype as 'swap' and try configuring it as the dumpvp at the earliest possible moment  - Wondering, if this can be done immediatly after /dev, /devices and disk dirver (sd?) modules are loaded.

Even with this change, we can't capture the panics that happen before dumpvp is configured, but that window will be significantly reduced !!!

Moinak was suggesting, explore the option of making it part of bootenv.rc - good point, but that makes it x86 specific - needs more investigation.
Comments please.
( Sep 25 2007, 01:23:43 AM PDT ) Permalink

20070906 Thursday September 06, 2007

How do you access structure members in assembly files... How do you access structure members in assembly files... If you look at the assembly files in OpenSolaris kernel [fortunately there are not many of them], you will see instructions like the one below at quite a few places :

          movq    T_PROCP(%r13), %r14     /* %r14 = proc */  <-- from swtch.s

In this particular case, %r13 holds the thread pointer and it is accessing the proc pointer from it and storing it in %r14. How it is accessing the member is the focus of this post.
If this was a c-source file, we could have done something similar to:
          proc_t *p = t->t_procp;
and compiler will take care of the rest.
But if we need to code it in assembly, we have to explicitly pass the 'offset' of the member [here, t_procp] to the instrcution. But if we were to hard code the 'offset', it would be a nightmare to maintain it, whenever structures changed.

So how OpenSolaris addresses this is pretty neat.
There is a file called offsets.in, which has entries as shown below :

_kthread                      THREAD_SIZE
        t_pcb                   T_LABEL
        t_lock
        .....
        t_procp
        t_link

This file gets compiled into a header file called assym.h and it can be included in any assembly file that makes references to these members. This Makefile takes care of building the header file. Looking at the assym.h, after building it, the entry for t_procp looks like this :

#define T_PROCP 0xdc

This is exactly what we need to feed to the assembly instruction !!

If we need to access a structure member that is not already present in the offsets.in, all we need to do is, just add the 'appropriate structure and its member', compile and start accessing in the assembly code.

What if the structure I am interested in is not of global signifcance - i.e no other module other than mine may care about it - in which case I don't want to clutter the offsets.in but want my own offsets.in which does the same job.
An example for this can be found here offsets.in - the structure and offsets that are declared here are of no significance to sparc so just confine it to x86.

You can use make(1S)'s AS_INC_PATH env varaible in your Makefiles to point to the location of where the generated heder files are present - with this varaible, you don't end up including the header file path for your c-files [confine it to assembly files].
( Sep 06 2007, 09:46:06 PM PDT ) Permalink

20070820 Monday August 20, 2007

lx brand internals - part1 lx brand internals - part1... If you are here, I assume that you already read the brandz design document available  as http://opensolaris.org/os/community/brandz/design/ and are interested in knowing the real low level details of how lx brand does its magic. If so continue...

This post may get modifed in the future as my understanding of things improve :-)

Now that you have got an lx zone [say, lx1]  installed, what happens when you issue, 'zoneadm -z lx1 boot'...
/usr/lib/zones/zoneadmd is the workhorse. It does all the heavy lifting w.r.t setting up the virtual platform; of that server() is the routine that does the actual work.
zsched process of a zone is equivalent to sched of global zone - all kernel threads related to the zone hang off from the zsched process.
zoneadmd creates zsched through server()->zone_ready()->vplat_create()->zone_create()->newproc()
zsched() launches zone_start_init() which will eventually launch the 'init' of the lx brand installed [say, centOS3.0 init].
brand_platform_iter_gmounts() mounts the file sytems mentioned in the config file /usr/lib/brand/lx/platform.xml [starting with keyword 'global']
No Inherited-Package-Dirs for lx brands.
Mount the rest of the file systems from the above xml file [starting with keyword 'mount']
boot/halt scripts allow the brand to perform any custom configuration, etc

lx brand boot callback: lx_support is called from zone_bootup()->do_subproc() to do any additional setup before the zone is booted - say, setting the 'restrat' attribute for init.
Then make the zone_boot() system call – as the comment in zone.c says, this doesn't need to do any work other than to just find the status set by the init() of the brand.

Understanding how init of the brand gets started, throws a lot of light around how brandz works:
zone_start_init() execs /sbin/init of the brand.
elfexec()  realizes that this is a branded process, and hence execs the helper library:
    /Brand/lx1/root/native/usr/lib/lx_brand.so.1
through lx_elfexec() and then maps in linux binary and its interpreter as follows :
 
CPU     ID                    FUNCTION:NAME
  1  48637                  elf32exec:entry   zsched              vp: /Brandz/lx1/root/sbin/init
              genunix`gexec+0x374
              genunix`exec_common+0x471
              genunix`exec_init+0x275
              genunix`start_init_common+0xfb
              genunix`zone_start_init+0x3d
              unix`thread_start+0x8

  1  48637                  elf32exec:entry   zsched                vp:  /Brandz/lx1/root/native/usr/lib/lx_brand.so.1
              lx_brand`lx_elfexec+0xfd
              elfexec`elf32exec+0xf3e
              genunix`gexec+0x374
              genunix`exec_common+0x471
              genunix`exec_init+0x275
              genunix`start_init_common+0xfb
              genunix`zone_start_init+0x3d
              unix`thread_start+0x8

  1  48627            mapexec32_brand:entry   zsched          vp: /Brandz/lx1/root/sbin/init
              lx_brand`lx_elfexec+0x169
              elfexec`elf32exec+0xf3e
              genunix`gexec+0x374
              genunix`exec_common+0x471
              genunix`exec_init+0x275
              genunix`start_init_common+0xfb
              genunix`zone_start_init+0x3d
              unix`thread_start+0x8

  1  48627            mapexec32_brand:entry   zsched           vp: /Brandz/lx1/root/lib/ld-2.3.2.so
              lx_brand`lx_elfexec+0x2aa
              elfexec`elf32exec+0xf3e
              genunix`gexec+0x374
              genunix`exec_common+0x471
              genunix`exec_init+0x275
              genunix`start_init_common+0xfb
              genunix`zone_start_init+0x3d
              unix`thread_start+0x8

AT_ENTRY auxv vector points to _start() of lx_brand.so.1 – solaris linker jumps to this after its boot-strapping.
exec_args() copies the arugments to the user stack. And execution in the userland starts at _rt_boot() and stack looks something like this :

                       
                        ----------------
                        |                              |  <-- %ebp
                        ----------------
                        |          argc              | 
                        ----------------
                        |          argv[]            |
                        ----------------
                        |          NULL            |
                        ----------------
                        |          envp             |
                        ----------------
                        |          NULL            |
                        ----------------
                        |          auxv[]            |
                        ----------------


When the linker turns over control to the entry in the executable [i.e _start() in our case], the stack looks exactly how it was when kernel gave the control to the linker i.e _rt _boot() :
_ rt_ boot()-> setup()->elf_entry_ pt()->_start()

_start() calls lx_init()  which registers the user-space handler() with the lx brand module
Handler is: lx_handler_table() - this looks like this:

LM2`lx_brand.so.1`lx_handler_table:             pushl  $0x0
LM2`lx_brand.so.1`lx_handler_table+5:           jmp    +0x1112  <LM2`lx_brand.so.1`lx_handler>
LM2`lx_brand.so.1`lx_handler_table+0xa:         nop   
LM2`lx_brand.so.1`lx_handler_table+0xb:         nop   
LM2`lx_brand.so.1`lx_handler_table+0xc:         nop   
LM2`lx_brand.so.1`lx_handler_table+0xd:         nop   
LM2`lx_brand.so.1`lx_handler_table+0xe:         nop   
LM2`lx_brand.so.1`lx_handler_table+0xf:         nop   
LM2`lx_brand.so.1`lx_handler_table+0x10:        pushl  $0x10
LM2`lx_brand.so.1`lx_handler_table+0x15:        jmp    +0x1102  <LM2`lx_brand.so.1`lx_handler>
LM2`lx_brand.so.1`lx_handler_table+0x1a:        nop   
LM2`lx_brand.so.1`lx_handler_table+0x1b:        nop

Where we jump into this table from the kernel is dictated by the system call made. lx_handler() takes the system call number as argument and proceeds to emulate it.

Ok...so now the Q is how do we initially get into the kernel when a system call is made by a lx process ??

32bit linux makes use of int80 to make a system call – brandz replaces it with its own handlers for branded processes :
  - brand_interpositioning_enable() installs the new handler on a context switch
  - brand_interpositioning_disable() restores the old handler while switching out
  - The new handler is: brand_sys_int80
Take a closer look at brand_sys_int80; In this post let us focus on 64bit solaris kernel :

Upon int80, the stack looks like this when we enter brand_sys_int80() [refer to 5-27 vol3a of intel
programmers manual]
[1]> <rsp,5/Jn
0xffffff00046c3fd8:
                            feb888bd  <-- %EIP of linux app that caused the 'int 80'    
                            43            <-- Code Segment selector
                            212          <-- EFLAGS
                            8047cec   <-- %esp - stack pointer as of 'int 80'   
                            4b            <-- %SS

brand_sys_int80() switches to the kernel stack of the running thread, and retrieves a lot of brand information
and then jumps to lx_brand_int80_callback(). As of the jump, the stack looks like this :

[1]> <rsp,8/Jn
0xffffff00046c3ee8:
                           feb888bd              <-- return address into app's user land
                           ffffff014ba301c0    <-- Pointer to proc brand data structure
                           ffffff0152573708    <-- lwp brand data
                           ffffff00046c3fd8     <-- Same as above 'rsp'
                           fffffffff7cddc60        <-- lx_brand`lx_brand_int80_callback()

lx_brand_int80_callback() takes the system call number present int %rax, left shifts it by 4 (as each entry
in the lx_handler_table() is 16bytes long] and that is the address it needs to jump to, to emulate the system
call. Then it swaps the 'return address that is available on stack with this newly computed address - such that
when it issues 'return from interrupt [iretq]', it starts off at the new address. The actual 'user land addrss from
the application' is saved into %rax.
The stack as of iretq:
[1]> <rsp,6/Jn
0xffffff00046c3fd8:            
                           fefddb40       
                           43             
                           212            
                           8047cec        
                           4b   
[1]> <rax=J
                           feb888bd

See %rax holds, what used to be on stack !!!! And what is on stack is the address in the lx_handler_table()
where it jumps to...From here on it executes the emulation code and is the content of next post...
( Aug 20 2007, 02:38:51 AM PDT ) Permalink

20061208 Friday December 08, 2006

Enuf error checking done in your code ?? Enuf error checking done in your code ?? I remember having read somewhere sometime back, that, more than 20% of the solaris code just does error checking ...and my first reaction was, wow thats lot and close to paranoia ...and with the passage of time I kind of forgot about it.

Couple of weeks back I got an escalation and, as usual, it started off as, 3rd party application core dumped inside a solaris library (and to be more specific in libaio on solaris10).

libaio allows applications to do IO in an asynchronous way on files that can not make use of kernel async IO (for ex: aioread()/aiowrite() on a device file make use of kernel aio; whereas the same call on a regular file will make use of aio features provided the library) - How library does aio is very simple: it creates a few worker threads which will pick up requests and perform IO in the background using pread() and pwrite() calls.

So when an application issues a aio read or write for the first time on a file descriptor (fd) that can't make use of kaio, library creates a bunch of worker threads to execute the aio requests. Refer to the routine __uaio_init() to see how the initialization is happening.

This post is concerned more with the following few lines of code from __uaio_init() :

    194 	/*
195 * Create the minimum number of read/write workers.
196 */

197 for (i = 0; i < _min_workers; i++)
198 (void) _aio_create_worker(NULL, AIOREAD);

As the comment says, it is creating a bunch of worker threads. But what is interesting is, we are not checking for the return value of _aio_create_worker() routine - This routine creates a user land thread [From solaris9 onwards,  libthread moved to 1:1 model - which means, for every user land thread created, there will be an lwp in the kernel to which it is tied to].

For some reason if this call fails, we continue as if everything is hunky-dory and set __uaio_ok which means initialization is complete. Down the line when it actually tries to take off a worker from the Q, accesses a NULL pointer and thus shoots itself in the foot.

That makes me think that there is never enough error checking !! more is always merrier here...
( Dec 08 2006, 02:54:49 AM PST ) Permalink

20061113 Monday November 13, 2006

One ASSERT per line ... One ASSERT per line... I just came across the following assert() in cpu.c

    666 	ASSERT(t->t_weakbound_cpu == CPU ||
667 (t->t_nomigrate < 0 && t->t_preempt > 0) ||
668 CPU_ON_INTR(CPU) || t->t_flag & T_INTR_THREAD ||
669 getpil() >= DISP_LEVEL);

A system with debug kernel would panic, if any of the above conditions were to fail. For me, I believe, on sparc system, it would take 2mins to decipher which of the conditions failed.
And on amd64, definitely longer.

If only these asserts were one per line :
    - the panic message would tell me what had gone wrong - letting me focus on the core of the problem, rather than which one failed.
    - Definitely improves the readability of the code, which is what solaris is known for.

If only somebody could find an RTI advocate, who will approve the changes, I will definitely be indebted to him/them.
( Nov 13 2006, 03:37:39 AM PST ) Permalink Comments [1]

20060324 Friday March 24, 2006

libaio hang ?? libaio2.html Recently I came across a core file from an application linked with libaio. Customer said the application hung and he used gcore(1) to collect the core file.

After initial investigation it turned out that an mmap() call from libaio failed as there is not enough virtual memory for the process left. libaio issues mmap() to set aside stack for user land threads that it creates which service the 'async IO' requests.

As it is a 32bit process, I assumed, at one point the process may have grown close to 4Gb and mmap() call around that time failed with 'no memory'.

With these points, we put together a test case whch calls malloc() till a failure happens, and then issue an aio call which results in mmap() failing. So we throught the application wass hitting the 32bit limit.

One smart customer facing engineer collected the output of 'ulimit -a' and there it showed that the 'vmemory' was limited to ~600Mb. Interestingly size of the core filed collected was also around the same.
So the recommendation was to set it to 'unlimited'; if not, atleast raise it to a signifcant number.

But the point to keep in mind, while setting to 'unlimited' is, it needs to be ensured that there is enough swap
space configured on the system. Otherwise a run away process can starve others.

As the problem happened on s9, I can't print the stack here.


( Mar 24 2006, 02:34:35 AM PST ) Permalink

20050921 Wednesday September 21, 2005

data loss root caused... moddebug fun on DEBUG builds... Few weeks back I was looking at a data loss issue reported by a customer and in this  post I describe how it is root caused.

This initially came in as a NFSv3 issue.  Customer doesn't see the problem in case of ufs. But sees it only if he uses nfs. Even with nfs, he doesn't see the problem if  'noac' mount option is used (noac suppresses caching). Ofcourse customer doesn't  want to use noac, as  performance is penalized.  Even when he switched NFS servers, the problem persisted.  Customer doesn't see the problem with solaris10 on a sparc box but sees it on an opteron box with same solaris 10.  He sees it with  both 32bit and 64bit kernel.

Couple of enterprising 'nfs' engineers, after looking at the data from customer site, cooked up a 10 line reproducible test case and proved that the problem is not in NFS land but in the 'vm' layer beneath.

Armed with a simple test case,  a killer weapon called DTrace and a killer system called V40z I took 5hrs to zero-in on the problematic routine.

What the test case does is: it opens an existing file, lseek's to an offset: 2633940  [== (321*8192) + (4096+212)] and does a write of 60bytes and then seeks back to the beginning of the file and starts reading the whole file in 8k chunks. And when finished reading, closes the file.
Thats all and last 60bytes written is lost, somehow. And the problem happens every time.

Whenever  a user reads or writes to a file, the data is read/copied into the kernel address space before being pushed to userland or the FS.  Where to copy into the kernel is controlled  by a segment driver called: segmap.  What it does is, given a vnode and an offset it churns out a kernel VA where a page can be mapped to, such that FileSystems can copy the data from this page or into the page depending upon whether its a write or read request.

With 64bit kernel, 64Mb of kernel VA is set aside for segmap. And each 'smap' structure maps 8Kb of this VA. Thus at boot time 8192  'smap'  structures are created to map the whole 64Mb.  Primarily  'smap' structure contains the vnode/offset members, beside some other information.
Even in 64bit land, page size on an x86 box is still 4Kb only.  Thus given a segmap addres, kernel needs 2pages to map 8k size.

If you refer back to the offset where the write is happening, its actually happening to a 2nd page of a segmap address.
The stack trace when we load a PTE for the segmap address as part of write is:


hat_memload:entry adr: fffffe80c0ddb000 pp: (pfn:dfc87)
genunix`segmap_fault+0x1c9
genunix`as_fault+0x205
unix`pagefault+0x8b
unix`trap+0xdbc
unix`_cmntrap+0x11b
unix`kcopy+0x2c
genunix`uiomove+0xdf
nfs`writerp+0x87
nfs`nfs3_write+0x3ad
genunix`fop_write+0xb
 
So what this means is we got an address:  fffffe80c0ddb000   from segmap for the vnode/offset we passed. And the page number kernel set aside for this is: dfc87. Ones we load a translation entry for this page, and when the write happens to the page,  processor marks the page dirty in one of the bits of the PTE.

What happens ususally is, fsflush or say pageout, finds that this page is dirty (refer to pvn_getdirty()), syncs these flags from PTE into the page structure and wirtes the contents to the backign store before disposing the page.

What the test case does next is, seeks back to the beginning and read the page in 8k chunks :

As the thread reads the pages in 8k chunks, when it comes to offset (321*8192),  in the next read it has to  return (4102+212+the 60bytes that we wrote last),  as segmap already has an entry for this vnode/offset it returns the segmap address the corresponds to the first page:
	segmap_fault:entry addr: fffffe80c0dda000, len: 8192
genunix`segmap_fault+0x1c9
genunix`segmap_getmapflt+0x3ab
nfs`nfs3_read+0x1da
genunix`fop_read+0xb
genunix`read+0x188
genunix`read32+0xe
unix`sys_syscall32+0xd9

Its understandable that this address and the earlier segmap address are 4k apart.

The catch is in segmap_getmapflt() - What it does is, it does a hat_probe() on the beginning address, as this is the first time we are accessing this address, there is no valid PTE for this address and ones hat_probe() returns zero saying there is no valid PTE for this address,  it thinks same is valid even for the 2nd page thats part of this segmap address.

So it now goes ahead and installs a new PTE for both the pages.  And thus a valid entry for the 2nd page is lost :

	x86pte_set:return new pte: 80000000dfc87103, old pte: 80000000dfc87163
	unix`hati_pte_map+0x93                                   
        unix`hat_kmap_load+0b7
unix`hat_memload+0x87
genunix`segmap_fault+0x1c9 genunix`segmap_getmapflt+0x3ab
nfs`nfs3_read+0x1da genunix`fop_read+0xb genunix`read+0x188 genunix`read32+0xe unix`sys_syscall32+0xd9

x86pte_set() installs a new pte and returns the old pte - as we can see the page involved is: dfc87.
The returned pte has bits 0x60 (which stand for PT_MOD | PT_REF).  Ones the new pte is installed, the 60bytes written  is guanranteed to be lost and are lost.
The problem doesn't happen in sparc land as the page size is 8Kb.
As you may have guessed by now, if the test case reads in 4k chunks, the problem would not have happened.

This problem is fixed both in s10 and nevada.
( Sep 21 2005, 09:43:45 PM PDT ) Permalink Comments [13]

20050905 Monday September 05, 2005

my experiments with reiserfs - part1 moddebug fun on DEBUG builds... The other day I was playing around on my home system (booted in RHL) and decided to check out reiserfs and this post briefly talks about couple of points about my initial experiments with reiserfs:

As I don't have a spare disk on my system, I decided to create reiserfs on a 64Mb USB memory stick.

As I have had prior exposure to solaris-ufs, I thought, having a journal or not is a mount time option. But it turned out that, whether I eventually mount it with log or nolog, mkreiserfs sets aside space for log.  So with default journal size, I was left with less than 50% of the stick capacity for my use. I thought this is too much and then re-read the man page, and figured there is  this '-s' option which sets journal size. So I recreated the FS with '-s 513' (as per the man page min is 513), but  this time mount fails to mount the stick saying its not 'reiserfs' partition; which has come as a surprise
as mkreisefs didn't complain at the time of FS creation.

I would have loved to see an option to mkreiserfs to avoid journal. This could have been small disk friendly...
So I didn't have a choice but to settle with less than 50% of disk space for my files.

debugreiserfs(1) is a definite hackers love. Very nice command if you are curious about where (to the granularity of 'byte') exactly is data/metadata.

As the disk space left is very less, reiserfs didn't bother to use internal nodes to store the tree. root block itself is a leaf node, where the directory entries and stat data is available. Whereas I wanted to see the 'formatted internal nodes' in action as I populate the FS.

Looks like my only choice is to get a bigger stick!!! or are there any work arounds ??
( Sep 05 2005, 05:21:44 AM PDT ) Permalink Comments [2]

20050811 Thursday August 11, 2005

moddebug fun... moddebug fun... There is a kernel variable 'moddebug' which can be set to different values
mentioned in usr/src/uts/common/sys/modctl.h :

#define MODDEBUG_LOADMSG         0x80000000      /* print "[un]loading..." msg */
#define MODDEBUG_ERRMSG            0x40000000      /* print detailed error msgs */
#define MODDEBUG_LOADMSG2       0x20000000      /* print 2nd level msgs */
....

This variable controls the verbosity level of messages printed by the kernel (module
loading and unloading subsystem of the kernel, to be precise).

This value can be set either through /etc/system or on a running system using
'mdb -kw'.  Advantage with the later option is, system doesn't need a reboot and will
become effective immdiately.

Modules get loaded on demand or 'cos of a  modload(1M) request.
And  unloaded in response to modunload(1M) or whenever there is shortage
of memory or by 'mod auto unload thread' in case of DEBUG kernels.

Messages are printed onto the console as well as to the messages file.
It will be fun to watch what all modules are getting loaded and unloaded -

# mdb -kw
Loading modules: [ unix krtld ... ]
> moddebug/W 80000000
moddebug:       0               =       0x80000000
> $q

Because I am not running a debug kernel and the possibilty of memory shortage is
less, I will issue a modunload and see what happens:

# modunload -i 0
# dmesg
...
Aug 11 14:52:07 t1 genunix: [ID 762576 kern.notice] uninstalled elfexec
Aug 11 14:52:07 t1 genunix: [ID 659089 kern.notice] unloading elfexec, module id 62, loadcnt 2.
Aug 11 14:52:07 t1 genunix: [ID 762576 kern.notice] uninstalled intpexec
Aug 11 14:52:07 t1 genunix: [ID 659089 kern.notice] unloading intpexec, module id 101, loadcnt 3.
Aug 11 14:52:23 t1 genunix: [ID 943528 kern.notice] load 'exec/intpexec' id 101 loaded @ 0x120460f/0x78117f60 size 1049/156
Aug 11 14:52:23 t1 genunix: [ID 131579 kern.notice] installing intpexec, module id 101.
Aug 11 14:52:23 t1 genunix: [ID 943528 kern.notice] load 'exec/elfexec' id 62 loaded @ 0x1325e6f/0x14dcce7 size 18763/259
Aug 11 14:52:23 t1 genunix: [ID 131579 kern.notice] installing elfexec, module id 62.
#

Modules 'intpexec' and 'elfexec' were unloaded; but had to be loaded immediatly
because I ran 'dmesg' command which happens to be a shell script which in turn
calls ELF binaries !!!
( Aug 11 2005, 02:55:45 AM PDT ) Permalink Comments [1]

20050802 Tuesday August 02, 2005

swap space full; processes are dumping core... swap full; processes are dumping core... Last week, I was looking at a few core files, along with a colleague of mine.
Fujitsu was doing some kind of stress testing on an s10 box and they were
seeing core dumps from random processes.

The stress test involved filling up the whole 'swap' space on the system and then
issuing 'init 6'.

Some of the cores are because of SIGBUS and some are from SIGSEGV.
All the cores have atleast one thread with following kind of stack :

fec7bbe0 libc_psr.so.1`memset+0x88(fec7bca8, ff1bc214, fc150c00, 
b400000, fc000, fb4fc000)
fec7bc40 libc.so.1`_thrp_create+0x1f0(0, fc000, 0, 0, 80, fec7bed4)
fec7be68 libc.so.1`pthread_create+0x1e8(fec7bf3c, 0, 32e7c, 15a608, 0,
ff1e8bc0)
fec7bed8 startd_thread_create+0x10(32e7c, 15a608, 0, f, 52000, 15a608)
fec7bf40 restarter_event_thread+0x1f4(3eb800, 57a460, f5ef8, f5ef8,
56f50, 56f1c)
fec7bfa0 libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)

After an hour of doing the usual things (look at the regs, look at the instructions that caused these
signals etc), it suddenly occurred to us, that memory for thread stacks is allocated through mmap(2)
system call, and is passed an important option:  MAP_NORESERVE - which says don't reserve
swap space for this mapping. If this stack ever needs to be written out to swap, system looks for
swap space that time and if space is not available then, process will terminate with either SIGBUS
or SIGSEGV as mentioned clearly in the mmap() man page...So this is what is causing the random
MT processes to dump core in our case.  Spent 1hr on a behaviour which is clearly documented...
hence, I thought nobody should repeat this mistake and so this post...

I believe, this flag was introduced in the first place for the following reason:
default stack size of a userland thread is 1Mb, and not many threads could potentially have stacks
that run into multiple pages (with a page size of 8Kb); So if mmap() sets aside swap space as well,
for every thread created, with a few thousand threads itself, swap requirements of the system will run
into few Gbytes whereas these threads may never actually need to be paged out at all...
( Aug 02 2005, 01:04:11 AM PDT ) Permalink Comments [2]

20050727 Wednesday July 27, 2005

UFS file system defragmentation... UFS defragmentation... A while back I worked on a utility to defragment a Unix File System (UFS).
I haven't integrated it yet into the source base for two reasons:

1. Couldn't run it on any customer system - to give it a taste of real world.

2. This involves modifying the block numbers underneath a file. And looks
   like cluster folks have a project, which depends on block numbers not
   changing underneath a file.

But as this project hasn't made it, this constraint can be overlooked (I hope).

In this post, I will explain a little bit about why I did this, what it does
and what it doesn't do.
Few customers reported the following problem:

# df -k .
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d40      26614583 24092573 2255865    92%    /stats
#
# dd bs=1024 if=/dev/zero of=test count=8
write: No space left on device
5+0 records in
5+0 records out
#

What this means is, even though there is ~2.2Gb of space is avaialbe,
'dd' failed to create a file of size 8k.
This is because the file system is so fragmented, that it could not
locate 8Kb of contiguous space and hence writes from 'dd' failed.

UFS terminology:
Being a block based file system, allocation unit under ufs is a block of
size 8k. To minimize wastage of space when files of small size are created,
a block is actually divided into framgnets of 1k size (this can be changed
while creating a UFS FS. see mkfs(1M)).
Fragments are the ones which are actually numbered.
What this means is a block is nothing but 8contiguous fragments and the
fragment number of the leading fragment should be a multiple of 8.

So if we need to store 8 files, each of size, say, 200bytes, ufs sets aside
8fragments instead of 8blocks and this results in huge space saving (8*1k Vs
8*8k). What does the math say, if we have a million of such small files?

In the above case, we also got the 'fstyp -v' o/p of the file system:

# fstyp -v /dev/md/rdsk/d40 | sed -e '/^$/,$d'
ufs
magic   11954   format  dynamic time    Tue Aug 27 19:20:27 2002
sblkno  16      cblkno  24      iblkno  32      dblkno  808
sbsize  2048    cgsize  8192    cgoffset 128    cgmask  0xffffffe0
ncg     522     size    27028032        blocks  26614583
bsize   8192    shift   13      mask    0xffffe000
fsize   1024    shift   10      mask    0xfffffc00
^^^^^^^^^^^^
frag    8       shift   3       fsbtodb 1
minfree 1%      maxbpg  2048    optim   time
maxcontig 16    rotdelay 0ms    rps     120
csaddr  808     cssize  9216    shift   9       mask    0xfffffe00
ntrak   19      nsect   248     spc     4712    ncyl    11472
cpg     22      bpg     6479    fpg     51832   ipg     6208
nindir  2048    inopb   64      nspf    2
nbfree  0       ndir    38886   nifree  1891931 nffree  4858090
^^^^^^^^^^                                                 ^^^^^^^^^^^^^^^
cgrotor 81      fmod    0       ronly   0


From this o/p, we can see that number of free blocks (nbfree) on the
system is 0. And all the free space is available in the form of fragments
(nffree) - that means no 8 fragments are contiguous.

How did the FS get into this state? The system serves as a web server and it
creates and deletes tons of small files and thus the fragmentation over
a period of time.

Traditionally the best way to avoid this is, have the frament size and block
size match such that 'df -k' o/p and 'dd' action are consistent. So if
fragment size was 8k in the above example, 'df -k' would have showed 100%
full, so you can't expect 'dd' to succeed anyway. What that means is allocate
an 8k sized block even if the request was for just a few  100bytes !!! So
you may end up buying more disk space.

The other way to fix this problem was, ones we start getting 'dd' failures,
take the file system offline - use the traditional tools ufsdump(1M)/
ufsrestore(1M) to dump the files to a backup media and restore it back.
As you can see, there are two disadvantages with this approach: unavailability
of the FS for a while and a backup device.

So one approach to fix the problem is, reconsider the block allocation
techniques of ufs (which will be an expensive proposition). And the other
approach is, defragment the file system on the fly. This is what I have
implemented.

Initially I thought of adding a ufs kernel thread to do the defragmentation
whenever the frament/block ratio reaches a threshold; but on popular advise,
I switched to a utility which is called 'fsdefrag' - which can be launched
from a crontab file if required.

What this does is, it sweeps through the FS checking for  all the cylinder
groups (another ufs thing), starts from the worst affected cylinder group to
the relatively better ones, and tries to reallocate the fragments set aside for
an inode such that lot of contiguous fragments can be made available at the end
of the exercise. For the inode that is touched, its fragment numbers will
change to the newer ones and old ones get freed after data is copied.
There are some checks made in deciding whether a given inode's data block is a
an appropriate one or not.

What this is not:
Once I say, defragmentation, first thing that comes to the mind of database
folks is, their ~2Gb file, which has blocks spread all over the FS will be
made contiguous such that access times improve in a big way. This utility
doesn't look at files which occupy more than 96k (in UFS, upto 96kb, files
are allocated direct blocks i.e block numbers are stored in the inode
itself). So if you have read this far and this paragraph let you down, I am
extremely sorry.

If you are interested in this, we can have a discussion on 'ufs: discuss' forum
off opensolaris.org/

( Jul 27 2005, 07:22:22 AM PDT ) Permalink Comments [2]

20050712 Tuesday July 12, 2005

system crash file layout kernel memory corruption detected !! and root caused

Sometime back while working on a pet project (may be next blog), I studied the layout of
a system crash file and this blog is about that. This is applicable for crash files from
sun4u platform (but, things should not be very much different in x86/64 land, but I don't
bet on it).

In this document i am explaing the layout of the system crash file on the
swap disk and how it is present in the vmcore.* subsequent to running
a 'savecore'.
Use: 1. Getting to know the layout
2. How we can extract unix.* from vcore.* and thus avoid copying over
unix.* while downloading cores (Not a major gain, but for fun...)
3. How we can avoid kvm_* calls and do the vitual to physical translation
on our own and then read the appropriate location in vmcore.* (again
for fun...)

No sooner the system comes back after a crash, crash dump is avialable
on primary swap device and it starts at:

offset = (size of the device - 64k);
'offset' is truncated to get aligned on a 64k boundary

and grows upwards. At the offset, calculated as above, there will be a
dumphdr_t structure which describes the dump.

No need to know the name of the primary swap device, the following lines
do the job:

dumpfd=open("/dev/dump", O_RDONLY, 0444)
ioctl(dumpfd, DIOCGETDEV, dumpfile) <- this gets dump device into
"dumpfile"
dumpfd=open(dumpfile, O_RDONLY, 0444)

Seek to 64k:
llseek(dumpfd, -DUMP_OFFSET, SEEK_END) & -DUMP_OFFSET;

And read the dumphdr_t:

pread64(dumpfd, &dumphdr, sizeof(dumphdr_t), endoff);

dumphdr contains info to describe the various sections of crash dump,
their start points, their sizes and some more info like crash time, utsname...

sections:

1. dump_start of dumphdr denotes the beginning of crash and there will be a
blank page (size: 8192) to start with (Even this contains one copy of
dumphdr_t, so its not exactly blank :-))

2. Then starts the symbol table in compressed form (its no longer compressed !!)
which will subsequently become 'unix.*' upon running savecore. Its exact
size is available in dump_ksyms_size/dump_ksyms_csize of dumphdr.

3. This is followed by the page translation map. Its an array of structures
of type mem_vtop:

struct mem_vtop {
struct as *m_as;
void *m_va;
pfn_t m_pfn;
}

How does the system populate this? It walks through all the segements of
an address space (assuming we are dumping only kernel pages, then it will
just be 'kas') and for each pages of VA mapped, it will find out underlying
page frame number and updates the above structure. This is the key.
In our debuggers, whenever we try to access a VA, debugger just looks at
this map, finds out the PFN, seek to the appropriate page/offset and read
whatever is present there (So we should be able to write our own read
routines instead of using kvm_read() for fun).

How big is it? its determined by the 'dump_nvtop' member of dumphdr.
(size is: dump_nvtop*sizeof(mem_vtop_t))

4. This is followed by dumping all the Page Frame Numbers (PFN) of all the
memory pages in use, in an array form.

How big is this section? Its determined by the 'dump_npages' member of
dumphdr_t.
(size is: dump_npages*sizeof(pfn_t))

5. This is followed by the contents of all the PFNs noted above, in compressed
form. Each compressed buffer is preceded by the compressed buffer size
(which is a uint32_t).

Big picture:


| |
------------------------- ^ start
| | |
| | pagesize (8192)
| | |
ksyms start ------------------------- -
| | ^
| | |
| | dump_ksyms_csize/dump_ksyms_size
| | |
| | v
page xlation map------------------------- -
start | | |
| | |
| |
| | dump_nvtop * sizeof(mem_vtop_t)
| |
| | |
| | |
| | v
pfn start ------------------------- -
| | ^
| | |
| | |
| |
| | dump_npages * sizeof(pfn_t)
| |
| | |
| | |
Acutal dump strt------------------------- v
| |
| |
| |
| |
| |
------------------------- <--- this offset is 64k aligned
| dumphdr stored here |
| |
_________________________
| | ^
| | |
| | 64k
| | |
------------------------- v End of swap device



Layout of vmcore.*:
-------------------

1. First page will be a blank page.

2. Kernel symbol table is read from swap device into the vmcore.*, uncompress
it (nothing happens here, as it was not compressed in the first hand) and
then write to it unix.*;
Which means we don't need to pull over unix.* while copying over the crash
dumps, we can extract it from vmcore.* (some more fun....). All we need
to do is: open vmcore.*; seek by 8192; and then read dumphdr.dump_ksyms_size
bytes into unix.*

(Now the question is: as we don't have access to swap device, how do we
find out the value of dumphdr.dump_ksyms_size?

The thing is point-1 is wrong - First page is not exactly blank - but it
contains coredhr - so read dumphdr_t bytes from offset:0 and then get
corehdr.dump_ksyms_size bytes, populate unix.* with that many bytes from
vmcore.* - Now we have a unix.* !!

3. Pfn table is dumped next. As the above diagram indicates, we need to
dump (dump_npages * sizeof(pfn_t)) bytes from appropriate offset i.e
dumphdr.dump_pfn

4. Now we create the page transalation info by reading the contents from
swap device at offset: 'page xlation map start' - in swap device data
type is mem_vtop_t and in core file its: dump_map_t

struct dump_map {
offset_t dm_first;
offset_t dm_next;
offset_t dm_data;
struct as *dm_as;
uintptr_t dm_va;
}

There will be 'dump_nvtop' mem_vtop_t structures in swap device.
'dump_nvtop' can be higher than or equal or 'dump_npages' - as a page
could have been mapped at multiple VAes.

Now we read each mem_vtop_t, look at its PFN, then find out its location
in pfn_map (created in step 3) and update the 'dm_data' variable as
follows:

dm_data = (dump_data + (location * PAGESIZE))

For 'dump_data_start', please refer to the below figure.

5. Then the actual pages are dumped. These pages are stored in compressed
form. How they are stored in the dump device is: compressed size followed
by the compressed contents.
To extract them, something like this is done:

pread(dumpfd, &csize, sizeof (uint32_t), dumpoff);
^^^^^ <- this gets us the compressed size
dumpoff += sizeof (uint32_t); <- go past the data type

pread(dumpfd, inbuf, csize, dumpoff) <- read in the compressed buf;

decompress(inbuf, outbuf, csize, pagesize); <- uncompress it
^^^^^^^^ <- expected uncompressed size

write(corefd, outbuf, pagesize,corehdr.dump_data+x*pagesize) <- write
the uncompressed buf to the vmcore.* file

Big picture (vmcore.X):


offset: 0 --------------------------
| corehdr_t |^
| ||
| |8192
| ||
corehdr. | |v
dump_kyms --------------------------
| |^
| Kernel ||
| Symbols (A.K.A) ||
| unix.* |unix.* size aligned to page size
| ||
| ||
| |v
dump_pfn --------------------------
| |^
| Number of elements: ||
| (dump_npages) |pfn_table (PFNs of all dumped pages)
| || (aligned to page size)
| |v
dump_map --------------------------
| |^
| Array of dump_map_t ||
| structures ||
| (dump_nvtop) |Xlation map
| ||
| ||
| |v
dump_start --------------------------
| |^
| ||
| ||
| |Contents of Physical pages
| |are here
| ||
| ||
| |v
--------------------------

So given this picture how do we do the xlation given a VA:

1. First search for the VA in the tranlation map starting at: corehdr.dump_map
2. Get the 'dm_data' of the corresponding dump_map_t
3. Seek to that offset, and just read.


( Jul 12 2005, 11:50:47 PM PDT ) Permalink Comments [2]

20050630 Thursday June 30, 2005

softcall processing in solaris kernel memory corruption detected !! and root caused

Points to note before i move to softcall processing (this is for sun4u only):

1. There exists an intr_vector array (each element of which is a 'struct intr_vector'
(refer to ivintr.h)). HW drivers register the interrupts that the HW
generates with this array. The index into this array will be the interrupt
number. And subsequently whenever that piece of HW interrups the CPU, it will
send this number.

2. A part of this array is used for softints. Softints are something which
are generated by the CPU itself (using SOFTINT register).

3. Again there exists an array called softcalls[], each element of which is
a 'struct softcall' (refer to common/os/softint.c)

Softcall is a mechanism whereby kernel can execute a function at PIL-1.
How do we do this?

1. The first step is to make an entry in the above array (i.e intr_vector[])
using add_softintr(). add_softintr() returns the index of this entry.

e.g: siron_inum=add_softintr(PIL_1, softlevel1,0)

2. From now on, we make a call to softcall() to execute a function at PIL-1.
What softcall does is: it adds an entry for this function in a softcall array -
and then makes a call to siron().

e.g: softcall(callout_execute, ct);

What siron() does is, it calls setsoftint()

e.g: setsoftint(siron_inum)

what setsoftint() does is, it checks whether the pending flag is set
for this number (in this case its siron_inum) in the intr_vector, and if its
not set, then it sets a bit in the SOFTINT register/sets the pending flags,
which triggers a level-1 interrupt when the PIL on the cpu goes below 1.

(one more thing it does is, takes off one free intr_req structure from the
freelist, fills it up with the interrupt number (i.e siron_number in this case)
and adds it to the list corresponding to this PIL - i.e to
cpu.machcpu.intr_pool[])

3. Now when a level-1 interrupt occurs,
pil_interrup() {
Takes off the first intr_req from the intr_pool[], note the
intr number and return the intr_req struct to the free pool.

intr_thread() {
SERVE_INTR(this is a macro) {
looks up the interrupt number in the intr_vector
and calls the handler
(i.e softlevel1 at index siron_inum)

softlevel1() {

softint() { }
}
}
}
}
}

what softint() does is, goes through the softcall array(i.e softcalls[])
and calls all the handlers registered (e.g callout_execute() at line:28)

Caveats:

1. siron() doesn't call setsoftint() if a global flag siron_pending is already
set. This gets unset in softlevel1(). For this reason, we can't guarantee on
which cpu a real-time callout runs.

(explained in detail: some other thread running on a different CPU may have called
softcall() and this resulted in siron_pending getting set and then it sets a
bit in SOFTINT registe of that cpu;
subsequently when clock runs and calls softcall(callout_execute), this results in
adding an entry to the softcall table, but as siron_pending is already set, this
just returns).
Subsequently when a level-1 interrupt happens on the first cpu, it will process
besides other entries, RT callouts as well.

2. One more thing to note is we are calling setsoftint() with siron_inum, and if
pending flag is alread set for this index in intr_vector[], it just retuns.

For this reason, in cyclics, it creates its own interrupt numbers for each cpu
for level1 and level10 interrupts.

Refer to cbe_configure() of sun4u/io/cbe.c
This way we don't depend on a softint that has to happen on a differet CPU.

3. cyclics also makes use of SOFTINT feature to generate soft interrupts, but
it doesn't go through the softcall array but what it does is:
add_softintr(PIL_1, (intrfunc)cbe_level1, 0);
cbe_softint() -> setsoftint(data->cbe_level1_inum);
cbe_level1() -> cyclic_softint() -> (*handler)(arg)
More on this later..

( Jun 30 2005, 10:36:44 PM PDT ) Permalink Comments [2]

20050614 Tuesday June 14, 2005

kernel memory corruption detected !! and root caused kernel memory corruption detected !! and root caused

I worked the following really interesting bug (rather, root caused) a while back. This bug was reported in one of the late builds of Solaris 10 and it came with a reproducible test case (I love bugs that are reproducible :-)).
This results in system panic and panics/hangs are not anyway palatable. So this had to be fixed before FCS (First Customer Shipment).

Solaris has the facility to collect the crash file  (even though  I take this for granted,  it  came as a surprise to me that some contemporary unix clones don't have this facility) - so whenever a solaris system panics/hangs,  it takes a snapshot of the kernel pages (dumpadm(1)) onto the swap device, which is later extracted into /var/crash/vmcore.? while coming up.

This is what is spewed on the console when the problem happened:


kernel memory allocator: buffer modified after being freed
modification occurred at offset 0x748 (0xdeadbeefdeadbeef replaced by
0xdeadbed7deadbeef)
buffer=c9d701f8 bufctl=cb9ebce8 cache: process_cache
previous transaction on buffer c9d701f8:
thread=d8994000 time=T-0.006178876 slab=cbdef500 cache: process_cache
kmem_cache_free_debug+f5
kmem_cache_free+2c
pid_exit+cb
freeproc+471
waitid+22a
waitsys+19

panic[cpu0]/thread=d86ce800: kernel heap corruption detected

cd2a9ea4 genunix:kmem_error+3e6 (0, c8a0b6f0, c9d701)
cd2a9ed4 genunix:kmem_cache_alloc_debug+2cf (c8a0b6f0, c9d701f8,)
cd2a9f00 genunix:kmem_cache_alloc+129 (c8a0b6f0, 0)
cd2a9f34 genunix:getproc+5b (cd2a9f6c, 0)
cd2a9f70 genunix:cfork+8f (0, 1, cd2a9fac, fe8)
cd2a9f80 genunix:fork1+c (1cce90, fec1f3d4, 3)

Solaris kernel memory allocator provides excellent debug features and one such is, it fills up a freed data structure with 'deadbeef' pattern such that when reallocating the same data structure later on, it can verify that contents are still 'deadbeef' and if not, it is an indication that some thread accessed the freed memory - sign for trouble.

The same thing happened above, first stack points out the path leading to the freeing of the buffer, along with when it was freed, what cache the buffer belongs to.

The 2nd stack points out, what problem it detected with the buffer when it tried allocating that. Ok, so its clear that process structure (proc_t of usr/src/uts/common/sys/proc.h) got modified after being freed, at offset 0x74c and this corresponds to:

  > ::print -a proc_t
[.]
748 p_server_threads {
748 dp_threads
74c dp_cv {
74c _opaque <--- this one.
}
}

As the problem is reproducible, I collected multiple crash files and all showed the same symptom - i.e corruption at the same offset in the proc_t - So this confirms that it is not some random memory corruption but some thread which knows what it's modifying.

Referring to the test case, what it does is, it creates a door server and forks off a few children which will bombard the server with door calls. Doors is an interesting/faster IPC mechanim available in solaris. All the server threads which serve a door hang off from p_server_threads (refer to the door_create(3DOOR) man page for details on private door for which a separate pool of server threads is maintained).

Given that the problem is
            - happening at offset 0x74c
            - test case creates a door server and slams it hard
            - and we now know what 'p_server_threads' is all about,
it is clear that this process structure once was a door server process, and door clients could potentially be doing a cv_signal()/cv_broadcast() after the server is gone.

But how is this possible? Given that as part of exit, server revokes the doors in proc_exit()

        /*
* revoke any doors created by the process.
*/
if (p->p_door_list)
door_exit();

As part of door_exit():

        
for (dp = p->p_door_list; dp != NULL; dp = dp->door_list) {
dp->door_target = NULL;
dp->door_flags |= DOOR_REVOKED;
if (dp->door_flags & DOOR_PRIVATE)
cv_broadcast(&dp->door_servers.dp_cv);
}
cv_broadcast(&p->p_server_threads.dp_cv);
/* Clear the list */
p->p_door_list = NULL;



it is setting the DOOR_REVOKED flag and doing a broadcast on the dp_cv to wake up any client threads waiting for server threads in door_get_server():
        while ((server_t = pool->dp_threads) == NULL) {
if (!cv_wait_sig_swap(&pool->dp_cv, &door_knob))
return (NULL); /* Got a signal */
if (DOOR_INVALID(dp))
return (NULL); /* Target is invalid now */

So once the clients are woken up from cv_wait_sig_swap() because of broadcast from door_exit(), they check for the validity of door and return NULL as the door is just revoked.

So on the surface, this doesn't look possible - but it is happening !!

To confirm that it is the cv_signal() from the clients that is modifying the server's proc structure after it is freed, I built a kernel with process_cache, aligned on page boundary. It is quite simple in solaris - kmem_cache_create() accepts alignment as an option, so just pass on the page size to the following process_cache create in main()

        /*
* Create kmem cache for proc structures
*/
process_cache = kmem_cache_create("process_cache", sizeof (proc_t),
PAGESIZE , NULL, NULL, NULL, NULL, NULL, 0);

and added a call to hat_unload() in pid_exit() for the process structure.

The new kernel panicked with:

0xc7a4fca8:     misc40: 
0xca529ae8: Page fault (#pf)
0xca5295a8: Bad kernel fault at addr=0xd112674c
0xca529e68: pid=112566, pc=0xfe88fceb, sp=0xd0e67000, eflags=0x10202
0xca529768:
0xca536b2b: d14eedc4 unix:die+b1 (e, d14eee60, d112674c, 3)
0xca5367ab: d14eee4c unix:trap+a33 (d14eee60, d112674c, 3)
0xca53626b: d14eee60 unix:_cmntrap+c9 ()
0xc8c79cab: d14eeeb0 genunix:cv_signal+9 (d112674c)
0xc8c7976b: d14eeed8 genunix:cv_wait_sig_swap+228 (d112674c, c8d92fe0)
0xc84225eb: d14eeef8 doorfs:door_get_server+4d (ceba9080)
0xc84227ab: d14eef70 doorfs:door_call+c8 (3, c27f17a8, d14eefac, fe801925, 3, c27f17a8)
0xc8422ceb: d14eef80 doorfs:doorfs+9f (3, c27f17a8, 0, 0, 0, 3)


Aha, All along I was looking for cv_signal() calls from within doorfs code, but here is a  cv_signal() coming from cv_wait_sig_swap() - interesting. Referring to the code, cv_wait_sig_swap() does not call cv_signal() unless TS_SIGNALLED flag is set - and this is getting set only in sleepq_wakeone_chan() and cv_signal()  is the only caller of sleepq_wakeone_chan() - panic thread has this flag set :

> d1604400::print kthread_t t_schedflag
t_schedflag = 0x13 (TS_LOAD|TS_DONT_SWAP|TS_SIGNALLED)
>

and the only cv_signal() on dp_cv, is coming from door_release_server() !!

So after more code reading, it turns out that this is what is happening;

1- door server threads in door_release_server() signal the waiting clients.
2- client threads get onto the run Q. At this stage door server doesn't not have any clue that there are some client threads in door_call on run Q (this is a stress test, where hundreds of processes are running on a 2way system, so things change very fast).
3- Server calls door_revoke_all() to revoke all the doors and one of the server threads through door_slam() sets the door_target to NULL. cv_broadcast() in door_revoke_all() sets, any other waiting client threads to run; and exits normally.
4- If any of the client threads, made runnable at step-3, get to run, it results in door_call() to fail, and as per the test case, if a door_call() of a thread fails, it issues an exit() rather than a thr_exit().
5- exit() results in setting SEXITLWPS flag being set on client proc.
6- Now if threads made runnable at step-2, get onto the cpu, they will execute the following piece of code in cv_wait_sig_swap() :

	signalled = (t->t_schedflag & TS_SIGNALLED);
t->t_flag &= ~T_WAKEABLE;
/* TS_DONT_SWAP set by disp() */
ASSERT(curthread->t_schedflag & TS_DONT_SWAP);
mutex_enter(mp);
if (ISSIG_PENDING(t, lwp, p)) {
mutex_exit(mp);
if (issig(FORREAL))
rval = 0;
mutex_enter(mp);
}
if (lwp->lwp_sysabort || MUSTRETURN(p, t))
rval = 0;
lwp->lwp_asleep = 0;
lwp->lwp_sysabort = 0;
if (rval == 0 && signalled) /* avoid consuming the cv_signal() */
cv_signal(cvp);



signalled is TRUE as these threads are cv_signal()ed; 'rval' is '0' as issig(FORREAL) returns TRUE becasue of SEXITLWPS. The last cv_signal()  results in modifying the proc structure which is long gone. So depending on how many threads were cv_signal()ed in door_release_server() 'deadbeef' will come down by that number. Changing the test case to call thr_exit() instead of exit() upon failure of door_call() is preventing this problem...

Here is the test case that reproduces the above problem.
host> cat misc40.c
#include <stdio.h>
#include <errno.h>
#include <sys/door.h
#include <sys/types.h>
#include <sys/wait.h>
#include <fcntl.h>

#define NPROC 10
#define NTHREAD 100
#define NITERS 100
#define PASS 0
#define FAIL -1

int nthreads = NTHREAD;
int nproc = NPROC;
int niter = NITERS;
int did_lock;

void * convert(void *);
void lock_server_stuff(void *, char *, int, door_desc_t *, int);
void do_threads();
pid_t childs[NPROC];

/*
* Slam a server as hard as possible
*/
int
main(int argc, char *argv[])
{
char buf[1024];
int ret;
int i;
int was_fail;
int childstat;

if(argc > 1)
niter = atoi(argv[1]);
if(argc > 2)
nproc = atoi(argv[2]);
if(argc > 3)
nthreads = atoi(argv[3]);

sprintf(buf," NPROC %d NTHREAD %d NITERS %d", NPROC, NTHREAD, NITERS);

if ((did_lock = door_create(lock_server_stuff, 0xdeadbeef, DOOR_UNREF)) < 0) {
sprintf(buf," did_lock %d errno %d", did_lock, errno);

return FAIL;
}

for (i = 0; i < nproc; i++) {
if ((childs[i] = fork1()) == 0) {
do_threads();
}
}

/* Wait for an unref to be delivered */
was_fail = 0;
for (i = 0; i < nproc; i++) {
ret = waitpid(childs[i], &childstat, 0);
if (ret != childs[i] ||
!WIFEXITED(childstat) ||
WEXITSTATUS(childstat) != PASS) {

was_fail++;
}
}

/*
* Wait 10 seconds. The server should get argp == DOOR_UNREF_DATA
* and exit(PASS)
*/
sleep(10);

/* Server wasn't exit */
if (was_fail) {
return FAIL;
}
return PASS;
}

char *s1 =
"After Seven Days, He was Quite Tired, So God Said, \
Let There be Day, Just for Picnics, with Wine and Bread,\
He Gathered Up Some People he Had Made,\
Created Blankets and laid back in the Shade.\
The People Sipped their wine , And what with God there,\
Asked him Questions, Like:\
Do you have to Eat? Or get your Hair Cut,\
In Heaven?\
And if your Eye got poked Out in This Life,\
Would it be waiting there in Heaven with your Wife?\
God Shuffled his Feet,\
Glanced around, at them\
The People Cleared their throats,\
And stared right back,\
At him";

/* Client code */
void *
convert(void * frog)
{
char buf[1024];
int ret;
int i;
char buffer[1024];

for (i = 0; i < niter; i++) {
door_arg_t da;

da.rbuf = buffer;
da.rsize = sizeof(buffer);
da.data_ptr = s1;
da.data_size = strlen(s1) + 1;
da.desc_ptr = 0;
da.desc_num = 0;

/* After 'lock_server_stuff: exit(PASS)' door_call return -1 */
/* (errno is not change) */
errno = 0;
if ((ret = door_call(did_lock, &da)) < 0 && errno != 0) {
sprintf(buf," ret %d errno %d (i %d)", ret, errno, i);

exit (FAIL);
}
}
}

int invocations;

/* Server code */
void
lock_server_stuff(void *c, char *argp, int a_size, door_desc_t *dp, int n_did)
{
char *p = argp;

if (argp == DOOR_UNREF_DATA) {

exit(PASS);
}

invocations++; /* Not protected */
/* Convert the string to upper case */
while (*p) {
if (*p >= 'a' && *p <= 'z')
*p -= ('a' - 'A');
p++;
}
door_return(argp, a_size, 0, 0);
}


void
do_threads()
{
char buf[1024];
int max_threads;
int i;

thr_setconcurrency(20);
/*
* Create a bunch of threads to pound on the server
*/
for (i = 0; i < nthreads; i++) {
if (thr_create(NULL, 8192, convert, NULL, 0, 0) != 0) {
sprintf(buf," max_threads %d ", i);
break;
}
}
max_threads = i;
for (i = 0; i < max_threads; i++) {
thr_join(0, 0, 0);
}

exit(PASS);
}
host>

This is fixed in build_71 of s10 as part of bug# 5105724.

Technorati Tag:
Technorati Tag:
( Jun 14 2005, 08:44:22 AM PDT ) Permalink Comments [3]

Calendar

RSS Feeds

Search

Links

Navigation

Referers