Dan Mick's Little Shop of Hints

« Previous day (Jun 14, 2005) | Main | Next day (Jun 16, 2005) »
20050615 Wednesday June 15, 2005

Diagnosing kernel hangs/panics with kmdb and moddebug

If you experience hangs or panics during Solaris boot, whether it's during installation or after you've already installed, using the kernel debugger can be a big help in collecting the first set of "what happened" information.

The kernel debugger is named "kmdb" in Solaris 10 and later, and is invoked by supplying the '-k' switch in the kernel boot arguments. So a common request from a kernel engineer starting to examine a problem is often "try booting with kmdb".

Sometimes it's useful to either set a breakpoint to pause the kernel startup and examine something, or to just set a kernel variable to enable or disable a feature, or enable debugging output. If you use -k to invoke kmdb, but also supply the '-d' switch, the debugger will be entered before the kernel really starts to do anything of consequence, so that you can set kernel variables or breakpoints.

So "booting with the -kd flags" is the key to "booting under the kernel debugger". Now, how do we do that?

Kernel debugging with GRUB-boot systems

On modern Solaris and OpenSolaris systems, GRUB is used to boot; to enable the kernel debugger, you add -kd arguments to the "kernel" (or "kernel$") line in the GRUB menu entry. When presented with the GRUB menu, hit 'e' to edit the entry, highlight the kernel line, and hit 'e' again to edit it; add the -kd arguments just after the /platform/i86pc/kernel/$ISADIR/unix argument, so that it says

kernel$ /platform/i86pc/kernel/$ISADIR/unix -kd
and then hit 'b' to boot that edited menu entry. '-k' means "start the debugger"; '-d' means "immediately enter the debugger after loading the kernel". After some booting status, you'll see the kernel debugger announce itself like this:
[0]>

(The number in square brackets is the CPU that is running the kernel debugger; that number might change for later entries into the debugger.)

Now we're in the kernel debugger

There are two good reasons to run under the kernel debugger:
  1. If we panic, the panic can be examined before reboot; you can get stack backtraces and get some idea of which section of code might be at fault.
  2. Now we can set kernel variables, set breakpoints, etc. to affect the kernel run.
Obviously, there's a lot you can do in a kernel debugger, and I'm only touching on it here, but here are two good ones:
  1. For investigating hangs: try turning on module debugging output. You can set the value of a kernel variable by using the '/W' command ("write a 32-bit value"). Here's how you set moddebug to 0x80000000, and then continue execution of the kernel:
    [0]> moddebug/W 80000000
    [0]> :c
    
    That will give you debug output for each kernel module that loads. (see /usr/include/sys/modctl.h, near the bottom, for moddebug flag information. I find 0x80000000 is the only one I really ever use.)
  2. To collect information about panics: when the kernel panics, it will drop into the debugger, and print some interesting information; however, usually the most interesting thing, first, is the stack backtrace; this shows, in reverse order, all the functions that were active at the time of panic. To generate a stack backtrace, use
    [0]> $c
    

    A few other very useful information commands during a panic are

    ::msgbuf
    which will show you the last things the kernel printed onscreen, and
    ::status
    which shows a summary of the state of the machine in panic.
  3. If you're running the kernel while the kernel debugger is active, and you experience a hang, you may be able to break into the debugger to examine the system state; you can do this by pressing the <F1> and <A> keys at the same time (a sort of "F1-shifted-A" keypress). (On SPARC systems, this key sequence is <Stop>-<A>.) This should give you the same debugger prompt as above, although on a multi-CPU system you may see the CPU number in the prompt is something other than 0. Once in the kernel debugger, you can get a stack backtrace as above; you can also use ::switch to change the CPU and get stack backtraces on the different CPU, which might shed more light on the hang. For instance, if you break into the debugger on CPU 1, you could switch to CPU 0 with
    [1]> 0::switch
    

There's obviously a lot more you can do with the kernel debugger, but these small tips will sometimes help get from a "I have no idea what to do" to "I have a few ideas to try that might let me continue to boot or install", which can make all the difference.

Technorati Tag: ( Jun 15 2005, 04:26:17 PM PDT ) Permalink Comments [6]

Calendar

RSS Feeds

Search

Links

Navigation

Referers