|
Observing the Solaris Kernel
This past weekend I had the pleasure of presenting at the
SIGCSE conference
in Houston. For the unacquainted the audience is computer science
educators, so my focus was on using OpenSolaris as a vehicle for
teaching operating system internals.
For the conference I prepared some slides and a demo that attempt to peel off
the first layer of the observability onion into the Solaris kernel. Aside
from DTrace (which by now the whole world has heard about) I also make
extensive use of the excellent mdb kernel debugging facilities. The demos
in particular are worth sharing so I am posting them here for all to see
and use as you see fit.
First,
the slides (also in
StarOffice; if you steal them for your own use all I ask is that you
provide a pointer back to this blog entry). The main theme of my presentation
is that OpenSolaris brings to the table free access to the source code, while
the powerful observability tools allow quick insight into the dynamics of
the kernel without mastering the source.
The first demo starts from a simple C program, follows the open() syscall
flow down to the process file table with DTrace, and then drills all the way
down (using the kernel debugger) into the inode at the file system level.
This is accomplished by leveraging access to the source code, and for the
demo I deliberately chose a part of the system to look at I had never
been in before. This demo took about twenty minutes to create and about
twenty minutes to walk through.
The second demo shows how mdb can be used to examine the anatomy
of a stopped process. I ran out of time before getting to the second demo;
I'll post the second demo in another blog entry in a few days, so keep an
eye out for it.
First Demo
Let's begin with the following program:
/*
* A simple program which creates a file, writes one byte to it,
* closes it, and exits.
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
int
main(void)
{
int fd;
const char buf[] = "\0";
if ((fd = open("simple1.out", O_CREAT|O_WRONLY, 0600)) < 0) {
perror("Unable to create file");
exit (1);
}
if (write(fd, buf, 1) < 0) {
perror("Unable to write file");
exit (2);
}
return (0);
}
Let's take a closer look at the open(2) system call. Running truss(1)
shows that the system call we're interested in is actually the third,
since there are two implicit open()s performed automatically (one by
the runtime linker looking for a config file, and the other to map in
the C library).
% truss -t open ./simple1
open("/var/ld/ld.config", O_RDONLY) Err#2 ENOENT
open("/lib/libc.so.1", O_RDONLY) = 3
open("simple1.out", O_WRONLY|O_CREAT, 0600) = 3
Armed with this much knowledge I use a really simple DTrace script
to see what functions open() calls.
#!/usr/sbin/dtrace -Fs
open:entry
/ execname == "simple1" /
{
self->tracing++;
}
:return
/ self->tracing > 0 /
{
trace(0);
}
open:return
/ self->tracing > 0 /
{
self->tracing--;
}
:entry
/ self->tracing > 0 /
{
trace(0);
}
Running this script yields quite a bit of output, of which the first
few calls are interesting because we can see that the system call is
setting up the new file descriptor which is eventually returned.
% CPU FUNCTION
0 => open 0
0 -> open32 0
0 -> copen 0
0 -> falloc 0
0 -> ufalloc 0
0 -> ufalloc_file 0
0 -> fd_find 0
...
Going over to opensolaris.org and pulling up
the source for fd_find() we can see that the file descriptor is
an index into an array, fi_list[], which is embedded in the process u
structure. Armed with that new info, I setup a breakpoint into the debugger
from the end of open using DTrace.
#!/usr/sbin/dtrace -wFs
open:return
/ execname == "simple1" /
{ breakpoint(); }
Starting the script as root and then running the simple1 program
as my user ID in another window results in the following session (recall
we're looking for the third open call, so we ignore the first two times we
hit the breakpoint):
[console] root@rutamaya > ./simple1_files.d
dtrace: script './simple1_files.d' matched 4 probes
dtrace: allowing destructive actions
dtrace: breakpoint action at probe syscall::open:return (ecb fffffe8501e671e0)
kmdb: target stopped at:
kaif_enter+8: popfq
[0]> :c
dtrace: breakpoint action at probe syscall::open:return (ecb fffffe8501e671e0)
kmdb: target stopped at:
kaif_enter+8: popfq
[0]> :c
dtrace: breakpoint action at probe syscall::open:return (ecb fffffe8501e671e0)
kmdb: target stopped at:
kaif_enter+8: popfq
[0]> <gsbase::print cpu_t cpu_thread->t_procp->p_user.u_finfo.fi_list
cpu_thread->t_procp->p_user.u_finfo.fi_list = 0xfffffea7aa34b800
[0]> 0xfffffea7aa34b800,4::print uf_entry_t uf_file->f_vnode
uf_file->f_vnode = 0xfffffe813aa86c80
uf_file->f_vnode = 0xfffffe813aa86c80
uf_file->f_vnode = 0xfffffe813aa86c80
uf_file->f_vnode = 0xffffffffb282ce00
%gsbase holds a pointer to the current CPU structure; from there I can
get the running thread, access the process pointer, follow the u
structure to the u_finfo which contains the array of open files
(again indexed by file descriptor, as we discovered from the source). Once
I have the pointer of this array I can follow the struct file to get
the vnode for each open file.
Going back to the source and looking at the
struct vnode I can see that the v_type and v_path
fields are interesting so let's take a look at them..
[0]> 0xfffffea7aa34b800,6::print uf_entry_t uf_file|::grep >0|::print "struct file" f_vnode->v_type f_vnode->v_path
f_vnode->v_type = 4 (VCHR)
f_vnode->v_path = 0xfffffea79ece81b8 "/devices/pseudo/pts@0:1"
f_vnode->v_type = 4 (VCHR)
f_vnode->v_path = 0xfffffea79ece81b8 "/devices/pseudo/pts@0:1"
f_vnode->v_type = 4 (VCHR)
f_vnode->v_path = 0xfffffea79ece81b8 "/devices/pseudo/pts@0:1"
f_vnode->v_type = 1 (VREG)
f_vnode->v_path = 0xfffffe80ea212e40 "/home/elowe/proj/SIGCSE/simple1.out"
This prints each element of the array of uf_entry_t structures in the
u structure if the pointer is not NULL, and pipes the dot (which is
a struct file to the ::print dcmd to examine the fields we're
interested in. The first three file descriptors are stdin/stdout/stderr
and all point back to my virtual terminal; the fourth is the file being
written to by the simple1 program.
Since I'm using NFS (and I can see from my earlier DTrace output that
nfs3_open() is in the call path) I decided it might be interesting to
poke at the underlying inode.
The source shows that nfs3_open() gets
an rnode_t by casting the vnode's v_data directly into
the file system's internal inode representation. A little bit ago we had
the vnode pointer in our mitts:
[0]> 0xfffffea7aa34b800,4::print uf_entry_t uf_file->f_vnode
...
uf_file->f_vnode = 0xffffffffb282ce00
So now we can keep digging:
[0]> 0xffffffffb282ce00::print vnode_t
{
..
v_data = 0xfffffe83625e9d10
v_vfsp = 0xffffffff885eaf00
..
v_op = 0xffffffff86d24c40
..
[0]> 0xffffffff86d24c40::print struct vnodeops
{
vnop_name = 0xffffffffc01c5a20 "nfs3"
vop_open = nfs`nfs3_open
..
[0]> 0xfffffe83625e9d10::print rnode_t
{
..
r_size = 0x1
r_attr = {
va_mask = 0xbfff
va_type = 1 (VREG)
va_mode = 0x180
va_uid = 0x1ebfa
va_gid = 0xa
..
[0]> 180=O
0600
[0]> 0x1ebfa=D
125946
[0]> a=D
10
[0]> :c
We can see that the file mask is 0600, the UID is 0x1ebfa hex (which
is 125946 in decimal), and the group ID is 10 decimal.
% ls -l simple1.out
-rw------- 1 elowe staff 1 Feb 27 11:34 simple1.out
Technorati Tags:
[OpenSolaris
Solaris]
(2006-03-06 14:13:37.0/2006-03-06 14:06:36.0)
Permalink
Trackback: http://blogs.sun.com/elowe/entry/observing_the_solaris_kernel
|