|
Diskset Import - An Introduction to the source
Diskset Import - An Introduction to the source
One of my most significant contributions (along with Steve Peng) to
Solaris 10 was to add support for import/export of disksets to SVM. So,
what
is import/export of disksets? Simply put, you've got a bunch of disks
encapsulated in an SVM diskset and you want to disconnect them from one
host and connect them to a different host. And, get your SVM
configuration
back. Why might you want to do this - say if you want to consolidate
your storage
or incase of a disaster if you want to move your storage from one
server to
another you might want to do something just like this.
SVM stores it's configuration information for the local set in a regular
metadb (the one that can be seen by metadb(1M)
without arguments). The
diskset
related configuration is stored in a diskset metadb (one that can be
seen by
'metadb -s <diskset>' command) that resides on most (if not all)
of the
disks that are
a part of that diskset. Additionally, the local set metadb has
knowledge about
the disksets including information on where to find the diskset metadbs.
The problem with moving storage from one server to another is that you
loose
the local metadb and thus don't know where to find the diskset metadbs
(and the
associated configuration). In order to implement diskset import it was
needed to
figure out which of the recently connected disks in the target system
have a
diskset
metadb on them, read the configuration in from that metadb and populate
the
kernel structures with the read in configuration information. That was
the
scope of the problem in a nut shell.
We started out by writing the code to scan the disks for diskset metadbs
(entirely in userland). If you want to follow the conversation with code
references, pull up metaimport.c
This is the essentially the source of
metaimport(1M).
The code starts out by scanning the available set of
disks,
pruning the disks that are in use and then for each drive that's left it
calls meta_get_set_info
- this is the heart of the scanning code. It
checks
to see if a diskset metadb exists on the passed in disk and if one
exists, it
reads it in and does a sanity check on the metadata information read
in. It
also does the work figuring out the new disk names, i.e. a disk named
c1t1d1
in the source system might be named c2t2d22 in the target system and
you need
to correct the related metadata information in the diskset metadb to
reflect
the fresh state of affairs. Upon it's return, meta_get_set_info has a
list
of disks that comprise a diskset.
Once we've build up the list of disksets and the disks that comprise
each
of those disksets, we pass all of this information to meta_imp_set
that
does
the real work of populating the information in the kernel via ioctls.
The
MD_DB_USEDEV
ioctl creates the kernel structures (akin to what happens
when
creating the initial configuration). The MD_IOCIMP_LOAD
ioctl then
snarfs in
the detailed configuration, the heart of this code is in
md_imp_snarf_set.
The ops vector for each of the modules (stripe, mirror, etc) was
expanded to
include an import op. So, for example, the stripe ops vector now looked
something like this -
md_ops_t stripe_md_ops = { stripe_open, /* open */ stripe_close, /* close */ md_stripe_strategy, /* strategy */ NULL, /* print */ stripe_dump, /* dump */ NULL, /* read */ NULL, /* write */ md_stripe_ioctl, /* stripe_ioctl, */ stripe_snarf, /* stripe_snarf */ stripe_halt, /* stripe_halt */ NULL, /* aread */ NULL, /* awrite */ stripe_imp_set, /* import set */ stripe_named_services };
The import op for each of the modules handled creating detailed
configuration
as well as updating out-of-date information.
md_imp_snarf_set
calls the import op for each of the modules that
appear in
the diskset configuration. So, if there's a stripe in the diskset
configuration
stripe_imp_set
gets called and so on. Subsequently, the code does
exactly
what it says
/* * Fixup * (1) locator block * (2) locator name block if necessary * (3) master block * (4) directory block * calls appropriate writes to push changes out */ if ((err = md_imp_db(setno)) != 0) goto cleanup;
/* * Create set in MD_LOCAL_SET */ if ((err = md_imp_create_set(setno)) != 0) goto cleanup;
It fixes up another set of out-of-date information and creates the
appropriate
structures in the local set to inform the local set about the diskset
configuration and where to find it. That's it, we're done with our job
in the
kernel and we return to userland.
In the userland, the only other thing that needs to be done is to
inform the
rpc daemon that stores the knowledge about disksets (rpc.metad) about
the
existence
of this imported set. This is accomplished via the clnt_resnarf_set
routine.
So there you have it - a 15,000 ft overview of the implementation of
diskset
import.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
( Jun 14 2005, 12:52:07 PM EDT / Jun 14 2005, 12:38:33 PM EDT )
Permalink
Trackback: http://blogs.sun.com/aalok/entry/diskset_import_an_introduction_to
Debugging on Sparc
Debugging
on Sparc
While debugging x86/x64 crash dumps has been fairly extensively talked
about at various places (most recently here),
I haven't come across any
resources that talk about debugging sparc dumps (other than numerous
bug reports). Now that
OpenSolaris is live, it'll be
relatively easier for
developers outside Sun to debug problems. Thus, the motivation behind
this entry.
Most of the time when you get a crash dump from a kernel panic, it's
either during development (in which case it's easy to debug because
you *know* exactly what caused the code to fail) or it's while the code
is
in production. It's harder to debug when you're given a dump obtained
from a production machine primarily because first you need to find out
what caused the code to fail and second you need to simulate the
failure
in the lab.
A lot of the times, finding the root cause entails figuring out what
parameters were passed to functions and what do the local variables
look like at a certain point in time. I'll walk through an example to
demonstrate how function arguments and local variables can be excavated
from a dump.
Parameter passing on sparc - A
brief overview
Unlike x86 that passes function arguments on the stack and x64 that
passes
function arguments (atleast most of them) in registers, sparc uses
register windows to pass parameters. Arguments are passed in
%i0, %i1 .. %i5 with %i0 having the first parameter and so on. If there
are more than six input parameters to a function, parameters after the
sixth are passed on the stack. %i6 contains the frame pointer (%fp)
Local variables are allocated at an offset to the frame pointer.
Stack Format
The frame structure is defined in the system file
usr/include/sys/frame.h
and it looks as follows -
struct frame { long fr_local[8]; /* saved locals */ long fr_arg[6]; /* saved arguments [0 - 5] */ struct frame *fr_savfp; /* saved frame pointer */ long fr_savpc; /* saved program counter */ #if !defined(__sparcv9) char *fr_stret; /* struct return addr */ #endif /* __sparcv9 */ long fr_argd[6]; /* arg dump area */ long fr_argx[1]; /* array of args past the sixth */ };
So the input parameters are in the fr_arg array.
Exacavating arguments with an NFSv4 bug
Using the bug 6268686
as an example and referencing OpenSolaris,
let's look
at the stack trace that resulted in the panic -
> $C
000002a1012203d1 vpanic(1295800, 7aabd868, 7aabd880, 851, 2400, 2a1012210fc) 000002a101220481 assfail+0x74(7aabd868, 7aabd880, 851, 18c6000, 1295800, 0) 000002a101220531 nfs4_make_dotdot+0x4f4(2a101220df8, 2388873b24c20, fffffffffffffff8, 301412eb920, 2a101221238, 1) 000002a101220941 nfs4lookupnew_otw+0x7d8(301ef0d4dc0, 2a101221530, 2a101221528, 301412eb920, df8475800, 38285c955c0) 000002a101220a71 nfs4_lookup+0x114(301ef0d4dc0, 2a101221530, 2a101221528, 301412eb920, 0, 391f52d44a8) 000002a101220b41 fop_lookup+0x28(301ef0d4dc0, 2a101221530, 2a101221528, 7aa69c2c, 0, 600045703c0) 000002a101220c01 lookuppnvp+0x344(2a1012217f0, 0, 600045703c0, 2a101221528, 2a101221530, 6000008dbc0) 000002a101220e41 lookuppnat+0x120(301ef0d4dc0, 0, 1, 0, 2a101221930, 0) 000002a101220f01 lookupnameat+0x5c(0, 0, 1, 0, 2a101221930, 0) 000002a101221011 vn_openat+0x164(1, 400, 1, 1, 0, 1) 000002a1012211d1 copen+0x260(ffffffffffd19553, 87aa3, 0, 50400, 0, 1) 000002a1012212e1 syscall_trap32+0x1e8(87aa3, 0, 50400, 0, 0, 0)
To set the context for this bug, we were trying to lookup a directory
and it
so happened that we ended up calling nfs4_make_dotdot to get an rnode.
The
comments in the code explain fairly well under what circumstances this
function
is called -
/* * nfs4_make_dotdot() - find or create a parent vnode of a non-root node. * * Our caller has a filehandle for ".." relative to a particular * directory object. We want to find or create a parent vnode * with that filehandle and return it. .. snip
Like the comments say, we had a filehandle for ".." relative to the
directory
object we're trying to lookup. So, to start off what was the pathname
we're
trying to lookup? To determine this, we'd like to know what are the
arguments
passed into the
nfs4_make_dotdot function. Check the source and the function
is defined in uts/common/fs/nfs/nfs4_subr.c
as -
int nfs4_make_dotdot(nfs4_sharedfh_t *fhp, hrtime_t t, vnode_t *dvp, cred_t *cr, vnode_t **vpp, int need_start_op)
The interesting bit is the passed in directory vnode pointer, dvp, and
it's
passed in in the i2 register. If we can find out the dvp, we'll also
know
the path we're playing with here.
64-bit sparc has a notion of stack
bias and you need to add the stack bias to
the frame pointer in order to get the actual data of the stack frame.
Applying that to the frame pointer for nfs4_make_dotdot and dumping out
the frame, we have -
> 000002a101220531+0x7ff::print struct frame { fr_local = [ 0, 0, 0x381a4c49000, 0x2a101221108, 0x2a101220ee0, 0x2a101221118, 0x7aabd800, 0x7aabd800 ] fr_arg = [ 0x2a101220df8, 0x2388873b24c20, 0xfffffffffffffff8, 0x301412eb920, 0x2a101221238, 0x1 ] fr_savfp = 0x2a101220941 fr_savpc = 0x7aa6b474 fr_argd = [ 0x1, 0x5bc679f3060, 0, 0x2a101221250, 0x200000000, 0x5bc679f3178 ] fr_argx = [ 0 ] }
i2 here looks bogus, darn! Let's backup one function higher to
nfs4lookupnew_otw and see if it we can fish out dvp out of it's frame
easily.
Quick look at the source in uts/common/fs/nfs/nfs4_vnops.c
and -
static int nfs4lookupnew_otw(vnode_t *dvp, char *nm, vnode_t **vpp, cred_t *cr)
The same dvp we're looking for should be in i0 provided it's not been
overwritten. Dump out the frame -
> 000002a101220941+0x7ff::print struct frame { fr_local = [ 0x391f52d4450, 0x381a4c49000, 0x2388873c342a0, 0x600074432a8, 0x2388873b24c20, 0, 0x1, 0x391f52d44e8 ] fr_arg = [ 0x301ef0d4dc0, 0x2a101221530, 0x2a101221528, 0x301412eb920, 0xdf8475800, 0x38285c955c0 ] fr_savfp = 0x2a101220a71 fr_savpc = 0x7aa69d40 fr_argd = [ 0x38f6901b110, 0x311fe247dc0, 0x2a101221704, 0x2a1012216fc, 0x2a101220c71, 0x7aa7ad20 ] fr_argx = [ 0x2a101221264 ] }
Quick check to see if it's been overwritten -
> nfs4lookupnew_otw::dis!grep i0 [ .. elided ]
It's not overwritten, we're in luck! Double check to see if it's a
vnode.
> 0x301ef0d4dc0::whatis 301ef0d4dc0 is 301ef0d4dc0+0, bufctl 301ecba50c8 allocated from vn_cache
It sure is a vnode. Dumping out the path is now easy -
> 301ef0d4dc0::print vnode_t v_data |::print rnode4_t r_svnode.sv_name r_svnode.sv_name = 0x391d8a24090 > 0x391d8a24090::print nfs4_fname_t fn_parent fn_name fn_parent = 0x3b6565ba920 fn_name = 0x3292727cbe0 "uts" > 0x3b6565ba920::print nfs4_fname_t fn_parent fn_name fn_parent = 0x353987b1550 fn_name = 0x4f42a102fe0 "src" > 0x353987b1550::print nfs4_fname_t fn_parent fn_name fn_parent = 0
We're operating on ./src/uts and this isn't handled correctly in the
lookup
handling routine (it's fixed now).
As I mentioned earlier, local variables are stored at an offset to the
frame pointer. Now that we have the frame pointer, we can dig out the
local
variables. The variable of interest in this case was the error structure
declared on the stack for nfs4_make_dotdot here
A close look at the disassembly of the function and we can see -
nfs4_make_dotdot+0x27c: mov %l2, %o0 nfs4_make_dotdot+0x280: mov 0xc, %o3 nfs4_make_dotdot+0x284: call +0x15c58 nfs4_end_fop nfs4_make_dotdot+0x288: mov %l7, %o5 nfs4_make_dotdot+0x28c: ba -0xd0 nfs4_make_dotdot+0x1bc nfs4_make_dotdot+0x290: cmp %i5, 0 nfs4_make_dotdot+0x294: add %fp, 0x797, %o4 nfs4_make_dotdot+0x298: mov %l2, %o0 nfs4_make_dotdot+0x29c: mov 0xc, %o3 nfs4_make_dotdot+0x2a0: mov %l1, %i1 nfs4_make_dotdot+0x2a4: call +0x15c38 nfs4_end_fop nfs4_make_dotdot+0x2a8: mov %l1, %o5 nfs4_make_dotdot+0x2ac: ld [%fp + 0x7bb], %i3 <------ nfs4_make_dotdot+0x2b0: cmp %i3, 0
that it's stored at fp + 0x7bb (fp is the fr_savfp in
nfs4_make_dotdot's frame)
Dump it out -
> 0x2a101220941+0x7bb::print nfs4_error_t { error = 0 stat = 0t10006 (NFS4ERR_SERVERFAULT) rpc_status = 0 (RPC_SUCCESS) }
This reveals a secondary problem in the code which is that there are no
checks for errors like NFS4ERR_SERVERFAULT (again, now fixed).
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: NFS
( Jun 14 2005, 12:53:42 PM EDT / Jun 14 2005, 11:50:01 AM EDT )
Permalink
Trackback: http://blogs.sun.com/aalok/entry/debugging_on_sparc
|