Alok Aggarwal's Weblog

All | General | Music | NFS

20050614 Tuesday June 14, 2005

 Diskset Import - An Introduction to the source

Diskset Import - An Introduction to the source One of my most significant contributions (along with Steve Peng) to
Solaris 10 was to add support for import/export of disksets to SVM. So, what
is import/export of disksets? Simply put, you've got a bunch of disks
encapsulated in an SVM diskset and you want to disconnect them from one
host and connect them to a different host. And, get your SVM configuration
back. Why might you want to do this - say if you want to consolidate your storage
or incase of a disaster if you want to move your storage from one server to
another you might want to do something just like this.

SVM stores it's configuration information for the local set in a regular
metadb (the one that can be seen by metadb(1M) without arguments). The diskset
related configuration is stored in a diskset metadb (one that can be seen by
'metadb -s <diskset>' command) that resides on most (if not all) of the disks that are
a part of that diskset. Additionally, the local set metadb has knowledge about
the disksets including information on where to find the diskset metadbs.

The problem with moving storage from one server to another is that you loose
the local metadb and thus don't know where to find the diskset metadbs (and the
associated configuration). In order to implement diskset import it was needed to
figure out which of the recently connected disks in the target system have a diskset
metadb on them, read the configuration in from that metadb and populate the
kernel structures with the read in configuration information. That was the
scope of the problem in a nut shell.

We started out by writing the code to scan the disks for diskset metadbs
(entirely in userland). If you want to follow the conversation with code
references, pull up metaimport.c This is the essentially the source of
metaimport(1M). The code starts out by scanning the available set of disks,
pruning the disks that are in use and then for each drive that's left it
calls meta_get_set_info - this is the heart of the scanning code. It checks
to see if a diskset metadb exists on the passed in disk and if one exists, it
reads it in and does a sanity check on the metadata information read in. It
also does the work figuring out the new disk names, i.e. a disk named c1t1d1
in the source system might be named c2t2d22 in the target system and you need
to correct the related metadata information in the diskset metadb to reflect
the fresh state of affairs. Upon it's return, meta_get_set_info has a list
of disks that comprise a diskset.

Once we've build up the list of disksets and the disks that comprise each
of those disksets, we pass all of this information to meta_imp_set that does
the real work of populating the information in the kernel via ioctls. The
MD_DB_USEDEV ioctl creates the kernel structures (akin to what happens when
creating the initial configuration). The MD_IOCIMP_LOAD ioctl then snarfs in
the detailed configuration, the heart of this code is in md_imp_snarf_set.
The ops vector for each of the modules (stripe, mirror, etc) was expanded to
include an import op. So, for example, the stripe ops vector now looked
something like this -

md_ops_t stripe_md_ops = {
stripe_open, /* open */
stripe_close, /* close */
md_stripe_strategy, /* strategy */
NULL, /* print */
stripe_dump, /* dump */
NULL, /* read */
NULL, /* write */
md_stripe_ioctl, /* stripe_ioctl, */
stripe_snarf, /* stripe_snarf */
stripe_halt, /* stripe_halt */
NULL, /* aread */
NULL, /* awrite */
stripe_imp_set, /* import set */
stripe_named_services
};

The import op for each of the modules handled creating detailed configuration
as well as updating out-of-date information.

md_imp_snarf_set calls the import op for each of the modules that appear in
the diskset configuration. So, if there's a stripe in the diskset configuration
stripe_imp_set gets called and so on. Subsequently, the code does exactly
what it says :)

 /*
* Fixup
* (1) locator block
* (2) locator name block if necessary
* (3) master block
* (4) directory block
* calls appropriate writes to push changes out
*/
if ((err = md_imp_db(setno)) != 0)
goto cleanup;

/*
* Create set in MD_LOCAL_SET
*/
if ((err = md_imp_create_set(setno)) != 0)
goto cleanup;


It fixes up another set of out-of-date information and creates the appropriate
structures in the local set to inform the local set about the diskset
configuration and where to find it. That's it, we're done with our job in the
kernel and we return to userland.

In the userland, the only other thing that needs to be done is to inform the
rpc daemon that stores the knowledge about disksets (rpc.metad) about the
existence of this imported set. This is accomplished via the clnt_resnarf_set routine.

So there you have it - a 15,000 ft overview of the implementation of diskset
import.

Technorati Tag:
Technorati Tag:



( Jun 14 2005, 12:52:07 PM EDT / Jun 14 2005, 12:38:33 PM EDT ) Permalink Comments [0]
Trackback: http://blogs.sun.com/aalok/entry/diskset_import_an_introduction_to

 Debugging on Sparc

Debugging on Sparc
Debugging on Sparc

While debugging x86/x64 crash dumps has been fairly extensively talked
about at various places (most recently here), I haven't come across any
resources that talk about debugging sparc dumps (other than numerous
bug reports). Now that OpenSolaris is live, it'll be relatively easier for
developers outside Sun to debug problems. Thus, the motivation behind
this entry.

Most of the time when you get a crash dump from a kernel panic, it's
either during development (in which case it's easy to debug because
you *know* exactly what caused the code to fail) or it's while the code is
in production. It's harder to debug when you're given a dump obtained
from a production machine primarily because first you need to find out
what caused the code to fail and second you need to simulate the failure
in the lab.

A lot of the times, finding the root cause entails figuring out what
parameters were passed to functions and what do the local variables
look like at a certain point in time. I'll walk through an example to
demonstrate how function arguments and local variables can be excavated
from a dump.

Parameter passing on sparc -  A brief overview

Unlike x86 that passes function arguments on the stack and x64 that passes
function arguments (atleast most of them) in registers, sparc uses
register windows
to pass parameters. Arguments are passed in
%i0, %i1 .. %i5 with %i0 having the first parameter and so on. If there
are more than six input parameters to a function, parameters after the
sixth are passed on the stack. %i6 contains the frame pointer (%fp)

Local variables are allocated at an offset to the frame pointer.

Stack Format

The frame structure is defined in the system file usr/include/sys/frame.h
and it looks as follows -

struct frame {
long fr_local[8]; /* saved locals */
long fr_arg[6]; /* saved arguments [0 - 5] */
struct frame *fr_savfp; /* saved frame pointer */
long fr_savpc; /* saved program counter */
#if !defined(__sparcv9)
char *fr_stret; /* struct return addr */
#endif /* __sparcv9 */
long fr_argd[6]; /* arg dump area */
long fr_argx[1]; /* array of args past the sixth */
};

So the input parameters are in the fr_arg array.

Exacavating arguments with an NFSv4 bug

Using the bug 6268686 as an example and referencing OpenSolaris, let's look
at the stack trace that resulted in the panic -

> $C

000002a1012203d1 vpanic(1295800, 7aabd868, 7aabd880, 851, 2400, 2a1012210fc)
000002a101220481 assfail+0x74(7aabd868, 7aabd880, 851, 18c6000, 1295800, 0)
000002a101220531 nfs4_make_dotdot+0x4f4(2a101220df8, 2388873b24c20,
fffffffffffffff8, 301412eb920, 2a101221238, 1)
000002a101220941 nfs4lookupnew_otw+0x7d8(301ef0d4dc0, 2a101221530,
2a101221528, 301412eb920, df8475800, 38285c955c0)
000002a101220a71 nfs4_lookup+0x114(301ef0d4dc0, 2a101221530, 2a101221528,
301412eb920, 0, 391f52d44a8)
000002a101220b41 fop_lookup+0x28(301ef0d4dc0, 2a101221530, 2a101221528,
7aa69c2c, 0, 600045703c0)
000002a101220c01 lookuppnvp+0x344(2a1012217f0, 0, 600045703c0, 2a101221528,
2a101221530, 6000008dbc0)
000002a101220e41 lookuppnat+0x120(301ef0d4dc0, 0, 1, 0, 2a101221930, 0)
000002a101220f01 lookupnameat+0x5c(0, 0, 1, 0, 2a101221930, 0)
000002a101221011 vn_openat+0x164(1, 400, 1, 1, 0, 1)
000002a1012211d1 copen+0x260(ffffffffffd19553, 87aa3, 0, 50400, 0, 1)
000002a1012212e1 syscall_trap32+0x1e8(87aa3, 0, 50400, 0, 0, 0)

To set the context for this bug, we were trying to lookup a directory and it
so happened that we ended up calling nfs4_make_dotdot to get an rnode. The
comments in the code explain fairly well under what circumstances this function
is called -

/*
* nfs4_make_dotdot() - find or create a parent vnode of a non-root node.
*
* Our caller has a filehandle for ".." relative to a particular
* directory object. We want to find or create a parent vnode
* with that filehandle and return it.
.. snip

Like the comments say, we had a filehandle for ".." relative to the directory
object we're trying to lookup. So, to start off what was the pathname we're
trying to lookup? To determine this, we'd like to know what are the arguments
passed into the nfs4_make_dotdot function. Check the source and the function
is defined in uts/common/fs/nfs/nfs4_subr.c as -

int
nfs4_make_dotdot(nfs4_sharedfh_t *fhp, hrtime_t t, vnode_t *dvp,
cred_t *cr, vnode_t **vpp, int need_start_op)

The interesting bit is the passed in directory vnode pointer, dvp, and it's
passed in in the i2 register. If we can find out the dvp, we'll also know
the path we're playing with here.

64-bit sparc has a notion of stack bias and you need to add the stack bias to
the frame pointer in order to get the actual data of the stack frame.

Applying that to the frame pointer for nfs4_make_dotdot and dumping out
the frame, we have -

> 000002a101220531+0x7ff::print struct frame
{
fr_local = [ 0, 0, 0x381a4c49000, 0x2a101221108, 0x2a101220ee0,
0x2a101221118, 0x7aabd800, 0x7aabd800 ]
fr_arg = [ 0x2a101220df8, 0x2388873b24c20, 0xfffffffffffffff8,
0x301412eb920, 0x2a101221238, 0x1 ]
fr_savfp = 0x2a101220941
fr_savpc = 0x7aa6b474
fr_argd = [ 0x1, 0x5bc679f3060, 0, 0x2a101221250, 0x200000000,
0x5bc679f3178 ]
fr_argx = [ 0 ]
}

i2 here looks bogus, darn! Let's backup one function higher to
nfs4lookupnew_otw and see if it we can fish out dvp out of it's frame easily.
Quick look at the source in uts/common/fs/nfs/nfs4_vnops.c and -

static int
nfs4lookupnew_otw(vnode_t *dvp, char *nm, vnode_t **vpp, cred_t *cr)

The same dvp we're looking for should be in i0 provided it's not been
overwritten. Dump out the frame -

> 000002a101220941+0x7ff::print struct frame
{
fr_local = [ 0x391f52d4450, 0x381a4c49000, 0x2388873c342a0, 0x600074432a8,
0x2388873b24c20, 0, 0x1, 0x391f52d44e8 ]
fr_arg = [ 0x301ef0d4dc0, 0x2a101221530, 0x2a101221528, 0x301412eb920,
0xdf8475800, 0x38285c955c0 ]
fr_savfp = 0x2a101220a71
fr_savpc = 0x7aa69d40
fr_argd = [ 0x38f6901b110, 0x311fe247dc0, 0x2a101221704, 0x2a1012216fc,
0x2a101220c71, 0x7aa7ad20 ]
fr_argx = [ 0x2a101221264 ]
}

Quick check to see if it's been overwritten -

> nfs4lookupnew_otw::dis!grep i0
[ .. elided ]

It's not overwritten, we're in luck! Double check to see if it's a vnode.

> 0x301ef0d4dc0::whatis
301ef0d4dc0 is 301ef0d4dc0+0, bufctl 301ecba50c8 allocated from vn_cache

It sure is a vnode. Dumping out the path is now easy -

> 301ef0d4dc0::print vnode_t v_data |::print rnode4_t r_svnode.sv_name
r_svnode.sv_name = 0x391d8a24090
> 0x391d8a24090::print nfs4_fname_t fn_parent fn_name
fn_parent = 0x3b6565ba920
fn_name = 0x3292727cbe0 "uts"
> 0x3b6565ba920::print nfs4_fname_t fn_parent fn_name
fn_parent = 0x353987b1550
fn_name = 0x4f42a102fe0 "src"
> 0x353987b1550::print nfs4_fname_t fn_parent fn_name
fn_parent = 0

We're operating on ./src/uts and this isn't handled correctly in the lookup
handling routine (it's fixed now).

As I mentioned earlier, local variables are stored at an offset to the
frame pointer. Now that we have the frame pointer, we can dig out the local
variables. The variable of interest in this case was the error structure
declared on the stack for nfs4_make_dotdot here

A close look at the disassembly of the function and we can see -

nfs4_make_dotdot+0x27c:         mov       %l2, %o0
nfs4_make_dotdot+0x280: mov 0xc, %o3
nfs4_make_dotdot+0x284: call +0x15c58 nfs4_end_fop
nfs4_make_dotdot+0x288: mov %l7, %o5
nfs4_make_dotdot+0x28c: ba -0xd0 nfs4_make_dotdot+0x1bc
nfs4_make_dotdot+0x290: cmp %i5, 0
nfs4_make_dotdot+0x294: add %fp, 0x797, %o4
nfs4_make_dotdot+0x298: mov %l2, %o0
nfs4_make_dotdot+0x29c: mov 0xc, %o3
nfs4_make_dotdot+0x2a0: mov %l1, %i1
nfs4_make_dotdot+0x2a4: call +0x15c38 nfs4_end_fop
nfs4_make_dotdot+0x2a8: mov %l1, %o5
nfs4_make_dotdot+0x2ac: ld [%fp + 0x7bb], %i3 <------
nfs4_make_dotdot+0x2b0: cmp %i3, 0

that it's stored at fp + 0x7bb (fp is the fr_savfp in nfs4_make_dotdot's frame)
Dump it out -

> 0x2a101220941+0x7bb::print nfs4_error_t
{
error = 0
stat = 0t10006 (NFS4ERR_SERVERFAULT)
rpc_status = 0 (RPC_SUCCESS)
}

This reveals a secondary problem in the code which is that there are no
checks for errors like NFS4ERR_SERVERFAULT (again, now fixed).

Technorati Tag:
Technorati Tag:
Technorati Tag:



( Jun 14 2005, 12:53:42 PM EDT / Jun 14 2005, 11:50:01 AM EDT ) Permalink Comments [0]
Trackback: http://blogs.sun.com/aalok/entry/debugging_on_sparc


« June 2005 »
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today


XML




    Blogroll


Today's Page Hits: 68

Locations of visitors to this page