« September 2008 »
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
10
11
13
14
15
18
24
    
       
Today
XML

Neat blogs

Navigation

Editing

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

20080912 Friday September 12, 2008
Decoding data transfer across nfssys() syscall

I've got this new code to XDR encode in userland and XDR decode in the kernel. I can run the kernel and debug issues live, but that is bulky. So, instead, I'm going to write some tools to help me. First off, I already have routines for printing a policy, so I've already got the hard part out of the way.

Or do I?

You can't just call printf() in the kernel. And while you can use syslog() or cmn_err(), they have the nasty habit of inserting a line feed after every call. So this

        printf("Hello ");
        printf("Nurse!\n");

Would yield

Hello
Nurse!

On a strict translation. So, the easiest way to proceed is to collect it all in a buffer:

void
spe_print_policy_list(spe_policy_t *policies)
{
        spe_policy_t    *sp;
        char            *buf;
        int             offset;

        buf = kmem_zalloc(MAX_PRINT_BUF + 1, KM_SLEEP);
        offset = 0;
        for (sp = policies; sp; sp = sp->next) {
                spe_print_policy(sp, buf, &offset);
                cmn_err(CE_WARN, "%s", buf);
                buf[0] ='\0';
                offset = 0;
        }
        free(buf, MAX_PRINT_BUF + 1);
}

Note that this is debug code and probably will be thrown away. So, I still make sure to do garbage cleanup correctly, but I don't care about whether it should be CE_WARN or CE_NOTE.

Why am I passing an offset?

Well, we have the following functions, most of which emit information:

spe_bitcount(unsigned int n)
spe_print_address(uint_t addr, char *buf)
spe_print_netmask(uint_t mask, char *buf)
spe_print_leaf(spe_leaf_t *sl, char *buf)
spe_print_thunk(spe_thunk_t st, char *buf)
spe_print_attribute(spe_interior_t *si, char *buf)
spe_print_policy(spe_policy_t *sp, char *buf)
spe_print_policy_list(spe_policy_t *policies)

If I pass in just the buffer, I could advance it every time I wrote:

                len = sprintf(buf, "%c%u", sep, rev);
                spe_print_fool(fool, &buf[len + 1]);

Note, I wrote that second line off the top of my head, don't trust it!

But that doesn't help you during a tree traversal. You print out the left side, come back out and you don't know much (or if at all) spe_print_fool changed the offset. So, instead I also pass in the offset to be used:

void
spe_print_netmask(uint_t mask, char *buf, int *poffset)
{
        int     t;
        int     len;

        t = spe_bitcount(mask);
        len = sprintf(&buf[*poffset], "/%u", t);
        *poffset = len + 1;
}

This way, the caller of spe_print_netmask will know where to write.

Now, this may be embarrassing, but we really need to debug this debug code before we add it into the kernel. I.e., we want to make sure we are focused on debugging our real code and not the support code. Luckily, we can write some stand-alone code and test it out there.

Since I know the print code used to dump out a policy correctly, I know know that this hunch was correct:

[th199096@aus1500-home ~/tmp]> ./a.out 
10, 200
20, 1
30
40,/

And here was the input:

10, 16, 64k, 35865356092523570, uid == 200096
20, 32, 2k, 35865356092523570, uid == 1066
30, 64, 1k, 35865356092523570, uid == 0
40, 2, 2k, 35865356092523570, subnet == 10.1.233.0/24

It is easy to see in the debugger that I reset poffset instead of incrementing it:

96                                      len = sprintf(&buf[*poffset], "%s",
8: *poffset = 28
7: len = 0
(gdb) 
98                                      *poffset = len + 1;
8: *poffset = 28
7: len = 3
(gdb) 
94                      for (i = 0; i < Attribute_count; i++) {
8: *poffset = 4
7: len = 3

That should be *poffset += len + 1.

Better, but not a go:

[th199096@aus1500-home ~/tmp]> ./a.out
10, 16, 64000, 4054008882l 
20, 32, 2000, 4054008882l 
30, 64, 1000, 4054008882l 
40, 2, 2000, 4054008882l 

The offending line starts with:

        len = sprintf(&buf[*poffset], "%u, %u, %u, %ul ", sp->sp_id, sp->sp_stripe_count,
            sp->sp_interlace, sp->sp_guuids[0]);

I suspect that the %ul is coming out as %ul. I.e., print an unsigned followed by an 'l'. We see that the data is correct in memory:

(gdb) p *sp
$2 = {sp_id = 10, sp_stripe_count = 16, sp_interlace = 64000,
    sp_attr_expr = 0x80687e0, sp_name = 0x0, sp_guuids = {35865356092523570, 0, 
    0, 0, 0, 0, 0, 0}, next = 0x80699f0}

Crud, that is a bad formatting and I've already fixed it once! Looks like I lost some code. :-< (see here for prior head banging Putting that throw away code to good use). And I'll bet the remaining problem has something to do with code changes I made due to the move to xdr!

And I would be wrong. Things look okay, but no change:

96                                      len = sprintf(&buf[*poffset], "%s",
5: spe_attribute_table[i].at_attr = SPE_ATTR_UID
4: buf = 0x8066748 "10, 16, 64000, 35865356092523570 "
3: sl->sl_attr = SPE_ATTR_UID
1: i = 13

I now suspect I am off by 1!


(gdb) p &buf[33]
$5 = 0x8066769 ""
(gdb) p &buf[34]
$6 = 0x806676a "uid"
(gdb) p spe_attribute_table[i].at_name
$7 = 0x8054ded "uid"

Yes, and now Thunderbirds are go!

[th199096@aus1500-home ~/tmp]> ./a.out
10, 16, 64000, 35865356092523570 uid == 200096
20, 32, 2000, 35865356092523570 uid == 1066
30, 64, 1000, 35865356092523570 uid == 0
35, 2, 2000, 35865356092523570 subnet ==  10.1.233.0/24 && gid == 55
40, 2, 2000, 35865356092523570 subnet ==  10.1.233.0/24

Note I added a new test case to handle a more complex rule!

Lessons Learned

  1. Test when you write code.
  2. Test your test code.
  3. Add new test cases, even when you think things are okay.
  4. Build up from simple tested code to more complex tested code.
    1. Assume your prior code works, but don't be afraid to question that when you stress the code.
    2. Building block approach to debugging
  5. If your environment is too complex for easy testing, write a stand-alone test suite.
  6. Don't be afraid to throw away your test code!
  7. Don't be afraid to reuse it.
    1. Which also means write it as well as you can.
    2. You might end up putting it in production and you don't want an assumption to become a customer bug.

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily
Setting up a pnfs community

Over in Latest - pNFS Admin Documentation, Lisa Week has a wiki entry which tells you how to configure a pNFS community for testing our bits. I do things a little different when I'm testing bits to put out on OpenSolaris.

The primary difference is that I need a NFSv4 share and I also need to configure the exports to support the NFS cthon tests. I.e., I need root access.

Configuring the mds

[root@pnfs-9-10 ~]> zpool create -f data /dev/dsk/c1t0d0s7
[root@pnfs-9-10 ~]> zfs create data/nfs4
[root@pnfs-9-10 ~]> zfs set sharenfs=rw,anon=0 data/nfs4
[root@pnfs-9-10 ~]> mdsadm -o add -t auth -a ip=10.1.233.50
adding: IP Addr - 10.1.233.50

This has the hidden effect of making sure the NFS server module is loaded in memory. The act of sharing will do that. I also have anon=0 which effectively allows root from any machine.

Configuring the ds

[root@pnfs-9-08 ~]> zpool create -f data /dev/dsk/c2t0d0s7
[root@pnfs-9-08 ~]> zfs create data/nfs4
[root@pnfs-9-08 ~]> zfs set sharenfs=rw,anon=0 data/nfs4
[root@pnfs-9-08 ~]> zfs create -t pnfsdata data/pnfs
[root@pnfs-9-08 ~]> dservadm addstor data/pnfs
[root@pnfs-9-08 ~]> dservadm addmds 10.1.233.52.8.1
[root@pnfs-9-08 ~]> dservadm enable

Again, while I appear to skip some steps from the Latest - pNFS Admin Documentation, they are being done implicitly. The zfs set sharenfs takes place of the share commands in the wiki. But please follow the wiki if you are unsure about what to do.

So I've set up the DS and also enabled it such that I can test NFSv4 against it as well.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily
Sun's rpcgen(1) is now recursive

At a previous employer, I had worked with someone who was incredulous that Sun's implementation rpcgen(1) did not eliminate tail recursion. After all, he had worked there and had that code change sitting in a workspace when he left.

So at the time, I manually fixed a XDR call that was blowing the stack.

Since then, I've occasionally had to relearn XDR and rpcgen. And I've always assumed that I had to manually do the tail recursion code. The other day I was in a hurry and I decide to generate output from both Linux and Solaris to see the differences. I thought that surely the Linux implementation would be optimal.

We can see there is a simple linked-list which would be NULL terminated:

#define SPED_MAX_GUUIDS         8
typedef struct spe_policy {
        uint_t                  sp_id;
        uint_t                  sp_stripe_count;
        uint_t                  sp_interlace;
        spe_interior_t          *sp_attr_expr;
        char                    *sp_name;
        uint64_t                sp_guuids[SPED_MAX_GUUIDS];
        struct spe_policy       *next;
} spe_policy_t;

I was looking at the diff and I was struck by how simple and elegant the few lines of Linux generated C code were:

bool_t
xdr_spe_policy (XDR *xdrs, spe_policy *objp)
{
	register int32_t *buf;

	int i;
	 if (!xdr_uint32_t (xdrs, &objp->sp_id))
		 return FALSE;
	 if (!xdr_count4 (xdrs, &objp->sp_stripe_count))
		 return FALSE;
	 if (!xdr_uint32_t (xdrs, &objp->sp_interlace))
		 return FALSE;
	 if (!xdr_pointer (xdrs, (char **)&objp->sp_attr_expr, sizeof (spe_interior), (xdrproc_t) xdr_spe_interior))
		 return FALSE;
	 if (!xdr_pointer (xdrs, (char **)&objp->sp_name, sizeof (char), (xdrproc_t) xdr_char))
		 return FALSE;
	 if (!xdr_vector (xdrs, (char *)objp->sp_guuids, SPED_MAX_GUUIDS,
		sizeof (uint64_t), (xdrproc_t) xdr_uint64_t))
		 return FALSE;
	 if (!xdr_pointer (xdrs, (char **)&objp->next, sizeof (spe_policy), (xdrproc_t) xdr_spe_policy))
		 return FALSE;
	return TRUE;
}

And I was looking at the Solaris generated output and was worried about the apparent complexity:

bool_t
xdr_spe_policy(XDR *xdrs, spe_policy_t *objp)
{
        rpc_inline_t *buf;

        spe_policy_t *tmp_spe_policy;
        bool_t more_data = TRUE;
        bool_t first_objp = TRUE;


        if (xdrs->x_op == XDR_DECODE) {
                while (more_data) {
                        if (!xdr_uint32_t(xdrs, &objp->sp_id))
                                return (FALSE);
...
                        if (!xdr_bool(xdrs, &more_data))
                                return (FALSE);

                        if (!more_data) {
                                objp->next = NULL;
                                break;
                        }

                        if (objp->next == NULL) {
                                objp->next = (spe_policy_t *)
                                        mem_alloc(sizeof (spe_policy_t));
                                if (objp->next == NULL)
                                        return (FALSE);
                                bzero(objp->next, sizeof (spe_policy_t));
                        }
                        objp = objp->next;
                }
        } else if (xdrs->x_op == XDR_ENCODE) {
                while (more_data) {
                        if (!xdr_uint32_t(xdrs, &objp->sp_id))
                                return (FALSE);
...
                        if (!xdr_vector(xdrs,
                            (char *)objp->sp_guuids,
                            SPED_MAX_GUUIDS,
                            sizeof (uint64_t),
                            (xdrproc_t)xdr_uint64_t))
                                return (FALSE);
                        objp = objp->next;
                        if (objp == NULL)
                                more_data = FALSE;
                        if (!xdr_bool(xdrs, &more_data))
                                return (FALSE);
                }
        } else {
                while (more_data) {
                        if (!xdr_uint32_t(xdrs, &objp->sp_id))
                                return (FALSE);
...
                        if (!xdr_vector(xdrs,
                            (char *)objp->sp_guuids,
                            SPED_MAX_GUUIDS,
                            sizeof (uint64_t),
                            (xdrproc_t)xdr_uint64_t))
                                return (FALSE);
                        tmp_spe_policy = objp;
                        objp = objp->next;
                        if (objp == NULL)
                                more_data = FALSE;
                        if (!first_objp)
                                mem_free(tmp_spe_policy_t,
                                    sizeof (spe_policy_t));
                        else
                                first_objp = FALSE;
                }

        }
        return (TRUE);
}

I didn't look at the actual code, I just assumed the worst. So I went searching for a version of rpcgen that would work for me. I also was kinda looking for the bug id of the code my friend had sitting in a private workspace.

The first hit in google on the keywords "rpcgen tail recursion" was Bug ID: 4023944 rpcgen generates linked list routines that blow stacks. Hmm, that is on opensolaris.org, which is good.

When I opened it, I was struck by several facts:

  1. State     10-Fix Delivered (Fix available in build)
  2. Fixed In     snv_14
  3. Responsible Engineer     Bill Ricker
    Hey, I know this guy! I met him at Connectathon several times.
  4. One of the problems leading to this fix was a blown stack in mountd.

I was a victim of low expectations. I assumed that the Solaris rpcgen(1) still did not handle tail recursion (it hadn't been a priority to putback when my friend had a working solution), so I didn't inspect the generated code to see that it did do so. I assumed that fewer lines of code meant elegant and more functional. Instead it meant a blown stack at some point.

So thanks Bill for getting that change made and protecting stacks everywhere!


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily
User space to kernel via xdr

I've been away from the spe code for some time doing bug work, gatekeeping, and web design. I'm back to it and trying to get it ready to integrate into our development gate. I've decided to move the spe decision making into the kernel and as such I'll need to pass the policies from user space to the kernel.

Traditionally, you have to define 64 and 32 bit versions of your data structures (user land is typically 32 bit address space). Your application packs in the data and sends it to the kernel via syscall. And then you have to copy your data from user space to the kernel.

The STRUCT_* macros in the following code chunk show the steps you take:

  1. You declare a local variable, u_sped in this case, which allows two views of the memory (see STRUCT_DECL(9F))
  2. You determine whether the user land address space is 32 or 64 bit
  3. You configure your local variable to use the current model
  4. You copy the memory from user land (sped_in in this case) to the kernel address space
  5. You get a local copy out of there.
  6. You get a local copy out of there.
  7. This one is different, it is a variable length "string"
void
nfs4_sped_svc(struct nfssped_args *sped_in)
{
        spe_policy_t    *sp = NULL;
        nfssped_op_t    opcode;
        char            *buf = NULL;
        size_t          len;
        size_t          tlen = 0;

        XDR             xdrs;

        model_t         model;

        STRUCT_DECL(nfssped_args, u_sped);  /* 1 */

        model = get_udatamodel();  /* 2 */

        /*
         * Initialize the data pointers.
         */
        STRUCT_INIT(u_sped, model); /* 3 */
        if (copyin(sped_in, STRUCT_BUF(u_sped), STRUCT_SIZE(u_sped))) {   /* 4 */
                set_errno(EFAULT);
                return;
        }

        opcode = STRUCT_FGET(u_sped, nsa_opcode);  /* 5 */
        len = STRUCT_FGET(u_sped, nsa_xdr_len);    /* 6 */

        if (len) {
                buf = kmem_zalloc(len, KM_SLEEP);

                if (copyinstr(STRUCT_FGETP(u_sped, nsa_xdr),         /* 7 */
                    buf, len, &tlen)) {
                        goto err_out;
                }
        }

One problem is that you need to know how big the buffer is to extract from it. If you have a fixed number of elements, i.e., 1, this is an easy problem. But I want to send an arbitrary number of elements. I could cycle through a linked list, sending one at a time, but that sounds gross.

But XDR was created to handle differences in network byte order, machine address ranges, variable number of data sets, etc.

So we take a pass in userland to encode the data into XDR, which also tells us how large the resulting data will be. And then we can unpack it in the kernel.

The largest issue I faced was that you typically start out with XDR notation and use rpcgen to produce headers and code. I already had the data structures, so I took my header and turned it into a .x file. And I had a hard time getting a union in there:

typedef union {
        uid_t           uid;
        gid_t           gid;
        int             i;
        char            *sz;
        spe_network_t   net;
} spe_data_t;

I would have twigged onto what was needed if I had started earlier - discriminated unions. So now I need to provide a way to tell the XDR code how to determine which field was to be used. Luckily, I knew that:

typedef enum {
        SPE_DATA_ADDR,
        SPE_DATA_GID,
        SPE_DATA_INT,
        SPE_DATA_NETNAME,
        SPE_DATA_NETWORK,
        SPE_DATA_STRING,
        SPE_DATA_UID
} spe_type_t;

I rearranged the .x file to get:

enum spe_type {
        SPE_DATA_ADDR,
        SPE_DATA_GID,
        SPE_DATA_INT,
        SPE_DATA_NETNAME,
        SPE_DATA_NETWORK,
        SPE_DATA_STRING,
        SPE_DATA_UID
};

union spe_data switch (spe_type data_type) {
        case SPE_DATA_UID:
                uid_t           uid;
        case SPE_DATA_GID:
                gid_t           gid;
        case SPE_DATA_INT:
                int             i;
        case SPE_DATA_NETNAME:
        case SPE_DATA_STRING:
                char            *sz;
        case SPE_DATA_ADDR:
        case SPE_DATA_NETWORK:
                struct spe_network   net;
};

But that produced the following for the header file:

struct spe_data {
        spe_type data_type;
        union {
                uid_t uid;
                gid_t gid;
                int i;
                char *sz;
                struct spe_network net;
        } spe_data_u;
};
typedef struct spe_data spe_data;

I wanted a nice clean separation, so I fought this for a while. But it is logical. In the end I went with:

typedef struct {
        spe_type_t      sd_type;
        union {
                uid_t           uid;
                gid_t           gid;
                int             i;
                char            *sz;
                spe_network_t   net;
        } sd_u;
} spe_data_t;

It fits my naming style and I also avoided doing the following:

#define uid sd_u.uid
#define gid sd_u.gid

Of course, with that, I should pick more unique field names. And what I hate about this method is that if you are sitting there with one window open to code and another on a debugger, you can't examine foo.uid. You have to dig through the headers to find that you really need to be looking at foo.sd_u.uid. I find it easier to avoid the macros.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily