I've got this new code to XDR encode in userland and XDR decode in the kernel. I can run the kernel and debug issues live, but that is bulky. So, instead, I'm going to write some tools to help me. First off, I already have routines for printing a policy, so I've already got the hard part out of the way.
Or do I?
You can't just call printf() in the kernel. And while you can use syslog() or cmn_err(), they have the nasty habit of inserting a line feed after every call. So this
printf("Hello ");
printf("Nurse!\n");
Would yield
Hello Nurse!
On a strict translation. So, the easiest way to proceed is to collect it all in a buffer:
void
spe_print_policy_list(spe_policy_t *policies)
{
spe_policy_t *sp;
char *buf;
int offset;
buf = kmem_zalloc(MAX_PRINT_BUF + 1, KM_SLEEP);
offset = 0;
for (sp = policies; sp; sp = sp->next) {
spe_print_policy(sp, buf, &offset);
cmn_err(CE_WARN, "%s", buf);
buf[0] ='\0';
offset = 0;
}
free(buf, MAX_PRINT_BUF + 1);
}
Note that this is debug code and probably will be thrown away. So, I still make sure to do garbage cleanup correctly, but I don't care about whether it should be CE_WARN or CE_NOTE.
Why am I passing an offset?
Well, we have the following functions, most of which emit information:
spe_bitcount(unsigned int n) spe_print_address(uint_t addr, char *buf) spe_print_netmask(uint_t mask, char *buf) spe_print_leaf(spe_leaf_t *sl, char *buf) spe_print_thunk(spe_thunk_t st, char *buf) spe_print_attribute(spe_interior_t *si, char *buf) spe_print_policy(spe_policy_t *sp, char *buf) spe_print_policy_list(spe_policy_t *policies)
If I pass in just the buffer, I could advance it every time I wrote:
len = sprintf(buf, "%c%u", sep, rev);
spe_print_fool(fool, &buf[len + 1]);
Note, I wrote that second line off the top of my head, don't trust it!
But that doesn't help you during a tree traversal. You print out the left side, come back out and you don't know much (or if at all) spe_print_fool changed the offset. So, instead I also pass in the offset to be used:
void
spe_print_netmask(uint_t mask, char *buf, int *poffset)
{
int t;
int len;
t = spe_bitcount(mask);
len = sprintf(&buf[*poffset], "/%u", t);
*poffset = len + 1;
}
This way, the caller of spe_print_netmask will know where to write.
Now, this may be embarrassing, but we really need to debug this debug code before we add it into the kernel. I.e., we want to make sure we are focused on debugging our real code and not the support code. Luckily, we can write some stand-alone code and test it out there.
Since I know the print code used to dump out a policy correctly, I know know that this hunch was correct:
[th199096@aus1500-home ~/tmp]> ./a.out 10, 200 20, 1 30 40,/
And here was the input:
10, 16, 64k, 35865356092523570, uid == 200096 20, 32, 2k, 35865356092523570, uid == 1066 30, 64, 1k, 35865356092523570, uid == 0 40, 2, 2k, 35865356092523570, subnet == 10.1.233.0/24
It is easy to see in the debugger that I reset poffset instead of incrementing it:
96 len = sprintf(&buf[*poffset], "%s",
8: *poffset = 28
7: len = 0
(gdb)
98 *poffset = len + 1;
8: *poffset = 28
7: len = 3
(gdb)
94 for (i = 0; i < Attribute_count; i++) {
8: *poffset = 4
7: len = 3
That should be *poffset += len + 1.
Better, but not a go:
[th199096@aus1500-home ~/tmp]> ./a.out 10, 16, 64000, 4054008882l 20, 32, 2000, 4054008882l 30, 64, 1000, 4054008882l 40, 2, 2000, 4054008882l
The offending line starts with:
len = sprintf(&buf[*poffset], "%u, %u, %u, %ul ", sp->sp_id, sp->sp_stripe_count,
sp->sp_interlace, sp->sp_guuids[0]);
I suspect that the %ul is coming out as %ul. I.e., print an unsigned followed by an 'l'. We see that the data is correct in memory:
(gdb) p *sp
$2 = {sp_id = 10, sp_stripe_count = 16, sp_interlace = 64000,
sp_attr_expr = 0x80687e0, sp_name = 0x0, sp_guuids = {35865356092523570, 0,
0, 0, 0, 0, 0, 0}, next = 0x80699f0}
Crud, that is a bad formatting and I've already fixed it once! Looks like I lost some code. :-< (see here for prior head banging Putting that throw away code to good use). And I'll bet the remaining problem has something to do with code changes I made due to the move to xdr!
And I would be wrong. Things look okay, but no change:
96 len = sprintf(&buf[*poffset], "%s", 5: spe_attribute_table[i].at_attr = SPE_ATTR_UID 4: buf = 0x8066748 "10, 16, 64000, 35865356092523570 " 3: sl->sl_attr = SPE_ATTR_UID 1: i = 13
I now suspect I am off by 1!
(gdb) p &buf[33] $5 = 0x8066769 "" (gdb) p &buf[34] $6 = 0x806676a "uid" (gdb) p spe_attribute_table[i].at_name $7 = 0x8054ded "uid"
Yes, and now Thunderbirds are go!
[th199096@aus1500-home ~/tmp]> ./a.out 10, 16, 64000, 35865356092523570 uid == 200096 20, 32, 2000, 35865356092523570 uid == 1066 30, 64, 1000, 35865356092523570 uid == 0 35, 2, 2000, 35865356092523570 subnet == 10.1.233.0/24 && gid == 55 40, 2, 2000, 35865356092523570 subnet == 10.1.233.0/24
Note I added a new test case to handle a more complex rule!
Over in Latest - pNFS Admin Documentation, Lisa Week has a wiki entry which tells you how to configure a pNFS community for testing our bits. I do things a little different when I'm testing bits to put out on OpenSolaris.
The primary difference is that I need a NFSv4 share and I also need to configure the exports to support the NFS cthon tests. I.e., I need root access.
[root@pnfs-9-10 ~]> zpool create -f data /dev/dsk/c1t0d0s7 [root@pnfs-9-10 ~]> zfs create data/nfs4 [root@pnfs-9-10 ~]> zfs set sharenfs=rw,anon=0 data/nfs4 [root@pnfs-9-10 ~]> mdsadm -o add -t auth -a ip=10.1.233.50 adding: IP Addr - 10.1.233.50
This has the hidden effect of making sure the NFS server module is loaded in memory. The act of sharing will do that. I also have anon=0 which effectively allows root from any machine.
[root@pnfs-9-08 ~]> zpool create -f data /dev/dsk/c2t0d0s7 [root@pnfs-9-08 ~]> zfs create data/nfs4 [root@pnfs-9-08 ~]> zfs set sharenfs=rw,anon=0 data/nfs4 [root@pnfs-9-08 ~]> zfs create -t pnfsdata data/pnfs [root@pnfs-9-08 ~]> dservadm addstor data/pnfs [root@pnfs-9-08 ~]> dservadm addmds 10.1.233.52.8.1 [root@pnfs-9-08 ~]> dservadm enable
Again, while I appear to skip some steps from the Latest - pNFS Admin Documentation, they are being done implicitly. The zfs set sharenfs takes place of the share commands in the wiki. But please follow the wiki if you are unsure about what to do.
So I've set up the DS and also enabled it such that I can test NFSv4 against it as well.
At a previous employer, I had worked with someone who was incredulous that Sun's implementation rpcgen(1) did not eliminate tail recursion. After all, he had worked there and had that code change sitting in a workspace when he left.
So at the time, I manually fixed a XDR call that was blowing the stack.
Since then, I've occasionally had to relearn XDR and rpcgen. And I've always assumed that I had to manually do the tail recursion code. The other day I was in a hurry and I decide to generate output from both Linux and Solaris to see the differences. I thought that surely the Linux implementation would be optimal.
We can see there is a simple linked-list which would be NULL terminated:
#define SPED_MAX_GUUIDS 8
typedef struct spe_policy {
uint_t sp_id;
uint_t sp_stripe_count;
uint_t sp_interlace;
spe_interior_t *sp_attr_expr;
char *sp_name;
uint64_t sp_guuids[SPED_MAX_GUUIDS];
struct spe_policy *next;
} spe_policy_t;
I was looking at the diff and I was struck by how simple and elegant the few lines of Linux generated C code were:
bool_t
xdr_spe_policy (XDR *xdrs, spe_policy *objp)
{
register int32_t *buf;
int i;
if (!xdr_uint32_t (xdrs, &objp->sp_id))
return FALSE;
if (!xdr_count4 (xdrs, &objp->sp_stripe_count))
return FALSE;
if (!xdr_uint32_t (xdrs, &objp->sp_interlace))
return FALSE;
if (!xdr_pointer (xdrs, (char **)&objp->sp_attr_expr, sizeof (spe_interior), (xdrproc_t) xdr_spe_interior))
return FALSE;
if (!xdr_pointer (xdrs, (char **)&objp->sp_name, sizeof (char), (xdrproc_t) xdr_char))
return FALSE;
if (!xdr_vector (xdrs, (char *)objp->sp_guuids, SPED_MAX_GUUIDS,
sizeof (uint64_t), (xdrproc_t) xdr_uint64_t))
return FALSE;
if (!xdr_pointer (xdrs, (char **)&objp->next, sizeof (spe_policy), (xdrproc_t) xdr_spe_policy))
return FALSE;
return TRUE;
}
And I was looking at the Solaris generated output and was worried about the apparent complexity:
bool_t
xdr_spe_policy(XDR *xdrs, spe_policy_t *objp)
{
rpc_inline_t *buf;
spe_policy_t *tmp_spe_policy;
bool_t more_data = TRUE;
bool_t first_objp = TRUE;
if (xdrs->x_op == XDR_DECODE) {
while (more_data) {
if (!xdr_uint32_t(xdrs, &objp->sp_id))
return (FALSE);
...
if (!xdr_bool(xdrs, &more_data))
return (FALSE);
if (!more_data) {
objp->next = NULL;
break;
}
if (objp->next == NULL) {
objp->next = (spe_policy_t *)
mem_alloc(sizeof (spe_policy_t));
if (objp->next == NULL)
return (FALSE);
bzero(objp->next, sizeof (spe_policy_t));
}
objp = objp->next;
}
} else if (xdrs->x_op == XDR_ENCODE) {
while (more_data) {
if (!xdr_uint32_t(xdrs, &objp->sp_id))
return (FALSE);
...
if (!xdr_vector(xdrs,
(char *)objp->sp_guuids,
SPED_MAX_GUUIDS,
sizeof (uint64_t),
(xdrproc_t)xdr_uint64_t))
return (FALSE);
objp = objp->next;
if (objp == NULL)
more_data = FALSE;
if (!xdr_bool(xdrs, &more_data))
return (FALSE);
}
} else {
while (more_data) {
if (!xdr_uint32_t(xdrs, &objp->sp_id))
return (FALSE);
...
if (!xdr_vector(xdrs,
(char *)objp->sp_guuids,
SPED_MAX_GUUIDS,
sizeof (uint64_t),
(xdrproc_t)xdr_uint64_t))
return (FALSE);
tmp_spe_policy = objp;
objp = objp->next;
if (objp == NULL)
more_data = FALSE;
if (!first_objp)
mem_free(tmp_spe_policy_t,
sizeof (spe_policy_t));
else
first_objp = FALSE;
}
}
return (TRUE);
}
I didn't look at the actual code, I just assumed the worst. So I went searching for a version of rpcgen that would work for me. I also was kinda looking for the bug id of the code my friend had sitting in a private workspace.
The first hit in google on the keywords "rpcgen tail recursion" was Bug ID: 4023944 rpcgen generates linked list routines that blow stacks. Hmm, that is on opensolaris.org, which is good.
When I opened it, I was struck by several facts:
I was a victim of low expectations. I assumed that the Solaris rpcgen(1) still did not handle tail recursion (it hadn't been a priority to putback when my friend had a working solution), so I didn't inspect the generated code to see that it did do so. I assumed that fewer lines of code meant elegant and more functional. Instead it meant a blown stack at some point.
So thanks Bill for getting that change made and protecting stacks everywhere!
I've been away from the spe code for some time doing bug work, gatekeeping, and web design. I'm back to it and trying to get it ready to integrate into our development gate. I've decided to move the spe decision making into the kernel and as such I'll need to pass the policies from user space to the kernel.
Traditionally, you have to define 64 and 32 bit versions of your data structures (user land is typically 32 bit address space). Your application packs in the data and sends it to the kernel via syscall. And then you have to copy your data from user space to the kernel.
The STRUCT_* macros in the following code chunk show the steps you take:
void
nfs4_sped_svc(struct nfssped_args *sped_in)
{
spe_policy_t *sp = NULL;
nfssped_op_t opcode;
char *buf = NULL;
size_t len;
size_t tlen = 0;
XDR xdrs;
model_t model;
STRUCT_DECL(nfssped_args, u_sped); /* 1 */
model = get_udatamodel(); /* 2 */
/*
* Initialize the data pointers.
*/
STRUCT_INIT(u_sped, model); /* 3 */
if (copyin(sped_in, STRUCT_BUF(u_sped), STRUCT_SIZE(u_sped))) { /* 4 */
set_errno(EFAULT);
return;
}
opcode = STRUCT_FGET(u_sped, nsa_opcode); /* 5 */
len = STRUCT_FGET(u_sped, nsa_xdr_len); /* 6 */
if (len) {
buf = kmem_zalloc(len, KM_SLEEP);
if (copyinstr(STRUCT_FGETP(u_sped, nsa_xdr), /* 7 */
buf, len, &tlen)) {
goto err_out;
}
}
One problem is that you need to know how big the buffer is to extract from it. If you have a fixed number of elements, i.e., 1, this is an easy problem. But I want to send an arbitrary number of elements. I could cycle through a linked list, sending one at a time, but that sounds gross.
But XDR was created to handle differences in network byte order, machine address ranges, variable number of data sets, etc.
So we take a pass in userland to encode the data into XDR, which also tells us how large the resulting data will be. And then we can unpack it in the kernel.
The largest issue I faced was that you typically start out with XDR notation and use rpcgen to produce headers and code. I already had the data structures, so I took my header and turned it into a .x file. And I had a hard time getting a union in there:
typedef union {
uid_t uid;
gid_t gid;
int i;
char *sz;
spe_network_t net;
} spe_data_t;
I would have twigged onto what was needed if I had started earlier - discriminated unions. So now I need to provide a way to tell the XDR code how to determine which field was to be used. Luckily, I knew that:
typedef enum {
SPE_DATA_ADDR,
SPE_DATA_GID,
SPE_DATA_INT,
SPE_DATA_NETNAME,
SPE_DATA_NETWORK,
SPE_DATA_STRING,
SPE_DATA_UID
} spe_type_t;
I rearranged the .x file to get:
enum spe_type {
SPE_DATA_ADDR,
SPE_DATA_GID,
SPE_DATA_INT,
SPE_DATA_NETNAME,
SPE_DATA_NETWORK,
SPE_DATA_STRING,
SPE_DATA_UID
};
union spe_data switch (spe_type data_type) {
case SPE_DATA_UID:
uid_t uid;
case SPE_DATA_GID:
gid_t gid;
case SPE_DATA_INT:
int i;
case SPE_DATA_NETNAME:
case SPE_DATA_STRING:
char *sz;
case SPE_DATA_ADDR:
case SPE_DATA_NETWORK:
struct spe_network net;
};
But that produced the following for the header file:
struct spe_data {
spe_type data_type;
union {
uid_t uid;
gid_t gid;
int i;
char *sz;
struct spe_network net;
} spe_data_u;
};
typedef struct spe_data spe_data;
I wanted a nice clean separation, so I fought this for a while. But it is logical. In the end I went with:
typedef struct {
spe_type_t sd_type;
union {
uid_t uid;
gid_t gid;
int i;
char *sz;
spe_network_t net;
} sd_u;
} spe_data_t;
It fits my naming style and I also avoided doing the following:
#define uid sd_u.uid #define gid sd_u.gid
Of course, with that, I should pick more unique field names. And what I hate about this method is that if you are sitting there with one window open to code and another on a debugger, you can't examine foo.uid. You have to dig through the headers to find that you really need to be looking at foo.sd_u.uid. I find it easier to avoid the macros.