Just got back from my first Oklahoma City OpenSolaris User Group meeting. It was fun. Said gave a presentation on sizing provisioning for attaching storage to VMware's ESX. It provided a good overview on VMotion and SVMotion.
But of more interest to me was Said's interest in serving the customer. Not only did he try to shape the presentation to those in the audience, he wanted to learn how to convey information better in his blog. I hope I was able to help.
One thing that I learned about customer service and blogging from him was that he flat out told the audience, if you have a question, post a comment in one of my blog entries and I'll get back to you. I.e., he is flipping the push model of the author delivering in blogging to a pull model of the reader driving content. I was floored by this and came away wondering how to have a free floating request section such that comment fields stay true to the blog article and new content can be driven.
Also of interest was the audience member who basically asked when was Sun's xVM going to support NFSv4.1. We don't even have it close to shipping and already people want it in configurations we aren't thinking about!
I'm being a complete OpenStorage fanboy, I just saw the integration notice for 6760398 Moving NDMP to open source go by and without even knowing exactly what it entailed, I was happy.
So happy I downloaded the Mecurial source onto my Linux server to make sure I could already see the change:
[thud@adept onnv-gate]> hg history | more changeset: 7917:5c4442486198 tag: tip user: Reza Sabdar <Reza (dot) Sabdar (at) Sun (dot) COM> date: Thu Oct 23 11:42:48 2008 -0700 summary: 6760398 Moving NDMP to open source
Another big step for OpenStorage:
Author: Reza Sabdar <Reza (dot) Sabdar (at) Sun (dot) COM> Repository: /export/onnv-gate Total changesets: 1 Changeset: 5c4442486198 Comments: 6760398 Moving NDMP to open source Files: added: usr/src/cmd/ndmpadm/Makefile usr/src/cmd/ndmpadm/ndmpadm.h usr/src/cmd/ndmpadm/ndmpadm_main.c usr/src/cmd/ndmpadm/ndmpadm_print.c usr/src/cmd/ndmpd/LICENSE usr/src/cmd/ndmpd/LICENSE.descrip usr/src/cmd/ndmpd/Makefile usr/src/cmd/ndmpd/include/bitmap.h usr/src/cmd/ndmpd/include/cstack.h usr/src/cmd/ndmpd/include/ndmpd_door.h usr/src/cmd/ndmpd/include/ndmpd_prop.h usr/src/cmd/ndmpd/include/tlm.h usr/src/cmd/ndmpd/include/tlm_buffers.h usr/src/cmd/ndmpd/include/traverse.h usr/src/cmd/ndmpd/ndmp.xml usr/src/cmd/ndmpd/ndmp/Makefile.rpcgen usr/src/cmd/ndmpd/ndmp/ndmp.x usr/src/cmd/ndmpd/ndmp/ndmpd.h usr/src/cmd/ndmpd/ndmp/ndmpd_callbacks.c usr/src/cmd/ndmpd/ndmp/ndmpd_chkpnt.c usr/src/cmd/ndmpd/ndmp/ndmpd_comm.c usr/src/cmd/ndmpd/ndmp/ndmpd_common.h usr/src/cmd/ndmpd/ndmp/ndmpd_config.c usr/src/cmd/ndmpd/ndmp/ndmpd_connect.c usr/src/cmd/ndmpd/ndmp/ndmpd_data.c usr/src/cmd/ndmpd/ndmp/ndmpd_door.c usr/src/cmd/ndmpd/ndmp/ndmpd_dtime.c usr/src/cmd/ndmpd/ndmp/ndmpd_fhistory.c usr/src/cmd/ndmpd/ndmp/ndmpd_handler.c usr/src/cmd/ndmpd/ndmp/ndmpd_log.c usr/src/cmd/ndmpd/ndmp/ndmpd_log.h usr/src/cmd/ndmpd/ndmp/ndmpd_main.c usr/src/cmd/ndmpd/ndmp/ndmpd_mark.c usr/src/cmd/ndmpd/ndmp/ndmpd_mover.c usr/src/cmd/ndmpd/ndmp/ndmpd_prop.c usr/src/cmd/ndmpd/ndmp/ndmpd_scsi.c usr/src/cmd/ndmpd/ndmp/ndmpd_tape.c usr/src/cmd/ndmpd/ndmp/ndmpd_tar.c usr/src/cmd/ndmpd/ndmp/ndmpd_tar3.c usr/src/cmd/ndmpd/ndmp/ndmpd_util.c usr/src/cmd/ndmpd/svc-ndmp usr/src/cmd/ndmpd/tlm/tlm_backup_reader.c usr/src/cmd/ndmpd/tlm/tlm_bitmap.c usr/src/cmd/ndmpd/tlm/tlm_buffers.c usr/src/cmd/ndmpd/tlm/tlm_hardlink.c usr/src/cmd/ndmpd/tlm/tlm_info.c usr/src/cmd/ndmpd/tlm/tlm_init.c usr/src/cmd/ndmpd/tlm/tlm_lib.c usr/src/cmd/ndmpd/tlm/tlm_proto.h usr/src/cmd/ndmpd/tlm/tlm_restore_writer.c usr/src/cmd/ndmpd/tlm/tlm_traverse.c usr/src/cmd/ndmpd/tlm/tlm_util.c usr/src/cmd/ndmpstat/Makefile usr/src/cmd/ndmpstat/ndmpstat_main.c usr/src/lib/libndmp/Makefile usr/src/lib/libndmp/Makefile.com usr/src/lib/libndmp/amd64/Makefile usr/src/lib/libndmp/common/libndmp.c usr/src/lib/libndmp/common/libndmp.h usr/src/lib/libndmp/common/libndmp_base64.c usr/src/lib/libndmp/common/libndmp_door_data.c usr/src/lib/libndmp/common/libndmp_error.c usr/src/lib/libndmp/common/libndmp_prop.c usr/src/lib/libndmp/common/llib-lndmp usr/src/lib/libndmp/common/mapfile-vers usr/src/lib/libndmp/i386/Makefile usr/src/lib/libndmp/sparc/Makefile usr/src/lib/libndmp/sparcv9/Makefile modified: usr/src/Makefile.lint usr/src/cmd/Makefile usr/src/lib/Makefile usr/src/pkgdefs/SUNWndmpu/Makefile usr/src/tools/opensolaris/license-list
I'm getting a signal when I use my snoop on a ds to mds packet trace:
[th199096@jhereg snoop]> ./snoop -V -i ~/ds2tmds.snoop > xxx WARNING: received signal 11 from packet 4
And packet 4 is:
4 0.00596 pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-DS C DS_EXIBI
I've gone from handcoding the XDR to generating it automatically, and both had this error. Time to see what is going on. I generated the packet trace with '-x0,2000' so I get to see the output:
0: 001b 242d e629 001b 242d e641 0800 4500 ..$-.)..$-.A..E.
16: 00a8 d3dd 4000 4006 0000 0a01 e943 0a01 ....@.@.....�C..
32: e944 03fc 0801 1f9e 8f80 17cc 1d01 5018 �D............P.
48: c1e8 0000 0000 8000 007c d058 4d4c 0000 .�.......|.XML..
64: 0000 0000 0002 0001 9641 0000 0001 0000 .........A......
80: 0002 0000 0001 0000 0020 48f4 fa0b 0000 ......... H.....
96: 0009 706e 6673 2d39 2d32 3500 0000 0000 ..pnfs-9-25.....
112: 0000 0000 0000 0000 0000 0000 0000 0000 ................
128: 0000 ffff ff02 eef0 3b00 0000 0027 706e ........;....'pn
144: 6673 2d39 2d32 353a 2073 7663 3a2f 6e65 fs-9-25: svc:/ne
160: 7477 6f72 6b2f 6473 6572 763a 6465 6661 twork/dserv:defa
176: 756c 743a 0000 ult:..
I'm going to go back and forth in the code to look at this. First, I've tracked down where the FMRI is appearing:
case NFS4_SETPORT:
uaddr = get_uaddr(nconf, addr);
if (uaddr == NULL) {
dserv_log(do_all_handle, LOG_INFO,
gettext("NFS4_SETPORT: get_uaddr failed"));
return (1);
}
(void) strlcpy(setportargs.dsa_uaddr, uaddr,
sizeof (setportargs.dsa_uaddr));
(void) strlcpy(setportargs.dsa_proto, nconf->nc_proto,
sizeof (setportargs.dsa_proto));
(void) strlcpy(setportargs.dsa_name, getenv("SMF_FMRI"),
sizeof (setportargs.dsa_name));
This is in usr/src/cmd/dserv/dservd/tbind_sup.c and is a dservd call into the kernel. It ends up in usr/src/uts/common/dserv/dserv_mds.c: (minus some unpacking)
int
dserv_mds_addport(const char *uaddr, const char *proto, const char *aname)
{
...
(void) sprintf(in, "%s: %s:", uts_nodename(), aname);
inst->dmi_name = dserv_strdup(in);
bzero(&res, sizeof (res));
args.ds_ident.boot_verifier = inst->dmi_verifier;
args.ds_ident.instance.instance_len = strlen(inst->dmi_name) + 1;
args.ds_ident.instance.instance_val = inst->dmi_name;
The defaults are also set:
dserv_mds_instance_init(dserv_mds_instance_t *inst)
{
inst->dmi_ds_id = 0;
inst->dmi_mds_addr = NULL;
inst->dmi_mds_netid = NULL;
inst->dmi_verifier = (uintptr_t)curthread;
inst->dmi_teardown_in_progress = B_FALSE;
}
So, if we knew the curthread, we could spot check to see that this went across okay. We also need to know if this has to be unique or not. If so, could we get a dup here?
So how does this data go across the wire? We need to look in the XDR (usr/src/head/rpcsvc/ds_prot.x):
struct identity {
ds_verifier boot_verifier;
opaque instance;
};
/*
* DS_EXIBI - Exchange Identity and Boot Instance
*
* ds_ident : An identiifier that the MDS can use to distinguish
* between data-server instances.
*/
struct DS_EXIBIargs {
identity ds_ident;
};
So we see the boot_verifier followed by the instance. BTW: MAXPATHLEN might be too small here as we add the nodename.
And an opaque is a length and an array. Hmm, the hand-coded usr/src/cmd/cmd-inet/usr.sbin/snoop/nfs4_xdr.c calls xdr_opaque, while the machine generated code does:
?
bool_t
xdr_identity(XDR *xdrs, identity *objp)
{
rpc_inline_t *buf;
if (!xdr_ds_verifier(xdrs, &objp->boot_verifier))
return (FALSE);
if (!xdr_bytes(xdrs, (char **)&objp->instance.instance_val,
(u_int *) &objp->instance.instance_len, MAXPATHLEN)
return (FALSE);
return (TRUE);
}
And that makes a difference:
[th199096@jhereg snoop]> ./snoop -v -i ~/ds2tmds.snoop > xxx [th199096@jhereg snoop]>
But wait, we don't see the signal, but we do see:
CTL-DS: ----- Sun CTL-DS ----- CTL-DS: CTL-DS: Proc = 2 (Exchange Identity and Boot Instance) CTL-DS: ---- short frame ---
And a debug statement shows that the length looks off:
CTL-DS: ----- Sun CTL-DS ----- CTL-DS: CTL-DS: Proc = 2 (Exchange Identity and Boot Instance) CTL-DS: xdr_identity bombed, len = 0 CTL-DS: ---- short frame ---
Hmm, I manually set the length before the call to xdr_opaque. So back to the raw data. We know right before the nodename, we should find the length.
Hmm, my allergies are killing my thought process. A signal 11 is SIGSEGV.
I'm back after a night's rest. I recompiled snoop with gcc and I think I've found the problem after staring at it in gdb:
127 switch (xdrs->x_op) {
128 case XDR_DECODE:
129 if (nodesize == 0)
130 return (TRUE);
131 if (sp == NULL)
(gdb)
132 *cpp = sp = (char *)mem_alloc(nodesize);
133 /* FALLTHROUGH */
134
135 case XDR_ENCODE:
136 sprintf(get_line(0, 0), "tdh_xdr_bytes calling xdr_opaque with %d!", nodesize);
137 return (xdr_opaque(xdrs, sp, nodesize));
138
139 case XDR_FREE:
140 if (sp != NULL) {
141 mem_free(sp, nodesize);
(gdb) p sp
$9 = 0x80c74a6 "\203�\020\203}\020"
(gdb)
We need to be allocating memory here. But whatever sp is pointing to is junk:
bool_t
xdr_identity(XDR *xdrs, identity *objp)
{
rpc_inline_t *buf;
if (!xdr_ds_verifier(xdrs, &objp->boot_verifier)) {
sprintf(get_line(0, 0), "xdr_identity bombed for verifier = %d", objp->boot_verifier);
return (FALSE);
}
sprintf(get_line(0, 0), "xdr_identity okay for verifier = %lx", objp->boot_verifier);
if (!tdh_xdr_bytes(xdrs, (char **)&objp->instance.instance_val,
(u_int *) &objp->instance.instance_len, MAXPATHLEN)) {
sprintf(get_line(0, 0), "xdr_identity bombed, len = %d", objp->instance.instance_len);
return (FALSE);
}
return (TRUE);
}
And we can see I am just grabbing it off the stack:
static void
ds_exibi_sa(char *line)
{
DS_EXIBIargs eargs;
if (!xdr_DS_EXIBIargs(&xdrm, &eargs))
longjmp(xdr_err, 1);
sprintf(line, "V = %d I = (%.20s)", eargs.ds_ident.boot_verifier,
utf8localize((utf8string *)&eargs.ds_ident.instance));
xdr_free(xdr_DS_EXIBIargs, (char *)&eargs);
}
A quick memset and retest:
[th199096@jhereg snoop]> ./snoop -v -i ~/ds2mds2.snoop > zzz WARNING: received signal 11 from packet 4 [th199096@jhereg snoop]> ./snoop -v -i ~/ds2mds2.snoop > zzz [th199096@jhereg snoop]>
And we can see the difference!
I'm in the midst of debugging a snoop implementation and I wanted to recompile with gcc and use gdb. I saved the output from the make command and basically used vi to put each .o file on a single line:
[th199096@jhereg snoop]> more files.make nfs4_xdr.o snoop.o snoop_aarp.o snoop_adsp.o snoop_aecho.o snoop_apple.o
Note that I could strip off the '.o's manually, but typically I would leave them there. What I want to do is take the filename and use it twice in command. I.e.
% gcc -g -c -o nfs4_xdr.o nfs4_xdr.c
So I decided to use Python to learn a bit more about it:
[th199096@jhereg snoop]> more tran.py
#!/usr/sfw/bin/python
l1 = []
print "#!/bin/sh -x"
print "# Make no changes here, machine generated!"
print "rm snoop *.o"
for line in open("files.make"):
[name, ext] = line.split('.')
print "gcc -g -DUSE_FOR_SNOOP -c -I/builds/th199096/snoop/proto/root_i386/usr/include" \
" -I. -I/builds/th199096/snoop/usr/src/common/net/dhcp -o %s.o %s.c" % (name, name)
l1.append(name)
print 'gcc -g -DTEXT_DOMAIN="SUNW_OST_OSCMD" -D_TS_ERRN -Bdirect -o snoop ',
for name in l1:
print "%s.o" % (name),
print "-L/builds/th199096/snoop/proto/root_i386/lib -L/builds/th199096/snoop/proto/root_i386/usr/lib" \
" -ldhcputil -ldlpi -lsocket -lnsl -ltsol"
One thing that jumped out was since I threw away the 'ext', I didn't have to worry abotu stripping off the '\n'. I also made use of the ',' on the end of the print statements to keep a line going.
I liked the ease of adding to the list of filenames. And in general, I found it easy to make a quick change and retest.
Could I have done this another way, say with the Makefile? Sure, but it wouldn't have been a learning experience. And off I go, the gdb prompt is calling me!
So another patch set for a Linux Dynamic Pseudo Root has been submitted by Steve Dickson to the Linux NFSv4 mailing list:
The following patch series gives rpc.mountd the ability to allocate a dynamic pseudo root, so the 'fsid=0' export option is no longer required. This allows v2, v3 and v4 clients mounts without any changes to the server's exports list. One anomaly of the Linux NFS server is that it requires a pseudo root to be defined. Currently the only way a pseudo root can be defined is by setting the fsid to zero (i.e. fsid=0). So if we wanted to make v4 the default mounting version and have things just work like v2/v3 all of the existing exports configurations would have to change (i.e. a 'fsid=0' would have to be added) to support a v4 mounts, which, imho, is unacceptable. So this patch series address this problem.
Steve has really highlighted a huge gap in seamless integration of the Linux NFSv4 implementation into automounters, etc. The path to an export should not change based on the version of the protocol.
Hmm, strike that, from a re-reading of I'm not sure if this patch eliminates my concern about the mount path or not. I.e., above he talks about adding a 'fsid=0' on the server and not what the client has to do about the path.
Time to ask him!
Update: Steve says it does address the mount path issue we've seen in the past.
I just got a framework in place for a table driven approach for snoop to decode the Control Protocol used between our DS and MDS servers for pNFS.
Here you can see the DS talking to the MDS and the MDS sending a NULL query:
[th199096@pnfs-9-25 ~]> sudo snoop.ctl -i /root/ds2tmds.snoop | grep CTL 4 0.00596 pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-DS C DS_EXIBI 6 0.00000 pnfs-9-26.Central.Sun.COM -> pnfs-9-25 CTL-DS R DS_EXIBI 8 0.00009 pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-DS C DS_REPORTAVAIL 13 0.00000 pnfs-9-26.Central.Sun.COM -> pnfs-9-25 CTL-MDS C MDS_NULL 15 2.99097 pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-MDS R MDS_NULL 17 0.00012 pnfs-9-26.Central.Sun.COM -> pnfs-9-25 CTL-DS R DS_REPORTAVAIL
Guess I'll have to file in the guts of this to see what is really being said here!
I decided to do a massive snoop session as I brought 2 DSes online with 2 storage pools each. I wanted to see the transactions go across the wire. Then I found out that snoop doesn't know about our control protocol. Guess what I'm working on now?
Anyway, I have a special kernel that just sends along the new ds_path information - no spe. I decided I could at least look in the kernel and see what I had:
> ::walk mds_DS_guid_entry_cache | ::print struct rfs4_dbe data | ::print -a ds_guid_info_t
{
ffffff0315aebc18 dbe = 0xffffff0315aebba8
ffffff0315aebc20 ds_ownerp = 0xffffff0315aede48
ffffff0315aebc28 ds_guid_next = {
ffffff0315aebc28 list_next = 0
ffffff0315aebc30 list_prev = 0
}
ffffff0315aebc38 ds_guid = {
ffffff0315aebc38 stor_type = 1 (ZFS)
ffffff0315aebc40 ds_guid_u = {
ffffff0315aebc40 zfsguid = {
ffffff0315aebc40 zfsguid_len = 0x10
ffffff0315aebc48 zfsguid_val = 0xffffff02f0d4aca8
}
}
}
ffffff0315aebc50 ds_attr_len = 0
ffffff0315aebc58 ds_attr_val = 0
ffffff0315aebc60 ds_path = {
ffffff0315aebc60 utf8string_len = 0
ffffff0315aebc68 utf8string_val = 0
}
}
{
ffffff0315aebd10 dbe = 0xffffff0315aebca0
ffffff0315aebd18 ds_ownerp = 0xffffff0315aede48
ffffff0315aebd20 ds_guid_next = {
ffffff0315aebd20 list_next = 0
ffffff0315aebd28 list_prev = 0
}
ffffff0315aebd30 ds_guid = {
ffffff0315aebd30 stor_type = 1 (ZFS)
ffffff0315aebd38 ds_guid_u = {
ffffff0315aebd38 zfsguid = {
ffffff0315aebd38 zfsguid_len = 0x10
ffffff0315aebd40 zfsguid_val = 0xffffff02f0c9f1c0
}
}
}
ffffff0315aebd48 ds_attr_len = 0
ffffff0315aebd50 ds_attr_val = 0
ffffff0315aebd58 ds_path = {
ffffff0315aebd58 utf8string_len = 0
ffffff0315aebd60 utf8string_val = 0
}
}
{
ffffff0315aebe08 dbe = 0xffffff0315aebd98
ffffff0315aebe10 ds_ownerp = 0xffffff0315aedf58
ffffff0315aebe18 ds_guid_next = {
ffffff0315aebe18 list_next = 0
ffffff0315aebe20 list_prev = 0
}
ffffff0315aebe28 ds_guid = {
ffffff0315aebe28 stor_type = 1 (ZFS)
ffffff0315aebe30 ds_guid_u = {
ffffff0315aebe30 zfsguid = {
ffffff0315aebe30 zfsguid_len = 0x10
ffffff0315aebe38 zfsguid_val = 0xffffff02f0c9f0a8
}
}
}
ffffff0315aebe40 ds_attr_len = 0
ffffff0315aebe48 ds_attr_val = 0
ffffff0315aebe50 ds_path = {
ffffff0315aebe50 utf8string_len = 0
ffffff0315aebe58 utf8string_val = 0
}
}
{
ffffff0315aebf00 dbe = 0xffffff0315aebe90
ffffff0315aebf08 ds_ownerp = 0xffffff0315aedf58
ffffff0315aebf10 ds_guid_next = {
ffffff0315aebf10 list_next = 0
ffffff0315aebf18 list_prev = 0
}
ffffff0315aebf20 ds_guid = {
ffffff0315aebf20 stor_type = 1 (ZFS)
ffffff0315aebf28 ds_guid_u = {
ffffff0315aebf28 zfsguid = {
ffffff0315aebf28 zfsguid_len = 0x10
ffffff0315aebf30 zfsguid_val = 0xffffff02f398d790
}
}
}
ffffff0315aebf38 ds_attr_len = 0
ffffff0315aebf40 ds_attr_val = 0
ffffff0315aebf48 ds_path = {
ffffff0315aebf48 utf8string_len = 0
ffffff0315aebf50 utf8string_val = 0
}
}
>
And no path strings. At least I have 4 entries, which I was expecting!
A DS communicates the data server storage associated with it to the MDS. (We'll look at this in more depth later.) It does that via an RPC call -- DS_REPORTAVAIL. Here are the associated structures used for the XDR:
Notice that ds_path has been added for the SPE to allow for human readable names to be mapped to guids.
If you also have been meaning to learn more about RBAC, a good start would be: Introducing pfexec, a Convenient Utility in the OpenSolaris OS By Joerg Moellenkamp, with contributions from Marina Sum, October 13, 2008.
As a group, we decided to change ds_addr_t to ds_addrlist_t to avoid confusion with struct ds_addr. The OpenSolaris gate has those changes already.
Time to pick back up on that analysis, but remembering that ds_addr is different than ds_addr_t.
Note, we are in usr/src/uts/common/fs/nfs/nfs41_state.c...
So mds_gather_devs does the work of stuffing the layout. It gets called for every entry found in the instp->ds_addr_tab:
968 ds_addr_t *dp = (ds_addr_t *)entry;
...
974 if (gap->dex < gap->max_devs_needed) {
975 gap->lo_arg.lo_devs[gap->dex] = rfs4_dbe_getid(dp->dbe);
976 gap->dev_ptr[gap->dex] = dp;
977 gap->dex++;
978 }
So we keep on reading ds_addr_t data structures until we have enough.
Now, how is that table populated? We are looping over these entries in the NFSv4 state tables:
1060 rw_enter(&instp->ds_addr_lock, RW_READER); 1061 rfs4_dbe_walk(instp->ds_addr_tab, mds_gather_devs, &args); 1062 rw_exit(&instp->ds_addr_lock);
So we need to look for instp->ds_addr_tab or instp->ds_addr_idx. And in usr/src/uts/common/fs/nfs/ds_srv.c, we find mds_ds_addr_update which does:
616 ds_status
617 mds_ds_addr_update(ds_owner_t *dop, struct ds_addr *dap)
618 {
619 struct mds_adddev_args darg;
620 bool_t create = FALSE;
621 ds_addr_t *devp;
...
626 if ((devp = (ds_addr_t *)rfs4_dbsearch(mds_server->ds_addr_uaddr_idx,
627 (void *)dap->addr.na_r_addr,
628 &create, NULL, RFS4_DBS_VALID)) != NULL) {
629 MDS_SET_DS_FLAGS(devp->dev_flags, dap->validuse);
630 rw_exit(&mds_server->ds_addr_lock);
631 return (stat);
632 }
Note how we are calling the ds_addr_t a devp, perhaps a better structure name might be ds_dev_addr_t.
So, if we find one in mds_server->ds_addr_tab (via the mds_server->ds_addr_uaddr_idx which is a secondary index to ds_addr_idx), then we return. Else:
636 darg.dev_netid = kstrdup(dap->addr.na_r_netid);
637 darg.dev_addr = kstrdup(dap->addr.na_r_addr);
638
639 /* make it */
640 devp = (ds_addr_t *)rfs4_dbcreate(mds_server->ds_addr_idx,
641 (void *)&darg);
642
643 if (devp) {
644 devp->ds_owner = dop;
645 MDS_SET_DS_FLAGS(devp->dev_flags, dap->validuse);
646 list_insert_tail(&dop->ds_addr_list, devp);
647 } else
648 stat = DSERR_INVAL;
we grab the info out of the ds_addr and create a new entry. Note that it is devp->ds_owner which is likely to have the addressing info I am interested in.
98 typedef struct {
99 rfs4_dbe_t *dbe;
100 time_t last_access;
101 char *identity;
102 ds_id ds_id;
103 ds_verifier verifier;
104 uint32_t dsi_flags;
105 list_t ds_addr_list;
106 listhttp://opensolaris.org/os/project/nfsv41/documentation/nfsv41_server/d13_layout_devices.jpg_t ds_guid_list;
107 } ds_owner_t;
So we have lists of ds_addr and ds_guid. But that ds_guid_list is currently only created and never populated.
Time to digress and attack this from a different angle.
This may no longer be accurate, but Robert Gordon, before he passed on (to another company), left us with this image (from Server Design Document):
This says quite clearly that while it may be the spe's job to generate layouts, in order to do so you need to construct a device list. Up until now, I've been working on a month's old statement that I need to "just generate the stripe width, stripe unit size, and an array of guids". Implicit in that is that someone else would do the logic, because it was trivial, to morph that into a layout.
And you know, I keep on looking for an explicit mapping to occur between the selection of the layout and the device list - it is the title of this series of blog articles. It may not be occurring because of the maturity of the code. I.e., everything up to now is predicated on there being a fixed number of DSes and fixed number of data server storage. And relationships just work in that if you only have 1 entry in a list because there is only 1 data store, then all of the other associated lists will also only have 1 entry.
There is still a lot of work to do to make this implementation a product.
Anyway, the picture spells out a lot of what is in the spec. The other way to attack this would be to look at a snoop trace during a create.
But anyway you slice it, there is no magic happening to tie a guid to a device list.
I'm going to have to expand the scope of my project.
In my task list for spe, a large item has been how to tie it into the current code base - you might have seen me reference it as translating data path to guid. To do that, I've had to understand what the current code is doing and the limitations in that code. I've also had to question exactly what it is we want done.
The Simple Policy Engine (spe) tells the pNFS MetaData Server (MDS) how to layout the stripes on the data servers (DS) at file creation time. If you think of RAID, a file is striped across disks and we need to know how many disks it is striped across and what is the width of the stripe. Then to determine which disk a particular piece of data is on, we can divide the file offset by the stripe width to get the disk.
This is simplistic, but is also the basic concept behind layout creation in pNFS. A huge difference is that we need to tell the client not only the stripe count and width, but the machine addresses of the DSes. It is a little bit more complex than that as each DS might have several data stores associated with it, a data store might be moved to a different DS, etc. We capture that complexity in the globally unique id (guid) assigned to each data store. But conceptually, lets consider just the base case of each DS having only one data store and it is always on that DS.
So the NFSv4.1 protocol defines an OPEN operation and a LAYOUTGET operation. It doesn't define how an implementation will determine which data sets are put into the layout.
In the current OpenSolaris implementation, these two operations result in the following call chains:
"OPEN" -> mds_op_open -> mds_do_opennull -> mds_create_file "LAYOUTGET" -> mds_op_layout_get -> mds_fetch_layout -> mds_get_flo -> fake_spe
In my development gate, a call to spe_allocate currently occurs in mds_create_file.
The relevant files to look at are: usr/src/uts/common/fs/nfs/nfs41_srv.c and usr/src/uts/common/fs/nfs/nfs41_state.c.
Note: I will be quoting routines in the above two files. Over time, those files will change and will not match up what I quote.
The interesting stuff in layout creation occurs in mds_fetch_layout:
Note that we have starting with nfs41_srv.c.
8320 if (mds_get_flo(cs, &lp) != NFS4_OK) 8321 return (NFS4ERR_LAYOUTUNAVAILABLE);
And in mds_get_flo:
8269 mutex_enter(&cs->vp->v_lock); 8270 fp = (rfs4_file_t *)vsd_get(cs->vp, cs->instp->vkey); 8271 mutex_exit(&cs->vp->v_lock); 8272 8273 /* Odd.. no rfs4_file_t for the vnode.. */ 8274 if (fp == NULL) 8275 return (NFS4ERR_LAYOUTUNAVAILABLE);
Which basically states that the file must have been created and in memory. These is not a panic for at least the following reasons:
8277 /* do we have a odl already ? */
8278 if (fp->flp == NULL) {
8279 /* Nope, read from disk */
8280 if (mds_get_odl(cs->vp, &fp->flp) != NFS4_OK) {
8281 /*
8282 * XXXXX:
8283 * XXXXX: No ODL, so lets go query PE
8284 * XXXXX:
8285 */
8286 fake_spe(cs->instp, &fp->flp);
8287
8288 if (fp->flp == NULL)
8289 return (NFS4ERR_LAYOUTUNAVAILABLE);
8290 }
8291 }
Note that an odl is a on-disk layout. And the statement on 8278 is how I will tie the spe in with this code. During an OPEN, I can simply set fp->flp and bypass this logic. If there is any error, then this field will be NULL and we can grab a simple default layout here. So I'll probably rename fake_spe to be mds_generate_default_flo.
So understanding what fake_spe does will help me understand what the real spe will have to do:
8236 int key = 1; ... 8241 *flp = NULL; 8242 8243 rw_enter(&instp->mds_layout_lock, RW_READER); 8244 lp = (mds_layout_t *)rfs4_dbsearch(instp->mds_layout_idx, 8245 (void *)(uintptr_t)key, &create, NULL, RFS4_DBS_VALID); 8246 rw_exit(&instp->mds_layout_lock); 8247 8248 if (lp == NULL) 8249 lp = mds_gen_default_layout(instp, mds_max_lo_devs); 8250 8251 if (lp != NULL) 8252 *flp = lp;
The current code only ever has 1 layout in memory. Hence, the key is 1. We'll need to see how that layout is generated. And that occurs in mds_gen_default_layout. Note how simplistic this code is - if for any reason the layout is deleted from the table, it is simply added back in here. Right now, the only reason the layout would be deleted is if a DS reboots (look at ds_exchange in ds_srv.c).
This is the code builds up the layout and stuffs it in memory:
Note that we have switched into nfs41_state.c.
1046 int mds_default_stripe = 32; 1047 int mds_max_lo_devs = 20; ... 1052 struct mds_gather_args args; 1053 mds_layout_t *lop; 1054 1055 bzero(&args, sizeof (args)); 1056 1057 args.max_devs_needed = MIN(max_devs_needed, 1058 MIN(mds_max_lo_devs, 99)); 1059 1060 rw_enter(&instp->ds_addr_lock, RW_READER); 1061 rfs4_dbe_walk(instp->ds_addr_tab, mds_gather_devs, &args); 1062 rw_exit(&instp->ds_addr_lock); 1063 1064 /* 1065 * if we didn't find any devices then we do no service 1066 */ 1067 if (args.dex == 0) 1068 return (NULL); 1069 1070 args.lo_arg.loid = 1; 1071 args.lo_arg.lo_stripe_unit = mds_default_stripe * 1024; 1072 1073 rw_enter(&instp->mds_layout_lock, RW_WRITER); 1074 lop = (mds_layout_t *)rfs4_dbcreate(instp->mds_layout_idx, 1075 (void *)&args); 1076 rw_exit(&instp->mds_layout_lock);
We first walk across the instp->ds_addr_tab and look for effectively 20 entries. Note that max_devs_needed is always 20 for this code and so will be args.max_devs_needed.
I think the check on 1067 is incorrect and a result of the current implementation normally being on a community with 1 DS. It should be the case that args.dex is greater than or equal to max_devs_needed. Actually, we need to be passing in how many devices we will have D (the ones assigned to a policy) and how many we need to use S, with S <= D. The args.dex will have to be >= S.
Note that on 1070, we assign it the only layout id which will ever be generated. And if we play things right, we could store this layout id back in the policy and avoid regenerating the layout if at all possible.
Finally we stuff the newly created layout into the table.
So mds_gather_devs does the work of stuffing the layout. It gets called for every entry found in the instp->ds_addr_tab:
974 if (gap->dex < gap->max_devs_needed) {
975 gap->lo_arg.lo_devs[gap->dex] = rfs4_dbe_getid(dp->dbe);
976 gap->dev_ptr[gap->dex] = dp;
977 gap->dex++;
978 }
So we keep on reading ds_addr_t data structures until we have enough.
Now, how is that table populated? You can look for ds_addr_idx over in usr/src/uts/common/fs/nfs/ds_srv.c, but basically, for each data store that a DS registers, one of these is created.
The upshot of all this is that if a pNFS community has N data stores, then the layout generated for the current implementation will have a stripe count of N.
Note and nfs41_srv.c.
Okay, we've generated the layout and start to generate the otw (over the wire) layout:
8332 8333 mds_set_deviceid(lp->dev_id, &otw_flo.nfl_deviceid); 8334
Crap, it is sending the device id across the wire! I'm going to have to rethink my approach. Instead of storing a policy as a device list and picking which devices I want out of that list (i.e., a Round Robin (RR) scheduler), I'm going to have to store each generated set as a new device list.
I don't understand the process like I thought I did.
Going back to mds_gather_devs, it is not stuffing data stores into a table as I thought. Instead, it is stuffing DS network addesses into a table.
What I'm missing is how the ds_addr entries map back to data stores. Okay, this code in mds_gen_default_layout does it:
mds_layout_lock, RW_WRITER); 1074 lop = (mds_layout_t *)rfs4_dbcreate(instp->mds_layout_idx, 1075 (void *)&args); 1076 rw_exit(&instp->mds_layout_lock);
We have just gotten the device list via the walk over mds_gather_devs. And now we effectively call mds_layout_create on 1074.
1104 ds_addr_t *dp;
1105 struct mds_gather_args *gap = (struct mds_gather_args *)arg;
1106 struct mds_addlo_args *alop = &gap->lo_arg;
...
1119 lp->layout_type = LAYOUT4_NFSV4_1_FILES;
1120 lp->stripe_unit = alop->lo_stripe_unit;
1121
1122 for (i = 0; alop->lo_devs[i] && i < 100; i++) {
1123 lp->devs[i] = alop->lo_devs[i];
1124 dp = mds_find_ds_addr(instp, alop->lo_devs[i]);
1125 /* lets hope this doesn't occur */
1126 if (dp == NULL)
1127 return (FALSE);
1128 gap->dev_ptr[i] = dp;
1129 }
Okay, alop->lo_devs is the array we built in mds_gather_devs. Yes, yes, that is true.
I just figured out where all of my confusion is coming from - the code has struct ds_addr and ds_addr_t. In the xdr code, struct ds_addr is just an address (usr/src/head/rpcsvc/ds_prot.x):
338 /*
339 * ds_addr -
340 *
341 * A structure that is used to specify an address and
342 * its usage.
343 *
344 * addr:
345 *
346 * The specific address on the DS.
347 *
348 * validuse:
349 *
350 * Bitmap associating the netaddr defined in "addr"
351 * to the protocols that are valid for that interface.
352 */
353 struct ds_addr {
354 struct netaddr4 addr;
355 ds_addruse validuse;
356 };
But in the code I've been looking at, ds_addr_t is a different structure (see usr/src/uts/common/nfs/mds_state.h):
133 /*
134 * ds_addr:
135 *
136 * This list is updated via the control-protocol
137 * message DS_REPORTAVAIL.
138 *
139 * FOR NOW: We scan this list to automatically build the default
140 * layout and the multipath device struct (mds_mpd)
141 */
142 typedef struct {
143 rfs4_dbe_t *dbe;
144 netaddr4 dev_addr;
145 struct knetconfig *dev_knc;
146 struct netbuf *dev_nb;
147 uint_t dev_flags;
148 ds_owner_t *ds_owner;
149 list_node_t ds_addr_next;
150 } ds_addr_t;
This is pure evil because we typically equate foo_t as being typedef struct foo foo_t. As you can see, I've been fighting that in the above analysis.
I'm going to file an issue on this naming convention and leave the analysis here. I'll come back to it and rewrite it as if I knew all along that I was using a ds_addr_t and not a struct ds_addr.
I just merged the nfs41-gate with the snv_100 tagged onnv-gate. This caused me to bump the closedv tag to 2 in the nfs41-gate.
You can refresh your copies of our closed-bins at http://www.opensolaris.org/os/project/nfsv41/downloads/.
BTW: While the pushes are automatic, I'm still trying to get the notification to be automatic.
6751438 mirror mounted mountpoints panic when umounted, finally got through the code review process, the RTI process, etc. I wanted to get this into snv_101 because that is the candidate to become OpenSolaris 2008.11. I want a stable mirror mount experience out there for users.
The other bug fix I have in the works has passed stalled, which I'm really okay with at this point. This is the 6738223 Can not share a single IP address bug. Anyway, I've gotten two positive reviews and one reviewer asking questions. If I were tricky, I could try to submit an RTI, but the fact of the matter is that the third reviewer can still be satisfied.
Also, there is no way I want to integrate this into snv_101. While the fix is trivial, it can wait for snv_102. I.e., I don't see a business need to put it back right now.
Some of the key differences between the bugs and why one has to go in:
So I feel one is a strong must and the other is a nice to have. And I don't think 'nice to have' meets the entry criteria for snv_101.
I ran into another machine which got stuck on that 'hg style' issue I reported in Getting around a tool repository which is not updating and I got mad enough to start looking at it. We've still got path dependencies in BFU to stuff inside SWAN and I was hoping we could get away from it with the new tools.
I started trying to figure out where the Python script was that was being run. I wanted to find the hard-coded reference to /ws/onnv-tools/onbld/etc/hgstyle. Knowing that I had just seen a Flag Day on this, I looked in Flag day: new default output style for Mercurial. And the solution was staring right at me!
Mark had coded it correctly, no path hardcoding! I could simply change:
[ui] username=Mark J. Nelsonstyle=/ws/onnv-tools/onbld/etc/hgstyle
to be:
[ui] username=Mark J. Nelsonstyle=/opt/onbld/etc/hgstyle
Sweet, great to have my faith restored and an effective solution.
But now I need to really fix the automount maps in the lab, we keep on tripping over this issue with a stale repository and we have a working one we should be using.