That last bug was actually pretty easy to solve. It turned out to be more code which had never been tested before. If we look at the code in usr/src/uts/common/fs/nfs/dserv_server.c (which will be modified by the time you read this), we see:
711 mutex_enter(&inst->dmi_content_lock);
712 error = find_open_root_objset(inst, sid, &root_objset);
713 if (error) {
714 error = (error == ENOENT) ? EIO : error;
715 goto out;
716 }
717
718 error = find_open_mdsfs_objset(inst, dataset_id, root_objset,
719 &(dnd->dnd_objset));
720 if (error == 0 || error != ENOENT) {
721 if (error == 0)
722 dnd->dnd_flags |= DSERV_NNODE_FLAG_OBJSET;
723 goto out;
724 }
The EAGAIN in Hard day of debugging the hard stuff, but now have multiple datasets on the same DS was returned on 712 by find_open_root_objset(). The new panic was occurring down in the call to find_open_mdsfs_objset() on 718. Before my changes, find_open_root_objset() always returned the first pNFS dataset root. So if you had two, you only ever saw the one.
I decided to skip right to the code and it ended up being easy to spot in find_open_root_objset(), I started there because I hadn't changed find_open_mdsfs_objset() and it was known to work.
541 static int
542 find_open_root_objset(dserv_mds_instance_t *inst, mds_sid mds_sid,
543 open_root_objset_t **root_objset)
544 {
...
603 /*
604 * Find the root pNFS object set.
605 */
606 for (tmp_root = list_head(&inst->dmi_datasets); tmp_root != NULL;
607 tmp_root = list_next(&inst->dmi_datasets, tmp_root)) {
608 if (ds_guid.dg_zpool_guid ==
609 tmp_root->oro_ds_guid.dg_zpool_guid &&
610 ds_guid.dg_objset_guid ==
611 tmp_root->oro_ds_guid.dg_objset_guid) {
612 /*
613 * This is our root pNFS object set!
614 */
615 found_root_objset = 1;
616 break;
617 }
618 }
*root_objset is never set when it is found. So, when we use it on 718, it is garbage. The fix is a simple assignment. Hmm, I could replace tmp_root with *root_objset, which is what the original author seemed to think was going on here.
But in any event, here is the 'pee' in pNFS:
[root@pnfs-17-21 ~]> mount -o vers=4 pnfs-17-24:/pnfs2/pnfs /pnfs/pnfs-17-24
[root@pnfs-17-21 ~]> cp /etc/passwd /pnfs/pnfs-17-24/qwhoei
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/qwhoei
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 1
Layout [0]:
Layout obtained at: Sat Jul 25 02:20:00:343367 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/qwhoei
-rw-r--r-- 1 root root 881 Jul 25 02:20 /pnfs/pnfs-17-24/qwhoei
Well, actually, there really isn't any parallel activity going on here. The file is 881 bytes and the default stripe size is 32k. So it all goes to stripe 0. But there is a lot going on in the background which was touched by my changes.
I could either start testing the kspe to get a smaller stripe size or I can use mkfile to get a file of 2*4*32k such that there are two writes to each stripe.
[root@pnfs-17-21 ~]> mkfile 256k chunky
[root@pnfs-17-21 ~]> ls -la chunky
-rw------T 1 root root 262144 Jul 25 02:34 chunky
[root@pnfs-17-21 ~]> cp chunky /pnfs/pnfs-17-24/pChunky
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/pChunky
-rw------- 1 root root 262144 Jul 25 02:34 /pnfs/pnfs-17-24/pChunky
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/pChunky
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 8
Layout [0]:
Layout obtained at: Sat Jul 25 02:34:22:325669 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
But I have no idea if the file is identical. I'll test that later...
I can test that with using a large text file, say 440k:
[root@pnfs-17-21 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
Layout obtained at: Sat Jul 25 02:55:17:818063 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/Pnfs4_vnops.c
-rw-r--r-- 1 root root 409540 Jul 25 02:55 /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
So we have pNFS!
The good news:
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/bubble
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 0
Layout [0]:
Layout obtained at: Fri Jul 24 22:33:22:698731 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
As you can see, the first and second stripe go onto the same DS. How can we make sure that these are for different datasets though? Well, snoop should tell us!
NFS: Op = 50 (LAYOUTGET) NFS: Status = 0 (NFS4_OK) NFS: State ID hash = 0721 NFS: len = 12 val = 710000004306000000000000 NFS: State ID Sequence ID = 1 NFS: Layout [0]: NFS: Return layout on close = FALSE NFS: Layout offset = 0 NFS: Layout length = 18446744073709551615 NFS: Layout iomode = LAYOUTIOMODE4_RW NFS: Layout type = LAYOUT4_NFSV4_1_FILES NFS: Device ID hash = [03F9] NFS: (16) 000000000000000000000000C8000000 NFS: Util = 0x8001 NFS: stripe_unit = 512 NFS: UFLG_DENSE NFS: First stripe = 0 NFS: fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS: fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS: fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS: fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS:
fh[0] and fh[1] differ in the mds_sid portion of the file handle: '10D336969B13CB6C0E' vs '1086FE860F0E0B87D4'. And the 0x10 at the start is probably the length. Amazingly, the mds_sid is also the only difference with fh[2] and fh[3]. That is because the mds_sid is what tells the DS which dataset to act on. The combination of the mds_sid and the ip addresses in the device info are what lets the client send the correct data to the correct DS:
NFS: Op = 47 (GETDEVICEINFO) NFS: Status = 0 (NFS4_OK) NFS: Layout type = LAYOUT4_NFSV4_1_FILES NFS: Device address: NFS: stripe_indices: 0, 1, 2, 3 NFS: multipath_ds_list[0]: [0] tcp:10.1.233.192.183.161 NFS: multipath_ds_list[1]: [0] tcp:10.1.233.192.183.161 NFS: multipath_ds_list[2]: [0] tcp:10.1.233.193.183.161 NFS: multipath_ds_list[3]: [0] tcp:10.1.233.193.183.161 NFS:
Now, I said good news, which implies bad news. And here it is:
[thud@pnfs-17-21 ~]> cp /etc/passwd /pnfs/pnfs-17-24/bubble cp: /pnfs/pnfs-17-24/bubble: write: Permission denied [thud@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/bubble -rw-r--r-- 1 thud staff 0 Jul 24 22:33 /pnfs/pnfs-17-24/bubble
So the create is working, but the data is not appearing. I'm thinking an issue with the client talking to the DSes. And with a new filehandle, I wonder if it is not being decoded correctly?
Well, it is client to DS communication, as a snoop shows us:
NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Compound) NFS: Tag = pnfs write NFS: Status = 10012 (NFS4ERR_LOCKED) NFS: Number of results = 3 NFS: NFS: Op = 53 (SEQUENCE) NFS: Status = 0 (NFS4_OK) NFS: Session ID hash = [3716] NFS: (16) 00000000FFFFFFFFD700000064000000 NFS: Sequence ID = 1 NFS: Slot ID = 0 NFS: Highest slotid = 200 NFS: Target highest slotid = 200 NFS: status = 0 NFS: NFS: Op = 22 (PUTFH) NFS: Status = 0 (NFS4_OK) NFS: NFS: Op = 38 (WRITE) NFS: Status = 10012 (NFS4ERR_LOCKED) NFS:
I just fixed an issue in the encoding of the ds_filehandle, so perhaps I have it in reverse? And no, the decoding was okay. We need to look for NFS4ERR_LOCKED. So either locking code is getting to me or it is EAGAIN - which is bad to track.
Well, not so bad, we know it is seen during the WRITE.
Nothing was panning out, so I went to the webrev (http://cr.opensolaris.org/~tdh/perse/) and I'm suspicious of this code in usr/src/uts/common/fs/nfs/dserv_server.c :
578 /*
579 * Use the MDS SID (from the file handle) to find the real data
580 * server guid (zpool guid + id of the root pNFS object set).
581 */
582 for (sid_map = list_head(&inst->dmi_mds_sids); sid_map != NULL;
583 sid_map = list_next(&inst->dmi_mds_sids, sid_map)) {
584 if ((mds_sid.len == sid_map->msm_mds_storid.len) &&
585 (memcmp(mds_sid.val, sid_map->msm_mds_storid.val,
586 sid_map->msm_mds_storid.len) == 0)) {
587 found_mds_sid = 1;
588 ds_guid = sid_map->msm_ds_guid;
589 break;
590 }
591 }
592
593 /*
594 * If we have no record of the given MDS SID it may mean that
595 * we haven't been able to do the REPORTAVAIL for this particular
596 * resource. Therefore, just tell the client to try again later.
597 * In the future, we will attempt to ask the MDS for this information
598 * via DS_MAP_MDSSID.
599 */
600 if (found_mds_sid != 1)
601 return (EAGAIN);
Until my changeset, the mds_sid was always 0s. And that piece of code was surrounded by an '#ifdef 0'. Great, I have my culprit!
I added some debug statements to show what the DS was seeing as it looped across the sid_map:
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sid_map is NOT NULL Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: fh[16] 312CE43624C45DB0000000000000001E Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sm[16] 768279474C9ACF441E00000000000000 Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: fh[16] 312CE43624C45DB0000000000000001E Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sm[16] B05DC42436E42C311E00000000000000 Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: no root objset found!
I'm very suspicious that the reply from the DS_REPORTAVAIL (MDS to DS) is xdr encoding the mds_sid and the one in the file handle is not being encoded. I think the fh mds_sid above is the same as the second sid_map one.
The root cause here was that the mds_sid used to be sent across as 0s. As such, when the MDS did an xdr encode and the DS did not do an xdr decode, it really didn't matter. And the mds_sid was actually ignored on the DS.
Since we don't encode the mds_sid when we send it in the ds_filehandle, I decided to not encode it in the DS_REPORTAVAIL result. As such, we will be able to look at network traces and see if they match in both places.
This analysis took several hours. I keep on getting blind-sided by thinking existing code works correctly. I looked down several side paths I didn't mention (like the MDS treating the mds_sid array like a pointer and not an array of mds_sids).
I can also prove that I've fixed the bug, as I have a new one on the DS, which is downstream of the old one:
find_open_root_objset: sid_map is NOT NULL find_open_root_objset: fh[16] D336969B13CB6C0E000000000000001E find_open_root_objset: sm[16] 86FE860F0E0B87D4000000000000001E panic[cpu1]/thread=ffffff01d7880000: BAD TRAP: type=e (#pf Page fault) rp=ffffff0007b3a0e0 addr=100428 occurred in module "genunix" due to an illegal access to a user address ... ffffff0007b3a1e0 genunix:list_head+10 () ffffff0007b3a230 nfssrv:find_open_mdsfs_objset+40 () ffffff0007b3a6e0 nfssrv:dserv_nnode_data_getobjset+158 () ffffff0007b3a750 nfssrv:dserv_nnode_io_prep+85 () ffffff0007b3a7a0 nfssrv:nnop_io_prep+41 () ffffff0007b3a960 nfssrv:mds_op_write+134 () ffffff0007b3a9b0 nfssrv:rfs41_op_dispatch+72 ()
But it is late and I don't know if I'll blog about finding out what this one does...