The good news:
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/bubble
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 0
Layout [0]:
Layout obtained at: Fri Jul 24 22:33:22:698731 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
As you can see, the first and second stripe go onto the same DS. How can we make sure that these are for different datasets though? Well, snoop should tell us!
NFS: Op = 50 (LAYOUTGET) NFS: Status = 0 (NFS4_OK) NFS: State ID hash = 0721 NFS: len = 12 val = 710000004306000000000000 NFS: State ID Sequence ID = 1 NFS: Layout [0]: NFS: Return layout on close = FALSE NFS: Layout offset = 0 NFS: Layout length = 18446744073709551615 NFS: Layout iomode = LAYOUTIOMODE4_RW NFS: Layout type = LAYOUT4_NFSV4_1_FILES NFS: Device ID hash = [03F9] NFS: (16) 000000000000000000000000C8000000 NFS: Util = 0x8001 NFS: stripe_unit = 512 NFS: UFLG_DENSE NFS: First stripe = 0 NFS: fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS: fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS: fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS: fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS:
fh[0] and fh[1] differ in the mds_sid portion of the file handle: '10D336969B13CB6C0E' vs '1086FE860F0E0B87D4'. And the 0x10 at the start is probably the length. Amazingly, the mds_sid is also the only difference with fh[2] and fh[3]. That is because the mds_sid is what tells the DS which dataset to act on. The combination of the mds_sid and the ip addresses in the device info are what lets the client send the correct data to the correct DS:
NFS: Op = 47 (GETDEVICEINFO) NFS: Status = 0 (NFS4_OK) NFS: Layout type = LAYOUT4_NFSV4_1_FILES NFS: Device address: NFS: stripe_indices: 0, 1, 2, 3 NFS: multipath_ds_list[0]: [0] tcp:10.1.233.192.183.161 NFS: multipath_ds_list[1]: [0] tcp:10.1.233.192.183.161 NFS: multipath_ds_list[2]: [0] tcp:10.1.233.193.183.161 NFS: multipath_ds_list[3]: [0] tcp:10.1.233.193.183.161 NFS:
Now, I said good news, which implies bad news. And here it is:
[thud@pnfs-17-21 ~]> cp /etc/passwd /pnfs/pnfs-17-24/bubble cp: /pnfs/pnfs-17-24/bubble: write: Permission denied [thud@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/bubble -rw-r--r-- 1 thud staff 0 Jul 24 22:33 /pnfs/pnfs-17-24/bubble
So the create is working, but the data is not appearing. I'm thinking an issue with the client talking to the DSes. And with a new filehandle, I wonder if it is not being decoded correctly?
Well, it is client to DS communication, as a snoop shows us:
NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Compound) NFS: Tag = pnfs write NFS: Status = 10012 (NFS4ERR_LOCKED) NFS: Number of results = 3 NFS: NFS: Op = 53 (SEQUENCE) NFS: Status = 0 (NFS4_OK) NFS: Session ID hash = [3716] NFS: (16) 00000000FFFFFFFFD700000064000000 NFS: Sequence ID = 1 NFS: Slot ID = 0 NFS: Highest slotid = 200 NFS: Target highest slotid = 200 NFS: status = 0 NFS: NFS: Op = 22 (PUTFH) NFS: Status = 0 (NFS4_OK) NFS: NFS: Op = 38 (WRITE) NFS: Status = 10012 (NFS4ERR_LOCKED) NFS:
I just fixed an issue in the encoding of the ds_filehandle, so perhaps I have it in reverse? And no, the decoding was okay. We need to look for NFS4ERR_LOCKED. So either locking code is getting to me or it is EAGAIN - which is bad to track.
Well, not so bad, we know it is seen during the WRITE.
Nothing was panning out, so I went to the webrev (http://cr.opensolaris.org/~tdh/perse/) and I'm suspicious of this code in usr/src/uts/common/fs/nfs/dserv_server.c :
578 /*
579 * Use the MDS SID (from the file handle) to find the real data
580 * server guid (zpool guid + id of the root pNFS object set).
581 */
582 for (sid_map = list_head(&inst->dmi_mds_sids); sid_map != NULL;
583 sid_map = list_next(&inst->dmi_mds_sids, sid_map)) {
584 if ((mds_sid.len == sid_map->msm_mds_storid.len) &&
585 (memcmp(mds_sid.val, sid_map->msm_mds_storid.val,
586 sid_map->msm_mds_storid.len) == 0)) {
587 found_mds_sid = 1;
588 ds_guid = sid_map->msm_ds_guid;
589 break;
590 }
591 }
592
593 /*
594 * If we have no record of the given MDS SID it may mean that
595 * we haven't been able to do the REPORTAVAIL for this particular
596 * resource. Therefore, just tell the client to try again later.
597 * In the future, we will attempt to ask the MDS for this information
598 * via DS_MAP_MDSSID.
599 */
600 if (found_mds_sid != 1)
601 return (EAGAIN);
Until my changeset, the mds_sid was always 0s. And that piece of code was surrounded by an '#ifdef 0'. Great, I have my culprit!
I added some debug statements to show what the DS was seeing as it looped across the sid_map:
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sid_map is NOT NULL Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: fh[16] 312CE43624C45DB0000000000000001E Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sm[16] 768279474C9ACF441E00000000000000 Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: fh[16] 312CE43624C45DB0000000000000001E Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sm[16] B05DC42436E42C311E00000000000000 Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: no root objset found!
I'm very suspicious that the reply from the DS_REPORTAVAIL (MDS to DS) is xdr encoding the mds_sid and the one in the file handle is not being encoded. I think the fh mds_sid above is the same as the second sid_map one.
The root cause here was that the mds_sid used to be sent across as 0s. As such, when the MDS did an xdr encode and the DS did not do an xdr decode, it really didn't matter. And the mds_sid was actually ignored on the DS.
Since we don't encode the mds_sid when we send it in the ds_filehandle, I decided to not encode it in the DS_REPORTAVAIL result. As such, we will be able to look at network traces and see if they match in both places.
This analysis took several hours. I keep on getting blind-sided by thinking existing code works correctly. I looked down several side paths I didn't mention (like the MDS treating the mds_sid array like a pointer and not an array of mds_sids).
I can also prove that I've fixed the bug, as I have a new one on the DS, which is downstream of the old one:
find_open_root_objset: sid_map is NOT NULL find_open_root_objset: fh[16] D336969B13CB6C0E000000000000001E find_open_root_objset: sm[16] 86FE860F0E0B87D4000000000000001E panic[cpu1]/thread=ffffff01d7880000: BAD TRAP: type=e (#pf Page fault) rp=ffffff0007b3a0e0 addr=100428 occurred in module "genunix" due to an illegal access to a user address ... ffffff0007b3a1e0 genunix:list_head+10 () ffffff0007b3a230 nfssrv:find_open_mdsfs_objset+40 () ffffff0007b3a6e0 nfssrv:dserv_nnode_data_getobjset+158 () ffffff0007b3a750 nfssrv:dserv_nnode_io_prep+85 () ffffff0007b3a7a0 nfssrv:nnop_io_prep+41 () ffffff0007b3a960 nfssrv:mds_op_write+134 () ffffff0007b3a9b0 nfssrv:rfs41_op_dispatch+72 ()
But it is late and I don't know if I'll blog about finding out what this one does...