« December 2009
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today
XML

Neat blogs

Navigation

Editing

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

20090725 Saturday July 25, 2009
Hard day of debugging the hard stuff, but now have multiple datasets on the same DS

The good news:

[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/bubble
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 0
Layout [0]:
        Layout obtained at: Fri Jul 24 22:33:22:698731 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK

As you can see, the first and second stripe go onto the same DS. How can we make sure that these are for different datasets though? Well, snoop should tell us!

NFS:  Op = 50 (LAYOUTGET)
NFS:  Status = 0 (NFS4_OK)
NFS:  State ID hash = 0721
NFS:      len = 12    val = 710000004306000000000000
NFS:      State ID Sequence ID = 1
NFS:  Layout [0]:
NFS:  Return layout on close = FALSE
NFS:  Layout offset = 0
NFS:  Layout length = 18446744073709551615
NFS:  Layout iomode = LAYOUTIOMODE4_RW
NFS:  Layout type = LAYOUT4_NFSV4_1_FILES
NFS:  Device ID hash = [03F9]
NFS:      (16) 000000000000000000000000C8000000
NFS:  Util = 0x8001
NFS:      stripe_unit = 512
NFS:      UFLG_DENSE
NFS:  First stripe = 0
NFS:  fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000
NFS:  fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000
NFS:  fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000
NFS:  fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000
NFS:

fh[0] and fh[1] differ in the mds_sid portion of the file handle: '10D336969B13CB6C0E' vs '1086FE860F0E0B87D4'. And the 0x10 at the start is probably the length. Amazingly, the mds_sid is also the only difference with fh[2] and fh[3]. That is because the mds_sid is what tells the DS which dataset to act on. The combination of the mds_sid and the ip addresses in the device info are what lets the client send the correct data to the correct DS:

NFS:  Op = 47 (GETDEVICEINFO)
NFS:  Status = 0 (NFS4_OK)
NFS:  Layout type = LAYOUT4_NFSV4_1_FILES
NFS:  Device address:
NFS:      stripe_indices: 0, 1, 2, 3
NFS:      multipath_ds_list[0]: [0] tcp:10.1.233.192.183.161
NFS:      multipath_ds_list[1]: [0] tcp:10.1.233.192.183.161
NFS:      multipath_ds_list[2]: [0] tcp:10.1.233.193.183.161
NFS:      multipath_ds_list[3]: [0] tcp:10.1.233.193.183.161
NFS:

Now, I said good news, which implies bad news. And here it is:

[thud@pnfs-17-21 ~]> cp /etc/passwd /pnfs/pnfs-17-24/bubble
cp: /pnfs/pnfs-17-24/bubble: write: Permission denied
[thud@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/bubble
-rw-r--r--   1 thud     staff          0 Jul 24 22:33 /pnfs/pnfs-17-24/bubble

So the create is working, but the data is not appearing. I'm thinking an issue with the client talking to the DSes. And with a new filehandle, I wonder if it is not being decoded correctly?

Well, it is client to DS communication, as a snoop shows us:

NFS:  ----- Sun NFS -----
NFS:
NFS:  Proc = 1 (Compound)
NFS:  Tag = pnfs write
NFS:  Status = 10012 (NFS4ERR_LOCKED)
NFS:  Number of results = 3
NFS:
NFS:  Op = 53 (SEQUENCE)
NFS:  Status = 0 (NFS4_OK)
NFS:  Session ID hash = [3716]
NFS:      (16) 00000000FFFFFFFFD700000064000000
NFS:  Sequence ID = 1
NFS:  Slot ID = 0
NFS:  Highest slotid = 200
NFS:  Target highest slotid = 200
NFS:  status = 0
NFS:
NFS:  Op = 22 (PUTFH)
NFS:  Status = 0 (NFS4_OK)
NFS:
NFS:  Op = 38 (WRITE)
NFS:  Status = 10012 (NFS4ERR_LOCKED)
NFS:

I just fixed an issue in the encoding of the ds_filehandle, so perhaps I have it in reverse? And no, the decoding was okay. We need to look for NFS4ERR_LOCKED. So either locking code is getting to me or it is EAGAIN - which is bad to track.

Well, not so bad, we know it is seen during the WRITE.

Nothing was panning out, so I went to the webrev (http://cr.opensolaris.org/~tdh/perse/) and I'm suspicious of this code in usr/src/uts/common/fs/nfs/dserv_server.c :

578         /*
 579          * Use the MDS SID (from the file handle) to find the real data
 580          * server guid (zpool guid + id of the root pNFS object set).
 581          */
 582         for (sid_map = list_head(&inst->dmi_mds_sids); sid_map != NULL;
 583             sid_map = list_next(&inst->dmi_mds_sids, sid_map)) {
 584                 if ((mds_sid.len == sid_map->msm_mds_storid.len) &&
 585                     (memcmp(mds_sid.val, sid_map->msm_mds_storid.val,
 586                     sid_map->msm_mds_storid.len) == 0)) {
 587                         found_mds_sid = 1;
 588                         ds_guid = sid_map->msm_ds_guid;
 589                         break;
 590                 }
 591         }
 592 
 593         /*
 594          * If we have no record of the given MDS SID it may mean that
 595          * we haven't been able to do the REPORTAVAIL for this particular
 596          * resource.  Therefore, just tell the client to try again later.
 597          * In the future, we will attempt to ask the MDS for this information
 598          * via DS_MAP_MDSSID.
 599          */
 600         if (found_mds_sid != 1)
 601                 return (EAGAIN);

Until my changeset, the mds_sid was always 0s. And that piece of code was surrounded by an '#ifdef 0'. Great, I have my culprit!

I added some debug statements to show what the DS was seeing as it looped across the sid_map:

Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sid_map is NOT NULL
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: fh[16] 312CE43624C45DB0000000000000001E
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sm[16] 768279474C9ACF441E00000000000000
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: fh[16] 312CE43624C45DB0000000000000001E
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sm[16] B05DC42436E42C311E00000000000000
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: no root objset found!

I'm very suspicious that the reply from the DS_REPORTAVAIL (MDS to DS) is xdr encoding the mds_sid and the one in the file handle is not being encoded. I think the fh mds_sid above is the same as the second sid_map one.

The root cause here was that the mds_sid used to be sent across as 0s. As such, when the MDS did an xdr encode and the DS did not do an xdr decode, it really didn't matter. And the mds_sid was actually ignored on the DS.

Since we don't encode the mds_sid when we send it in the ds_filehandle, I decided to not encode it in the DS_REPORTAVAIL result. As such, we will be able to look at network traces and see if they match in both places.

This analysis took several hours. I keep on getting blind-sided by thinking existing code works correctly. I looked down several side paths I didn't mention (like the MDS treating the mds_sid array like a pointer and not an array of mds_sids).

I can also prove that I've fixed the bug, as I have a new one on the DS, which is downstream of the old one:

find_open_root_objset: sid_map is NOT NULL
find_open_root_objset: fh[16] D336969B13CB6C0E000000000000001E
find_open_root_objset: sm[16] 86FE860F0E0B87D4000000000000001E

panic[cpu1]/thread=ffffff01d7880000: BAD TRAP: type=e (#pf Page fault) rp=ffffff0007b3a0e0 addr=100428 occurred in module "genunix" due to an illegal access to a user address
...
ffffff0007b3a1e0 genunix:list_head+10 ()
ffffff0007b3a230 nfssrv:find_open_mdsfs_objset+40 ()
ffffff0007b3a6e0 nfssrv:dserv_nnode_data_getobjset+158 ()
ffffff0007b3a750 nfssrv:dserv_nnode_io_prep+85 ()
ffffff0007b3a7a0 nfssrv:nnop_io_prep+41 ()
ffffff0007b3a960 nfssrv:mds_op_write+134 ()
ffffff0007b3a9b0 nfssrv:rfs41_op_dispatch+72 ()

But it is late and I don't know if I'll blog about finding out what this one does...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Trackback URL: http://blogs.sun.com/tdh/entry/hard_day_of_debugging_the
Comments:

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed