One of the hardest things about developing software is testing it. As a developer, you get too close to the code and what you expect it to do. So I was very happy to get a question in email about why was a set of policies not working correctly:
I tried to reset the SPE policies to the following:
# cat /etc/policies.spe
101, 1, 8k, DS2, path == /diskpool/JUNK/TEST/P1
102, 2, 4k, DS1, path == /diskpool/JUNK/TEST/P2
103, 3, 2k, mix3, path == /diskpool/JUNK/TEST/P3
104, 4, 1k, mix4, path == /diskpool/JUNK/TEST/P4
#105, 5, 128k, default, path == /diskpool/JUNK/TEST/P5 || path == /export/JUNK
111, 5, 32k, default, path == /diskpool/JUNK/TEST/P5
112, 5, 64k, default, path == /export/JUNK
and have the client uses the path to verify the stripe. For stripe 1,2,3
things seem working OK; however when I tried with strip 4 or 5 with
the policy above, I got "nfsstat -l" to list stripe 10 and count 32k:
pnfs-minipit1-3:~#
pnfs-minipit1-3:~# /mntpit1-3: mount -o vers=4 pnfs-minipit1-5:/diskpool/JUNK/TEST/P5 /mnt
pnfs-minipit1-3:~# nfsstat -m /mnt
/mnt from pnfs-minipit1-5:/diskpool/JUNK/TEST/P5
Flags: vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=1048576,wsize=1048576,retrans=5,timeo=600
Attr cache: acregmin=3,acregmax=60,acdirmin=30,acdirmax=60
pnfs-minipit1-3:~# mkfile -v 1m /mnt/kk
/mnt/kk 1048576 bytes
pnfs-minipit1-3:~# nfsstat -l /mnt/kk
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 33
Layout [0]:
Layout obtained at: Sun Aug 30 00:23:47:199247 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 10, stripe unit: 32768
Stripe [0]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
Stripe [1]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
Stripe [2]:
tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
Stripe [3]:
tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
Stripe [4]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
Stripe [5]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
Stripe [6]:
tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
Stripe [7]:
tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
Stripe [8]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
Stripe [9]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
pnfs-minipit1-3:~#
What did I do wrong?
Before we even look at the issue, what did the person asking for help do right
In short, I'm confident that the person took all reasonable steps to solve the problem on their own.
So, now let's try to figure out what is going on. I have a DTrace script which will help out here. Ouch, I should have shared that earlier. Anyway, spe.d should help. I'll start it up and monitor the results as I do a file create:
[root@pnfs-minipit1-5 ~]> ./spe.d dtrace: script './spe.d' matched 8 probes CPU ID FUNCTION:NAME 1 3420 nfs41_spe_allocate:spe-i-check_open 200096 (10) from 10.1.233.191 is checking /diskpool/JUNK/TEST/P5/foollu2 1 3419 nfs41_spe_allocate:spe-i-policy_eval Policy 101 did not match with error 0 from 10.1.233.191 1 3419 nfs41_spe_allocate:spe-i-policy_eval Policy 102 did not match with error 0 from 10.1.233.191 1 3419 nfs41_spe_allocate:spe-i-policy_eval Policy 103 did not match with error 0 from 10.1.233.191 1 3419 nfs41_spe_allocate:spe-i-policy_eval Policy 104 did not match with error 0 from 10.1.233.191 1 3419 nfs41_spe_allocate:spe-i-policy_eval Policy 111 matched with error 0 from 10.1.233.191 1 63564 mds_ds_path_to_mds_sid:return mds_sid[0] = pnfs-minipit1-6:pNFSpool1/p1DS2 1 63564 mds_ds_path_to_mds_sid:return mds_sid[1] = pnfs-minipit1-6:pNFSpool2/p2DS2 1 63564 mds_ds_path_to_mds_sid:return mds_sid[2] = pnfs-minipit1-6:pNFSpool3/p3DS2 1 63564 mds_ds_path_to_mds_sid:return mds_sid[3] = pnfs-minipit1-7:pNFSpool1/p1DS1 1 63564 mds_ds_path_to_mds_sid:return ERROR - could not find a matching pgi for pnfs-minipit1-7:pNFSpool1/p2DS1 1 56851 nfs41_spe_allocate:return No matching policy
So we did find a matching policy, but we only found 5 datasets and the policy requires 5. So, why isn't that dataset in memory?
[root@pnfs-minipit1-5 ~]> echo "::walk mds_DS_guid_entry_cache|::print struct rfs4_dbe dbe_data|::print ds_guid_info_t ds_dataset_name.utf8string_val|::eval /s" | mdb -k 0xffffff01e73e5fc0: pnfs-minipit1-6:pNFSpool3/p3DS2�����f 0xffffff01e7227078: pnfs-minipit1-6:pNFSpool3/p3DS1�����f 0xffffff01e7227120: pnfs-minipit1-6:pNFSpool2/p2DS2�����f 0xffffff01e72271c8: pnfs-minipit1-6:pNFSpool2/p2DS1�����f 0xffffff01e7227270: pnfs-minipit1-6:pNFSpool1/p1DS2�����f 0xffffff01e7227318: pnfs-minipit1-6:pNFSpool1/p1DS1�����f 0xffffff01e5d62e70: pnfs-minipit1-7:pNFSpool2/p2DS2�����f 0xffffff01e6597628: pnfs-minipit1-7:pNFSpool2/p2DS1�����f 0xffffff01e4e40468: pnfs-minipit1-7:pNFSpool1/p1DS2�����f 0xffffff01e5f84cb8: pnfs-minipit1-7:pNFSpool1/p1DS1�����f [root@pnfs-minipit1-5 ~]> echo "::walk mds_DS_guid_entry_cache|::print struct rfs4_dbe dbe_invalid" | mdb -k dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0
Hmm, I need to figure out a better way to process UTF8 stings in mdb, but it appears to be there and in a valid entry. I couldn't figure this out until I lined up the dataset name output:
pnfs-minipit1-7:pNFSpool2/p2DS1 pnfs-minipit1-7:pNFSpool1/p2DS1
The naming convention shows the error, 'p2DS1' should be on pool #2, but it is on pool #1: pNFSpool1.
If this were implemented in SMF instead of a as a flat file, we would be able to do checking when we create the policy to catch this type of configuration error. But for now, DTrace and mdb can catch it.
For me, getting mdb to dump the dataset names was the hardest part. I had to experiment and search the web. I'm glad for all of the effort here, I'll use that trick again and again.
Removes are failing and I don't think this is the classic issue we have had with this operation. I know that at the last BAT, the removes were succeeding enough to cause a panic. And we don't seem to be getting to that code even.
Snoop shows the client sending the remove, but the MDS is not sending anything to the DS. Barring the MDS sending to the wrong DS, I've put together a simple DTrace script to see what is going on:
[root@pnfs-17-24 ~]> more remove.d
#!/usr/sbin/dtrace -s
::do_ctl_mds_remove:return
{
printf("rc = %d", args[1]);
}
::ctl_mds_clnt_remove_file:return
{
printf("rc = %d", args[1]);
}
:::nfss-e-vop_fid_pseudo_failed
{
}
:::nfss-i-layout_is_null_cannot_remove
{
}
::mds_op_remove:entry
{
self->resop = (nfs_resop4 *)arg1;
self->resp = self->resop->nfs_resop4_u.opremove;
}
::mds_op_remove:return
{
printf("rc = %d", self->resp.status);
}
I don't like the hoops I have to go through to get the status on mds_op_remove, but I have to do that because that function is of type void. I'm going to change them to return a status. It doesn't cost much and makes simple DTrace scripting easier.
Anyway, here is our result:
[root@pnfs-17-24 ~]> ./remove.d dtrace: script './remove.d' matched 6 probes CPU ID FUNCTION:NAME 0 30503 do_ctl_mds_remove:nfss-e-vop_fid_pseudo_failed 0 62050 do_ctl_mds_remove:return rc = 28 0 62052 mds_op_remove:return rc = 0 ^C
What is that 28?
[thud@ultralord nfssrv]> grep 28 /usr/include/sys/errno.h #define ENOSPC 28 /* No space left on device */
Now I want to know what vop_fid_pseudo() is returning, but only when called in this context. I can use the thread properties like this:
::vop_fid_pseudo:entry
/self->live == 1/
{
self->vp = (vnode_t *)arg0;
self->fidp = (fid_t *)arg1;
}
::vop_fid_pseudo:return
/self->live == 1/
{
printf("fid len = %d\trc = %d",
self->fidp->un._fid.len, args[1]);
}
Only if self->live is set will I see output:
[root@pnfs-17-24 ~]> ./remove.d dtrace: script './remove.d' matched 8 probes CPU ID FUNCTION:NAME 0 62468 vop_fid_pseudo:return fid len = 10 rc = 28 0 30503 do_ctl_mds_remove:nfss-e-vop_fid_pseudo_failed 0 62050 do_ctl_mds_remove:return rc = 28 0 62052 mds_op_remove:return rc = 0 ^C
And now I need to know where that error is set:
49 vop_fid_pseudo(vnode_t *vp, fid_t *fidp)
50 {
51 struct vattr va;
52 int error;
53
54 error = VOP_FID(vp, fidp, NULL);
55
56 /*
57 * XXX nfs4_fid() does nothing and returns EREMOTE.
58 * XXX nfs3_fid()/nfs_fid() returns nfs filehandle as its fid
59 * which has a bigger length than local fid.
60 * NFS_FH4MAXDATA is the size of
61 * fhandle4_t.fh_xdata[NFS_FH4MAXDATA].
62 *
63 * Note: nfs[2,3,4]_fid() only gets called for diskless clients.
64 */
65 if (error == EREMOTE ||
66 (error == 0 && fidp->fid_len > NFS_FH4MAXDATA)) {
67
68 va.va_mask = AT_NODEID;
69 error = VOP_GETATTR(vp, &va, 0, CRED(), NULL);
70 if (error)
71 return (error);
72
73 fidp->fid_len = sizeof (va.va_nodeid);
74 bcopy(&va.va_nodeid, fidp->fid_data, fidp->fid_len);
75 return (0);
76 }
77
78 return (error);
79 }
And that means I need to know what VOP_GETATTR() is returning. I'll pull the same trick as before, but I'll need to be looking at, well, I have to figure that out now, don't I?
The trick is to let the vnode tell me:
int
fop_getattr(
vnode_t *vp,
...
err = (*(vp)->v_op->vop_getattr)(vp, vap, flags, cr, ct);
So I need to find the pointer to the function:
printf("vop_getattr = %p", self->vp->v_op->vop_getattr);
And then I need to find out what that is:
0 62467 vop_fid_pseudo:entry vop_getattr = fffffffff883c180 ... [root@pnfs-17-24 ~]> mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs mpt sd sockfs ip hook neti sctp arp usba stmf fctl nca lofs idm cpc random zfs nfs fcip logindmux ptm sppp ] > fffffffff883c180::dis zfs_getattr: pushq %rbp
Okay, we at least know which filesystem is complaining. I probably could have done this differently, but this way works. The only error in that function comes from zfs_zaccess():
0 53373 zfs_zaccess:return rc = 0 0 53125 zfs_getattr:return rc = 0 0 53125 zfs_getattr:return rc = 0
Which means I was reading the code incorrectly! Note that these are being called by the code, my check forces that to be true, but the error must be elsewhere. It must be the VOP_FID() which is causing the issue!
4323 static int
4324 zfs_fid(vnode_t *vp, fid_t *fidp, caller_context_t *ct)
4325 {
4326 znode_t *zp = VTOZ(vp);
4327 zfsvfs_t *zfsvfs = zp->z_zfsvfs;
4328 uint32_t gen;
4329 uint64_t object = zp->z_id;
4330 zfid_short_t *zfid;
4331 int size, i;
4332
4333 ZFS_ENTER(zfsvfs);
4334 ZFS_VERIFY_ZP(zp);
4335 gen = (uint32_t)zp->z_gen;
4336
4337 size = (zfsvfs->z_parent != zfsvfs) ? LONG_FID_LEN : SHORT_FID_LEN;
4338 if (fidp->fid_len < size) {
4339 fidp->fid_len = size;
4340 ZFS_EXIT(zfsvfs);
4341 return (ENOSPC);
4342 }
Some quick DTrace will tell me I'm correct:
::zfs_fid:entry
/self->live == 1/
{
zp = (znode_t *)(self->vp)->v_data;
zfsvfs = zp->z_zfsvfs;
lfid = sizeof (zfid_long_t) - sizeof (uint16_t);
sfid = sizeof (zfid_short_t) - sizeof (uint16_t);
size = (zfsvfs->z_parent != zfsvfs) ? lfid : sfid;
printf("fidlen %s size! (%d, %d, %d, %d)",
self->fidp->un._fid.len < size ? "<" : "not <",
self->fidp->un._fid.len, size, lfid, sfid);
}
...
0 53162 zfs_fid:entry fidlen < size! (0, 10, 20, 10)
0 53163 zfs_fid:return rc = 28
So is this new? Or something I introduced? I don't see how the code ever worked here. We pass a fid_t off the stack, i.e., uninitialized, into vop_fid_pseudo() and when we call zfs_fid(), that check should trigger every time. The comments on lines 57-61 lead me to think that vop_fid_pseudo() is not intended for types other than NFS. In any event, the other callers of it do:
makefh4(nfs_fh4 *fh, vnode_t *vp, struct exportinfo *exi)
...
fid_t fid;
...
bzero(&fid, sizeof (fid));
fid.fid_len = MAXFIDSZ;
And the file should have been removed!
0 67453 zfs_fid:return rc = 0 0 81693 vop_fid_pseudo:entry vop_getattr = fffffffff8088180 0 67452 zfs_fid:entry fidlen not < size! (64, 10, 20, 10) 0 67453 zfs_fid:return rc = 0 0 81694 vop_fid_pseudo:return fid len = 10 rc = 0
From the snoop we can both see the MDS communicating with the DS and that I need to go teach snoop about a remove!
CTL-MDS: ----- Sun CTL-MDS ----- CTL-MDS: CTL-MDS: Proc = 10 (Remove object(s) or entire fsid at the DS) CTL-MDS:
While I fixed a valid bug (and we are past the code that was panicking at the BAT, I doubt I fixed the problem we had been seeing:
Before a huge copy:
[root@pnfs-17-22 ~]> zfs list NAME USED AVAIL REFER MOUNTPOINT pnfs1 9.35G 6.28G 23K /pnfs1
After a huge copy:
[root@pnfs-17-22 ~]> zfs list NAME USED AVAIL REFER MOUNTPOINT pnfs1 12.5G 3.09G 23K /pnfs1
And after the delete:
[root@pnfs-17-22 ~]> zfs list NAME USED AVAIL REFER MOUNTPOINT pnfs1 12.5G 3.09G 23K /pnfs1
Note, the DS did report that it tried to delete the file:
[root@pnfs-17-22 ~]> ./remove.d dtrace: script './remove.d' matched 2 probes CPU ID FUNCTION:NAME 0 62434 ctl_mds_srv_remove:return status = 0 1 62434 ctl_mds_srv_remove:return status = 0 1 62434 ctl_mds_srv_remove:return status = 0 0 62434 ctl_mds_srv_remove:return status = 0 1 62434 ctl_mds_srv_remove:return status = 0 1 62434 ctl_mds_srv_remove:return status = 0 0 62434 ctl_mds_srv_remove:return status = 0
Amazingly enough, I had forgotten I was running that DTrace script. So the issue is that ctl_mds_srv_remove() reports success, but the file contents are still there.
I'm starting to wonder if I haven't hit my head on a known ZFS bug that is fixed in 118. The pNFS code is based on 117. I'll investigate.
So, my unit testing is all done and it is time to integrate. Bzzt! Wrong answer - we have to do regression testing. In our case, it is called pNFS/miniPIT and was mainly crafted by Helen Chao. And it was bombing on my new code.
First of all, the core was confusing:
[root@pnfs-minipit2-4 ~]> panic[cpu0]/thread=30008a61560: 000000047ad BAD TRAP: type=31 rp=2a100f596d0 addr=14 mmu_fsr=0 occurred in module "nfssrv" due to a NULL pointer dereference nfsd: trap type = 0x31 addr=0x14 pid=100528, pc=0x7afb3d64, sp=0x2a100f58f71, tstate=0x4414001604, context=0x29b g1-g7: 7be42af4, 2, 2000, 0, 0, 29b, 30008a61560
The NULL pointer wasn't anything to do with the function being called. Secondly, I focused on trying to reproduce it inside the test harness - it wasn't until I made a simple test case that I progressed.
Third, I focused on the server being 32 bit instead of the real difference - it was not setup as a MDS - therefore it had no DSes reporting to it. I'll show why that is important:
4463 error = nfs41_spe_allocate(&spe_va, claddr,
4464 vp->v_path, &lc, TRUE);
4465 if (error) {
4466 /*
4467 * XXX: Until we get the SMF code
4468 * in place, we handle all errors by
4469 * using the default layout of the
4470 * old prototype code
4471 *
4472 * At that point, we should return the
4473 * given error.
4474 *
4475 * XXX: Any way *plo is NULL here?
4476 */
4477 *plo = mds_gen_default_layout(cs->instp);
4478
4479 /*
4480 * Record the layout, don't get bent out of shape
4481 * if it fails, we'll try again at checkstate time.
4482 */
4483 (void) mds_put_layout(*plo, vp);
4484
4485 return (NFS4_OK);
4486 }
4487
4488 /*
4489 * XXX: Any way *plo is NULL here?
4490 */
4491 *plo = mds_add_layout(&lc);
The fourth, and major problem, is that I had valid comments and I ignored them. Yes, it was certainly valid for the *plo to be NULL after 4477 but probably not 4491. But they are both easy enough to code for!
The issue is that a NFSv4.1 server, i.e., no pNFS, also goes through this code path. We need to handle not having a layout:
4474 * At that point, we should return the
4475 * given error.
4476 */
4477 *plo = mds_gen_default_layout(cs->instp);
4478 if (*plo == NULL) {
4479 status = NFS4ERR_LAYOUTUNAVAILABLE;
4480 } else {
4481 /*
4482 * Record the layout, don't get
4483 * bent out of shape if it fails,
4484 * we'll try again at checkstate time.
4485 */
4486 (void) mds_put_layout(*plo, vp);
4487 }
4488
4489 return (status);
Which appears okay, except with just this change, files will not be created. We want to return NFS4ERR_LAYOUTUNAVAILABLE, mainly for DTrace probing, but in the caller, we want to do:
4892 } else {
4893 status = mds_createfile_get_layout(req, vp, cs, &ct, plo);
4894
4895 /*
4896 * Allow mds_createfile_get_layout() to be verbose
4897 * in what it presents as a status, but be aware
4898 * that it is permissible to not generate a
4899 * layout.
4900 */
4901 if (status == NFS4ERR_LAYOUTUNAVAILABLE) {
4902 status = NFS4_OK;
4903 }
4904 }
I've been fixing up a refcnt bug I found in recall_read_delegations(). One of the things I found out while in Austin was how to unit test those changes. We need a NFSv4.0 and a NFSv4.1 client both reading a file from a NFSv4.1 server. Note that we can't have a pNFS server, due to we do not yet support a mount option to force a 4.1 over a 4.0 mount. And we may never.
Anyway, you have two clients with a read delegation and force a recall by touching the files from another client.
I know it is a good test case because the server consistently crashes:
panic[cpu0]/thread=ffffff02d457db40: kernel heap corruption detected ffffff001054de90 genunix:kmem_error+4a9 () ffffff001054deb0 genunix:kmem_free+d0 () ffffff001054df30 nfssrv:recall_read_delegations+8e8 () ffffff001054df90 nfssrv:deleg_rd_setattr+45 () ffffff001054e000 genunix:vnext_setattr+84 () ffffff001054e060 nfssrv:deleg_rd_setattr+66 () ffffff001054e0e0 genunix:vhead_setattr+9f () ffffff001054e140 genunix:fop_setattr+ad ()
Firefox rules! My UPS just failed and I thought I lost all of this. Yeah Firefox for saving the day!
I added DTrace probes, printf() statements when the probes failed to fire, and finally delays in the code. I couldn't see anything. I finally decided to remove all of that code and worked on cleaning the original code up. There was a rfs4_file_t pointer (fp), array of pointers (fpa), and last active pointer (fap). I was getting confused.
I fixed all that and while the panic was still occurring, the code was easier to read. I finally started adding some ASSERTs and nailed down the problem!
If we look at the original code:
89 /*
90 * Is there more than one file structure for this vp?
91 * Get the vsd for each instance of the server, if it exists.
92 */
93 fpa = kmem_zalloc((sizeof (rfs4_file_t *) * nsi_count), KM_SLEEP);
94 DTRACE_NFSV4_1(rrd__i__fpa_alloc, int, nsi_count);
95
96 fpa[0] = fp;
97 cnt = 1;
98 mutex_enter(&vp->v_lock);
99 for (instp = list_head(&nsi_head); instp != NULL;
100 instp = list_next(&nsi_head, &instp->nsi_list)) {
101 fpa[cnt] = (rfs4_file_t *)vsd_get(vp, instp->vkey);
102 if (fpa[cnt] && (fpa[cnt] != fp))
103 cnt++;
104 }
105 mutex_exit(&vp->v_lock);
I said that before 101, it should be the case that cnt is less than nsi_count. I added an ASSERT and it triggered right away. I looked back at the code and while I still thought the ASSERT was valid, I saw why it could trigger in a good code path. Consider a case where the number of items in the list is greater than nsi_count. Then consider that we immediately fill in the remaining items in fpa, leaving some items in the list. We then read an item, but we don't mean to store it.
That is where the bug actually comes in I believe in the author's mind. Either there never is more than nsi_count entries in the list or not considering there can be more valid ones. Another point here is that there is no way to guarantee that the first item returned in the list is fp.
My fix is this:
103 for (instp = list_head(&nsi_head); instp != NULL;
104 instp = list_next(&nsi_head, &instp->nsi_list)) {
105 rfs4_file_t *temp;
106
107 temp = (rfs4_file_t *)vsd_get(vp, instp->vkey);
108 if (temp && (temp != fp)) {
109 ASSERT(cnt < nsi_count);
110 fpa[cnt++] = temp;
111 }
112 }
It works as a great defensive fix, but I'm concerned it misses the bigger issue of we should either count the items first in the list for the allocation or we should determine why we have more entries than we expect. I'll have to confer with my colleague once he gets back from vacation...
The code walk through in Austin was great for me - I ended up resolving all of my open issues and several bugs were found. I need to do another round of unit testing and then start running our regression test suite.
I also got some great help in putting the spe daemon into smf:
[root@pnfs-17-24 nfs]> svcadm enable svc:/network/nfs/spe
[root@pnfs-17-24 nfs]> ps -ef | grep spe
root 100882 100854 0 11:02:47 pts/1 0:00 grep spe
root 100880 100004 0 11:02:30 ? 0:00 /usr/lib/nfs/sped
Not sure I have the dependencies right, but that may be because I have such a horribly hacked together test rig. I'll have to do a fresh install to make sure things start right the first time.
I'll be in Austin this week getting approval for the adoption of mds_sids to manage devices, layouts (in core and on disk), and file handles. Oh, and to add kspe!
For those of you interested, the code is over here: http://cr.opensolaris.org/~tdh/perse/. I've removed the spe debug printfs and added DTRACE probes as appropriate.
On my todo list is to perhaps remove some debug DTrace probes (don't see a real need for in shipping code, kinda just kernel printfs) and to determine whether kspe lives in the nfs or nfssrv module. I made a half-hearted attempt to stuff it back into the nfs module and got reminded why I had it in nfssrv in the first place. I.e., need to get at mds_sid info and to abstract the kspe globals such that we aren't in the way of zonification. That last is the kicker - everything in the nfs module needs to know everything about kspe and it gets messy.
The memleaks on the DSes are plugged:
> ::findleaks
CACHE LEAKED BUFCTL CALLER
ffffff01d459d5a0 1 ffffff01eae60088 cralloc_flags+0x21
ffffff01c682a860 1 ffffff01d75d65e8 dserv_mds_do_reportavail+0x210
ffffff01c68262e0 4 ffffff01f3c3f1c0 mds_compound+0x54
ffffff01c682b2e0 3 ffffff01ecc0e6e8 mds_compound+0x193
ffffff01c6828020 2 ffffff02032b4d30 mds_get_server_impl_id+0x30
ffffff01c68262e0 2 ffffff01f3c3f0e8 mds_get_server_impl_id+0x58
ffffff01c6826b20 2 ffffff01dfb46cc0 mds_get_server_impl_id+0x8a
ffffff01c6828860 1 ffffff01d7ba0740 modinstall+0x129
ffffff01c682b2e0 1 ffffff01ddfd1b60 modinstall+0x129
ffffff01c6828860 1 ffffff01d7ba0c50 modinstall+0x129
ffffff01c68265a0 2 ffffff01e159f890 tohex+0x32
ffffff01c6828020 2 ffffff01e3e6e398 xdr_array+0xae
ffffff01c68282e0 1 ffffff0201571818 xdr_bytes+0x70
ffffff01c68285a0 1 ffffff02590b2538 xdr_bytes+0x70
------------------------------------------------------------------------
Total 24 buffers, 2356 bytes
And on the MDS, we see the same:
> ::findleaks
CACHE LEAKED BUFCTL CALLER
ffffff01c68262e0 4 ffffff01d6e497b8 mds_compound+0x54
ffffff01c682b2e0 3 ffffff01d69cdcb0 mds_compound+0x193
ffffff01c6828020 1 ffffff0202d71db8 mds_get_server_impl_id+0x30
ffffff01c6828020 1 ffffff03ea11d4c0 mds_get_server_impl_id+0x30
ffffff01c68262e0 2 ffffff01ebc1a008 mds_get_server_impl_id+0x58
ffffff01c6826b20 2 ffffff0202808588 mds_get_server_impl_id+0x8a
ffffff01c68265a0 2 ffffff01f92d2458 tohex+0x32
ffffff01c6828020 2 ffffff0202d71c08 xdr_array+0xae
ffffff01c68285a0 1 ffffff07c864e128 xdr_bytes+0x70
ffffff01c68282e0 1 ffffff02027ff390 xdr_bytes+0x70
------------------------------------------------------------------------
Total 19 buffers, 1496 bytes
Besides that nasty interaction on the DS XDR code, we can see that I've fixed the rpc_init_taglist() leak on both systems. BTW - I thought that one may have been in Nevada, but I checked and it is only in the nfs41-gate. Sweet, that means I don't have to backport it into a gate which would cause another week of testing.
The code is ready to go, I have a code walk through next week, where I need to iron out the following issues:
Looks like my new code is not complete:
> ::findleaks
CACHE LEAKED BUFCTL CALLER
ffffff01c682a860 1 ffffff01d79523d0 dserv_mds_do_reportavail+0x210
ffffff01c68262e0 4 ffffff01ee2a9118 mds_compound+0x54
ffffff01c682b2e0 3 ffffff01e8721738 mds_compound+0x193
ffffff01c6828020 1 ffffff01f274dc00 mds_get_server_impl_id+0x30
ffffff01c68262e0 1 ffffff01e96acb40 mds_get_server_impl_id+0x58
ffffff01c6826b20 1 ffffff01e128de70 mds_get_server_impl_id+0x8a
ffffff01c6828860 1 ffffff01d87c66d8 modinstall+0x129
ffffff01c682b2e0 1 ffffff01ddb51748 modinstall+0x129
ffffff01c6828860 1 ffffff01d7f8f9b0 modinstall+0x129
ffffff01c6828020 1 ffffff01dd7db798 rpc_init_taglist+0x25
ffffff01c6828020 12741 ffffff01f180fdf8 rpc_init_taglist+0x25
ffffff01c6828020 1 ffffff01e3ede5e8 rpc_init_taglist+0x25
ffffff01c6828020 1 ffffff09ba2e4da0 rpc_init_taglist+0x25
ffffff01c6828020 23152 ffffff01e2e37cc8 rpc_init_taglist+0x25
ffffff01c68265a0 1 ffffff01ee74bd30 tohex+0x32
ffffff01c6828020 2 ffffff01d632b880 xdr_array+0xae
ffffff01c6828020 1 ffffff01fe11ec00 xdr_array+0xae
ffffff01c68282e0 1 ffffff01e5e4c2b0 xdr_bytes+0x70
ffffff01c68262e0 1659 ffffff01eaae3700 xdr_bytes+0x70
ffffff01c68285a0 1 ffffff01fe136de0 xdr_bytes+0x70
ffffff01c68262e0 1571800 ffffff01e7dde3b8 xdr_bytes+0x70
------------------------------------------------------------------------
Total 1609375 buffers, 26900440 bytes
> ffffff01e7dde3b8$<bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
ffffff01e7dde3b8 ffffff01ea091bc8 4d6b30e7ab91 ffffff01d8245b40
ffffff01c68262e0 ffffff01c6b37000 ffffff01cc88be60
kmem_cache_alloc_debug+0x283
kmem_cache_alloc+0xa9
kmem_alloc+0xa3
xdr_bytes+0x70
xdr_mds_sid+0x21
xdr_ds_fh_v1+0x68
xdr_ds_fh+0x3f
xdr_decode_nfs41_fh+0xdd
xdr_snfs_argop4+0x5e
xdr_COMPOUND4args_srv+0xf4
svc_authany_wrap+0x22
svc_cots_kgetargs+0x41
dispatch_dserv_nfsv41+0x5d
svc_getreq+0x20d
svc_run+0x197
By the way, those leaks of 1 or 2, those are probably active memory when I forced the core.
So this is the second bug I claimed to have fixed earlier today. Of note is that we never saw a panic, so something at least is correct. And, I decided to fix the rpc_init_taglist bug while I am at it.
I'm going to need to add some DTrace to track down what is happening here...
Aargh! I say, aargh! nfs4_xdr.c belongs to the nfs module and not the nfssrv module. For quick turn around, I've only been rebuilding nfssrv and not the whole kernel. It was only when just changing nfs_xdr.c and trying a dmake in src/uts/intel/nfssrv that I noticed nothing happened. My code may be golden after all! If it compiles that is.
Okay, I did some other changes, but here is my compiling code:
4059 case OP_PUTFH: {
4060 nfs_fh4 *obj = &array[i].nfs_argop4_u.opputfh.object;
4061
4062 if (obj->nfs_fh4_val == NULL)
4063 continue;
4064
4065 DTRACE_NFSV4_1(xdr__i__op_putfh_version, uint32_t,
4066 minorversion);
4067 if (minorversion != 0) {
4068 struct mds_ds_fh *dsfh =
4069 (struct mds_ds_fh *)obj->nfs_fh4_val;
4070
4071 DTRACE_NFSV4_1(xdr__i__op_putfh_type,
4072 nfs41_fh_type_t, dsfh->type);
4073
4074 /*
4075 * Is it really a DS filehandle?
4076 */
4077 if (dsfh->type == FH41_TYPE_DMU_DS) {
4078 mds_sid *sid = &dsfh->fh.v1.mds_sid;
4079
4080 DTRACE_NFSV4_1(xdr__i__op_putfh_sid,
4081 mds_sid *, sid);
4082
4083 if (sid->val) {
4084 kmem_free(sid->val, sid->len);
4085 }
4086 }
4087 }
4088
4089 kmem_free(obj->nfs_fh4_val, obj->nfs_fh4_len);
4090 continue;
4091 }
And I added this simple DTrace script:
[root@pnfs-17-22 ~]> more ds.d
#!/usr/sbin/dtrace -s
nfsv4:::xdr-i-op_putfh_version
{
printf("xdr decode a FH -- version == %u",
(uint32_t)arg0);
}
nfsv4:::xdr-i-op_putfh_type
{
printf("xdr decode a FH -- type == %s",
(int)arg0 == 2 ? "DS" : "regular");
}
nfsv4:::xdr-i-op_putfh_sid
{
sid = (mds_sid *)arg0;
printf("xdr decode a FH -- sid == %s",
sid == NULL ? "(null)" : "valid");
}
Which shows:
[root@pnfs-17-22 ~]> ./ds.d dtrace: script './ds.d' matched 3 probes CPU ID FUNCTION:NAME 0 2834 xdr_snfs_argop4_free:xdr-i-op_putfh_version xdr decode a FH -- version == 1 0 2833 xdr_snfs_argop4_free:xdr-i-op_putfh_type xdr decode a FH -- type == DS 0 2832 xdr_snfs_argop4_free:xdr-i-op_putfh_sid xdr decode a FH -- sid == valid 0 2834 xdr_snfs_argop4_free:xdr-i-op_putfh_version xdr decode a FH -- version == 1 0 2833 xdr_snfs_argop4_free:xdr-i-op_putfh_type xdr decode a FH -- type == DS 0 2832 xdr_snfs_argop4_free:xdr-i-op_putfh_sid xdr decode a FH -- sid == valid
But I still have to check back later to see if there are memory leaks!
I've been trying to show how you would use kmdb and ::findleaks to track down memory leaks. You need to do this with XDR code, even the machine generated stuff. You also need to do it before you integrate and not after. I've fixed two leaks that were pre-existing. They would probably go until either someone had a regression test session flunk because of accumulated memory leaks (the mds_sid leaks would do it) or we sat down to find them before shipping code.
The other thing about memory leaks is that you have to test after you fix them, you might find more, find out your fix didn't work, or find out your fix uncovered others.
And perhaps it is time to remind you of my other disclaimer, I don't hide my braindead mistakes. I show them in hopes that someone can learn from them - even if it is just me. :->
Once we deliver the pNFS server, you really, really would not want to do what I am about to show you. But, I've got orphaned data on my DSes and I need to clean them up. Before my latest server changes enabled on disk layouts, you only had to make changes on the DS, but now you will also have to nuke some state on the MDS.
First we want to disable any traffic from the clients. I could do that by unmounting all shares or rebooting them.
Second, we want to reboot the MDS - that will cause all state to either be flushed to disk or nuke it out of memory.
Third, on each DS, I want to do:
[root@pnfs-17-22 pnfs-17-22]> dservadm liststor
storage:
pnfs1/ds1
pnfs2/ds2
pnfs1/ds8
[root@pnfs-17-22 pnfs-17-22]> dservadm dropstor pnfs1/ds1 [root@pnfs-17-22 pnfs-17-22]> dservadm dropstor pnfs2/ds2 [root@pnfs-17-22 pnfs-17-22]> dservadm dropstor pnfs1/ds8
[root@pnfs-17-22 pnfs-17-22]> zfs list NAME USED AVAIL REFER MOUNTPOINT pnfs1 15.6G 0 23K /pnfs1 pnfs1/ds1 7.89G 0 15K - pnfs1/ds1/DB41524C083AB6A2 3.49G 0 3.49G - pnfs1/ds1/EF25FBCF08FA1BD7 4.39G 0 4.39G - pnfs1/ds8 7.74G 0 15K - pnfs1/ds8/DB41524C083AB6A2 3.48G 0 3.48G - pnfs1/ds8/EF25FBCF08FA1BD7 4.26G 0 4.26G - pnfs1/scratch 21K 0 21K /pnfs1/scratch pnfs2 7.89G 12.0G 22K /pnfs2 pnfs2/ds2 7.89G 12.0G 15K - pnfs2/ds2/DB41524C083AB6A2 3.59G 12.0G 3.59G - pnfs2/ds2/EF25FBCF08FA1BD7 4.30G 12.0G 4.30G - pnfs2/scratch 21K 12.0G 21K /pnfs2/scratch
We are looking for unamed filesystems under our datasets.
[root@pnfs-17-22 pnfs-17-22]> zfs destroy pnfs1/ds1/DB41524C083AB6A2 ... [root@pnfs-17-22 pnfs-17-22]> zfs destory pnfs2/ds2/EF25FBCF08FA1BD7
Note, if you get the following:
[root@pnfs-17-23 ~]> zfs destroy pnfs1/ds3/DB41524C083AB6A2 cannot destroy 'pnfs1/ds3/DB41524C083AB6A2': dataset is busy
Then you need to first reboot the DS!
Once all DSes have been scrubbed, it is time to scrub the MDS. We need to remove all of the layouts which refer to files which are no longer there!
[root@pnfs-17-24 layouts]> cd /var/nfs/v4_state/layouts/ [root@pnfs-17-24 layouts]> ls -la total 16 drwxr-xr-x 5 daemon daemon 512 Jul 29 03:17 . drwxr-xr-x 5 daemon daemon 512 Jul 23 16:30 .. drwxr-xr-x 2 daemon daemon 2560 Jul 31 01:27 2d90003 drwxr-xr-x 2 daemon daemon 1536 Jul 28 20:43 2d90005 drwxr-xr-x 2 daemon daemon 1024 Jul 31 01:27 2d90006 [root@pnfs-17-24 layouts]> rm -rf 2d*
At this point, you want to reboot the MDS. Why? Why not!
And then you want to re-enable the datasets on the DSes:
[root@pnfs-17-22 pnfs-17-22]> dservadm addstor pnfs1/ds1 [root@pnfs-17-22 pnfs-17-22]> dservadm addstor pnfs2/ds2 [root@pnfs-17-22 pnfs-17-22]> dservadm addstor pnfs1/ds8
And then, you guessed it, reboot the DSes. Actually this one serves a purpose - I don't think the DS understands to send a DS_REPORTAVAIL when new storage is added. So this will force a clean state!
We've got a bug in our remove code - I didn't introduce it in my recent set of changes, we can just say it has always been there. The bug is that when we remove a file, we eventually fill up the dataset on the DSes.
I'm going to want to fix it, mainly because I want to get my head around how we are going to examine files on the DSes.
By this, I mean that the inode information lives on the MDS. You can browse the directory structure on it and see what files are there. You can't do that on the DS. The DS may have the concept of an inode, but it doesn't care at all about directory structure, file names, etc. The first file may have a layout of 4 and the second a layout of 8. And it may be the case that those files only intersect on this dataset.
To find dangling files, conceptually we need to run scanners on both the MDS and the DSes. For each file file on the MDS, we can look at the odl and check to see if there is a corresponding file on the DS. On the DS, we have to make sure that the MDS has that file existing and the dataset is in the corresponding layout.
Scratch that, we just need a scanner on the DS - I'm starting to describe a parallelized fsck. So, on the DS, we scan the datasets and for each file we find, we check with the MDS:
A colleague just pointed out 6792701 Removing large holey file does not free space which I like as a culprit for a couple of reasons:
But I'm not sure that they are holey. Our server sets NFL4_UFLG_DENSE, which should inform the clients to not create holes.
In any event, the bug will give me some clues as to how to triage the issue.
Okay, my fix for the memory leaks on the MDS went fine. Now I've got a problem on the DS:
> ::findleaks
CACHE LEAKED BUFCTL CALLER
ffffff01c682a860 1 ffffff01d49a0068 dserv_mds_do_reportavail+0x210
ffffff01c68262e0 520364 ffffff01ea054458 dserv_nnode_from_fh_ds+0x5b
ffffff01c68262e0 1258470 ffffff01f3ed5db0 dserv_nnode_from_fh_ds+0x5b
ffffff01c68262e0 1343031 ffffff01f3ed5e88 dserv_nnode_from_fh_ds+0x77
ffffff01c68262e0 435756 ffffff01ee4b65f8 dserv_nnode_from_fh_ds+0x77
ffffff01c68262e0 1 ffffff01de76e558 mds_compound+0x54
ffffff01c68262e0 3 ffffff01e83df668 mds_compound+0x54
ffffff01c682b2e0 4 ffffff01d99cea30 mds_compound+0x193
ffffff01c6828020 2 ffffff01dcd9f8f8 mds_get_server_impl_id+0x30
ffffff01c68262e0 2 ffffff01dd4350d8 mds_get_server_impl_id+0x58
ffffff01c6826b20 2 ffffff01d7153290 mds_get_server_impl_id+0x8a
ffffff01c6828860 1 ffffff01d8805120 modinstall+0x129
ffffff01c6828860 1 ffffff01d7fe7a98 modinstall+0x129
ffffff01c682b2e0 1 ffffff01ddd5a870 modinstall+0x129
ffffff01c6828020 1 ffffff0f82dd4b68 rpc_init_taglist+0x25
ffffff01c6828020 20145 ffffff020b8337a0 rpc_init_taglist+0x25
ffffff01c6828020 1 ffffff01d613e538 rpc_init_taglist+0x25
ffffff01c6828020 11460 ffffff01e20b1de8 rpc_init_taglist+0x25
ffffff01c6828020 1 ffffff0f82de32e0 rpc_init_taglist+0x25
ffffff01c6828020 1 ffffff0f82de1708 rpc_init_taglist+0x25
ffffff01c6828020 2 ffffff01ddda3018 rpc_init_taglist+0x25
ffffff01c68265a0 2 ffffff01dc8f41b0 tohex+0x32
ffffff01c6828020 3 ffffff01dcda06c0 xdr_array+0xae
ffffff01c6828020 1 ffffff01df04e7f8 xdr_array+0xae
ffffff01c68262e0 1083945 ffffff01edde37b0 xdr_bytes+0x70
ffffff01c68282e0 1 ffffff01f31f9e60 xdr_bytes+0x70
ffffff01c68285a0 1 ffffff01e1798598 xdr_bytes+0x70
ffffff01c68262e0 694948 ffffff01f3ed5c00 xdr_bytes+0x70
------------------------------------------------------------------------
Total 5368151 buffers, 86904136 bytes
Even though we still see a high number of leaks in xdr_bytes(), which we expect to indicate that the mds_sid is not getting cleaned up, I'll argue that this is to be expected with the high number of leaks in dserv_nnode_from_fh_ds(). I.e., if the file handle is not being nuked, we can't expect the mds_sid component to also be zapped.
> ffffff01ea054458$<bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
ffffff01ea054458 ffffff01e9f71d50 89f9dc8f5e7 ffffff01d7aa93a0
ffffff01c68262e0 ffffff01c747f9c0 ffffff01cfd78cd8
kmem_cache_alloc_debug+0x283
kmem_cache_alloc+0xa9
kmem_alloc+0xa3
dserv_nnode_from_fh_ds+0x5b
nnode_from_fh_v41+0x45
mds_op_putfh+0xe3
rfs41_op_dispatch+0x72
mds_compound+0x1d4
rfs41_dispatch+0x17c
dispatch_dserv_nfsv41+0x93
svc_getreq+0x20d
svc_run+0x197
svc_do_run+0x81
nfssys+0xa0e
The issue is actually very simple to understand - prior to my changes, the dnk_sid field was not being used. But that field is a mds_sid, which requires a memory allocation to use:
426 static nnode_error_t
427 dserv_nnode_from_fh_ds(nnode_t **npp, mds_ds_fh *fhp)
428 {
429 dserv_nnode_key_t dskey;
430 nnode_key_t key;
431 uint32_t hash;
432
433 if (fhp->vers < 1)
434 return (ESTALE); /* XXX badhandle */
435 if (fhp->fh.v1.mds_fid.len < 8) /* XXX stupid */
436 return (ESTALE);
437 if (fhp->fh.v1.mds_fid.len > DS_MAXFIDSZ)
438 return (ESTALE);
439 /* XXX cannot do sid until mds_sid is sane */
440 dskey.dnk_sid = NULL;
441 dskey.dnk_fid = &fhp->fh.v1.mds_fid;
442
443 hash = dserv_nnode_hash(&dskey);
444
445 key.nk_keydata = &dskey;
446 key.nk_compare = dserv_nnode_compare;
447
448 return (nnode_find_or_create(npp, &key, hash, fhp, dserv_nnode_build));
449 }
I changed this to be:
425 static nnode_error_t
426 dserv_nnode_from_fh_ds(nnode_t **npp, mds_ds_fh *fhp)
427 {
428 dserv_nnode_key_t dskey;
429 nnode_key_t key;
430 uint32_t hash;
431
432 if (fhp->vers < 1)
433 return (ESTALE); /* XXX badhandle */
434 if (fhp->fh.v1.mds_fid.len < 8) /* XXX stupid */
435 return (ESTALE);
436 if (fhp->fh.v1.mds_fid.len > DS_MAXFIDSZ)
437 return (ESTALE);
438
439 /*
440 * XXX: Should we be mallocing in the middle of a
441 * cache? Cache entry does not have enough space.
442 * What is the use of the mds_sid here?
443 */
444 dskey.dnk_sid = kmem_alloc(sizeof (mds_sid), KM_SLEEP);
445 dskey.dnk_sid->len = fhp->fh.v1.mds_sid.len;
446 dskey.dnk_sid->val = kmem_alloc(dskey.dnk_sid->len,
447 KM_SLEEP);
448
449 bcopy(fhp->fh.v1.mds_sid.val, dskey.dnk_sid->val,
450 dskey.dnk_sid->len);
451
452 dskey.dnk_fid = &fhp->fh.v1.mds_fid;
453
454 hash = dserv_nnode_hash(&dskey);
455
456 key.nk_keydata = &dskey;
457 key.nk_compare = dserv_nnode_compare;
458
459 return (nnode_find_or_create(npp, &key, hash, fhp, dserv_nnode_build));
460 }
The problem here is that both dskey and key are pulled off of the stack. We don't expect them to have a life after dserv_nnode_from_fh_ds() and before my changes, there was no need to release any memory.
The key is only used here in the find case:
637 rw_enter(&bucket->nb_lock, rw);
638 found = avl_find(&bucket->nb_tree, &key, &where);
639 if (found) {
And indeed, we can easily follow where the compare function has to be from line 457 above:
310 static int
311 dserv_nnode_compare(const void *va, const void *vb)
312 {
313 const dserv_nnode_key_t *a = va;
314 const dserv_nnode_key_t *b = vb;
315 int rc;
316
317 NFS_AVL_COMPARE(a->dnk_sid->len, b->dnk_sid->len);
318 rc = memcmp(a->dnk_sid->val, b->dnk_sid->val, a->dnk_sid->len);
319 NFS_AVL_RETURN(rc);
320
321 NFS_AVL_COMPARE(a->dnk_fid->len, b->dnk_fid->len);
322 rc = memcmp(a->dnk_fid->val, b->dnk_fid->val, a->dnk_fid->len);
323 NFS_AVL_RETURN(rc);
324
325 return (0);
326 }
And I added that code on 317-319!
Edit Note: And I just realized, I'm supplying only one of the two keys. Where does the other come from? It turns out that it does get created in:
static nnode_error_t
dserv_nnode_build(nnode_seed_t *seed, void *vfh)
{
dserv_nnode_key_t *key = NULL;
...
key->dnk_sid = kmem_alloc(sizeof (mds_sid), KM_SLEEP);
key->dnk_sid->len = fhp->fh.v1.mds_sid.len;
key->dnk_sid->val = kmem_alloc(key->dnk_sid->len,
KM_SLEEP);
And that code does take care of releasing the memory correctly with dserv_nnode_key_free()!
It is not used in the create case:
674 rc = nnode_build(npp, data, nnbuild); 675 if (rc == 0) 676 avl_insert(&bucket->nb_tree, *npp, where);
So the fix will be:
457 ne = nnode_find_or_create(npp, &key, hash, fhp, dserv_nnode_build); 458 459 kmem_free(dskey.dnk_sid->val, dskey.dnk_sid->len); 460 kmem_free(dskey.dnk_sid, sizeof (mds_sid)); 461 462 return (ne); 463 }
Ouch, I thought I was done! Remember earlier I stated that the xdr_bytes would be from this code? Well I decided to challenge that statement:
> ffffff01edde37b0$<bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
ffffff01edde37b0 ffffff01edc60168 89f9c4733bc ffffff01d7aa93a0
ffffff01c68262e0 ffffff01c9a25b80 ffffff01cca20030
kmem_cache_alloc_debug+0x283
kmem_cache_alloc+0xa9
kmem_alloc+0xa3
xdr_bytes+0x70
xdr_mds_sid+0x21
xdr_ds_fh_v1+0x68
xdr_ds_fh+0x3f
xdr_decode_nfs41_fh+0xdd
xdr_snfs_argop4+0x5e
xdr_COMPOUND4args_srv+0xf4
svc_authany_wrap+0x22
svc_cots_kgetargs+0x41
dispatch_dserv_nfsv41+0x5d
svc_getreq+0x20d
svc_run+0x197
In retrospect, "D'oh!", we can see we are not decoding the xdr here. But then, in my defense, I made that statement before we looked at the code!
Ahh, looks like we have some hand written XDR code:
4034 static bool_t
4035 xdr_snfs_argop4_free(XDR *xdrs, nfs_argop4 **arrayp, int len)
4036 {
...
4056 case OP_PUTFH:
4057 if (array[i].nfs_argop4_u.opputfh.object.nfs_fh4_val !=
4058 NULL) {
4059 kmem_free(array[i].nfs_argop4_u.opputfh.object.
4060 nfs_fh4_val,
4061 array[i].nfs_argop4_u.opputfh.object.
4062 nfs_fh4_len);
4063 }
Which does not free the mds_sid portion of the file handle. This hand encoding is a nightmare to unravel. I'm going to have to pass in the minorversion:
5022 (void) xdr_snfs_argop4_free(xdrs, &objp->array, 5023 objp->array_len, objp->minorversion);
to determine whether or not to check for the mds_sid. If it is a minorversion, then I'll need to check whether it is a normal NFSv41 filehandle or a DS filehandle:
4057 case OP_PUTFH:
4058 if (array[i].nfs_argop4_u.opputfh.object.nfs_fh4_val ==
4059 NULL)
4060 continue;
4061
4062 if (minorversion != 0) {
4063 struct mds_ds_fh *dsfh =
4064 (struct mds_ds_fh *)array[i].nfs_argop4_u.
4065 opputfh.object.nfs_fh4_val;
4066
4067 /*
4068 * Is it really a DS filehandle?
4069 */
4070 if (fh4->type == FH41_TYPE_DMU_DS) {
4071 mds_sid *sid = &dsfh->fh.v1.mds_sid;
4072 if (sid->val) {
4073 kmem_free(sid->val, sid->len);
4074 }
4075 }
4076 }
4077
4078 kmem_free(array[i].nfs_argop4_u.opputfh.object.
4079 nfs_fh4_val,
4080 array[i].nfs_argop4_u.opputfh.object.
4081 nfs_fh4_len);
The key to understanding this code (and the headache) is to look at the definitions of our filehandles:
typedef struct {
nfs41_fh_type_t type;
nfs41_fh_vers_t vers;
union {
nfs41_fh_v1_t v1;
/* new versions will be added here */
} fh;
} nfs41_fh_fmt_t;
struct mds_ds_fh {
nfs41_fh_type_t type; /* MDS or DS? */
ds_fh_version vers; /* OTW version number */
union {
ds_fh_v1 v1;
/* new versions will be added here */
} fh;
};
typedef struct mds_ds_fh mds_ds_fh;
You can typecast nfs_fh4_val in several ways, i.e.:
4063 struct mds_ds_fh *dsfh = 4064 (struct mds_ds_fh *)array[i].nfs_argop4_u. 4065 opputfh.object.nfs_fh4_val;
but you always have to check the type to make sure it is valid. The key to understanding this comes out of nfs41_filehandle_xdr.c:
132 bool_t
133 xdr_decode_nfs41_fh(XDR *xdrs, nfs_fh4 *objp)
134 {
135
136 uint_t otw_len;
137 uint_t type;
138
139 ASSERT(xdrs->x_op == XDR_DECODE);
140
141 objp->nfs_fh4_val = NULL;
142 objp->nfs_fh4_len = 0;
...
150 /* Get the filehandle type */
151 if (!xdr_enum(xdrs, (enum_t *)&type))
152 return (FALSE);
153
154 switch (type) {
155 case FH41_TYPE_NFS: {
156 nfs41_fh_fmt_t *nfhp = NULL;
157
158 nfhp = kmem_zalloc(sizeof (nfs41_fh_fmt_t), KM_SLEEP);
159 nfhp->type = FH41_TYPE_NFS;
...
170 case FH41_TYPE_DMU_DS: {
171 struct mds_ds_fh *dfhp = NULL;
172
173 dfhp = kmem_zalloc(sizeof (struct mds_ds_fh), KM_SLEEP);
174 dfhp->type = FH41_TYPE_DMU_DS;
Since both structures have the type as the first field, after the typecast, we will be able to determine if that was valid to do.
So the first memory leak I fixed was one I had added. The second one was a pre-existing one. Running findleaks has saved my bacon here.
And of course, I'm going to have to check one more time to make sure that my fixes plugged the leaks!
I've update the spe implementation notes to version 1.1. The big changes from the version 1.0 from last year are:
One of the things I've been doing is spamming the MDS with file creates and deletes. I've been doing it for a day and now I want to see if the new code is leaking memory:
> ::findleaks
CACHE LEAKED BUFCTL CALLER
ffffff01c68262e0 1 ffffff0203ff9e70 mds_compound+0x54
ffffff01c68262e0 3 ffffff01eb07ada0 mds_compound+0x54
ffffff01c682b2e0 3 ffffff01de7f1818 mds_compound+0x193
ffffff01c6828020 1 ffffff020136ba50 mds_get_server_impl_id+0x30
ffffff01c68262e0 1 ffffff0203ff9a38 mds_get_server_impl_id+0x58
ffffff01c6826b20 1 ffffff01eb624ca8 mds_get_server_impl_id+0x8a
ffffff01c6828860 1 ffffff01e3fa6898 modinstall+0x129
ffffff01c6828860 1 ffffff01d82cd960 modinstall+0x129
ffffff01c6828860 1 ffffff01d61493a0 modinstall+0x129
ffffff01c682b2e0 1 ffffff02f1f72da8 modinstall+0x129
ffffff01c6828860 1 ffffff01e3fa6970 modinstall+0x129
ffffff01c682b2e0 1 ffffff01de7f13e0 modinstall+0x129
ffffff01c6828020 1 ffffff01fed32cd8 rpc_init_taglist+0x25
ffffff01c6828020 2 ffffff01e6bb68c0 rpc_init_taglist+0x25
ffffff01c6828020 1 ffffff020136bdb0 rpc_init_taglist+0x25
ffffff01c6828020 1311 ffffff01d742e4a0 rpc_init_taglist+0x25
ffffff01c68265a0 1 ffffff01f6a1ecb0 tohex+0x32
ffffff01c6828020 2 ffffff020136b390 xdr_array+0xae
ffffff01c68285a0 1 ffffff01d67cc068 xdr_bytes+0x70
ffffff01c68262e0 1154050 ffffff01eb07c7c8 xdr_bytes+0x70
ffffff01c68282e0 1 ffffff0200f380e0 xdr_bytes+0x70
ffffff01c68262e0 3979053 ffffff01f9f569b0 xdr_bytes+0x70
------------------------------------------------------------------------
Total 5134439 buffers, 82195096 bytes
I knew about the rpc_init_taglist() from before my changes, so that leaves the xdr_bytes, and yes, those are leaking:
> ffffff01f9f569b0$<bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
ffffff01f9f569b0 ffffff01fa04fc68 2a7a9936fecc ffffff01d84574e0
ffffff01c68262e0 ffffff01c70614c0 0
kmem_cache_alloc_debug+0x283
kmem_cache_alloc+0x164
kmem_alloc+0xa3
xdr_bytes+0x70
xdr_mds_sid+0x21
xdr_ds_fh_v1+0x68
xdr_ds_fh+0x3f
xdr_ds_fh_fmt+0x3b
get_mds_ds_fh+0x46
ds_checkstate+0x3f
nfs_ds_cp_dispatch+0x9e
svc_getreq+0x20d
svc_run+0x197
svc_do_run+0x81
nfssys+0xa0e
> ffffff01f9f569b0$<bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
ffffff01f9f569b0 ffffff01fa04fc68 2a7a9936fecc ffffff01d84574e0
ffffff01c68262e0 ffffff01c70614c0 0
kmem_cache_alloc_debug+0x283
kmem_cache_alloc+0x164
kmem_alloc+0xa3
xdr_bytes+0x70
xdr_mds_sid+0x21
xdr_ds_fh_v1+0x68
xdr_ds_fh+0x3f
xdr_ds_fh_fmt+0x3b
get_mds_ds_fh+0x46
ds_checkstate+0x3f
nfs_ds_cp_dispatch+0x9e
svc_getreq+0x20d
svc_run+0x197
svc_do_run+0x81
nfssys+0xa0e
BTW: I had enabled detailed accounting earlier with setting in /etc/system:
set kmem_flags=0xf
I think what is going on is that I assumed the space in the ds_filehandle was fixed and that there was no allocation going on for the mds_sid when it was xdr decoded. There are several complications to this:
Okay, from code inspection, the DS would have the same issue. I created a free routine and went through and applied it everywhere. Now when we add a new field with a memory allocation, we can make a change in only 1 place to clean it up.
Also, the leak was happening before - we just never bothered to check for it or track it down.
Well, I ran into problems in dumping the contents of the mds_sid array, DTrace does not iterate. To fix this, I decided to use a thread counter in the code for mds_ds_path_to_mds_sid:entry. I then increment it in the mds_ds_path_to_mds_sid:return. This allows me to enumerate the data set names.
#!/usr/sbin/dtrace -s
nfsv4:::spe-i-check_open
{
printf("%d (%d) from %s is checking %s",
(uid_t)arg0, (gid_t)arg1, stringof(arg3), stringof(arg2));
}
nfsv4:::spe-i-policy_eval
{
printf("Policy %d %s with error %d from %s",
(uint_t)arg0, (boolean_t)arg1 ? "matched" : "did not match",
(int)arg2, stringof(arg3));
}
::nfs41_spe_allocate:entry
{
self->addr = (struct netbuf *)arg2;
self->stripe_count = (count4 *)arg4;
self->unit_size = (uint32_t *)arg5;
self->mds_sids = (mds_sid **)arg6;
self->loaded_sids = 0;
}
::nfs41_spe_allocate:return
/args[1] == 0/
{
printf("Policy has %d stripes and %u block size",
*self->stripe_count, *self->unit_size);
}
::nfs41_spe_allocate:return
/args[1] != 0/
{
printf("No matching policy");
}
::mds_ds_path_to_mds_sid:entry
{
self->ustring = (utf8string *)arg0;
self->ss_name = stringof(self->ustring->utf8string_val);
self->mds_sid = (struct mds_sid *)arg1;
}
::mds_ds_path_to_mds_sid:return
/args[1] == 0/
{
ss_name = (char *)alloca(self->ustring->utf8string_len + 1);
bcopy(self->ustring->utf8string_val, ss_name,
self->ustring->utf8string_len);
ss_name[self->ustring->utf8string_len + 1] = '\0';
printf("mds_sid[%d] = %s", self->loaded_sids++,
stringof(ss_name));
}
::mds_ds_path_to_mds_sid:return
/args[1] != 0/
{
}
The remaining problem was that my first attempt to dump the utf8string resulted in garbage at the end of the output. Which makes sense, as such strings are not NUL terminated. Here is the original code:
::mds_ds_path_to_mds_sid:entry
{
u = (utf8string *)arg0;
self->ss_name = stringof(u->utf8string_val);
self->mds_sid = (struct mds_sid *)arg1;
}
::mds_ds_path_to_mds_sid:return
/args[1] == 0/
{
printf("mds_sid[%d] = %s", self->loaded_sids++, self->ss_name);
}
There was an interesting thread of discussion over at: [dtrace-discuss] getting extra characters with printf(copyin(a, b)). I tried the approaches suggested by both Chad and Adam, but no luck. I get on getting:
dtrace: error on enabled probe ID 7 (ID 63585: fbt:nfssrv:mds_ds_path_to_mds_sid:return): invalid address (0xffffff01f6a103f8) in action #2 at DIF offset 56
Which to be fair, I more than expected. The issue is that the copyin and copyinstr work on user land memory and the stuff I want to dump is in the kernel. Where that thread really helped me though was it pointed me to using bcopy(). With that in play, things went well:
1 3502 nfs41_spe_allocate:spe-i-check_open 0 (0) from 10.1.233.12 is checking /pnfs1/nfs41/pnfs-4-02/2/spek.txt 1 3501 nfs41_spe_allocate:spe-i-policy_eval Policy 1000 did not match with error 0 from 10.1.233.12 1 3501 nfs41_spe_allocate:spe-i-policy_eval Policy 2000 matched with error 0 from 10.1.233.12 1 63585 mds_ds_path_to_mds_sid:return mds_sid[0] = pnfs-17-22:pnfs1/ds1n 1 63585 mds_ds_path_to_mds_sid:return mds_sid[1] = pnfs-17-22:pnfs1/ds8n 1 63585 mds_ds_path_to_mds_sid:return mds_sid[2] = pnfs-17-22:pnfs2/ds2n 1 63585 mds_ds_path_to_mds_sid:return mds_sid[3] = pnfs-17-23:pnfs1/ds3n 1 63585 mds_ds_path_to_mds_sid:return mds_sid[4] = pnfs-17-23:pnfs1/ds7n 1 63585 mds_ds_path_to_mds_sid:return mds_sid[5] = pnfs-17-23:pnfs2/ds4f 1 63585 mds_ds_path_to_mds_sid:return mds_sid[6] = pnfs-4-01:pnfs1/d10n 1 63585 mds_ds_path_to_mds_sid:return mds_sid[7] = pnfs-4-01:pnfs1/ds4 1 62825 nfs41_spe_allocate:return Policy has 8 stripes and 1000 block size
It is time to get rid of the remaining debug printfs in the kspe code. The answer is DTrace.
I've added the following:
/*
* Give the client, uid, gid, and path.
*/
DTRACE_NFSV4_4(spe__i__check_open, uid_t, pat.uid,
gid_t, pat.gid, char *, dir_path, char *, address);
Which will allow me to determine the relevant parameters to doing a check in the policies. And:
/*
* Give the policy id, match, result, and client.
*/
DTRACE_NFSV4_4(spe__i__policy_eval, uint_t, sp->sp_id,
boolean_t, b, int, result, char *, address);
Will allow me to see the result of each policy lookup. I could probably do this with a function return, I would just have to pass in the address. But this is just as much an exercise in learning more about DTrace as anything else.
The last piece I will do as a function return. But I will lose the client address!
[root@pnfs-17-24 ~]> more spe.d
#!/usr/sbin/dtrace -s
nfsv4:::spe-i-check_open
{
co_uid = (uid_t)arg0;
co_gid = (gid_t)arg1;
co_path = (char *)arg2;
co_addr = (char *)arg3;
printf("%d (%d) from %s is checking %s",
co_uid, co_gid, stringof(co_path), stringof(co_addr));
}
nfsv4:::spe-i-policy_eval
{
pe_id = (uint_t)arg0;
pe_match = (boolean_t)arg1;
pe_res = (int)arg2;
pe_addr = (char *)arg3;
printf("Policy %d %s with error %d from %s",
pe_id, pe_match ? "matched" : "did not match",
pe_res, stringof(pe_addr));
}
Yields:
[root@pnfs-17-24 ~]> ./spe.d dtrace: script './spe.d' matched 2 probes CPU ID FUNCTION:NAME 0 3502 nfs41_spe_allocate:spe-i-check_open 0 (0) from /pnfs2/pnfs/pnfs-4-02/spek4.txt is checking 0.0.0.0 0 3501 nfs41_spe_allocate:spe-i-policy_eval Policy 1000 did not match with error 0 from 0.0.0.0
Okay, my address and path is backwards on the first one. And the addresses look wrong. The uid and gid are actually probably okay, I'm doing my tests as root after all.
And a quick sanity check shows I am right:
[root@pnfs-17-24 ~]> ./spe.d dtrace: script './spe.d' matched 2 probes CPU ID FUNCTION:NAME 1 3502 nfs41_spe_allocate:spe-i-check_open 3533 (10) from 0.0.0.0 is checking /pnfs1/nfs41/ds1tmds3.snoop 1 3501 nfs41_spe_allocate:spe-i-policy_eval Policy 1000 did not match with error 0 from 0.0.0.0 1 3501 nfs41_spe_allocate:spe-i-policy_eval Policy 2000 matched with error 0 from 0.0.0.0
Turns out that the passed in address is bogus:
[root@pnfs-17-24 ~]> Jul 28 21:12:24 pnfs-17-24 nfssrv: Address is 4294901760 for FFFF0000
This was working at one point, maybe. :->
I ended up doing some further investigation and added another probe:
nfsv4:::mds-i-create_address
{
ca_req = (struct svc_req *)arg0;
ca_addr = (struct netbuf *)arg1;
a = (char *)ca_req->rq_xprt->xp_xpc.xpc_rtaddr.buf;
iaddr = 0;
iaddr = (a[0] << 24) | (a[1] << 16) | (a[2] << 8) | a[3];
r1 = (iaddr & (0xff << 24)) >> 24;
r2 = (iaddr & (0xff << 16)) >> 16;
r3 = (iaddr & (0xff << 8)) >> 8;
r4 = iaddr & 0xff;
printf("Address is %u.%u.%u.%u", r1, r2, r3, r4);
}
It is late and DTrace does not do iteration. The code mimics what I do in C. It gives a consistent answer:
0 3564 mds_createfile_get_layout:mds-i-create_address Address is 255.255.255.254 0 3478 nfs41_spe_allocate:spe-i-check_open 0 (0) from 255.255.255.254 is checking /pnfs1/nfs41/ddsssss
Hmm, it is not changing as I change the client I'm making the request from.
I don't think it is working in OpenSolaris...
I think the bug isn't in the code, but in the way I am unpacking the netbuf. I'm looking at 4 bytes and the buffer is typically 16 bytes.
Yes, I ended up looking at our (OpenSolaris) source for __inet_taddr2uaddr() to figure out what was going on. I needed to go from:
1094 /*
1095 * XXX: Assuming IPv4!
1096 */
1097 a = (char *)addr->buf;
1098 for (i = 0; i < 4; i++) {
1099 iaddr |= a[i] << ((3-i) * 8);
1100 }
1101
1102 pat.addr = iaddr;
To:
1111 struct sockaddr_in *sa;
...
1115 sa = (struct sockaddr_in *)(addr->buf);
1116
1117 /*
1118 * XXX: Assuming IPv4!
1119 */
1120 a = (unsigned char *)&sa->sin_addr;
1121 for (i = 0; i < 4; i++) {
1122 iaddr |= a[i] << ((3-i) * 8);
1123 }
1124
1125 pat.addr = iaddr;
And we see this output:
0 19974 nfs41_spe_allocate:spe-i-check_open 0 (0) from 10.1.233.12 is checking /pnfs1/nfs41/jk19
Well, I still need to finish up the DTRace script to dump the found layout upon return. And I should figure out how to dump the client address via DTrace.
The important lesson here is to test the stuff you think you've already tested. I know I tested the address logic in the stand alone prototype. And I thought I also tested it in a special kernel version, but I must not have done so. In fact, it makes sense with some of the debug logic I pulled out earlier this week.
In all, I'm happy, I fixed a very nasty bug.