The memleaks on the DSes are plugged:
> ::findleaks
CACHE LEAKED BUFCTL CALLER
ffffff01d459d5a0 1 ffffff01eae60088 cralloc_flags+0x21
ffffff01c682a860 1 ffffff01d75d65e8 dserv_mds_do_reportavail+0x210
ffffff01c68262e0 4 ffffff01f3c3f1c0 mds_compound+0x54
ffffff01c682b2e0 3 ffffff01ecc0e6e8 mds_compound+0x193
ffffff01c6828020 2 ffffff02032b4d30 mds_get_server_impl_id+0x30
ffffff01c68262e0 2 ffffff01f3c3f0e8 mds_get_server_impl_id+0x58
ffffff01c6826b20 2 ffffff01dfb46cc0 mds_get_server_impl_id+0x8a
ffffff01c6828860 1 ffffff01d7ba0740 modinstall+0x129
ffffff01c682b2e0 1 ffffff01ddfd1b60 modinstall+0x129
ffffff01c6828860 1 ffffff01d7ba0c50 modinstall+0x129
ffffff01c68265a0 2 ffffff01e159f890 tohex+0x32
ffffff01c6828020 2 ffffff01e3e6e398 xdr_array+0xae
ffffff01c68282e0 1 ffffff0201571818 xdr_bytes+0x70
ffffff01c68285a0 1 ffffff02590b2538 xdr_bytes+0x70
------------------------------------------------------------------------
Total 24 buffers, 2356 bytes
And on the MDS, we see the same:
> ::findleaks
CACHE LEAKED BUFCTL CALLER
ffffff01c68262e0 4 ffffff01d6e497b8 mds_compound+0x54
ffffff01c682b2e0 3 ffffff01d69cdcb0 mds_compound+0x193
ffffff01c6828020 1 ffffff0202d71db8 mds_get_server_impl_id+0x30
ffffff01c6828020 1 ffffff03ea11d4c0 mds_get_server_impl_id+0x30
ffffff01c68262e0 2 ffffff01ebc1a008 mds_get_server_impl_id+0x58
ffffff01c6826b20 2 ffffff0202808588 mds_get_server_impl_id+0x8a
ffffff01c68265a0 2 ffffff01f92d2458 tohex+0x32
ffffff01c6828020 2 ffffff0202d71c08 xdr_array+0xae
ffffff01c68285a0 1 ffffff07c864e128 xdr_bytes+0x70
ffffff01c68282e0 1 ffffff02027ff390 xdr_bytes+0x70
------------------------------------------------------------------------
Total 19 buffers, 1496 bytes
Besides that nasty interaction on the DS XDR code, we can see that I've fixed the rpc_init_taglist() leak on both systems. BTW - I thought that one may have been in Nevada, but I checked and it is only in the nfs41-gate. Sweet, that means I don't have to backport it into a gate which would cause another week of testing.
The code is ready to go, I have a code walk through next week, where I need to iron out the following issues:
Looks like my new code is not complete:
> ::findleaks
CACHE LEAKED BUFCTL CALLER
ffffff01c682a860 1 ffffff01d79523d0 dserv_mds_do_reportavail+0x210
ffffff01c68262e0 4 ffffff01ee2a9118 mds_compound+0x54
ffffff01c682b2e0 3 ffffff01e8721738 mds_compound+0x193
ffffff01c6828020 1 ffffff01f274dc00 mds_get_server_impl_id+0x30
ffffff01c68262e0 1 ffffff01e96acb40 mds_get_server_impl_id+0x58
ffffff01c6826b20 1 ffffff01e128de70 mds_get_server_impl_id+0x8a
ffffff01c6828860 1 ffffff01d87c66d8 modinstall+0x129
ffffff01c682b2e0 1 ffffff01ddb51748 modinstall+0x129
ffffff01c6828860 1 ffffff01d7f8f9b0 modinstall+0x129
ffffff01c6828020 1 ffffff01dd7db798 rpc_init_taglist+0x25
ffffff01c6828020 12741 ffffff01f180fdf8 rpc_init_taglist+0x25
ffffff01c6828020 1 ffffff01e3ede5e8 rpc_init_taglist+0x25
ffffff01c6828020 1 ffffff09ba2e4da0 rpc_init_taglist+0x25
ffffff01c6828020 23152 ffffff01e2e37cc8 rpc_init_taglist+0x25
ffffff01c68265a0 1 ffffff01ee74bd30 tohex+0x32
ffffff01c6828020 2 ffffff01d632b880 xdr_array+0xae
ffffff01c6828020 1 ffffff01fe11ec00 xdr_array+0xae
ffffff01c68282e0 1 ffffff01e5e4c2b0 xdr_bytes+0x70
ffffff01c68262e0 1659 ffffff01eaae3700 xdr_bytes+0x70
ffffff01c68285a0 1 ffffff01fe136de0 xdr_bytes+0x70
ffffff01c68262e0 1571800 ffffff01e7dde3b8 xdr_bytes+0x70
------------------------------------------------------------------------
Total 1609375 buffers, 26900440 bytes
> ffffff01e7dde3b8$<bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
ffffff01e7dde3b8 ffffff01ea091bc8 4d6b30e7ab91 ffffff01d8245b40
ffffff01c68262e0 ffffff01c6b37000 ffffff01cc88be60
kmem_cache_alloc_debug+0x283
kmem_cache_alloc+0xa9
kmem_alloc+0xa3
xdr_bytes+0x70
xdr_mds_sid+0x21
xdr_ds_fh_v1+0x68
xdr_ds_fh+0x3f
xdr_decode_nfs41_fh+0xdd
xdr_snfs_argop4+0x5e
xdr_COMPOUND4args_srv+0xf4
svc_authany_wrap+0x22
svc_cots_kgetargs+0x41
dispatch_dserv_nfsv41+0x5d
svc_getreq+0x20d
svc_run+0x197
By the way, those leaks of 1 or 2, those are probably active memory when I forced the core.
So this is the second bug I claimed to have fixed earlier today. Of note is that we never saw a panic, so something at least is correct. And, I decided to fix the rpc_init_taglist bug while I am at it.
I'm going to need to add some DTrace to track down what is happening here...
Aargh! I say, aargh! nfs4_xdr.c belongs to the nfs module and not the nfssrv module. For quick turn around, I've only been rebuilding nfssrv and not the whole kernel. It was only when just changing nfs_xdr.c and trying a dmake in src/uts/intel/nfssrv that I noticed nothing happened. My code may be golden after all! If it compiles that is.
Okay, I did some other changes, but here is my compiling code:
4059 case OP_PUTFH: {
4060 nfs_fh4 *obj = &array[i].nfs_argop4_u.opputfh.object;
4061
4062 if (obj->nfs_fh4_val == NULL)
4063 continue;
4064
4065 DTRACE_NFSV4_1(xdr__i__op_putfh_version, uint32_t,
4066 minorversion);
4067 if (minorversion != 0) {
4068 struct mds_ds_fh *dsfh =
4069 (struct mds_ds_fh *)obj->nfs_fh4_val;
4070
4071 DTRACE_NFSV4_1(xdr__i__op_putfh_type,
4072 nfs41_fh_type_t, dsfh->type);
4073
4074 /*
4075 * Is it really a DS filehandle?
4076 */
4077 if (dsfh->type == FH41_TYPE_DMU_DS) {
4078 mds_sid *sid = &dsfh->fh.v1.mds_sid;
4079
4080 DTRACE_NFSV4_1(xdr__i__op_putfh_sid,
4081 mds_sid *, sid);
4082
4083 if (sid->val) {
4084 kmem_free(sid->val, sid->len);
4085 }
4086 }
4087 }
4088
4089 kmem_free(obj->nfs_fh4_val, obj->nfs_fh4_len);
4090 continue;
4091 }
And I added this simple DTrace script:
[root@pnfs-17-22 ~]> more ds.d
#!/usr/sbin/dtrace -s
nfsv4:::xdr-i-op_putfh_version
{
printf("xdr decode a FH -- version == %u",
(uint32_t)arg0);
}
nfsv4:::xdr-i-op_putfh_type
{
printf("xdr decode a FH -- type == %s",
(int)arg0 == 2 ? "DS" : "regular");
}
nfsv4:::xdr-i-op_putfh_sid
{
sid = (mds_sid *)arg0;
printf("xdr decode a FH -- sid == %s",
sid == NULL ? "(null)" : "valid");
}
Which shows:
[root@pnfs-17-22 ~]> ./ds.d dtrace: script './ds.d' matched 3 probes CPU ID FUNCTION:NAME 0 2834 xdr_snfs_argop4_free:xdr-i-op_putfh_version xdr decode a FH -- version == 1 0 2833 xdr_snfs_argop4_free:xdr-i-op_putfh_type xdr decode a FH -- type == DS 0 2832 xdr_snfs_argop4_free:xdr-i-op_putfh_sid xdr decode a FH -- sid == valid 0 2834 xdr_snfs_argop4_free:xdr-i-op_putfh_version xdr decode a FH -- version == 1 0 2833 xdr_snfs_argop4_free:xdr-i-op_putfh_type xdr decode a FH -- type == DS 0 2832 xdr_snfs_argop4_free:xdr-i-op_putfh_sid xdr decode a FH -- sid == valid
But I still have to check back later to see if there are memory leaks!
I've been trying to show how you would use kmdb and ::findleaks to track down memory leaks. You need to do this with XDR code, even the machine generated stuff. You also need to do it before you integrate and not after. I've fixed two leaks that were pre-existing. They would probably go until either someone had a regression test session flunk because of accumulated memory leaks (the mds_sid leaks would do it) or we sat down to find them before shipping code.
The other thing about memory leaks is that you have to test after you fix them, you might find more, find out your fix didn't work, or find out your fix uncovered others.
And perhaps it is time to remind you of my other disclaimer, I don't hide my braindead mistakes. I show them in hopes that someone can learn from them - even if it is just me. :->