« August 2009 »
SunMonTueWedThuFriSat
      
3
4
5
6
7
8
9
11
13
14
15
17
18
19
20
21
22
23
24
25
26
28
29
31
     
Today
XML

Neat blogs

Navigation

Editing

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

20090830 Sunday August 30, 2009
Investigating mountd

We've had some interesting question on nfs-discuss lately about mountd and I thought I'd use an userland script I wrote, snoop, and DTrace to show some interesting properties about mountd.

My perl script will send UDP mount requests to a server and spoof the client IP. I want to control what I send and I'll sometimes spoof a request from a non-existent machine.

BTW - you will notice I don't talk about what the share is that much, unless I state otherwise, it is:

[root@silver ~]> share | grep tdh
-@tank/home     /export/zfs/tdh   rw   ""  

Need IP to name mappings

We can try a host without a name mapping:

[tdh@slayer ~/src]> host 10.10.20.41
Host 41.20.10.10.in-addr.arpa. not found: 3(NXDOMAIN)
[tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 10.10.20.42

Note that since we don't have a client to receive the reply, we'll snoop it:

 30   0.01297  10.10.20.42 -> silver       MOUNT3 C Mount /export/zfs/tdh
 31   0.01712       silver -> thens.internal.excfb.com DNS C 42.20.10.10.in-addr.arpa. Internet PTR ?
 32   0.05912 thens.internal.excfb.com -> silver       DNS R  Error: 3(Name Error)
 33   0.00522       silver -> thens.internal.excfb.com DNS C 42.20.10.10.in-addr.arpa. Internet PTR ?
 34   0.00056 thens.internal.excfb.com -> silver       DNS R  Error: 3(Name Error)
 35   0.00777       silver -> 10.10.20.42  MOUNT3 R Mount Permission denied

BTW - right off the bat we can see that mountd tries to resolve the client IP. What happens if there is a reverse entry?

[tdh@slayer ~/src]> host 192.168.4.14
14.4.168.192.in-addr.arpa domain name pointer blast-4-14.internal.excfb.com.
[tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 192.168.4.14

Well?

 37   0.03198 blast-4-14.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 38   0.00089       silver -> thens.internal.excfb.com DNS C 14.4.168.192.in-addr.arpa. Internet PTR ?
 39   0.00051 thens.internal.excfb.com -> silver       DNS R 14.4.168.192.in-addr.arpa. Internet PTR blast-4-14.internal.excfb.com.
 40   0.00290       silver -> blast-4-14.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix

So a client has to have a reverse mapping from IP to host name before we allow a mount to succeed. And we can see that in the source code for usr/src/cmd/fs.d/nfs/mountd/mountd.c:

    875 	getclientsnames(transp, &nb, &clnames);
    876 	if (clnames == NULL || nb == NULL) {
    877 		/*
    878 		 * We failed to get a name for the client, even 'anon',
    879 		 * probably because we ran out of memory. In this situation
    880 		 * it doesn't make sense to allow the mount to succeed.
    881 		 */
    882 		error = EACCES;
    883 		goto reply;
    884 	}

Need to match exactly on hostname

What if I change the share to be:

[root@silver ~]> zfs set sharenfs=rw=blast4-14:blast4-15 tank/home/tdh

Will it work or fail?

[tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 192.168.4.14
[tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 192.168.4.15

Note that I wanted to show something warm in the cache and something cold:

 24   0.03500 blast-4-14.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 25   0.00129       silver -> blast-4-14.internal.excfb.com MOUNT3 R Mount Permission denied
 41   0.03706 blast-4-15.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh (retransmit)
 42   0.01419       silver -> thens.internal.excfb.com DNS C 15.4.168.192.in-addr.arpa. Internet PTR ?
 43   0.00048 thens.internal.excfb.com -> silver       DNS R 15.4.168.192.in-addr.arpa. Internet PTR blast-4-15.internal.excfb.com.
 44   0.00089       silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied

Two points here, the DNS cache is not flushed when a share reloads, but the nfs auth cache must be. If it were not flushed, we would have gotten permission granted.

Okay, I can show you what is going on by this example:

[root@silver ~]> zfs set sharenfs=rw=blast4-14.internal.excfb.com:blast4-15 tank/home/tdh

And;

 24   0.03769 blast-4-14.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 25   0.00122       silver -> blast-4-14.internal.excfb.com MOUNT3 R Mount Permission denied
 41   0.03115 blast-4-15.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh (retransmit)
 42   0.00092       silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied

Okay, we should have gotten permission granted for the first!

[root@silver ~]> zfs set sharenfs=rw=blast4-17.internal.excfb.com:blast4-15 tank/home/tdh

And that fails as well:

 20   0.03105 blast-4-17.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 21   0.00086       silver -> thens.internal.excfb.com DNS C 17.4.168.192.in-addr.arpa. Internet PTR ?
 22   0.00056 thens.internal.excfb.com -> silver       DNS R 17.4.168.192.in-addr.arpa. Internet PTR blast-4-17.internal.excfb.com.
 23   0.00163       silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount Permission denied

Ohhh! Smug mode off

Okay, I'm interested in what the function in_access_list will tell me. I've got a small DTrace script, which I iterated over the development off, which will let me know what is going on here:


!/usr/sbin/dtrace -Fs

/*
 * Thanks to Peter Harvey;
 * http://blogs.sun.com/peteh/entry/dereferencing_user_space_pointers_in
 *
 *      # ./mountd.d `pgrep -x mountd`
 *
 */

dtrace:::BEGIN
{
        printf("Sampling... Hit Ctrl-C to end.\n");
}

pid$1::check_client_new:return
{
        printf("Access permission is %d", arg1);
}

pid$1::in_access_list:entry
{
        self->trace_me = 1;
        printf("Access list is %s", copyinstr(arg2));
}

pid$1::in_access_list:return
{
        self->trace_me = 0;
        printf("Access permission is %d", arg1);
}

pid$1::strcasecmp:entry
/self->trace_me == 1/
{
        printf("host vs list entry: |%s|, |%s|\n", copyinstr(arg0),
                copyinstr(arg1));
}

pid$1::strcasecmp:entry
/self->trace_me == 1/
{
        printf("Comparison is %d", arg1);
}

Note that I need to use strcasecmp because I can't iterate over the array. I get as a result:

[root@silver ~]> ./mount_trace.sh
+ pgrep -x mountd 
+ /root/mountd.d 634 
dtrace: script '/root/mountd.d' matched 6 probes
CPU FUNCTION                                 
  0 | :BEGIN                                  Sampling... Hit Ctrl-C to end.

  0  -> in_access_list                        Access list is blast4-17.internal.excfb.com:blast4-15
  0    -> strcasecmp                          host vs list entry: |blast4-17.internal.excfb.com|, |blast-4-17.internal.excfb.com|

  0     | strcasecmp:entry                    Comparison is 135156944
  0     | strcasecmp:entry                    host vs list entry: |blast4-15|, |blast-4-17.internal.excfb.com|

  0     | strcasecmp:entry                    Comparison is 135156944
  0    <- in_access_list                      Access permission is 0
^C

D'oh! My share is wrong, wrong I say!

If instead I try:

[root@silver ~]> zfs set sharenfs=rw=blast-4-17.internal.excfb.com:blast-4-15 tank/home/tdh

We expect blast-4-15 to fail and blast-4-17 to succeed see:

 28   0.03211 blast-4-15.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 29   0.00150       silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied
 49   0.03460 blast-4-17.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh (retransmit)
 50   0.00087       silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix

Which shows again, that you need an exact match and we don't append the domain name to end of the hostname. What would happen if added this to end of the server's /etc/hosts?

192.168.4.14    blast-4-15

I expect it to work. Does it?

[root@silver ~]> grep MOUNT xxx 
 24   0.03176 blast-4-15.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 27   0.00124       silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied
 49   0.00160 blast-4-17.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh (retransmit)
 52   0.00098       silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix

Hmm, I bet the name entry is cached, which I can show by trying one which is not cached:

[root@silver ~]> zfs set sharenfs=rw=blast-4-17.internal.excfb.com:blast-4-31 tank/home/tdh

And

[root@silver ~]> grep MOUNT xxx
 24   0.03633 blast-4-17.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 27   0.00102       silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix
 45   0.02997   blast-4-31 -> silver       MOUNT3 C Mount /export/zfs/tdh (retransmit)
 46   0.00189       silver -> blast-4-31   MOUNT3 R Mount OK FH=01CC Auth=unix

And we can see the lack of caching because of two clues, the name output in snoop, i.e., "blast-4-31", and the lack of DNS activity between the two packets.

Summary

Some of the behaviour shocked me and I made some stupid mistakes that were hard to figure out what I had done. As an exercise in triage, it was great. I now have the beginnings of a DTrace debugging tool that I can point people at if they need some help. I'm very, very happy about that part!


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
Question: Why isn't a policy working?

One of the hardest things about developing software is testing it. As a developer, you get too close to the code and what you expect it to do. So I was very happy to get a question in email about why was a set of policies not working correctly:

I tried to reset the SPE policies to the following:

# cat /etc/policies.spe
101, 1, 8k, DS2, path == /diskpool/JUNK/TEST/P1
102, 2, 4k, DS1, path == /diskpool/JUNK/TEST/P2
103, 3, 2k, mix3, path == /diskpool/JUNK/TEST/P3
104, 4, 1k, mix4, path == /diskpool/JUNK/TEST/P4
#105, 5, 128k, default, path == /diskpool/JUNK/TEST/P5 || path == /export/JUNK
111, 5, 32k, default, path == /diskpool/JUNK/TEST/P5
112, 5, 64k, default, path == /export/JUNK

and have the client uses the path to verify the stripe.  For stripe 1,2,3
things seem working OK; however when I tried with strip 4 or 5 with
the policy above, I got "nfsstat -l" to list stripe 10 and count 32k:

pnfs-minipit1-3:~#
pnfs-minipit1-3:~# /mntpit1-3: mount -o vers=4 pnfs-minipit1-5:/diskpool/JUNK/TEST/P5 /mnt
pnfs-minipit1-3:~# nfsstat -m /mnt
/mnt from pnfs-minipit1-5:/diskpool/JUNK/TEST/P5
 Flags:         vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=1048576,wsize=1048576,retrans=5,timeo=600
 Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60

pnfs-minipit1-3:~# mkfile -v 1m /mnt/kk
/mnt/kk 1048576 bytes
pnfs-minipit1-3:~# nfsstat -l /mnt/kk
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 33
Layout [0]:
        Layout obtained at: Sun Aug 30 00:23:47:199247 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 10, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
        Stripe [1]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
        Stripe [2]:
                tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
        Stripe [3]:
                tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
        Stripe [4]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
        Stripe [5]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
        Stripe [6]:
                tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
        Stripe [7]:
                tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
        Stripe [8]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
        Stripe [9]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
pnfs-minipit1-3:~#

What did I do wrong?

Before we even look at the issue, what did the person asking for help do right

  1. Described the problem.
    1. What did they expect to see?
    2. What did they actually see?
  2. Showed me how to reproduce the problem.

In short, I'm confident that the person took all reasonable steps to solve the problem on their own.

So, now let's try to figure out what is going on. I have a DTrace script which will help out here. Ouch, I should have shared that earlier. Anyway, spe.d should help. I'll start it up and monitor the results as I do a file create:

[root@pnfs-minipit1-5 ~]> ./spe.d
dtrace: script './spe.d' matched 8 probes
CPU     ID                    FUNCTION:NAME
  1   3420 nfs41_spe_allocate:spe-i-check_open 200096 (10) from 10.1.233.191 is checking /diskpool/JUNK/TEST/P5/foollu2
  1   3419 nfs41_spe_allocate:spe-i-policy_eval Policy 101 did not match with error 0 from 10.1.233.191
  1   3419 nfs41_spe_allocate:spe-i-policy_eval Policy 102 did not match with error 0 from 10.1.233.191
  1   3419 nfs41_spe_allocate:spe-i-policy_eval Policy 103 did not match with error 0 from 10.1.233.191
  1   3419 nfs41_spe_allocate:spe-i-policy_eval Policy 104 did not match with error 0 from 10.1.233.191
  1   3419 nfs41_spe_allocate:spe-i-policy_eval Policy 111 matched with error 0 from 10.1.233.191
  1  63564    mds_ds_path_to_mds_sid:return mds_sid[0] = pnfs-minipit1-6:pNFSpool1/p1DS2
  1  63564    mds_ds_path_to_mds_sid:return mds_sid[1] = pnfs-minipit1-6:pNFSpool2/p2DS2
  1  63564    mds_ds_path_to_mds_sid:return mds_sid[2] = pnfs-minipit1-6:pNFSpool3/p3DS2
  1  63564    mds_ds_path_to_mds_sid:return mds_sid[3] = pnfs-minipit1-7:pNFSpool1/p1DS1
  1  63564    mds_ds_path_to_mds_sid:return ERROR - could not find a matching pgi for pnfs-minipit1-7:pNFSpool1/p2DS1
  1  56851        nfs41_spe_allocate:return No matching policy

So we did find a matching policy, but we only found 5 datasets and the policy requires 5. So, why isn't that dataset in memory?

[root@pnfs-minipit1-5 ~]> echo "::walk mds_DS_guid_entry_cache|::print struct rfs4_dbe dbe_data|::print ds_guid_info_t ds_dataset_name.utf8string_val|::eval /s" | mdb -k
0xffffff01e73e5fc0:             pnfs-minipit1-6:pNFSpool3/p3DS2�����f
0xffffff01e7227078:             pnfs-minipit1-6:pNFSpool3/p3DS1�����f
0xffffff01e7227120:             pnfs-minipit1-6:pNFSpool2/p2DS2�����f
0xffffff01e72271c8:             pnfs-minipit1-6:pNFSpool2/p2DS1�����f
0xffffff01e7227270:             pnfs-minipit1-6:pNFSpool1/p1DS2�����f
0xffffff01e7227318:             pnfs-minipit1-6:pNFSpool1/p1DS1�����f
0xffffff01e5d62e70:             pnfs-minipit1-7:pNFSpool2/p2DS2�����f
0xffffff01e6597628:             pnfs-minipit1-7:pNFSpool2/p2DS1�����f
0xffffff01e4e40468:             pnfs-minipit1-7:pNFSpool1/p1DS2�����f
0xffffff01e5f84cb8:             pnfs-minipit1-7:pNFSpool1/p1DS1�����f
[root@pnfs-minipit1-5 ~]> echo "::walk mds_DS_guid_entry_cache|::print struct rfs4_dbe dbe_invalid" | mdb -k
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0 

Hmm, I need to figure out a better way to process UTF8 stings in mdb, but it appears to be there and in a valid entry. I couldn't figure this out until I lined up the dataset name output:

pnfs-minipit1-7:pNFSpool2/p2DS1
pnfs-minipit1-7:pNFSpool1/p2DS1

The naming convention shows the error, 'p2DS1' should be on pool #2, but it is on pool #1: pNFSpool1.

If this were implemented in SMF instead of a as a flat file, we would be able to do checking when we create the policy to catch this type of configuration error. But for now, DTrace and mdb can catch it.

For me, getting mdb to dump the dataset names was the hardest part. I had to experiment and search the web. I'm glad for all of the effort here, I'll use that trick again and again.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090827 Thursday August 27, 2009
Beating my head against named on Fedora!

I added 3 new subdomains to my home network for testing. I added the records to my chroot'ed named at /var/named/chroot/var/named/named.conf. I just did reverse pointers and I couldn't get it to work:

[root@adept var]# host 192.168.4.120
Host 120.4.168.192.in-addr.arpa. not found: 3(NXDOMAIN)

I did this with a simple Perl script, so I debugged the heck out of it and checked for tabs galore. I finally added forward lookups, which worked:

[root@adept var]# host blast-4-120
blast-4-120.internal.excfb.com has address 192.168.4.120

Heck, I've been burnt by a bad link in /etc before, so I checked it:

[root@adept var]# ls -al /etc/named.conf
lrwxrwxrwx 1 root named 21 2008-02-25 16:15 /etc/named.conf -> /var/named/named.conf
[root@adept var]# ls -la /var/named/named.conf
lrwxrwxrwx 1 root named 38 2008-02-25 16:24 /var/named/named.conf -> /var/named/chroot/var/named/named.conf

I even diff'ed them to be really, really sure. I ran named manually with '-g', fixed the warnings I got and then found out it didn't handle the chroot nicely. I looked at the init file and gave up on understanding it.

I couldn't find a log file for it, so I sent a SIGHUP to look for a database dump. I added logging to the config file and never saw any output. I never found that database dump.

But I did find an option that said where it should be:

       dump-file "/var/named/data/cache_dump.db";

I then asked myself, is there another copy of the config file?

[root@adept var]# ps -ef | grep named
named     4047     1  0 00:30 ?        00:00:00 /usr/sbin/named -u named -t /var/named/chroot
root      4207  2886  0 01:00 pts/3    00:00:00 grep named
[root@adept var]# cd /var
[root@adept var]# find . -name named.conf
./named/named.conf
./named/chroot/etc/named.conf
./named/chroot/var/named/named.conf
[root@adept var]# ls -la ./named/named.conf
lrwxrwxrwx 1 root named 38 2008-02-25 16:24 ./named/named.conf -> /var/named/chroot/var/named/named.conf
[root@adept var]# ls -la ./named/chroot/etc/named.conf
-rw-r--r-- 1 root named 2741 2008-02-25 20:49 ./named/chroot/etc/named.conf

Why yes, yes there is and it doesn't have my new zones!

[root@adept etc]# pwd
/var/named/chroot/etc
[root@adept etc]# mv named.conf named.conf.fracked
[root@adept etc]# ln -s ../var/named/named.conf .
[root@adept etc]# ls -la ../var/named/named.conf
-rw-r----- 1 root named 4920 2009-08-27 00:13 ../var/named/named.conf
[root@adept etc]# service named restart
Stopping named:                                            [  OK  ]
Starting named:                                            [  OK  ]
[root@adept etc]#
[root@adept etc]# host 192.168.4.120
120.4.168.192.in-addr.arpa domain name pointer blast-4-120.internal.excfb.com.

Now what was I doing before I fell down this rat hole?


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090816 Sunday August 16, 2009
Removes are failing, time to debug

Removes are failing and I don't think this is the classic issue we have had with this operation. I know that at the last BAT, the removes were succeeding enough to cause a panic. And we don't seem to be getting to that code even.

Snoop shows the client sending the remove, but the MDS is not sending anything to the DS. Barring the MDS sending to the wrong DS, I've put together a simple DTrace script to see what is going on:

[root@pnfs-17-24 ~]> more remove.d
#!/usr/sbin/dtrace -s

::do_ctl_mds_remove:return
{
        printf("rc = %d", args[1]);
}

::ctl_mds_clnt_remove_file:return
{
        printf("rc = %d", args[1]);
}

:::nfss-e-vop_fid_pseudo_failed
{
}

:::nfss-i-layout_is_null_cannot_remove
{
}

::mds_op_remove:entry
{
        self->resop = (nfs_resop4 *)arg1;
        self->resp = self->resop->nfs_resop4_u.opremove;
}

::mds_op_remove:return
{

        printf("rc = %d", self->resp.status);
}

I don't like the hoops I have to go through to get the status on mds_op_remove, but I have to do that because that function is of type void. I'm going to change them to return a status. It doesn't cost much and makes simple DTrace scripting easier.

Anyway, here is our result:

[root@pnfs-17-24 ~]> ./remove.d
dtrace: script './remove.d' matched 6 probes
CPU     ID                    FUNCTION:NAME
  0  30503 do_ctl_mds_remove:nfss-e-vop_fid_pseudo_failed 
  0  62050         do_ctl_mds_remove:return rc = 28
  0  62052             mds_op_remove:return rc = 0
^C

What is that 28?

[thud@ultralord nfssrv]> grep 28 /usr/include/sys/errno.h 
#define	ENOSPC	28	/* No space left on device		*/

Now I want to know what vop_fid_pseudo() is returning, but only when called in this context. I can use the thread properties like this:

::vop_fid_pseudo:entry
/self->live == 1/
{
        self->vp = (vnode_t *)arg0;
        self->fidp = (fid_t *)arg1;
}

::vop_fid_pseudo:return
/self->live == 1/
{
        printf("fid len = %d\trc = %d",
            self->fidp->un._fid.len, args[1]);
}

Only if self->live is set will I see output:

[root@pnfs-17-24 ~]> ./remove.d
dtrace: script './remove.d' matched 8 probes
CPU     ID                    FUNCTION:NAME
  0  62468            vop_fid_pseudo:return fid len = 10        rc = 28
  0  30503 do_ctl_mds_remove:nfss-e-vop_fid_pseudo_failed 
  0  62050         do_ctl_mds_remove:return rc = 28
  0  62052             mds_op_remove:return rc = 0
^C

And now I need to know where that error is set:

  49 vop_fid_pseudo(vnode_t *vp, fid_t *fidp)
  50 {
  51         struct vattr va;
  52         int error;
  53 
  54         error = VOP_FID(vp, fidp, NULL);
  55 
  56         /*
  57          * XXX nfs4_fid() does nothing and returns EREMOTE.
  58          * XXX nfs3_fid()/nfs_fid() returns nfs filehandle as its fid
  59          * which has a bigger length than local fid.
  60          * NFS_FH4MAXDATA is the size of
  61          * fhandle4_t.fh_xdata[NFS_FH4MAXDATA].
  62          *
  63          * Note: nfs[2,3,4]_fid() only gets called for diskless clients.
  64          */
  65         if (error == EREMOTE ||
  66             (error == 0 && fidp->fid_len > NFS_FH4MAXDATA)) {
  67 
  68                 va.va_mask = AT_NODEID; 
  69                 error = VOP_GETATTR(vp, &va, 0, CRED(), NULL);
  70                 if (error)
  71                         return (error);
  72 
  73                 fidp->fid_len = sizeof (va.va_nodeid);
  74                 bcopy(&va.va_nodeid, fidp->fid_data, fidp->fid_len);
  75                 return (0);
  76         }
  77 
  78         return (error);
  79 }

And that means I need to know what VOP_GETATTR() is returning. I'll pull the same trick as before, but I'll need to be looking at, well, I have to figure that out now, don't I?

The trick is to let the vnode tell me:

int
fop_getattr(
        vnode_t *vp,
...
        err = (*(vp)->v_op->vop_getattr)(vp, vap, flags, cr, ct);

So I need to find the pointer to the function:

        printf("vop_getattr = %p", self->vp->v_op->vop_getattr);

And then I need to find out what that is:

  0  62467             vop_fid_pseudo:entry vop_getattr = fffffffff883c180
...
[root@pnfs-17-24 ~]> mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs mpt sd sockfs ip hook neti sctp arp usba stmf fctl nca lofs idm cpc random zfs nfs fcip logindmux ptm sppp ]
> fffffffff883c180::dis
zfs_getattr:                    pushq  %rbp

Okay, we at least know which filesystem is complaining. I probably could have done this differently, but this way works. The only error in that function comes from zfs_zaccess():

  0  53373               zfs_zaccess:return rc = 0
  0  53125               zfs_getattr:return rc = 0
  0  53125               zfs_getattr:return rc = 0

Which means I was reading the code incorrectly! Note that these are being called by the code, my check forces that to be true, but the error must be elsewhere. It must be the VOP_FID() which is causing the issue!

4323 static int
4324 zfs_fid(vnode_t *vp, fid_t *fidp, caller_context_t *ct)
4325 {
4326         znode_t         *zp = VTOZ(vp);
4327         zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
4328         uint32_t        gen;
4329         uint64_t        object = zp->z_id;
4330         zfid_short_t    *zfid;
4331         int             size, i;
4332 
4333         ZFS_ENTER(zfsvfs);
4334         ZFS_VERIFY_ZP(zp);
4335         gen = (uint32_t)zp->z_gen;
4336 
4337         size = (zfsvfs->z_parent != zfsvfs) ? LONG_FID_LEN : SHORT_FID_LEN;
4338         if (fidp->fid_len < size) {
4339                 fidp->fid_len = size;
4340                 ZFS_EXIT(zfsvfs);
4341                 return (ENOSPC);
4342         }

Some quick DTrace will tell me I'm correct:

::zfs_fid:entry
/self->live == 1/
{
        zp = (znode_t *)(self->vp)->v_data;
        zfsvfs = zp->z_zfsvfs;

        lfid = sizeof (zfid_long_t) - sizeof (uint16_t);
        sfid = sizeof (zfid_short_t) - sizeof (uint16_t);

        size = (zfsvfs->z_parent != zfsvfs) ? lfid : sfid;

        printf("fidlen %s size! (%d, %d, %d, %d)",
            self->fidp->un._fid.len < size ? "<" : "not <",
            self->fidp->un._fid.len, size, lfid, sfid);
}
...
  0  53162                    zfs_fid:entry fidlen < size! (0, 10, 20, 10)
  0  53163                   zfs_fid:return rc = 28

So is this new? Or something I introduced? I don't see how the code ever worked here. We pass a fid_t off the stack, i.e., uninitialized, into vop_fid_pseudo() and when we call zfs_fid(), that check should trigger every time. The comments on lines 57-61 lead me to think that vop_fid_pseudo() is not intended for types other than NFS. In any event, the other callers of it do:

makefh4(nfs_fh4 *fh, vnode_t *vp, struct exportinfo *exi)
...
        fid_t fid;
...
        bzero(&fid, sizeof (fid));
        fid.fid_len = MAXFIDSZ;

And the file should have been removed!

  0  67453                   zfs_fid:return rc = 0
  0  81693             vop_fid_pseudo:entry vop_getattr = fffffffff8088180
  0  67452                    zfs_fid:entry fidlen not < size! (64, 10, 20, 10)
  0  67453                   zfs_fid:return rc = 0
  0  81694            vop_fid_pseudo:return fid len = 10        rc = 0

From the snoop we can both see the MDS communicating with the DS and that I need to go teach snoop about a remove!

CTL-MDS:  ----- Sun CTL-MDS -----
CTL-MDS:
CTL-MDS:  Proc = 10 (Remove object(s) or entire fsid at the DS)
CTL-MDS:

While I fixed a valid bug (and we are past the code that was panicking at the BAT, I doubt I fixed the problem we had been seeing:

Before a huge copy:

[root@pnfs-17-22 ~]> zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
pnfs1                       9.35G  6.28G    23K  /pnfs1

After a huge copy:

[root@pnfs-17-22 ~]> zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
pnfs1                       12.5G  3.09G    23K  /pnfs1

And after the delete:

[root@pnfs-17-22 ~]> zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
pnfs1                       12.5G  3.09G    23K  /pnfs1

Note, the DS did report that it tried to delete the file:

[root@pnfs-17-22 ~]> ./remove.d 
dtrace: script './remove.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0  62434        ctl_mds_srv_remove:return status = 0
  1  62434        ctl_mds_srv_remove:return status = 0
  1  62434        ctl_mds_srv_remove:return status = 0
  0  62434        ctl_mds_srv_remove:return status = 0
  1  62434        ctl_mds_srv_remove:return status = 0
  1  62434        ctl_mds_srv_remove:return status = 0
  0  62434        ctl_mds_srv_remove:return status = 0

Amazingly enough, I had forgotten I was running that DTrace script. So the issue is that ctl_mds_srv_remove() reports success, but the file contents are still there.

I'm starting to wonder if I haven't hit my head on a known ZFS bug that is fixed in 118. The pNFS code is based on 117. I'll investigate.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090812 Wednesday August 12, 2009
Regression testing saves my bacon

So, my unit testing is all done and it is time to integrate. Bzzt! Wrong answer - we have to do regression testing. In our case, it is called pNFS/miniPIT and was mainly crafted by Helen Chao. And it was bombing on my new code.

First of all, the core was confusing:

[root@pnfs-minipit2-4 ~]> 
panic[cpu0]/thread=30008a61560: 000000047ad
BAD TRAP: type=31 rp=2a100f596d0 addr=14 mmu_fsr=0 occurred in module "nfssrv" due to a NULL pointer dereference

nfsd: trap type = 0x31
addr=0x14
pid=100528, pc=0x7afb3d64, sp=0x2a100f58f71, tstate=0x4414001604, context=0x29b
g1-g7: 7be42af4, 2, 2000, 0, 0, 29b, 30008a61560

The NULL pointer wasn't anything to do with the function being called. Secondly, I focused on trying to reproduce it inside the test harness - it wasn't until I made a simple test case that I progressed.

Third, I focused on the server being 32 bit instead of the real difference - it was not setup as a MDS - therefore it had no DSes reporting to it. I'll show why that is important:

4463         error = nfs41_spe_allocate(&spe_va, claddr,
4464             vp->v_path, &lc, TRUE);
4465         if (error) {
4466                 /*
4467                  * XXX: Until we get the SMF code
4468                  * in place, we handle all errors by
4469                  * using the default layout of the
4470                  * old prototype code
4471                  *
4472                  * At that point, we should return the
4473                  * given error.
4474                  *
4475                  * XXX: Any way *plo is NULL here?
4476                  */
4477                 *plo = mds_gen_default_layout(cs->instp);
4478 
4479                 /*
4480                  * Record the layout, don't get bent out of shape
4481                  * if it fails, we'll try again at checkstate time.
4482                  */
4483                 (void) mds_put_layout(*plo, vp);
4484 
4485                 return (NFS4_OK);
4486         }
4487 
4488         /*
4489          * XXX: Any way *plo is NULL here?
4490          */
4491         *plo = mds_add_layout(&lc);

The fourth, and major problem, is that I had valid comments and I ignored them. Yes, it was certainly valid for the *plo to be NULL after 4477 but probably not 4491. But they are both easy enough to code for!

The issue is that a NFSv4.1 server, i.e., no pNFS, also goes through this code path. We need to handle not having a layout:

4474                  * At that point, we should return the
4475                  * given error.
4476                  */
4477                 *plo = mds_gen_default_layout(cs->instp);
4478                 if (*plo == NULL) {
4479                         status = NFS4ERR_LAYOUTUNAVAILABLE;
4480                 } else {
4481                         /*
4482                          * Record the layout, don't get
4483                          * bent out of shape if it fails,
4484                          * we'll try again at checkstate time.
4485                          */
4486                         (void) mds_put_layout(*plo, vp);
4487                 }
4488 
4489                 return (status);

Which appears okay, except with just this change, files will not be created. We want to return NFS4ERR_LAYOUTUNAVAILABLE, mainly for DTrace probing, but in the caller, we want to do:

4892         } else {
4893                 status = mds_createfile_get_layout(req, vp, cs, &ct, plo);
4894 
4895                 /*
4896                  * Allow mds_createfile_get_layout() to be verbose
4897                  * in what it presents as a status, but be aware
4898                  * that it is permissible to not generate a
4899                  * layout.
4900                  */
4901                 if (status == NFS4ERR_LAYOUTUNAVAILABLE) {
4902                         status = NFS4_OK;
4903                 }
4904         }

Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090810 Monday August 10, 2009
Tracking down a nasty delegation return bug

I've been fixing up a refcnt bug I found in recall_read_delegations(). One of the things I found out while in Austin was how to unit test those changes. We need a NFSv4.0 and a NFSv4.1 client both reading a file from a NFSv4.1 server. Note that we can't have a pNFS server, due to we do not yet support a mount option to force a 4.1 over a 4.0 mount. And we may never.

Anyway, you have two clients with a read delegation and force a recall by touching the files from another client.

I know it is a good test case because the server consistently crashes:

panic[cpu0]/thread=ffffff02d457db40: kernel heap corruption detected

ffffff001054de90 genunix:kmem_error+4a9 ()
ffffff001054deb0 genunix:kmem_free+d0 ()
ffffff001054df30 nfssrv:recall_read_delegations+8e8 ()
ffffff001054df90 nfssrv:deleg_rd_setattr+45 ()
ffffff001054e000 genunix:vnext_setattr+84 ()
ffffff001054e060 nfssrv:deleg_rd_setattr+66 ()
ffffff001054e0e0 genunix:vhead_setattr+9f ()
ffffff001054e140 genunix:fop_setattr+ad ()

Firefox rules! My UPS just failed and I thought I lost all of this. Yeah Firefox for saving the day!

I added DTrace probes, printf() statements when the probes failed to fire, and finally delays in the code. I couldn't see anything. I finally decided to remove all of that code and worked on cleaning the original code up. There was a rfs4_file_t pointer (fp), array of pointers (fpa), and last active pointer (fap). I was getting confused.

I fixed all that and while the panic was still occurring, the code was easier to read. I finally started adding some ASSERTs and nailed down the problem!

If we look at the original code:

 89         /*
 90          * Is there more than one file structure for this vp?
 91          * Get the vsd for each instance of the server, if it exists.
 92          */
 93         fpa = kmem_zalloc((sizeof (rfs4_file_t *) * nsi_count), KM_SLEEP);
 94         DTRACE_NFSV4_1(rrd__i__fpa_alloc, int, nsi_count);
 95 
 96         fpa[0] = fp;
 97         cnt = 1;
 98         mutex_enter(&vp->v_lock);
 99         for (instp = list_head(&nsi_head); instp != NULL;
100             instp = list_next(&nsi_head, &instp->nsi_list)) {
101                 fpa[cnt] = (rfs4_file_t *)vsd_get(vp, instp->vkey);
102                 if (fpa[cnt] && (fpa[cnt] != fp))
103                         cnt++;
104         }
105         mutex_exit(&vp->v_lock);

I said that before 101, it should be the case that cnt is less than nsi_count. I added an ASSERT and it triggered right away. I looked back at the code and while I still thought the ASSERT was valid, I saw why it could trigger in a good code path. Consider a case where the number of items in the list is greater than nsi_count. Then consider that we immediately fill in the remaining items in fpa, leaving some items in the list. We then read an item, but we don't mean to store it.

That is where the bug actually comes in I believe in the author's mind. Either there never is more than nsi_count entries in the list or not considering there can be more valid ones. Another point here is that there is no way to guarantee that the first item returned in the list is fp.

My fix is this:

103         for (instp = list_head(&nsi_head); instp != NULL;
104             instp = list_next(&nsi_head, &instp->nsi_list)) {
105                 rfs4_file_t     *temp;
106 
107                 temp = (rfs4_file_t *)vsd_get(vp, instp->vkey);
108                 if (temp && (temp != fp)) {
109                         ASSERT(cnt < nsi_count);
110                         fpa[cnt++] = temp;
111                 }
112         }

It works as a great defensive fix, but I'm concerned it misses the bigger issue of we should either count the items first in the list for the allocation or we should determine why we have more entries than we expect. I'll have to confer with my colleague once he gets back from vacation...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
Code walk through was a success

The code walk through in Austin was great for me - I ended up resolving all of my open issues and several bugs were found. I need to do another round of unit testing and then start running our regression test suite.

I also got some great help in putting the spe daemon into smf:

[root@pnfs-17-24 nfs]> svcadm enable svc:/network/nfs/spe
[root@pnfs-17-24 nfs]> ps -ef | grep spe
    root 100882 100854   0 11:02:47 pts/1       0:00 grep spe
    root 100880 100004   0 11:02:30 ?           0:00 /usr/lib/nfs/sped

Not sure I have the dependencies right, but that may be because I have such a horribly hacked together test rig. I'll have to do a fresh install to make sure things start right the first time.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090802 Sunday August 02, 2009
Off to Austin for a code walkthrough

I'll be in Austin this week getting approval for the adoption of mds_sids to manage devices, layouts (in core and on disk), and file handles. Oh, and to add kspe!

For those of you interested, the code is over here: http://cr.opensolaris.org/~tdh/perse/. I've removed the spe debug printfs and added DTRACE probes as appropriate.

On my todo list is to perhaps remove some debug DTrace probes (don't see a real need for in shipping code, kinda just kernel printfs) and to determine whether kspe lives in the nfs or nfssrv module. I made a half-hearted attempt to stuff it back into the nfs module and got reminded why I had it in nfssrv in the first place. I.e., need to get at mds_sid info and to abstract the kspe globals such that we aren't in the way of zonification. That last is the kicker - everything in the nfs module needs to know everything about kspe and it gets messy.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090801 Saturday August 01, 2009
memleaks plugged!

The memleaks on the DSes are plugged:

> ::findleaks
CACHE             LEAKED           BUFCTL CALLER
ffffff01d459d5a0       1 ffffff01eae60088 cralloc_flags+0x21
ffffff01c682a860       1 ffffff01d75d65e8 dserv_mds_do_reportavail+0x210
ffffff01c68262e0       4 ffffff01f3c3f1c0 mds_compound+0x54
ffffff01c682b2e0       3 ffffff01ecc0e6e8 mds_compound+0x193
ffffff01c6828020       2 ffffff02032b4d30 mds_get_server_impl_id+0x30
ffffff01c68262e0       2 ffffff01f3c3f0e8 mds_get_server_impl_id+0x58
ffffff01c6826b20       2 ffffff01dfb46cc0 mds_get_server_impl_id+0x8a
ffffff01c6828860       1 ffffff01d7ba0740 modinstall+0x129
ffffff01c682b2e0       1 ffffff01ddfd1b60 modinstall+0x129
ffffff01c6828860       1 ffffff01d7ba0c50 modinstall+0x129
ffffff01c68265a0       2 ffffff01e159f890 tohex+0x32
ffffff01c6828020       2 ffffff01e3e6e398 xdr_array+0xae
ffffff01c68282e0       1 ffffff0201571818 xdr_bytes+0x70
ffffff01c68285a0       1 ffffff02590b2538 xdr_bytes+0x70
------------------------------------------------------------------------
           Total      24 buffers, 2356 bytes

And on the MDS, we see the same:

> ::findleaks
CACHE             LEAKED           BUFCTL CALLER
ffffff01c68262e0       4 ffffff01d6e497b8 mds_compound+0x54
ffffff01c682b2e0       3 ffffff01d69cdcb0 mds_compound+0x193
ffffff01c6828020       1 ffffff0202d71db8 mds_get_server_impl_id+0x30
ffffff01c6828020       1 ffffff03ea11d4c0 mds_get_server_impl_id+0x30
ffffff01c68262e0       2 ffffff01ebc1a008 mds_get_server_impl_id+0x58
ffffff01c6826b20       2 ffffff0202808588 mds_get_server_impl_id+0x8a
ffffff01c68265a0       2 ffffff01f92d2458 tohex+0x32
ffffff01c6828020       2 ffffff0202d71c08 xdr_array+0xae
ffffff01c68285a0       1 ffffff07c864e128 xdr_bytes+0x70
ffffff01c68282e0       1 ffffff02027ff390 xdr_bytes+0x70
------------------------------------------------------------------------
           Total      19 buffers, 1496 bytes

Besides that nasty interaction on the DS XDR code, we can see that I've fixed the rpc_init_taglist() leak on both systems. BTW - I thought that one may have been in Nevada, but I checked and it is only in the nfs41-gate. Sweet, that means I don't have to backport it into a gate which would cause another week of testing.

The code is ready to go, I have a code walk through next week, where I need to iron out the following issues:

  1. Do we leave in the print routines for spe debugging in the kernel?
    • Small footprint
    • But not needed
      • Could use dtrace to debug now that bulk of unit testing is done.
      • No need to print from kernel, may eventually want to write a file.
      • All printing can be done with sped.
  2. Is sped a loader or a daemon right now?
    • Only loads policies and nppols
    • Would have to re-run to change them
    • Not SMF-ized
  3. Biggest issue - is kspe part of nfssrv or nfs?
    • Put in nfssrv module because of need to get at mds_sids.
    • Not sure if client-side spe will ever work, because of issue with getting path from vnode.
    • Want to do this right, the first integration!

  4. Originally posted on Kool Aid Served Daily
    Copyright (C) 2009, Kool Aid Served Daily
One of those memory leaks is still there on the DS

Looks like my new code is not complete:


> ::findleaks
CACHE             LEAKED           BUFCTL CALLER
ffffff01c682a860       1 ffffff01d79523d0 dserv_mds_do_reportavail+0x210
ffffff01c68262e0       4 ffffff01ee2a9118 mds_compound+0x54
ffffff01c682b2e0       3 ffffff01e8721738 mds_compound+0x193
ffffff01c6828020       1 ffffff01f274dc00 mds_get_server_impl_id+0x30
ffffff01c68262e0       1 ffffff01e96acb40 mds_get_server_impl_id+0x58
ffffff01c6826b20       1 ffffff01e128de70 mds_get_server_impl_id+0x8a
ffffff01c6828860       1 ffffff01d87c66d8 modinstall+0x129
ffffff01c682b2e0       1 ffffff01ddb51748 modinstall+0x129
ffffff01c6828860       1 ffffff01d7f8f9b0 modinstall+0x129
ffffff01c6828020       1 ffffff01dd7db798 rpc_init_taglist+0x25
ffffff01c6828020   12741 ffffff01f180fdf8 rpc_init_taglist+0x25
ffffff01c6828020       1 ffffff01e3ede5e8 rpc_init_taglist+0x25
ffffff01c6828020       1 ffffff09ba2e4da0 rpc_init_taglist+0x25
ffffff01c6828020   23152 ffffff01e2e37cc8 rpc_init_taglist+0x25
ffffff01c68265a0       1 ffffff01ee74bd30 tohex+0x32
ffffff01c6828020       2 ffffff01d632b880 xdr_array+0xae
ffffff01c6828020       1 ffffff01fe11ec00 xdr_array+0xae
ffffff01c68282e0       1 ffffff01e5e4c2b0 xdr_bytes+0x70
ffffff01c68262e0    1659 ffffff01eaae3700 xdr_bytes+0x70
ffffff01c68285a0       1 ffffff01fe136de0 xdr_bytes+0x70
ffffff01c68262e0 1571800 ffffff01e7dde3b8 xdr_bytes+0x70
------------------------------------------------------------------------
           Total 1609375 buffers, 26900440 bytes
> ffffff01e7dde3b8$<bufctl_audit
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
ffffff01e7dde3b8 ffffff01ea091bc8     4d6b30e7ab91 ffffff01d8245b40
                 ffffff01c68262e0 ffffff01c6b37000 ffffff01cc88be60
                 kmem_cache_alloc_debug+0x283
                 kmem_cache_alloc+0xa9
                 kmem_alloc+0xa3
                 xdr_bytes+0x70
                 xdr_mds_sid+0x21
                 xdr_ds_fh_v1+0x68
                 xdr_ds_fh+0x3f
                 xdr_decode_nfs41_fh+0xdd
                 xdr_snfs_argop4+0x5e
                 xdr_COMPOUND4args_srv+0xf4
                 svc_authany_wrap+0x22
                 svc_cots_kgetargs+0x41
                 dispatch_dserv_nfsv41+0x5d
                 svc_getreq+0x20d
                 svc_run+0x197

By the way, those leaks of 1 or 2, those are probably active memory when I forced the core.

So this is the second bug I claimed to have fixed earlier today. Of note is that we never saw a panic, so something at least is correct. And, I decided to fix the rpc_init_taglist bug while I am at it.

I'm going to need to add some DTrace to track down what is happening here...

Aargh! I say, aargh! nfs4_xdr.c belongs to the nfs module and not the nfssrv module. For quick turn around, I've only been rebuilding nfssrv and not the whole kernel. It was only when just changing nfs_xdr.c and trying a dmake in src/uts/intel/nfssrv that I noticed nothing happened. My code may be golden after all! If it compiles that is.

Okay, I did some other changes, but here is my compiling code:

4059                 case OP_PUTFH: {
4060                         nfs_fh4 *obj = &array[i].nfs_argop4_u.opputfh.object;
4061 
4062                         if (obj->nfs_fh4_val == NULL)
4063                                 continue;
4064 
4065                         DTRACE_NFSV4_1(xdr__i__op_putfh_version, uint32_t,
4066                             minorversion);
4067                         if (minorversion != 0) {
4068                                 struct mds_ds_fh        *dsfh =
4069                                     (struct mds_ds_fh *)obj->nfs_fh4_val;
4070 
4071                                 DTRACE_NFSV4_1(xdr__i__op_putfh_type,
4072                                     nfs41_fh_type_t, dsfh->type);
4073 
4074                                 /*
4075                                  * Is it really a DS filehandle?
4076                                  */
4077                                 if (dsfh->type == FH41_TYPE_DMU_DS) {
4078                                         mds_sid *sid = &dsfh->fh.v1.mds_sid;
4079 
4080                                         DTRACE_NFSV4_1(xdr__i__op_putfh_sid,
4081                                             mds_sid *, sid);
4082 
4083                                         if (sid->val) {
4084                                                 kmem_free(sid->val, sid->len);
4085                                         }
4086                                 }
4087                         }
4088 
4089                         kmem_free(obj->nfs_fh4_val, obj->nfs_fh4_len);
4090                         continue;
4091                 }

And I added this simple DTrace script:

[root@pnfs-17-22 ~]> more ds.d 
#!/usr/sbin/dtrace -s

nfsv4:::xdr-i-op_putfh_version
{
        printf("xdr decode a FH -- version == %u",
            (uint32_t)arg0);
}

nfsv4:::xdr-i-op_putfh_type
{
        printf("xdr decode a FH -- type == %s",
            (int)arg0 == 2 ? "DS" : "regular");
}

nfsv4:::xdr-i-op_putfh_sid
{
        sid = (mds_sid *)arg0;

        printf("xdr decode a FH -- sid == %s",
            sid == NULL ? "(null)" : "valid");
}

Which shows:

[root@pnfs-17-22 ~]> ./ds.d
dtrace: script './ds.d' matched 3 probes
CPU     ID                    FUNCTION:NAME
  0   2834 xdr_snfs_argop4_free:xdr-i-op_putfh_version xdr decode a FH -- version == 1
  0   2833 xdr_snfs_argop4_free:xdr-i-op_putfh_type xdr decode a FH -- type == DS
  0   2832 xdr_snfs_argop4_free:xdr-i-op_putfh_sid xdr decode a FH -- sid == valid
  0   2834 xdr_snfs_argop4_free:xdr-i-op_putfh_version xdr decode a FH -- version == 1
  0   2833 xdr_snfs_argop4_free:xdr-i-op_putfh_type xdr decode a FH -- type == DS
  0   2832 xdr_snfs_argop4_free:xdr-i-op_putfh_sid xdr decode a FH -- sid == valid

But I still have to check back later to see if there are memory leaks!

I've been trying to show how you would use kmdb and ::findleaks to track down memory leaks. You need to do this with XDR code, even the machine generated stuff. You also need to do it before you integrate and not after. I've fixed two leaks that were pre-existing. They would probably go until either someone had a regression test session flunk because of accumulated memory leaks (the mds_sid leaks would do it) or we sat down to find them before shipping code.

The other thing about memory leaks is that you have to test after you fix them, you might find more, find out your fix didn't work, or find out your fix uncovered others.

And perhaps it is time to remind you of my other disclaimer, I don't hide my braindead mistakes. I show them in hopes that someone can learn from them - even if it is just me. :->


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily