We've had some interesting question on nfs-discuss lately about mountd and I thought I'd use an userland script I wrote, snoop, and DTrace to show some interesting properties about mountd.
My perl script will send UDP mount requests to a server and spoof the client IP. I want to control what I send and I'll sometimes spoof a request from a non-existent machine.
BTW - you will notice I don't talk about what the share is that much, unless I state otherwise, it is:
[root@silver ~]> share | grep tdh -@tank/home /export/zfs/tdh rw ""
We can try a host without a name mapping:
[tdh@slayer ~/src]> host 10.10.20.41 Host 41.20.10.10.in-addr.arpa. not found: 3(NXDOMAIN) [tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 10.10.20.42
Note that since we don't have a client to receive the reply, we'll snoop it:
30 0.01297 10.10.20.42 -> silver MOUNT3 C Mount /export/zfs/tdh 31 0.01712 silver -> thens.internal.excfb.com DNS C 42.20.10.10.in-addr.arpa. Internet PTR ? 32 0.05912 thens.internal.excfb.com -> silver DNS R Error: 3(Name Error) 33 0.00522 silver -> thens.internal.excfb.com DNS C 42.20.10.10.in-addr.arpa. Internet PTR ? 34 0.00056 thens.internal.excfb.com -> silver DNS R Error: 3(Name Error) 35 0.00777 silver -> 10.10.20.42 MOUNT3 R Mount Permission denied
BTW - right off the bat we can see that mountd tries to resolve the client IP. What happens if there is a reverse entry?
[tdh@slayer ~/src]> host 192.168.4.14 14.4.168.192.in-addr.arpa domain name pointer blast-4-14.internal.excfb.com. [tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 192.168.4.14
Well?
37 0.03198 blast-4-14.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh 38 0.00089 silver -> thens.internal.excfb.com DNS C 14.4.168.192.in-addr.arpa. Internet PTR ? 39 0.00051 thens.internal.excfb.com -> silver DNS R 14.4.168.192.in-addr.arpa. Internet PTR blast-4-14.internal.excfb.com. 40 0.00290 silver -> blast-4-14.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix
So a client has to have a reverse mapping from IP to host name before we allow a mount to succeed. And we can see that in the source code for usr/src/cmd/fs.d/nfs/mountd/mountd.c:
875 getclientsnames(transp, &nb, &clnames);
876 if (clnames == NULL || nb == NULL) {
877 /*
878 * We failed to get a name for the client, even 'anon',
879 * probably because we ran out of memory. In this situation
880 * it doesn't make sense to allow the mount to succeed.
881 */
882 error = EACCES;
883 goto reply;
884 }
What if I change the share to be:
[root@silver ~]> zfs set sharenfs=rw=blast4-14:blast4-15 tank/home/tdh
Will it work or fail?
[tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 192.168.4.14 [tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 192.168.4.15
Note that I wanted to show something warm in the cache and something cold:
24 0.03500 blast-4-14.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh 25 0.00129 silver -> blast-4-14.internal.excfb.com MOUNT3 R Mount Permission denied 41 0.03706 blast-4-15.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh (retransmit) 42 0.01419 silver -> thens.internal.excfb.com DNS C 15.4.168.192.in-addr.arpa. Internet PTR ? 43 0.00048 thens.internal.excfb.com -> silver DNS R 15.4.168.192.in-addr.arpa. Internet PTR blast-4-15.internal.excfb.com. 44 0.00089 silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied
Two points here, the DNS cache is not flushed when a share reloads, but the nfs auth cache must be. If it were not flushed, we would have gotten permission granted.
Okay, I can show you what is going on by this example:
[root@silver ~]> zfs set sharenfs=rw=blast4-14.internal.excfb.com:blast4-15 tank/home/tdh
And;
24 0.03769 blast-4-14.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh 25 0.00122 silver -> blast-4-14.internal.excfb.com MOUNT3 R Mount Permission denied 41 0.03115 blast-4-15.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh (retransmit) 42 0.00092 silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied
Okay, we should have gotten permission granted for the first!
[root@silver ~]> zfs set sharenfs=rw=blast4-17.internal.excfb.com:blast4-15 tank/home/tdh
And that fails as well:
20 0.03105 blast-4-17.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh 21 0.00086 silver -> thens.internal.excfb.com DNS C 17.4.168.192.in-addr.arpa. Internet PTR ? 22 0.00056 thens.internal.excfb.com -> silver DNS R 17.4.168.192.in-addr.arpa. Internet PTR blast-4-17.internal.excfb.com. 23 0.00163 silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount Permission denied
Ohhh! Smug mode off
Okay, I'm interested in what the function in_access_list will tell me. I've got a small DTrace script, which I iterated over the development off, which will let me know what is going on here:
!/usr/sbin/dtrace -Fs
/*
* Thanks to Peter Harvey;
* http://blogs.sun.com/peteh/entry/dereferencing_user_space_pointers_in
*
* # ./mountd.d `pgrep -x mountd`
*
*/
dtrace:::BEGIN
{
printf("Sampling... Hit Ctrl-C to end.\n");
}
pid$1::check_client_new:return
{
printf("Access permission is %d", arg1);
}
pid$1::in_access_list:entry
{
self->trace_me = 1;
printf("Access list is %s", copyinstr(arg2));
}
pid$1::in_access_list:return
{
self->trace_me = 0;
printf("Access permission is %d", arg1);
}
pid$1::strcasecmp:entry
/self->trace_me == 1/
{
printf("host vs list entry: |%s|, |%s|\n", copyinstr(arg0),
copyinstr(arg1));
}
pid$1::strcasecmp:entry
/self->trace_me == 1/
{
printf("Comparison is %d", arg1);
}
Note that I need to use strcasecmp because I can't iterate over the array. I get as a result:
[root@silver ~]> ./mount_trace.sh + pgrep -x mountd + /root/mountd.d 634 dtrace: script '/root/mountd.d' matched 6 probes CPU FUNCTION 0 | :BEGIN Sampling... Hit Ctrl-C to end. 0 -> in_access_list Access list is blast4-17.internal.excfb.com:blast4-15 0 -> strcasecmp host vs list entry: |blast4-17.internal.excfb.com|, |blast-4-17.internal.excfb.com| 0 | strcasecmp:entry Comparison is 135156944 0 | strcasecmp:entry host vs list entry: |blast4-15|, |blast-4-17.internal.excfb.com| 0 | strcasecmp:entry Comparison is 135156944 0 <- in_access_list Access permission is 0 ^C
D'oh! My share is wrong, wrong I say!
If instead I try:
[root@silver ~]> zfs set sharenfs=rw=blast-4-17.internal.excfb.com:blast-4-15 tank/home/tdh
We expect blast-4-15 to fail and blast-4-17 to succeed see:
28 0.03211 blast-4-15.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh 29 0.00150 silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied 49 0.03460 blast-4-17.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh (retransmit) 50 0.00087 silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix
Which shows again, that you need an exact match and we don't append the domain name to end of the hostname. What would happen if added this to end of the server's /etc/hosts?
192.168.4.14 blast-4-15
I expect it to work. Does it?
[root@silver ~]> grep MOUNT xxx 24 0.03176 blast-4-15.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh 27 0.00124 silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied 49 0.00160 blast-4-17.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh (retransmit) 52 0.00098 silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix
Hmm, I bet the name entry is cached, which I can show by trying one which is not cached:
[root@silver ~]> zfs set sharenfs=rw=blast-4-17.internal.excfb.com:blast-4-31 tank/home/tdh
And
[root@silver ~]> grep MOUNT xxx 24 0.03633 blast-4-17.internal.excfb.com -> silver MOUNT3 C Mount /export/zfs/tdh 27 0.00102 silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix 45 0.02997 blast-4-31 -> silver MOUNT3 C Mount /export/zfs/tdh (retransmit) 46 0.00189 silver -> blast-4-31 MOUNT3 R Mount OK FH=01CC Auth=unix
And we can see the lack of caching because of two clues, the name output in snoop, i.e., "blast-4-31", and the lack of DNS activity between the two packets.
Some of the behaviour shocked me and I made some stupid mistakes that were hard to figure out what I had done. As an exercise in triage, it was great. I now have the beginnings of a DTrace debugging tool that I can point people at if they need some help. I'm very, very happy about that part!
One of the hardest things about developing software is testing it. As a developer, you get too close to the code and what you expect it to do. So I was very happy to get a question in email about why was a set of policies not working correctly:
I tried to reset the SPE policies to the following:
# cat /etc/policies.spe
101, 1, 8k, DS2, path == /diskpool/JUNK/TEST/P1
102, 2, 4k, DS1, path == /diskpool/JUNK/TEST/P2
103, 3, 2k, mix3, path == /diskpool/JUNK/TEST/P3
104, 4, 1k, mix4, path == /diskpool/JUNK/TEST/P4
#105, 5, 128k, default, path == /diskpool/JUNK/TEST/P5 || path == /export/JUNK
111, 5, 32k, default, path == /diskpool/JUNK/TEST/P5
112, 5, 64k, default, path == /export/JUNK
and have the client uses the path to verify the stripe. For stripe 1,2,3
things seem working OK; however when I tried with strip 4 or 5 with
the policy above, I got "nfsstat -l" to list stripe 10 and count 32k:
pnfs-minipit1-3:~#
pnfs-minipit1-3:~# /mntpit1-3: mount -o vers=4 pnfs-minipit1-5:/diskpool/JUNK/TEST/P5 /mnt
pnfs-minipit1-3:~# nfsstat -m /mnt
/mnt from pnfs-minipit1-5:/diskpool/JUNK/TEST/P5
Flags: vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=1048576,wsize=1048576,retrans=5,timeo=600
Attr cache: acregmin=3,acregmax=60,acdirmin=30,acdirmax=60
pnfs-minipit1-3:~# mkfile -v 1m /mnt/kk
/mnt/kk 1048576 bytes
pnfs-minipit1-3:~# nfsstat -l /mnt/kk
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 33
Layout [0]:
Layout obtained at: Sun Aug 30 00:23:47:199247 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 10, stripe unit: 32768
Stripe [0]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
Stripe [1]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
Stripe [2]:
tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
Stripe [3]:
tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
Stripe [4]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
Stripe [5]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
Stripe [6]:
tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
Stripe [7]:
tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
Stripe [8]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
Stripe [9]:
tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
pnfs-minipit1-3:~#
What did I do wrong?
Before we even look at the issue, what did the person asking for help do right
In short, I'm confident that the person took all reasonable steps to solve the problem on their own.
So, now let's try to figure out what is going on. I have a DTrace script which will help out here. Ouch, I should have shared that earlier. Anyway, spe.d should help. I'll start it up and monitor the results as I do a file create:
[root@pnfs-minipit1-5 ~]> ./spe.d dtrace: script './spe.d' matched 8 probes CPU ID FUNCTION:NAME 1 3420 nfs41_spe_allocate:spe-i-check_open 200096 (10) from 10.1.233.191 is checking /diskpool/JUNK/TEST/P5/foollu2 1 3419 nfs41_spe_allocate:spe-i-policy_eval Policy 101 did not match with error 0 from 10.1.233.191 1 3419 nfs41_spe_allocate:spe-i-policy_eval Policy 102 did not match with error 0 from 10.1.233.191 1 3419 nfs41_spe_allocate:spe-i-policy_eval Policy 103 did not match with error 0 from 10.1.233.191 1 3419 nfs41_spe_allocate:spe-i-policy_eval Policy 104 did not match with error 0 from 10.1.233.191 1 3419 nfs41_spe_allocate:spe-i-policy_eval Policy 111 matched with error 0 from 10.1.233.191 1 63564 mds_ds_path_to_mds_sid:return mds_sid[0] = pnfs-minipit1-6:pNFSpool1/p1DS2 1 63564 mds_ds_path_to_mds_sid:return mds_sid[1] = pnfs-minipit1-6:pNFSpool2/p2DS2 1 63564 mds_ds_path_to_mds_sid:return mds_sid[2] = pnfs-minipit1-6:pNFSpool3/p3DS2 1 63564 mds_ds_path_to_mds_sid:return mds_sid[3] = pnfs-minipit1-7:pNFSpool1/p1DS1 1 63564 mds_ds_path_to_mds_sid:return ERROR - could not find a matching pgi for pnfs-minipit1-7:pNFSpool1/p2DS1 1 56851 nfs41_spe_allocate:return No matching policy
So we did find a matching policy, but we only found 5 datasets and the policy requires 5. So, why isn't that dataset in memory?
[root@pnfs-minipit1-5 ~]> echo "::walk mds_DS_guid_entry_cache|::print struct rfs4_dbe dbe_data|::print ds_guid_info_t ds_dataset_name.utf8string_val|::eval /s" | mdb -k 0xffffff01e73e5fc0: pnfs-minipit1-6:pNFSpool3/p3DS2�����f 0xffffff01e7227078: pnfs-minipit1-6:pNFSpool3/p3DS1�����f 0xffffff01e7227120: pnfs-minipit1-6:pNFSpool2/p2DS2�����f 0xffffff01e72271c8: pnfs-minipit1-6:pNFSpool2/p2DS1�����f 0xffffff01e7227270: pnfs-minipit1-6:pNFSpool1/p1DS2�����f 0xffffff01e7227318: pnfs-minipit1-6:pNFSpool1/p1DS1�����f 0xffffff01e5d62e70: pnfs-minipit1-7:pNFSpool2/p2DS2�����f 0xffffff01e6597628: pnfs-minipit1-7:pNFSpool2/p2DS1�����f 0xffffff01e4e40468: pnfs-minipit1-7:pNFSpool1/p1DS2�����f 0xffffff01e5f84cb8: pnfs-minipit1-7:pNFSpool1/p1DS1�����f [root@pnfs-minipit1-5 ~]> echo "::walk mds_DS_guid_entry_cache|::print struct rfs4_dbe dbe_invalid" | mdb -k dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0 dbe_invalid = 0
Hmm, I need to figure out a better way to process UTF8 stings in mdb, but it appears to be there and in a valid entry. I couldn't figure this out until I lined up the dataset name output:
pnfs-minipit1-7:pNFSpool2/p2DS1 pnfs-minipit1-7:pNFSpool1/p2DS1
The naming convention shows the error, 'p2DS1' should be on pool #2, but it is on pool #1: pNFSpool1.
If this were implemented in SMF instead of a as a flat file, we would be able to do checking when we create the policy to catch this type of configuration error. But for now, DTrace and mdb can catch it.
For me, getting mdb to dump the dataset names was the hardest part. I had to experiment and search the web. I'm glad for all of the effort here, I'll use that trick again and again.