A colleague came to me with an interesting question. They had a share on a server which allowed root access and could not get chown(1) to work via NFS. They wanted to know why our NFS server implementation denied it.
I immediately ran the test and it worked right off the bat! So now the question was what was different on the two systems? First of all, I tried it with a Linux client:
root@silent:~# mount slammer:/export/slammer /slammer/slammer/ root@silent:~# cd /slammer/slammer/ root@silent:/slammer/slammer# touch sil root@silent:/slammer/slammer# chown thud sil root@silent:/slammer/slammer# ls -la sil -rw-r--r--+ 1 thud root 0 2009-11-11 12:40 sil
To eliminate that as the culprit, I tried an OpenSolaris client:
[root@silver tdh]> mount slammer:/export/slammer /slammer/slammer/ [root@silver tdh]> cd /slammer/slammer/ [root@silver slammer]> touch ver [root@silver slammer]> chown thud ver chown: ver: Permission denied
While it appears to be an issue between OpenSolaris and Linux, it is far from the case. The issue is at the NFS protocol layer and I can easily show it by flipping the success on each client!
Working OpenSolaris:
[root@silver slammer]> cd [root@silver ~]> umount /slammer/slammer/ [root@silver ~]> mount -o vers=3 slammer:/export/slammer /slammer/slammer/ [root@silver ~]> cd /slammer/slammer/ [root@silver slammer]> touch ver_3 [root@silver slammer]> chown thud ver_3 [root@silver slammer]> ls -la ver_3 -rw-r--r-- 1 thud root 0 Nov 11 12:50 ver_3
Broken Linux:
root@silent:~# mount -t nfs4 slammer:/export/slammer /slammer/slammer/ root@silent:~# cd /slammer/slammer/ root@silent:/slammer/slammer# touch sil_4 root@silent:/slammer/slammer# chown thud sil_4 root@silent:/slammer/slammer# ls -la sil_4 -rw-r--r-- 1 thud 4294967294 0 2009-11-11 12:53 sil_4
But wait, you are saying it worked for Linux. Well, no, it did not. If we look at this on the server, we see:
slammer# ls -la sil* ver* -rw-r--r-- 1 2008 root 0 Nov 11 18:40 sil -rw-r--r-- 1 nobody root 0 Nov 11 18:53 sil_4 -rw-r--r-- 1 nobody nobody 0 Nov 11 18:44 ver -rw-r--r-- 1 2008 root 0 Nov 11 18:50 ver_3
So while the Linux client thinks that it worked, it did not. We can confirm that by remounting with NFSv3:
root@silent:/slammer/slammer# ls -la sil* ver* -rw-r--r--+ 1 thud root 0 2009-11-11 12:40 sil -rw-r--r--+ 1 4294967294 root 0 2009-11-11 12:53 sil_4 -rw-r--r--+ 1 4294967294 4294967294 0 2009-11-11 12:44 ver -rw-r--r--+ 1 thud root 0 2009-11-11 12:50 ver_3
And for fun, if we remount with NFSv4:
root@silent:/slammer/slammer# cd root@silent:~# umount /slammer/slammer/ root@silent:~# mount -t nfs4 slammer:/export/slammer /slammer/slammer/ root@silent:~# cd /slammer/slammer/ root@silent:/slammer/slammer# ls -la sil* ver* -rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:40 sil -rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:53 sil_4 -rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:44 ver -rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:50 ver_3
The first thing we normally check in such cases is the mapid domain:
slammer# cat /var/run/nfs4_domain internal.excfb.com [root@silver slammer]> cat /var/run/nfs4_domain internal.excfb.com root@silent:/slammer/slammer# cat /proc/net/rpc/nfs4.idtoname/content #domain type id [name]
So I need to get the Linux box in the right domain! Actually, I need to modify '/etc/default/nfs-common' on this Ubuntu system and tell it to run mapid!
Much better:
root@silent:/slammer/slammer# touch sil_4_2 root@silent:/slammer/slammer# chown thud sil_4_2 chown: changing ownership of `sil_4_2': Invalid argument
Now the Linux and OpenSolaris clients are agreeing to behave the same way with the server! What is the issue here?
In NFSv4, we change the owner with a SETATTR compound op:
NFS: Op = 34 (SETATTR) NFS: State ID hash = SPECIAL_0 NFS: len = 12 val = 000000000000000000000000 NFS: State ID Sequence ID = 00000000 NFS: 0x00 NFS: 0x00 NFS: 0x00 NFS: 0x00 NFS: 0x10 OWNER NFS: 0x00 NFS: 0x00 NFS: 0x00 NFS: Owner = thud@internal.excfb.com
And in NFSv3, we try with this:
NFS: Proc = 2 (Set file attributes) NFS: File handle = [86D5] NFS: A76A323408831F040A001500000000003F0C00000A000300000000002D000000 NFS: Mode = (not set) NFS: User ID = 2008 NFS: Group ID = (not set) NFS: Size = (not set) NFS: Access time = (do not set) NFS: Modification time = (do not set) NFS:
So in the NFSv4 case, we need to be able to map 'thud@internal.excfb.com' to an user id. If that mapping does not exist, then we disallow the action. In NFSv3, we already have the user id, and we don't care if it is undefined on the server:
slammer# grep thud /etc/passwd slammer# ls -la sil* ver* -rw-r--r-- 1 2008 root 0 Nov 11 18:40 sil -rw-r--r-- 1 nobody root 0 Nov 11 18:53 sil_4 -rw-r--r-- 1 root root 0 Nov 11 19:18 sil_4_2 -rw-r--r-- 1 nobody nobody 0 Nov 11 18:44 ver -rw-r--r-- 1 2008 root 0 Nov 11 18:50 ver_3
If we add the user id of thud, we see:
slammer# useradd -u 2009 -g 10 -d /export/slammer thud slammer# ls -la sil* ver* -rw-r--r-- 1 2008 root 0 Nov 11 18:40 sil -rw-r--r-- 1 nobody root 0 Nov 11 18:53 sil_4 -rw-r--r-- 1 root root 0 Nov 11 19:18 sil_4_2 -rw-r--r-- 1 nobody nobody 0 Nov 11 18:44 ver -rw-r--r-- 1 2008 root 0 Nov 11 18:50 ver_3
Yes, I did an off by one error on purpose! What happens if we try to chown(1) via NFSv4 now:
root@silent:/slammer/slammer# chown thud base root@silent:/slammer/slammer# ls -la base sil* ver* -rw-r--r-- 1 thud root 0 2009-11-11 13:29 base -rw-r--r-- 1 nobody root 0 2009-11-11 12:40 sil -rw-r--r-- 1 nobody root 0 2009-11-11 12:53 sil_4 -rw-r--r-- 1 root root 0 2009-11-11 13:18 sil_4_2 -rw-r--r-- 1 nobody nogroup 0 2009-11-11 12:44 ver -rw-r--r-- 1 nobody root 0 2009-11-11 12:50 ver_3
Those 'nobody's should be 'thud' and that 'thud' should not map! What does NFSv3 say:
root@silent:/slammer/slammer# ls -la base sil* ver* -rw-r--r--+ 1 2009 root 0 2009-11-11 13:29 base -rw-r--r--+ 1 thud root 0 2009-11-11 12:40 sil -rw-r--r--+ 1 4294967294 root 0 2009-11-11 12:53 sil_4 -rw-r--r--+ 1 root root 0 2009-11-11 13:18 sil_4_2 -rw-r--r--+ 1 4294967294 4294967294 0 2009-11-11 12:44 ver -rw-r--r--+ 1 thud root 0 2009-11-11 12:50 ver_3
The NFSv4 listing shows you the username space of the server and the NFSv3 listing shows you the username space of the client.
The big difference is that NFSv4 enforces that both systems understand the name. It doesn't mandate that the names have the same userid, just that they have a mapping.
To get everything in sync, to have both NFSv3 and NFSv4 work the same, use either NIS or LDAP to maintain your user mappings.
We've been seeing an intermittent bug where when we have a mirrormount and we unmount the parent, it fails to do so. We don't know why, but we thought we had a reproducible test case. It was in the miniPIT test suite. The problem with that is that it is huge and thus hard to debug.
At a first blush, this is easy to conceptualize:
[root@pnfs-18-09 ~]> mount -o vers=4 pnfs-18-11:/a_mnt /mnt [root@pnfs-18-09 ~]> cd /mnt [root@pnfs-18-09 /mnt]> cd stress/ [root@pnfs-18-09 stress]> nfsstat -m `pwd` /mnt/stress from pnfs-18-11:/a_mnt/stress Flags: vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576,wsize=1048576,retrans=5,timeo=600 Attr cache: acregmin=3,acregmax=60,acdirmin=30,acdirmax=60 [root@pnfs-18-09 stress]> cd [root@pnfs-18-09 ~]> umount /mnt
We see that the mirrormount did succeed, but we also see that the unmount did as well.
It turned out that the test case we had was on UFS and not ZFS. I was wondering how the test suite could easily create a myriad of UFS filesystems on the fly - after all, there aren't that many slices available. My colleague Helen Chao showed me how we do it with mkfile and lofiadm:
[root@pnfs-18-11 ~]> mkfile 1G /export/FILE2
[root@pnfs-18-11 ~]> mkfile 1G /export/FILE1
[root@pnfs-18-11 ~]> lofiadm -a /export/FILE1
/dev/lofi/1
[root@pnfs-18-11 ~]> lofiadm -a /export/FILE2
/dev/lofi/2
[root@pnfs-18-11 ~]> newfs /dev/lofi/1
newfs: construct a new file system /dev/rlofi/1: (y/n)? y
/dev/rlofi/1: 2097000 sectors in 3495 cylinders of 1 tracks, 600 sectors
1023.9MB in 219 cyl groups (16 c/g, 4.69MB/g, 2240 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 9632, 19232, 28832, 38432, 48032, 57632, 67232, 76832, 86432,
2006432, 2016032, 2025632, 2035232, 2044832, 2054432, 2064032, 2073632,
2083232, 2092832
[root@pnfs-18-11 ~]> newfs /dev/lofi/2
newfs: construct a new file system /dev/rlofi/2: (y/n)? y
/dev/rlofi/2: 2097000 sectors in 3495 cylinders of 1 tracks, 600 sectors
1023.9MB in 219 cyl groups (16 c/g, 4.69MB/g, 2240 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 9632, 19232, 28832, 38432, 48032, 57632, 67232, 76832, 86432,
2006432, 2016032, 2025632, 2035232, 2044832, 2054432, 2064032, 2073632,
2083232, 2092832
[root@pnfs-18-11 ~]> mkdir /a_mnt
[root@pnfs-18-11 ~]> mount /dev/lofi/1 /a_mnt
[root@pnfs-18-11 ~]> mkdir /a_mnt/stress
[root@pnfs-18-11 ~]> mount /dev/lofi/2 /a_mnt/stress
[root@pnfs-18-11 ~]> share -F nfs /a_mnt
[root@pnfs-18-11 ~]> share -F nfs /a_mnt/stress
I probably overkilled the file size in the mkfile since I'm not writing to them, but in all, this is a very neat piece of hackery she does here. I really like how she follows the Unix paradigm of using small tools to do complex tasks.
If you've ever wondered how the NFS protocol is maintained, designed, discussed, etc, now is a great time to watch it happen live. James Lentini has proposed a method to allow client to initiate a file "copy" from one location to another on the same server. With the traditional method, the client has to read the file from the server and then write it back. With the COPY proposal, all of that overhead is removed.
That part is pretty pedestrian and not being debated. What is being debated is a server-to-server copy proposal. You can see the benefits here:
/
/
Existing approach / New COPY procedures
Client Source Destination / Client Source Destination
+ + + / + + +
| | | / | | |
1 |>--get FH------>| | / |>-COPY_NOTIFY-->| |
| /| | / | /| |
| / | | / | / | |
| / | | / | / | |
2 |<-----------+ | | / |<-----------+ | |
| | | / | | |
| | | / | | |
3 |>--get FH----------------->| / |>--COPY------------------->|
| | /| / | | /|
| | / | / | | / |
4 |<-----------------------+ | / |<-----------------------+ |
| | | / | | |
5 |>--read-------->| | / | |<--read--<|
| /| | / | |\ |
| / | | / | | \ |
| / | | / | | \ |
6 |<-----------+ | | / | | +----->|
| | | / | | |
7 |>--write------------------>| / | |<--read--<|
| | /| / | |\ |
| | / | / | | \ |
| | / | / | | \ |
8 |<----------------------+ | / | | +----->|
| | | / | | |
9 |>--read-------->| | / | |<--read--<|
| /| | / | |\ |
| / | | / | | \ |
| / | | / | | \ |
10 |<-----------+ | | / | | +----->|
| | | / | | |
Don't read anything into the length of a horizontal request, i.e., the read at time 5 is not twice as fast as the write at time 7. Assume that everything is just 1 network hop away.
Also, ignore that we can have multiple reads or writes in flight. In the end, you can see that we will issue almost 1/2 the number of over the wire calls if we let the servers handle the copy.
And the mailing list is full of lively discussion about how to do this - there isn't a disagreement that it is a bad thing to do. Instead the discussion is all about how to implement it.
Rather than bias you towards my view, you can go watch the action in either the thread archives or join the working group alias nfsv4 -- NFSv4 Working Group.
I decided to undertake the editing of RFC3530bis, which is a second effort on RFC3530. The intent is to make the document cleaner, but to not introduce any over-the-wire protocol changes. (We have NFSv4.1 for that.)
The main effort so far has been in taking the text version of RFC3530 and converting it to XML. The reason to do that is that it really makes it easy to organize the layout. Effectively, this is just like keeping my thesis in LaTeX. Plus I always love compiling a document. The previous version of the document had been maintained in troff and the known source was out of sync with the official document.
So now you can look at the documents here:
The first major change was the pulling of the XDR description of NFSv4.0 out of the main document and into a separate one. That allows for the easy creation of your own .x file from that document.
The NFS tests we employ at Connectathon are available as a tarball at NFS Testsuite. And they are all read-only. Someone somewhere has a SCCS repository for it.
All of this was setup some time ago, way before git, Mercurial, etc. We should be able to do better than this for managing this code. As the source tree is lost, we've had a problem updating changes. The last official change occurred in 2004.
So I created a Mercurial tree on hg.opensolaris.org. You can grab a copy of the source via:
hg clone ssh://anon@hg.opensolaris.org/hg/nfsv41/cthon
which does not assume you have an OpenSolaris account. You could update the tree with something along the lines of:
hg push -R /home/nfs4hg/src/cthon04 \ -e "ssh -q -F /home/nfs4hg/opensolaris/config"
which assumes you do have an OpenSolaris account and that in this case you have setup a secure and passwordless ssh key. (I'm not sure those last two really belong together.)
We've posted the talk schedule for Connectathon 2009. They will be presented for a 3 day period of Feb 23rd - 25th.
It is pretty much set, but you might see some minor adjustments being made.
The talks are open to the public and will probably be a bit different this year. In the past, they tended to focus on:
What did we do over the last year?
Whereas we are now trying to change the focus to
What are we going to do over the next year?
The change is subtle, but we expect less slides and more audience participation. The impact of that will be that the slides posted to the site (after the conference) will probably not stand on their own as well. In short, I'm urging you to get out and attend the talks.
So another patch set for a Linux Dynamic Pseudo Root has been submitted by Steve Dickson to the Linux NFSv4 mailing list:
The following patch series gives rpc.mountd the ability to allocate a dynamic pseudo root, so the 'fsid=0' export option is no longer required. This allows v2, v3 and v4 clients mounts without any changes to the server's exports list. One anomaly of the Linux NFS server is that it requires a pseudo root to be defined. Currently the only way a pseudo root can be defined is by setting the fsid to zero (i.e. fsid=0). So if we wanted to make v4 the default mounting version and have things just work like v2/v3 all of the existing exports configurations would have to change (i.e. a 'fsid=0' would have to be added) to support a v4 mounts, which, imho, is unacceptable. So this patch series address this problem.
Steve has really highlighted a huge gap in seamless integration of the Linux NFSv4 implementation into automounters, etc. The path to an export should not change based on the version of the protocol.
Hmm, strike that, from a re-reading of I'm not sure if this patch eliminates my concern about the mount path or not. I.e., above he talks about adding a 'fsid=0' on the server and not what the client has to do about the path.
Time to ask him!
Update: Steve says it does address the mount path issue we've seen in the past.
6751438 mirror mounted mountpoints panic when umounted, finally got through the code review process, the RTI process, etc. I wanted to get this into snv_101 because that is the candidate to become OpenSolaris 2008.11. I want a stable mirror mount experience out there for users.
The other bug fix I have in the works has passed stalled, which I'm really okay with at this point. This is the 6738223 Can not share a single IP address bug. Anyway, I've gotten two positive reviews and one reviewer asking questions. If I were tricky, I could try to submit an RTI, but the fact of the matter is that the third reviewer can still be satisfied.
Also, there is no way I want to integrate this into snv_101. While the fix is trivial, it can wait for snv_102. I.e., I don't see a business need to put it back right now.
Some of the key differences between the bugs and why one has to go in:
So I feel one is a strong must and the other is a nice to have. And I don't think 'nice to have' meets the entry criteria for snv_101.
I just announced on nfs41-discuss that the closed-binaries went live! See Mercurial Repository created.
When I configured my community, I followed all of the steps outlined at Setting up a pnfs community except for the mdsadm command on the MDS server:
[root@pnfs-9-10 ~]> mdsadm -o add -t auth -a ip=10.1.233.50 adding: IP Addr - 10.1.233.50
I experienced the following on my MDS console, which are being investigated but do not appear to be fatal:
[root@pnfs-9-14 ~]> Oct 3 17:00:11 pnfs-9-14 /usr/lib/nfs/nfsd[101025]: write failed for /var/nfs/v4_state/mds_/010.001.233.053-6448e6939a: write(669938620) returned -1 errno=14 ss_len=669938600 Oct 3 17:05:55 pnfs-9-14 nfssrv: NOTICE: op_destroy_session: SP4_NONE
And ditto for these on the DS console:
[root@pnfs-9-13 ~]> dservadm enable [root@pnfs-9-13 ~]> Oct 3 16:52:29 pnfs-9-13 dserv[101033]: bad cmd: 3 Oct 3 16:52:29 pnfs-9-13 last message repeated 1 time Oct 3 16:52:32 pnfs-9-13 dserv: WARNING: CLNT_CALL() ds protocol to mds failed: 5 Oct 3 16:52:32 pnfs-9-13 dserv[101033]: ioctl failed: I/O error sahre sahre: Command not found. [root@pnfs-9-13 ~]> share -@data/nfs4 /data/nfs4 anon=0,sec=sys,rw "" [root@pnfs-9-13 ~]> Oct 3 17:00:27 pnfs-9-13 /usr/lib/nfs/nfsd[101019]: write failed for /var/nfs/v4_state/mds_/010.001.233.053-6448e693c4: write(419178265) returned -1 errno=14 ss_len=419178245
The group has so much to do and it feels like so little time to do it. I don't think anyone just codes. I'm looking at my action list and it is all over the place:
[thud@adept src]> hg incoming comparing with ssh://anon@hg.opensolaris.org/hg/nfsv41/nfs41-gate searching for changes changeset: 7742:9fab48a31a4a tag: tip user: Thomas Haynesdate: Thu Oct 02 21:19:03 2008 -0500 summary: Test of push to osol [thud@adept src]> hg pull -u pulling from ssh://anon@hg.opensolaris.org/hg/nfsv41/nfs41-gate searching for changes adding changesets adding manifests adding file changes added 1 changesets with 1 changes to 1 files 1 files updated, 0 files merged, 0 files removed, 0 files unresolved
In usr/src/uts/common/fs/nfs/, there are some historical naming conventions that can help you understand where you are in the code:
You may end up in code which doesn't follow these conventions, i.e., the spe code I will be adding, some of the mirror mount code, etc. But if you are look at a stack trace, you would be able to see this type of code being called by a function following the above patterns.
Earlier I wrote about Debugging a refcnt. I'm still not convinced I have the answer, but I am convinced I'm on the right track. So I've made a some changes to the code base to make things better/correct:
291 static void
292 nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net)
293 {
294 ASSERT(mutex_owned(&net->net_cnt_lock));
295 net->net_refcnt++;
296 ASSERT(net->net_refcnt != 0);
297 }
298
299 static void
300 nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t *net)
301 {
302 mutex_enter(&net->net_cnt_lock);
303 nfs4_ephemeral_tree_incr(net);
304 mutex_exit(&net->net_cnt_lock);
305 }
...
1859 } else {
1860 nfs4_ephemeral_tree_incr(net);
1861 }
734 } else {
735 net = mi->mi_ephemeral_tree;
736 nfs4_ephemeral_tree_hold(net);
737
738 mutex_exit(&mi->mi_lock);
2006 if (was_locked == FALSE) 2007 mutex_exit(&net->net_tree_lock);
I've also introduced some key debug logic to help me figure out if my conjecture about what is happening is correct or not inside nfs4_ephemeral_umount_activate:
1732 void
1733 nfs4_ephemeral_umount_activate(mntinfo4_t *mi, bool_t *pmust_unlock,
1734 nfs4_ephemeral_tree_t **pnet)
1735 {
1736 nfs4_ephemeral_tree_t *checker = mi->mi_ephemeral;
1737
1738 /*
1739 * Now we need to get rid of the ephemeral data if it exists.
1740 */
1741 mutex_enter(&mi->mi_lock);
1742 if (mi->mi_ephemeral) {
1743 /*
1744 * If we are the root node of an ephemeral branch
1745 * which is being removed, then we need to fixup
1746 * pointers into and out of the node.
1747 */
1748 if (!(mi->mi_flags & MI4_EPHEMERAL_RECURSED))
1749 nfs4_ephemeral_umount_cleanup(mi->mi_ephemeral);
1750
1751 ASSERT(mi->mi_ephemeral != NULL);
1752
1753 kmem_free(mi->mi_ephemeral, sizeof (*mi->mi_ephemeral));
1754 mi->mi_ephemeral = NULL;
1755 }
1756 mutex_exit(&mi->mi_lock);
1757
1758 if (*pmust_unlock) {
1759 ASSERT(checker != *pnet);
1760 }
1761
1762 nfs4_ephemeral_umount_unlock(pmust_unlock, pnet);
1763 }
I didn't want to change 1753 to be an assignment to NULL, which would mimic the case for mi->mi_ephemeral_tree. And I don't want to change this code too much.
So I save off a local copy of mi->mi_ephemeral right off the bat. I allow the free and NULLing to occur. And then, only in the case where we must unlock (because the code is more likely to be valid in the other case), do I check to make sure that the saved address is different than the node we are about to unlock. Note that if the mi->mi_ephemeral pointer were NULL when we entered the function, we will not trigger here. Actually, if we do, we have more problems. :->
And if this assert does trigger, I'll have credible evidence I know what is going on and I'll have a good handle on how to fix it.
Now if my build would just finish...
Update:
Okay, I'm way off base and probably need to take a break from this:
../../common/fs/nfs/nfs4_stub_vnops.c:1736: warning: initialization from incompatible pointer type "../../common/fs/nfs/nfs4_stub_vnops.c", line 1736: warning: assignment type mismatch:
I realized earlier that the types were different, but I didn't let that sink in. The core is consistent:
> *ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
> ffffff0524673000::print mntinfo4_t mi_ephemeral_tree | ::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
The mntinfo4 does point to the tree.
Okay, I took my first advice to add more comments and I have a new piece of code to question in nfs4_ephemeral_umount:
1763 int is_recursed = FALSE;
1764 int was_locked = FALSE;
...
1770 *pmust_unlock = FALSE;
...
1817 if (!is_recursed) {
...
1845 was_locked = TRUE;
1846 } else {
1847 net->net_refcnt++;
1848 ASSERT(net->net_refcnt != 0);
...
1859 if (was_locked == FALSE) {
...
1866 if (mutex_tryenter(&net->net_tree_lock)) {
1867 *pmust_unlock = TRUE;
1868 } else {
So, if !is_recursed, then we can end up on line 1847 to bump the refcnt. At this point, *pmust_unlock is FALSE and was_locked is also FALSE. If we grab the net->net_tree_lock right away on 1866, then crud, it is okay.
But the other case is not. Assume that is_recursed is TRUE. That means we do not bump the refcnt and at 1859, the following values hold: was_locked is FALSE and *pmust_unlock is FALSE. Now if we grab the mutex right away, we end up with *pmust_unlock as TRUE. In my examination of the code, that then implies we *must* drop the refcnt at some exit point. But we just saw that we never grabbed it in this scenario.
This would account for us trying to rele a refcnt we never held. We return TRUE to nfs4_free_mount and it sends this into nfs4_ephemeral_umount_activate and thus into nfs4_ephemeral_umount_unlock.
I'm going to introduce more state in a *pmust_rele to keep track of whether I need to rele or not.
A refcnt is a way to keep track of how many outstanding holds are being used on a data structure. Say that you need to reference something in the data structure, but you don't want to do it while holding a lock. You do want to make sure that the data is still there when you go back. What you do is bump the refcnt and have the cleanup code never free that data until the refcnt goes back to 0.
Some issues are what happens if it never goes back to 0? Or if it accidentally goes back to 0 before you are done with it? The answers are that you leak memory and crash, in that order.
I'm debugging the scenario where the refcnt goes back to 0 before you release your hold. Before I dive into looking at that, let's look at the relevant code in the mirror mount source: usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c (Note that this code will change over time, I'm linking to the copy of the gate.)
We bump the count here:
291 static void
292 nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t *net)
293 {
294 mutex_enter(&net->net_cnt_lock);
295 net->net_refcnt++;
296 ASSERT(net->net_refcnt != 0);
297 mutex_exit(&net->net_cnt_lock);
298 }
The first thing to note is that we grab net_cnt_lock before we do anything. How do we even know that is valid? Well, we grab a lock on the mountinfo and check to see if there is an ephemeral node:
696 mutex_enter(&mi->mi_lock);
...
701 if (mi->mi_ephemeral_tree == NULL) {
...
727 } else {
728 net = mi->mi_ephemeral_tree;
729 mutex_exit(&mi->mi_lock);
730
731 nfs4_ephemeral_tree_hold(net);
Oh, that is very naughty indeed - what I said and what the code says disagree. What do other parts of the code say?
1422 mutex_enter(&mi->mi_lock); ... 1444 nfs4_ephemeral_tree_hold(net); ... 1528 mutex_exit(&mi->mi_lock);
And a special case in nfs4_ephemeral_harvest_forest
2098 mutex_enter(&ntg->ntg_forest_lock);
2099 for (net = ntg->ntg_forest; net != NULL; net = next) {
2100 next = net->net_next;
2101
2102 nfs4_ephemeral_tree_hold(net);
I expect the code at 2102 to not hold the mntinfo4. We don't have our hands on that info at the time. But I believe the code at 731 to be a bug that will eventually bit us.
But back to the hold code:
295 net->net_refcnt++;
296 ASSERT(net->net_refcnt != 0);
We bump the refcnt and immediately assert that it is not 0. As refcnt is an unsigned int, this will catch the case where we wrap over the size of an integer. We don't ever expect to hit this case.
It is in the rele case that we are more concerned:
300 /*
301 * We need a safe way to decrement the refcnt whilst the
302 * lock is being held.
303 */
304 static void
305 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t *net)
306 {
307 ASSERT(mutex_owned(&net->net_cnt_lock));
308 ASSERT(net->net_refcnt != 0);
309 net->net_refcnt--;
310 }
311
312 static void
313 nfs4_ephemeral_tree_rele(nfs4_ephemeral_tree_t *net)
314 {
315 mutex_enter(&net->net_cnt_lock);
316 nfs4_ephemeral_tree_decr(net);
317 mutex_exit(&net->net_cnt_lock);
318 }
First note that either we hold the lock ourself and call nfs4_ephemeral_tree_decr or we let nfs4_ephemeral_tree_rele grab the lock for us. The assert on 307 checks that this is valid. The assert on 308 checks to see that we are not already at 0 when we decrement. I.e., because the refcnt is unsigned, once we drop below 0, we can not do a check like:
ASSERT(net->net_refcnt >= 0);
By the way, we do not need to be holding mi->mi_lock everywhere we rele or decr. The reason we need to when we hold is that we need to know the memory is valid. With the other two operations, by definition, the memory must be valid.
To prove that statement, we need to show that the ephemeral tree can not be disassociated from the mntinfo4 without the mi->mi_lock being held. The only place we do that is in nfs4_ephemeral_umount:
2003 /* 2004 * At this point, the tree should no 2005 * longer be associated with the 2006 * mntinfo4. We need to pull it off 2007 * there and let the harvester take 2008 * care of it once the refcnt drops. 2009 */ 2010 mutex_enter(&mi->mi_lock); 2011 mi->mi_ephemeral_tree = NULL; 2012 mutex_exit(&mi->mi_lock);
Okay, further validation that we should have the lock held at 731.
The final piece of the puzzle is what happens in the harvester? I.e., we decided in 2011 to disassociate the tree with the mntinfo4. But we didn't release the memory. How do we do that? Well, we look in nfs4_ephemeral_harvest_forest and we also see why we don't care about holding mi->mi_lock there.
2098 mutex_enter(&ntg->ntg_forest_lock);
2099 for (net = ntg->ntg_forest; net != NULL; net = next) {
2100 next = net->net_next;
2101
2102 nfs4_ephemeral_tree_hold(net);
When we hold ntg->ntg_forest_lock, no other thread is allowed to manipulate the forest of ephemeral trees. What that means is that the current forest is locked and the memory is not going away. Hence we don't need to hold mi->mi_lock. The other code has to hold it because the reference in the mntinfo4 can go away (see line 2011).
2140 while (e) {
...
2247 e = prior;
2248 }
2249
2250 check_done:
2251
2252 /*
2253 * At this point we are done processing this tree.
2254 *
2255 * If the tree is invalid and we are the only reference
2256 * to it, then we push it on the local linked list
2257 * to remove it at the end. We avoid that action now
2258 * to keep the tree processing going along at a fair clip.
2259 *
2260 * Else, even if we are the only reference, we drop
2261 * our hold on the current tree and allow it to be
2262 * reused as needed.
2263 */
2264 mutex_enter(&net->net_cnt_lock);
2265 if (net->net_refcnt == 1 &&
2266 net->net_status & NFS4_EPHEMERAL_TREE_INVALID) {
2267 nfs4_ephemeral_tree_decr(net);
2268 net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
2269 mutex_exit(&net->net_cnt_lock);
2270 mutex_exit(&net->net_tree_lock);
2271
2272 if (prev)
2273 prev->net_next = net->net_next;
2274 else
2275 ntg->ntg_forest = net->net_next;
2276
2277 net->net_next = harvest;
2278 harvest = net;
2279 continue;
2280 }
2281
2282 nfs4_ephemeral_tree_decr(net);
2283 net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
2284 mutex_exit(&net->net_cnt_lock);
2285 mutex_exit(&net->net_tree_lock);
2286
2287 prev = net;
2288 }
2289 mutex_exit(&ntg->ntg_forest_lock);
2248 completes the a while processing loop started in 2140 above. Among other things, this loop decides if a mirror mount has been idle long enough to automatically release it. It does this in a recursion elimination to visit every child and sibling node of the root of the ephemeral tree.
The interesting code starts on lines 2265 and 2266. If the refcnt is 1, then the hold we established on 2102 is the only reference to this node. Then, if the tree has been invalidated, either by the harvest loop above (this time or earlier) or by a manual unmount, then we know it is safe to release this node. We decrement the refcnt at 2267. This is important because if we leave it at 1 and push it on the harvest list, then another node can do a rele and we think it is okay. We remove it from the forst in lines 2272 to 2275 and push it onto a local linked list harvest on lines 2277 and 2278. We then go on to process the next node.
When we have finished all trees in the forest, they are all in one of 3 states:
With all of that background out of the way, let us look at a coredump:
> $c
vpanic()
assfail+0x7e(fffffffff8490b98, fffffffff8490b70, 134)
nfs4_ephemeral_tree_decr+0x64(ffffff020db7fc18)
nfs4_ephemeral_umount_unlock+0x3f(ffffff0004f33c14, ffffff0004f33bb8)
nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
nfs4_free_mount+0x2fe(ffffff047ae40dc8, 400, ffffff01974f2b28)
nfs4_free_mount_thread+0x24(ffffff016623a9f8)
thread_start+8()
> *ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
> ffffff0004f33c14::print boolean_t
1 (B_TRUE)
This is the assert on line 308. Some things we can tell right away is that this should be a manual unmount, since we came through nfs4_free_mount and not the harvester. The flags are 0x400, which means MS_FORCE - a forced umount call. Also, while the refcnt is 0, it has not been harvested. In that case, we would expect to see that the state would include NFS4_EPHEMERAL_TREE_INVALID. And indeed, the mntinfo4 (see net_mount) does still point to this node:
> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree
mi_ephemeral_tree = 0xffffff020db7fc18
> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree | ::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
So, who did us in? Well, the first thought I had was this piece of code in nfs4_ephemeral_umount : (which we would have just called)
1990 nfs4_ephemeral_tree_decr(net); 1991 mutex_exit(&net->net_cnt_lock); 1992 1993 if (was_locked == FALSE) 1994 mutex_exit(&net->net_tree_lock);
We only hold the refcnt in here if we do not set was_locked to TRUE:
1764 int was_locked = FALSE;
...
1845 was_locked = TRUE;
1846 } else {
1847 net->net_refcnt++;
1848 ASSERT(net->net_refcnt != 0);
1849 }
1850
1851 mutex_exit(&net->net_cnt_lock);
Hmm, we also avoid using nfs4_ephemeral_tree_hold. We should probably create a symmetric function for incr like we did for decr.
Anyway, I fixed the code at 19909 and we still see the panic. The new one has the same signature. By the way, I found this probable error by matching holds to reles. I stopped with just the simple cases when I found the above issue.
Okay, what does nfs4_free_mount() do?
4039 /* 4040 * If we got an error, then do not nuke the 4041 * tree. Either the harvester is busy reclaiming 4042 * this node or we ran into some busy condition. 4043 * 4044 * The harvester will eventually come along and cleanup. 4045 * The only problem would be the root mount point. 4046 * 4047 * Since the busy node can occur for a variety 4048 * of reasons and can result in an entry staying 4049 * in df output but no longer accessible from the 4050 * directory tree, we are okay. 4051 */ 4052 if (!nfs4_ephemeral_umount(mi, flag, cr, 4053 &must_unlock, &eph_tree)) 4054 nfs4_ephemeral_umount_activate(mi, &must_unlock, 4055 &eph_tree); 4056
So we call nfs4_ephemeral_umount, which had that suspect piece of code. We know that we will not set NFS4_EPHEMERAL_TREE_LOCKED ourselves. We can see that it was not held, because pmust_unlock was set to TRUE. Also, the status tells us we did not mark the tree as NFS4_EPHEMERAL_TREE_INVALID and the fact that the mntinfo4 pointer is valid tells us we did not disassociate the tree from it. All of which tells me fixing line 1990 would probably have no impact on the execution path. I.e., I'm starting to suspect we don't have a problem with a race case.
The only way to return 0 and not disassociate the tree is to go down this path back in nfs4_ephemeral_umount:
1945 if (!is_derooting) {
1946 /*
1947 * Only work on children if the caller has not already
1948 * done so.
1949 */
1950 if (!is_recursed) {
1951 ASSERT(eph != NULL);
1952
1953 error = nfs4_ephemeral_unmount_engine(eph,
1954 FALSE, flag, cr);
1955 if (error)
1956 goto is_busy;
1957 }
1958 } else {
And as a matter of fact, the evidence suggests that no other code did this either. nfs4_ephemeral_unmount_engine must have been error free.
So either we have decremented the refcnt twice for our one hold or we decremented when we did not have a hold.
Heh, I've gone down many a code branch, only to backtrack as I found the code was correct. Perhaps we should return to our debug session? nfs4_ephemeral_umount_activate takes a pointer to a mntinfo4 and a pointer to a pointer of a nfs4_ephemeral_tree_t:
nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
> *ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
> ffffff0524673000::print mntinfo4_t mi_ephemeral
mi_ephemeral = 0
The ephemeral node is not pointing to the same thing as the function. And I have more faith in the function, because it had to just execute:
1733 if (mi->mi_ephemeral) {
1734 /*
1735 * If we are the root node of an ephemeral branch
1736 * which is being removed, then we need to fixup
1737 * pointers into and out of the node.
1738 */
1739 if (!(mi->mi_flags & MI4_EPHEMERAL_RECURSED))
1740 nfs4_ephemeral_umount_cleanup(mi->mi_ephemeral);
1741
1742 ASSERT(mi->mi_ephemeral != NULL);
1743
1744 kmem_free(mi->mi_ephemeral, sizeof (*mi->mi_ephemeral));
1745 mi->mi_ephemeral = NULL;
1746 }
1747 mutex_exit(&mi->mi_lock);
1748
1749 nfs4_ephemeral_umount_unlock(pmust_unlock, pnet);
I.e., if we pass in a mntinfo4, either the ephemeral tree is already NULL or we free it and NULL it out. And oh crap, if mi->mi_ephemeral was a pointer to the same memory location as *pnet, we just screwed the pooch and caused a core. I.e., the only place we should be removing this memory is at:
2010 mutex_enter(&mi->mi_lock); 2011 mi->mi_ephemeral_tree = NULL; 2012 mutex_exit(&mi->mi_lock);
Well, this entry is long enough as it is. I'll convince myself that removing 1744 is the correct thing to do and then test that fix. Note, this kmem_free might explain why systems do not always core. If the memory is garbage, all it has to be is non-zero for the assert to not trigger. Whether or not other nasty things could have happened is to be seen later.
UPDATE: Hmm, one is a mi->mi_ephemeral and the other is a mi->mi_ephemeral_tree. The end result may stand, but my analysis may be off. Back with more later!
So all of my blog entries from this week were from a BakeAThon or BAT. This one was at the Sun offices in Austin, TX.
Here is a picture of Jeff Smith and I hard at work:
You can see Rob Thurlow getting a bagel and Karen Rochford running some tests for a colleague who couldn't attend.
Jeff is helping me understand why I was getting crap in the vattr structure which was supplying the gid and uid. I had the code running before and it worked. Fast forward 2 months and it does not work. It may have been working because I was testing as root and the random garbage was being initialized to 0. I should have kept some test logs and I shouldn't have had such a long layoff.
Anyway, I was talking to Jeff because he had made some changes since my last time to test. We decided that the vattr structure I was grabbing was no longer valid, even within the same structure. We get an open create call, we pull the credentials from the RPC layer, and we create the file. I could see that the UID and GID made it to disk correctly, but they were garbage when I was trying to match them against a policy.
I added the following code to make sure I had the correct vattr:
vattr_t spe_va;
spe_va.va_mask = AT_GID|AT_UID;
error = VOP_GETATTR(vp, &spe_va, 0, cs->cr, &ct);
if (error) {
/*
* TDH: Need to do proxy IO.
*/
return(status);
}
We first make sure we are going to get the uid and gid. Then we call a getattr on the vnode. While we should have this information already, the vnode should be in memory and this should be fast. Hmm, perhaps we will need to revisit this thought to make sure we don't ever go to disk.
Anyway, it works and that is what matters with a prototype. Right?
At a previous employer, I had worked with someone who was incredulous that Sun's implementation rpcgen(1) did not eliminate tail recursion. After all, he had worked there and had that code change sitting in a workspace when he left.
So at the time, I manually fixed a XDR call that was blowing the stack.
Since then, I've occasionally had to relearn XDR and rpcgen. And I've always assumed that I had to manually do the tail recursion code. The other day I was in a hurry and I decide to generate output from both Linux and Solaris to see the differences. I thought that surely the Linux implementation would be optimal.
We can see there is a simple linked-list which would be NULL terminated:
#define SPED_MAX_GUUIDS 8
typedef struct spe_policy {
uint_t sp_id;
uint_t sp_stripe_count;
uint_t sp_interlace;
spe_interior_t *sp_attr_expr;
char *sp_name;
uint64_t sp_guuids[SPED_MAX_GUUIDS];
struct spe_policy *next;
} spe_policy_t;
I was looking at the diff and I was struck by how simple and elegant the few lines of Linux generated C code were:
bool_t
xdr_spe_policy (XDR *xdrs, spe_policy *objp)
{
register int32_t *buf;
int i;
if (!xdr_uint32_t (xdrs, &objp->sp_id))
return FALSE;
if (!xdr_count4 (xdrs, &objp->sp_stripe_count))
return FALSE;
if (!xdr_uint32_t (xdrs, &objp->sp_interlace))
return FALSE;
if (!xdr_pointer (xdrs, (char **)&objp->sp_attr_expr, sizeof (spe_interior), (xdrproc_t) xdr_spe_interior))
return FALSE;
if (!xdr_pointer (xdrs, (char **)&objp->sp_name, sizeof (char), (xdrproc_t) xdr_char))
return FALSE;
if (!xdr_vector (xdrs, (char *)objp->sp_guuids, SPED_MAX_GUUIDS,
sizeof (uint64_t), (xdrproc_t) xdr_uint64_t))
return FALSE;
if (!xdr_pointer (xdrs, (char **)&objp->next, sizeof (spe_policy), (xdrproc_t) xdr_spe_policy))
return FALSE;
return TRUE;
}
And I was looking at the Solaris generated output and was worried about the apparent complexity:
bool_t
xdr_spe_policy(XDR *xdrs, spe_policy_t *objp)
{
rpc_inline_t *buf;
spe_policy_t *tmp_spe_policy;
bool_t more_data = TRUE;
bool_t first_objp = TRUE;
if (xdrs->x_op == XDR_DECODE) {
while (more_data) {
if (!xdr_uint32_t(xdrs, &objp->sp_id))
return (FALSE);
...
if (!xdr_bool(xdrs, &more_data))
return (FALSE);
if (!more_data) {
objp->next = NULL;
break;
}
if (objp->next == NULL) {
objp->next = (spe_policy_t *)
mem_alloc(sizeof (spe_policy_t));
if (objp->next == NULL)
return (FALSE);
bzero(objp->next, sizeof (spe_policy_t));
}
objp = objp->next;
}
} else if (xdrs->x_op == XDR_ENCODE) {
while (more_data) {
if (!xdr_uint32_t(xdrs, &objp->sp_id))
return (FALSE);
...
if (!xdr_vector(xdrs,
(char *)objp->sp_guuids,
SPED_MAX_GUUIDS,
sizeof (uint64_t),
(xdrproc_t)xdr_uint64_t))
return (FALSE);
objp = objp->next;
if (objp == NULL)
more_data = FALSE;
if (!xdr_bool(xdrs, &more_data))
return (FALSE);
}
} else {
while (more_data) {
if (!xdr_uint32_t(xdrs, &objp->sp_id))
return (FALSE);
...
if (!xdr_vector(xdrs,
(char *)objp->sp_guuids,
SPED_MAX_GUUIDS,
sizeof (uint64_t),
(xdrproc_t)xdr_uint64_t))
return (FALSE);
tmp_spe_policy = objp;
objp = objp->next;
if (objp == NULL)
more_data = FALSE;
if (!first_objp)
mem_free(tmp_spe_policy_t,
sizeof (spe_policy_t));
else
first_objp = FALSE;
}
}
return (TRUE);
}
I didn't look at the actual code, I just assumed the worst. So I went searching for a version of rpcgen that would work for me. I also was kinda looking for the bug id of the code my friend had sitting in a private workspace.
The first hit in google on the keywords "rpcgen tail recursion" was Bug ID: 4023944 rpcgen generates linked list routines that blow stacks. Hmm, that is on opensolaris.org, which is good.
When I opened it, I was struck by several facts:
I was a victim of low expectations. I assumed that the Solaris rpcgen(1) still did not handle tail recursion (it hadn't been a priority to putback when my friend had a working solution), so I didn't inspect the generated code to see that it did do so. I assumed that fewer lines of code meant elegant and more functional. Instead it meant a blown stack at some point.
So thanks Bill for getting that change made and protecting stacks everywhere!