A refcnt is a way to keep track of how many outstanding holds are being used on a data structure. Say that you need to reference something in the data structure, but you don't want to do it while holding a lock. You do want to make sure that the data is still there when you go back. What you do is bump the refcnt and have the cleanup code never free that data until the refcnt goes back to 0.
Some issues are what happens if it never goes back to 0? Or if it accidentally goes back to 0 before you are done with it? The answers are that you leak memory and crash, in that order.
I'm debugging the scenario where the refcnt goes back to 0 before you release your hold. Before I dive into looking at that, let's look at the relevant code in the mirror mount source: usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c (Note that this code will change over time, I'm linking to the copy of the gate.)
We bump the count here:
291 static void
292 nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t *net)
293 {
294 mutex_enter(&net->net_cnt_lock);
295 net->net_refcnt++;
296 ASSERT(net->net_refcnt != 0);
297 mutex_exit(&net->net_cnt_lock);
298 }
The first thing to note is that we grab net_cnt_lock before we do anything. How do we even know that is valid? Well, we grab a lock on the mountinfo and check to see if there is an ephemeral node:
696 mutex_enter(&mi->mi_lock);
...
701 if (mi->mi_ephemeral_tree == NULL) {
...
727 } else {
728 net = mi->mi_ephemeral_tree;
729 mutex_exit(&mi->mi_lock);
730
731 nfs4_ephemeral_tree_hold(net);
Oh, that is very naughty indeed - what I said and what the code says disagree. What do other parts of the code say?
1422 mutex_enter(&mi->mi_lock); ... 1444 nfs4_ephemeral_tree_hold(net); ... 1528 mutex_exit(&mi->mi_lock);
And a special case in nfs4_ephemeral_harvest_forest
2098 mutex_enter(&ntg->ntg_forest_lock);
2099 for (net = ntg->ntg_forest; net != NULL; net = next) {
2100 next = net->net_next;
2101
2102 nfs4_ephemeral_tree_hold(net);
I expect the code at 2102 to not hold the mntinfo4. We don't have our hands on that info at the time. But I believe the code at 731 to be a bug that will eventually bit us.
But back to the hold code:
295 net->net_refcnt++;
296 ASSERT(net->net_refcnt != 0);
We bump the refcnt and immediately assert that it is not 0. As refcnt is an unsigned int, this will catch the case where we wrap over the size of an integer. We don't ever expect to hit this case.
It is in the rele case that we are more concerned:
300 /*
301 * We need a safe way to decrement the refcnt whilst the
302 * lock is being held.
303 */
304 static void
305 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t *net)
306 {
307 ASSERT(mutex_owned(&net->net_cnt_lock));
308 ASSERT(net->net_refcnt != 0);
309 net->net_refcnt--;
310 }
311
312 static void
313 nfs4_ephemeral_tree_rele(nfs4_ephemeral_tree_t *net)
314 {
315 mutex_enter(&net->net_cnt_lock);
316 nfs4_ephemeral_tree_decr(net);
317 mutex_exit(&net->net_cnt_lock);
318 }
First note that either we hold the lock ourself and call nfs4_ephemeral_tree_decr or we let nfs4_ephemeral_tree_rele grab the lock for us. The assert on 307 checks that this is valid. The assert on 308 checks to see that we are not already at 0 when we decrement. I.e., because the refcnt is unsigned, once we drop below 0, we can not do a check like:
ASSERT(net->net_refcnt >= 0);
By the way, we do not need to be holding mi->mi_lock everywhere we rele or decr. The reason we need to when we hold is that we need to know the memory is valid. With the other two operations, by definition, the memory must be valid.
To prove that statement, we need to show that the ephemeral tree can not be disassociated from the mntinfo4 without the mi->mi_lock being held. The only place we do that is in nfs4_ephemeral_umount:
2003 /* 2004 * At this point, the tree should no 2005 * longer be associated with the 2006 * mntinfo4. We need to pull it off 2007 * there and let the harvester take 2008 * care of it once the refcnt drops. 2009 */ 2010 mutex_enter(&mi->mi_lock); 2011 mi->mi_ephemeral_tree = NULL; 2012 mutex_exit(&mi->mi_lock);
Okay, further validation that we should have the lock held at 731.
The final piece of the puzzle is what happens in the harvester? I.e., we decided in 2011 to disassociate the tree with the mntinfo4. But we didn't release the memory. How do we do that? Well, we look in nfs4_ephemeral_harvest_forest and we also see why we don't care about holding mi->mi_lock there.
2098 mutex_enter(&ntg->ntg_forest_lock);
2099 for (net = ntg->ntg_forest; net != NULL; net = next) {
2100 next = net->net_next;
2101
2102 nfs4_ephemeral_tree_hold(net);
When we hold ntg->ntg_forest_lock, no other thread is allowed to manipulate the forest of ephemeral trees. What that means is that the current forest is locked and the memory is not going away. Hence we don't need to hold mi->mi_lock. The other code has to hold it because the reference in the mntinfo4 can go away (see line 2011).
2140 while (e) {
...
2247 e = prior;
2248 }
2249
2250 check_done:
2251
2252 /*
2253 * At this point we are done processing this tree.
2254 *
2255 * If the tree is invalid and we are the only reference
2256 * to it, then we push it on the local linked list
2257 * to remove it at the end. We avoid that action now
2258 * to keep the tree processing going along at a fair clip.
2259 *
2260 * Else, even if we are the only reference, we drop
2261 * our hold on the current tree and allow it to be
2262 * reused as needed.
2263 */
2264 mutex_enter(&net->net_cnt_lock);
2265 if (net->net_refcnt == 1 &&
2266 net->net_status & NFS4_EPHEMERAL_TREE_INVALID) {
2267 nfs4_ephemeral_tree_decr(net);
2268 net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
2269 mutex_exit(&net->net_cnt_lock);
2270 mutex_exit(&net->net_tree_lock);
2271
2272 if (prev)
2273 prev->net_next = net->net_next;
2274 else
2275 ntg->ntg_forest = net->net_next;
2276
2277 net->net_next = harvest;
2278 harvest = net;
2279 continue;
2280 }
2281
2282 nfs4_ephemeral_tree_decr(net);
2283 net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
2284 mutex_exit(&net->net_cnt_lock);
2285 mutex_exit(&net->net_tree_lock);
2286
2287 prev = net;
2288 }
2289 mutex_exit(&ntg->ntg_forest_lock);
2248 completes the a while processing loop started in 2140 above. Among other things, this loop decides if a mirror mount has been idle long enough to automatically release it. It does this in a recursion elimination to visit every child and sibling node of the root of the ephemeral tree.
The interesting code starts on lines 2265 and 2266. If the refcnt is 1, then the hold we established on 2102 is the only reference to this node. Then, if the tree has been invalidated, either by the harvest loop above (this time or earlier) or by a manual unmount, then we know it is safe to release this node. We decrement the refcnt at 2267. This is important because if we leave it at 1 and push it on the harvest list, then another node can do a rele and we think it is okay. We remove it from the forst in lines 2272 to 2275 and push it onto a local linked list harvest on lines 2277 and 2278. We then go on to process the next node.
When we have finished all trees in the forest, they are all in one of 3 states:
With all of that background out of the way, let us look at a coredump:
> $c
vpanic()
assfail+0x7e(fffffffff8490b98, fffffffff8490b70, 134)
nfs4_ephemeral_tree_decr+0x64(ffffff020db7fc18)
nfs4_ephemeral_umount_unlock+0x3f(ffffff0004f33c14, ffffff0004f33bb8)
nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
nfs4_free_mount+0x2fe(ffffff047ae40dc8, 400, ffffff01974f2b28)
nfs4_free_mount_thread+0x24(ffffff016623a9f8)
thread_start+8()
> *ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
> ffffff0004f33c14::print boolean_t
1 (B_TRUE)
This is the assert on line 308. Some things we can tell right away is that this should be a manual unmount, since we came through nfs4_free_mount and not the harvester. The flags are 0x400, which means MS_FORCE - a forced umount call. Also, while the refcnt is 0, it has not been harvested. In that case, we would expect to see that the state would include NFS4_EPHEMERAL_TREE_INVALID. And indeed, the mntinfo4 (see net_mount) does still point to this node:
> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree
mi_ephemeral_tree = 0xffffff020db7fc18
> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree | ::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
So, who did us in? Well, the first thought I had was this piece of code in nfs4_ephemeral_umount : (which we would have just called)
1990 nfs4_ephemeral_tree_decr(net); 1991 mutex_exit(&net->net_cnt_lock); 1992 1993 if (was_locked == FALSE) 1994 mutex_exit(&net->net_tree_lock);
We only hold the refcnt in here if we do not set was_locked to TRUE:
1764 int was_locked = FALSE;
...
1845 was_locked = TRUE;
1846 } else {
1847 net->net_refcnt++;
1848 ASSERT(net->net_refcnt != 0);
1849 }
1850
1851 mutex_exit(&net->net_cnt_lock);
Hmm, we also avoid using nfs4_ephemeral_tree_hold. We should probably create a symmetric function for incr like we did for decr.
Anyway, I fixed the code at 19909 and we still see the panic. The new one has the same signature. By the way, I found this probable error by matching holds to reles. I stopped with just the simple cases when I found the above issue.
Okay, what does nfs4_free_mount() do?
4039 /* 4040 * If we got an error, then do not nuke the 4041 * tree. Either the harvester is busy reclaiming 4042 * this node or we ran into some busy condition. 4043 * 4044 * The harvester will eventually come along and cleanup. 4045 * The only problem would be the root mount point. 4046 * 4047 * Since the busy node can occur for a variety 4048 * of reasons and can result in an entry staying 4049 * in df output but no longer accessible from the 4050 * directory tree, we are okay. 4051 */ 4052 if (!nfs4_ephemeral_umount(mi, flag, cr, 4053 &must_unlock, &eph_tree)) 4054 nfs4_ephemeral_umount_activate(mi, &must_unlock, 4055 &eph_tree); 4056
So we call nfs4_ephemeral_umount, which had that suspect piece of code. We know that we will not set NFS4_EPHEMERAL_TREE_LOCKED ourselves. We can see that it was not held, because pmust_unlock was set to TRUE. Also, the status tells us we did not mark the tree as NFS4_EPHEMERAL_TREE_INVALID and the fact that the mntinfo4 pointer is valid tells us we did not disassociate the tree from it. All of which tells me fixing line 1990 would probably have no impact on the execution path. I.e., I'm starting to suspect we don't have a problem with a race case.
The only way to return 0 and not disassociate the tree is to go down this path back in nfs4_ephemeral_umount:
1945 if (!is_derooting) {
1946 /*
1947 * Only work on children if the caller has not already
1948 * done so.
1949 */
1950 if (!is_recursed) {
1951 ASSERT(eph != NULL);
1952
1953 error = nfs4_ephemeral_unmount_engine(eph,
1954 FALSE, flag, cr);
1955 if (error)
1956 goto is_busy;
1957 }
1958 } else {
And as a matter of fact, the evidence suggests that no other code did this either. nfs4_ephemeral_unmount_engine must have been error free.
So either we have decremented the refcnt twice for our one hold or we decremented when we did not have a hold.
Heh, I've gone down many a code branch, only to backtrack as I found the code was correct. Perhaps we should return to our debug session? nfs4_ephemeral_umount_activate takes a pointer to a mntinfo4 and a pointer to a pointer of a nfs4_ephemeral_tree_t:
nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
> *ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
> ffffff0524673000::print mntinfo4_t mi_ephemeral
mi_ephemeral = 0
The ephemeral node is not pointing to the same thing as the function. And I have more faith in the function, because it had to just execute:
1733 if (mi->mi_ephemeral) {
1734 /*
1735 * If we are the root node of an ephemeral branch
1736 * which is being removed, then we need to fixup
1737 * pointers into and out of the node.
1738 */
1739 if (!(mi->mi_flags & MI4_EPHEMERAL_RECURSED))
1740 nfs4_ephemeral_umount_cleanup(mi->mi_ephemeral);
1741
1742 ASSERT(mi->mi_ephemeral != NULL);
1743
1744 kmem_free(mi->mi_ephemeral, sizeof (*mi->mi_ephemeral));
1745 mi->mi_ephemeral = NULL;
1746 }
1747 mutex_exit(&mi->mi_lock);
1748
1749 nfs4_ephemeral_umount_unlock(pmust_unlock, pnet);
I.e., if we pass in a mntinfo4, either the ephemeral tree is already NULL or we free it and NULL it out. And oh crap, if mi->mi_ephemeral was a pointer to the same memory location as *pnet, we just screwed the pooch and caused a core. I.e., the only place we should be removing this memory is at:
2010 mutex_enter(&mi->mi_lock); 2011 mi->mi_ephemeral_tree = NULL; 2012 mutex_exit(&mi->mi_lock);
Well, this entry is long enough as it is. I'll convince myself that removing 1744 is the correct thing to do and then test that fix. Note, this kmem_free might explain why systems do not always core. If the memory is garbage, all it has to be is non-zero for the assert to not trigger. Whether or not other nasty things could have happened is to be seen later.
UPDATE: Hmm, one is a mi->mi_ephemeral and the other is a mi->mi_ephemeral_tree. The end result may stand, but my analysis may be off. Back with more later!
I think the refcount code in ZFS is a very good implementation. It has a very cool debug feature where it keeps link lists of the callers that bumped and unbumped the refcnt, but in the nonDEBUG kernel case it is all done with atomic instructions.
There is actually nothing ZFS specific in the implementation and it should probably be moved to genunix.
Posted by Darren Moffat on October 01, 2008 at 06:37 AM CDT #
Darren,
Thanks for the heads up. I almost went down this route with this bug. Glad to know if I do I don't have to reinvent the wheel.
Tom
Posted by Tom Haynes on October 01, 2008 at 08:29 AM CDT #