Earlier I wrote about Debugging a refcnt. I'm still not convinced I have the answer, but I am convinced I'm on the right track. So I've made a some changes to the code base to make things better/correct:
291 static void
292 nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net)
293 {
294 ASSERT(mutex_owned(&net->net_cnt_lock));
295 net->net_refcnt++;
296 ASSERT(net->net_refcnt != 0);
297 }
298
299 static void
300 nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t *net)
301 {
302 mutex_enter(&net->net_cnt_lock);
303 nfs4_ephemeral_tree_incr(net);
304 mutex_exit(&net->net_cnt_lock);
305 }
...
1859 } else {
1860 nfs4_ephemeral_tree_incr(net);
1861 }
734 } else {
735 net = mi->mi_ephemeral_tree;
736 nfs4_ephemeral_tree_hold(net);
737
738 mutex_exit(&mi->mi_lock);
2006 if (was_locked == FALSE) 2007 mutex_exit(&net->net_tree_lock);
I've also introduced some key debug logic to help me figure out if my conjecture about what is happening is correct or not inside nfs4_ephemeral_umount_activate:
1732 void
1733 nfs4_ephemeral_umount_activate(mntinfo4_t *mi, bool_t *pmust_unlock,
1734 nfs4_ephemeral_tree_t **pnet)
1735 {
1736 nfs4_ephemeral_tree_t *checker = mi->mi_ephemeral;
1737
1738 /*
1739 * Now we need to get rid of the ephemeral data if it exists.
1740 */
1741 mutex_enter(&mi->mi_lock);
1742 if (mi->mi_ephemeral) {
1743 /*
1744 * If we are the root node of an ephemeral branch
1745 * which is being removed, then we need to fixup
1746 * pointers into and out of the node.
1747 */
1748 if (!(mi->mi_flags & MI4_EPHEMERAL_RECURSED))
1749 nfs4_ephemeral_umount_cleanup(mi->mi_ephemeral);
1750
1751 ASSERT(mi->mi_ephemeral != NULL);
1752
1753 kmem_free(mi->mi_ephemeral, sizeof (*mi->mi_ephemeral));
1754 mi->mi_ephemeral = NULL;
1755 }
1756 mutex_exit(&mi->mi_lock);
1757
1758 if (*pmust_unlock) {
1759 ASSERT(checker != *pnet);
1760 }
1761
1762 nfs4_ephemeral_umount_unlock(pmust_unlock, pnet);
1763 }
I didn't want to change 1753 to be an assignment to NULL, which would mimic the case for mi->mi_ephemeral_tree. And I don't want to change this code too much.
So I save off a local copy of mi->mi_ephemeral right off the bat. I allow the free and NULLing to occur. And then, only in the case where we must unlock (because the code is more likely to be valid in the other case), do I check to make sure that the saved address is different than the node we are about to unlock. Note that if the mi->mi_ephemeral pointer were NULL when we entered the function, we will not trigger here. Actually, if we do, we have more problems. :->
And if this assert does trigger, I'll have credible evidence I know what is going on and I'll have a good handle on how to fix it.
Now if my build would just finish...
Update:
Okay, I'm way off base and probably need to take a break from this:
../../common/fs/nfs/nfs4_stub_vnops.c:1736: warning: initialization from incompatible pointer type "../../common/fs/nfs/nfs4_stub_vnops.c", line 1736: warning: assignment type mismatch:
I realized earlier that the types were different, but I didn't let that sink in. The core is consistent:
> *ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
> ffffff0524673000::print mntinfo4_t mi_ephemeral_tree | ::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
The mntinfo4 does point to the tree.
Okay, I took my first advice to add more comments and I have a new piece of code to question in nfs4_ephemeral_umount:
1763 int is_recursed = FALSE;
1764 int was_locked = FALSE;
...
1770 *pmust_unlock = FALSE;
...
1817 if (!is_recursed) {
...
1845 was_locked = TRUE;
1846 } else {
1847 net->net_refcnt++;
1848 ASSERT(net->net_refcnt != 0);
...
1859 if (was_locked == FALSE) {
...
1866 if (mutex_tryenter(&net->net_tree_lock)) {
1867 *pmust_unlock = TRUE;
1868 } else {
So, if !is_recursed, then we can end up on line 1847 to bump the refcnt. At this point, *pmust_unlock is FALSE and was_locked is also FALSE. If we grab the net->net_tree_lock right away on 1866, then crud, it is okay.
But the other case is not. Assume that is_recursed is TRUE. That means we do not bump the refcnt and at 1859, the following values hold: was_locked is FALSE and *pmust_unlock is FALSE. Now if we grab the mutex right away, we end up with *pmust_unlock as TRUE. In my examination of the code, that then implies we *must* drop the refcnt at some exit point. But we just saw that we never grabbed it in this scenario.
This would account for us trying to rele a refcnt we never held. We return TRUE to nfs4_free_mount and it sends this into nfs4_ephemeral_umount_activate and thus into nfs4_ephemeral_umount_unlock.
I'm going to introduce more state in a *pmust_rele to keep track of whether I need to rele or not.
I was trying to integrate a branch merge of the nfs41-gate and ON. And I'm keeping tight notes because I'm trying to get Piyush to be the gatekeeper. Anyway, here is what went wrong:
Reparent to the gate: % hg reparent ssh://aus1500-home//pool/ws/nfs41-gate And push: % hg push pushing to ssh://aus1500-home//pool/ws/nfs41-gate searching for changes Are you sure you wish to push? [y/N]: y remote: abort: could not lock repository /pool/ws/nfs41-gate: Permission denied abort: unexpected response: empty string
This worked last time I tried it. And Rich Brown was able to make a change this morning. Now, I could just blame him, but that is too easy. I use google and find [Volo-dev] I am unable to push my changes to the gate. I don't get past the comment:
You are trying to push back the changes as user "anon".
Before I realize what is going on. I forgot to reparent the gate correctly:
% hg reparent ssh://nfs4hg@aus1500-home//pool/ws/nfs41-gate % hg push pushing to ssh://nfs4hg@aus1500-home//pool/ws/nfs41-gate searching for changes Are you sure you wish to push? [y/N]: y pushing to ssh://nfs4hg@aus1500-home//pool/ws/nfs41-gate searching for changes Are you sure you wish to push? [y/N]: y remote: adding changesets remote: adding manifests ... remote: Preparing gk email... remote: ...gk email sent
All is well and Rich is off the hook!
A refcnt is a way to keep track of how many outstanding holds are being used on a data structure. Say that you need to reference something in the data structure, but you don't want to do it while holding a lock. You do want to make sure that the data is still there when you go back. What you do is bump the refcnt and have the cleanup code never free that data until the refcnt goes back to 0.
Some issues are what happens if it never goes back to 0? Or if it accidentally goes back to 0 before you are done with it? The answers are that you leak memory and crash, in that order.
I'm debugging the scenario where the refcnt goes back to 0 before you release your hold. Before I dive into looking at that, let's look at the relevant code in the mirror mount source: usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c (Note that this code will change over time, I'm linking to the copy of the gate.)
We bump the count here:
291 static void
292 nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t *net)
293 {
294 mutex_enter(&net->net_cnt_lock);
295 net->net_refcnt++;
296 ASSERT(net->net_refcnt != 0);
297 mutex_exit(&net->net_cnt_lock);
298 }
The first thing to note is that we grab net_cnt_lock before we do anything. How do we even know that is valid? Well, we grab a lock on the mountinfo and check to see if there is an ephemeral node:
696 mutex_enter(&mi->mi_lock);
...
701 if (mi->mi_ephemeral_tree == NULL) {
...
727 } else {
728 net = mi->mi_ephemeral_tree;
729 mutex_exit(&mi->mi_lock);
730
731 nfs4_ephemeral_tree_hold(net);
Oh, that is very naughty indeed - what I said and what the code says disagree. What do other parts of the code say?
1422 mutex_enter(&mi->mi_lock); ... 1444 nfs4_ephemeral_tree_hold(net); ... 1528 mutex_exit(&mi->mi_lock);
And a special case in nfs4_ephemeral_harvest_forest
2098 mutex_enter(&ntg->ntg_forest_lock);
2099 for (net = ntg->ntg_forest; net != NULL; net = next) {
2100 next = net->net_next;
2101
2102 nfs4_ephemeral_tree_hold(net);
I expect the code at 2102 to not hold the mntinfo4. We don't have our hands on that info at the time. But I believe the code at 731 to be a bug that will eventually bit us.
But back to the hold code:
295 net->net_refcnt++;
296 ASSERT(net->net_refcnt != 0);
We bump the refcnt and immediately assert that it is not 0. As refcnt is an unsigned int, this will catch the case where we wrap over the size of an integer. We don't ever expect to hit this case.
It is in the rele case that we are more concerned:
300 /*
301 * We need a safe way to decrement the refcnt whilst the
302 * lock is being held.
303 */
304 static void
305 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t *net)
306 {
307 ASSERT(mutex_owned(&net->net_cnt_lock));
308 ASSERT(net->net_refcnt != 0);
309 net->net_refcnt--;
310 }
311
312 static void
313 nfs4_ephemeral_tree_rele(nfs4_ephemeral_tree_t *net)
314 {
315 mutex_enter(&net->net_cnt_lock);
316 nfs4_ephemeral_tree_decr(net);
317 mutex_exit(&net->net_cnt_lock);
318 }
First note that either we hold the lock ourself and call nfs4_ephemeral_tree_decr or we let nfs4_ephemeral_tree_rele grab the lock for us. The assert on 307 checks that this is valid. The assert on 308 checks to see that we are not already at 0 when we decrement. I.e., because the refcnt is unsigned, once we drop below 0, we can not do a check like:
ASSERT(net->net_refcnt >= 0);
By the way, we do not need to be holding mi->mi_lock everywhere we rele or decr. The reason we need to when we hold is that we need to know the memory is valid. With the other two operations, by definition, the memory must be valid.
To prove that statement, we need to show that the ephemeral tree can not be disassociated from the mntinfo4 without the mi->mi_lock being held. The only place we do that is in nfs4_ephemeral_umount:
2003 /* 2004 * At this point, the tree should no 2005 * longer be associated with the 2006 * mntinfo4. We need to pull it off 2007 * there and let the harvester take 2008 * care of it once the refcnt drops. 2009 */ 2010 mutex_enter(&mi->mi_lock); 2011 mi->mi_ephemeral_tree = NULL; 2012 mutex_exit(&mi->mi_lock);
Okay, further validation that we should have the lock held at 731.
The final piece of the puzzle is what happens in the harvester? I.e., we decided in 2011 to disassociate the tree with the mntinfo4. But we didn't release the memory. How do we do that? Well, we look in nfs4_ephemeral_harvest_forest and we also see why we don't care about holding mi->mi_lock there.
2098 mutex_enter(&ntg->ntg_forest_lock);
2099 for (net = ntg->ntg_forest; net != NULL; net = next) {
2100 next = net->net_next;
2101
2102 nfs4_ephemeral_tree_hold(net);
When we hold ntg->ntg_forest_lock, no other thread is allowed to manipulate the forest of ephemeral trees. What that means is that the current forest is locked and the memory is not going away. Hence we don't need to hold mi->mi_lock. The other code has to hold it because the reference in the mntinfo4 can go away (see line 2011).
2140 while (e) {
...
2247 e = prior;
2248 }
2249
2250 check_done:
2251
2252 /*
2253 * At this point we are done processing this tree.
2254 *
2255 * If the tree is invalid and we are the only reference
2256 * to it, then we push it on the local linked list
2257 * to remove it at the end. We avoid that action now
2258 * to keep the tree processing going along at a fair clip.
2259 *
2260 * Else, even if we are the only reference, we drop
2261 * our hold on the current tree and allow it to be
2262 * reused as needed.
2263 */
2264 mutex_enter(&net->net_cnt_lock);
2265 if (net->net_refcnt == 1 &&
2266 net->net_status & NFS4_EPHEMERAL_TREE_INVALID) {
2267 nfs4_ephemeral_tree_decr(net);
2268 net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
2269 mutex_exit(&net->net_cnt_lock);
2270 mutex_exit(&net->net_tree_lock);
2271
2272 if (prev)
2273 prev->net_next = net->net_next;
2274 else
2275 ntg->ntg_forest = net->net_next;
2276
2277 net->net_next = harvest;
2278 harvest = net;
2279 continue;
2280 }
2281
2282 nfs4_ephemeral_tree_decr(net);
2283 net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
2284 mutex_exit(&net->net_cnt_lock);
2285 mutex_exit(&net->net_tree_lock);
2286
2287 prev = net;
2288 }
2289 mutex_exit(&ntg->ntg_forest_lock);
2248 completes the a while processing loop started in 2140 above. Among other things, this loop decides if a mirror mount has been idle long enough to automatically release it. It does this in a recursion elimination to visit every child and sibling node of the root of the ephemeral tree.
The interesting code starts on lines 2265 and 2266. If the refcnt is 1, then the hold we established on 2102 is the only reference to this node. Then, if the tree has been invalidated, either by the harvest loop above (this time or earlier) or by a manual unmount, then we know it is safe to release this node. We decrement the refcnt at 2267. This is important because if we leave it at 1 and push it on the harvest list, then another node can do a rele and we think it is okay. We remove it from the forst in lines 2272 to 2275 and push it onto a local linked list harvest on lines 2277 and 2278. We then go on to process the next node.
When we have finished all trees in the forest, they are all in one of 3 states:
With all of that background out of the way, let us look at a coredump:
> $c
vpanic()
assfail+0x7e(fffffffff8490b98, fffffffff8490b70, 134)
nfs4_ephemeral_tree_decr+0x64(ffffff020db7fc18)
nfs4_ephemeral_umount_unlock+0x3f(ffffff0004f33c14, ffffff0004f33bb8)
nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
nfs4_free_mount+0x2fe(ffffff047ae40dc8, 400, ffffff01974f2b28)
nfs4_free_mount_thread+0x24(ffffff016623a9f8)
thread_start+8()
> *ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
> ffffff0004f33c14::print boolean_t
1 (B_TRUE)
This is the assert on line 308. Some things we can tell right away is that this should be a manual unmount, since we came through nfs4_free_mount and not the harvester. The flags are 0x400, which means MS_FORCE - a forced umount call. Also, while the refcnt is 0, it has not been harvested. In that case, we would expect to see that the state would include NFS4_EPHEMERAL_TREE_INVALID. And indeed, the mntinfo4 (see net_mount) does still point to this node:
> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree
mi_ephemeral_tree = 0xffffff020db7fc18
> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree | ::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
So, who did us in? Well, the first thought I had was this piece of code in nfs4_ephemeral_umount : (which we would have just called)
1990 nfs4_ephemeral_tree_decr(net); 1991 mutex_exit(&net->net_cnt_lock); 1992 1993 if (was_locked == FALSE) 1994 mutex_exit(&net->net_tree_lock);
We only hold the refcnt in here if we do not set was_locked to TRUE:
1764 int was_locked = FALSE;
...
1845 was_locked = TRUE;
1846 } else {
1847 net->net_refcnt++;
1848 ASSERT(net->net_refcnt != 0);
1849 }
1850
1851 mutex_exit(&net->net_cnt_lock);
Hmm, we also avoid using nfs4_ephemeral_tree_hold. We should probably create a symmetric function for incr like we did for decr.
Anyway, I fixed the code at 19909 and we still see the panic. The new one has the same signature. By the way, I found this probable error by matching holds to reles. I stopped with just the simple cases when I found the above issue.
Okay, what does nfs4_free_mount() do?
4039 /* 4040 * If we got an error, then do not nuke the 4041 * tree. Either the harvester is busy reclaiming 4042 * this node or we ran into some busy condition. 4043 * 4044 * The harvester will eventually come along and cleanup. 4045 * The only problem would be the root mount point. 4046 * 4047 * Since the busy node can occur for a variety 4048 * of reasons and can result in an entry staying 4049 * in df output but no longer accessible from the 4050 * directory tree, we are okay. 4051 */ 4052 if (!nfs4_ephemeral_umount(mi, flag, cr, 4053 &must_unlock, &eph_tree)) 4054 nfs4_ephemeral_umount_activate(mi, &must_unlock, 4055 &eph_tree); 4056
So we call nfs4_ephemeral_umount, which had that suspect piece of code. We know that we will not set NFS4_EPHEMERAL_TREE_LOCKED ourselves. We can see that it was not held, because pmust_unlock was set to TRUE. Also, the status tells us we did not mark the tree as NFS4_EPHEMERAL_TREE_INVALID and the fact that the mntinfo4 pointer is valid tells us we did not disassociate the tree from it. All of which tells me fixing line 1990 would probably have no impact on the execution path. I.e., I'm starting to suspect we don't have a problem with a race case.
The only way to return 0 and not disassociate the tree is to go down this path back in nfs4_ephemeral_umount:
1945 if (!is_derooting) {
1946 /*
1947 * Only work on children if the caller has not already
1948 * done so.
1949 */
1950 if (!is_recursed) {
1951 ASSERT(eph != NULL);
1952
1953 error = nfs4_ephemeral_unmount_engine(eph,
1954 FALSE, flag, cr);
1955 if (error)
1956 goto is_busy;
1957 }
1958 } else {
And as a matter of fact, the evidence suggests that no other code did this either. nfs4_ephemeral_unmount_engine must have been error free.
So either we have decremented the refcnt twice for our one hold or we decremented when we did not have a hold.
Heh, I've gone down many a code branch, only to backtrack as I found the code was correct. Perhaps we should return to our debug session? nfs4_ephemeral_umount_activate takes a pointer to a mntinfo4 and a pointer to a pointer of a nfs4_ephemeral_tree_t:
nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
> *ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
net_mount = 0xffffff0523c36000
net_root = 0
net_next = 0
net_tree_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_cnt_lock = {
_opaque = [ 0xffffff0004f33c80 ]
}
net_status = 0
net_refcnt = 0
}
> ffffff0524673000::print mntinfo4_t mi_ephemeral
mi_ephemeral = 0
The ephemeral node is not pointing to the same thing as the function. And I have more faith in the function, because it had to just execute:
1733 if (mi->mi_ephemeral) {
1734 /*
1735 * If we are the root node of an ephemeral branch
1736 * which is being removed, then we need to fixup
1737 * pointers into and out of the node.
1738 */
1739 if (!(mi->mi_flags & MI4_EPHEMERAL_RECURSED))
1740 nfs4_ephemeral_umount_cleanup(mi->mi_ephemeral);
1741
1742 ASSERT(mi->mi_ephemeral != NULL);
1743
1744 kmem_free(mi->mi_ephemeral, sizeof (*mi->mi_ephemeral));
1745 mi->mi_ephemeral = NULL;
1746 }
1747 mutex_exit(&mi->mi_lock);
1748
1749 nfs4_ephemeral_umount_unlock(pmust_unlock, pnet);
I.e., if we pass in a mntinfo4, either the ephemeral tree is already NULL or we free it and NULL it out. And oh crap, if mi->mi_ephemeral was a pointer to the same memory location as *pnet, we just screwed the pooch and caused a core. I.e., the only place we should be removing this memory is at:
2010 mutex_enter(&mi->mi_lock); 2011 mi->mi_ephemeral_tree = NULL; 2012 mutex_exit(&mi->mi_lock);
Well, this entry is long enough as it is. I'll convince myself that removing 1744 is the correct thing to do and then test that fix. Note, this kmem_free might explain why systems do not always core. If the memory is garbage, all it has to be is non-zero for the assert to not trigger. Whether or not other nasty things could have happened is to be seen later.
UPDATE: Hmm, one is a mi->mi_ephemeral and the other is a mi->mi_ephemeral_tree. The end result may stand, but my analysis may be off. Back with more later!