« February 2010
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
      
       
Today
XML

Neat blogs

Navigation

Editing

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

20080925 Thursday September 25, 2008
More on that refcnt debugging

Earlier I wrote about Debugging a refcnt. I'm still not convinced I have the answer, but I am convinced I'm on the right track. So I've made a some changes to the code base to make things better/correct:

  1. Introduced a incr function to make sure that all bumps are correct:
       291  static void
       292  nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net)
       293  {
       294          ASSERT(mutex_owned(&net->net_cnt_lock));
       295          net->net_refcnt++;
       296          ASSERT(net->net_refcnt != 0);
       297  }
       298
       299  static void
       300  nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t *net)
       301  {
       302          mutex_enter(&net->net_cnt_lock);
       303          nfs4_ephemeral_tree_incr(net);
       304          mutex_exit(&net->net_cnt_lock);
       305  }
    ...
      1859                  } else {
      1860                          nfs4_ephemeral_tree_incr(net);
      1861                  }
    
  2. Fixed up the one locking issue:
       734          } else {
       735                  net = mi->mi_ephemeral_tree;
       736                  nfs4_ephemeral_tree_hold(net);
       737
       738                  mutex_exit(&mi->mi_lock);
    
  3. And I still have the change which I thought would fix the bug:
      2006                  if (was_locked == FALSE)
      2007                          mutex_exit(&net->net_tree_lock);
    

    As a matter of fact, after struggling to work my way through all of the cases in nfs4_ephemeral_umount, I need to add more documentation about the interactions of the booleans *pmust_unlock, is_recursed, and was_locked. I could add more booleans to be explicit about what is going on, but I think better comments will suffice.

I've also introduced some key debug logic to help me figure out if my conjecture about what is happening is correct or not inside nfs4_ephemeral_umount_activate:

  1732  void
  1733  nfs4_ephemeral_umount_activate(mntinfo4_t *mi, bool_t *pmust_unlock,
  1734      nfs4_ephemeral_tree_t **pnet)
  1735  {
  1736          nfs4_ephemeral_tree_t   *checker = mi->mi_ephemeral;
  1737
  1738          /*
  1739           * Now we need to get rid of the ephemeral data if it exists.
  1740           */
  1741          mutex_enter(&mi->mi_lock);
  1742          if (mi->mi_ephemeral) {
  1743                  /*
  1744                   * If we are the root node of an ephemeral branch
  1745                   * which is being removed, then we need to fixup
  1746                   * pointers into and out of the node.
  1747                   */
  1748                  if (!(mi->mi_flags & MI4_EPHEMERAL_RECURSED))
  1749                          nfs4_ephemeral_umount_cleanup(mi->mi_ephemeral);
  1750
  1751                  ASSERT(mi->mi_ephemeral != NULL);
  1752
  1753                  kmem_free(mi->mi_ephemeral, sizeof (*mi->mi_ephemeral));
  1754                  mi->mi_ephemeral = NULL;
  1755          }
  1756          mutex_exit(&mi->mi_lock);
  1757
  1758          if (*pmust_unlock) {
  1759                  ASSERT(checker != *pnet);
  1760          }
  1761
  1762          nfs4_ephemeral_umount_unlock(pmust_unlock, pnet);
  1763  }

I didn't want to change 1753 to be an assignment to NULL, which would mimic the case for mi->mi_ephemeral_tree. And I don't want to change this code too much.

So I save off a local copy of mi->mi_ephemeral right off the bat. I allow the free and NULLing to occur. And then, only in the case where we must unlock (because the code is more likely to be valid in the other case), do I check to make sure that the saved address is different than the node we are about to unlock. Note that if the mi->mi_ephemeral pointer were NULL when we entered the function, we will not trigger here. Actually, if we do, we have more problems. :->

And if this assert does trigger, I'll have credible evidence I know what is going on and I'll have a good handle on how to fix it.

Now if my build would just finish...

Update:

Okay, I'm way off base and probably need to take a break from this:

../../common/fs/nfs/nfs4_stub_vnops.c:1736: warning: initialization from incompatible pointer type
"../../common/fs/nfs/nfs4_stub_vnops.c", line 1736: warning: assignment type mismatch:

I realized earlier that the types were different, but I didn't let that sink in. The core is consistent:

> *ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_status = 0
    net_refcnt = 0
}
> ffffff0524673000::print mntinfo4_t mi_ephemeral_tree | ::print nfs4_ephemeral_tree_t
{
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_status = 0
    net_refcnt = 0
}

The mntinfo4 does point to the tree.

Okay, I took my first advice to add more comments and I have a new piece of code to question in nfs4_ephemeral_umount:

   1763 	int			is_recursed = FALSE;
   1764 	int			was_locked = FALSE;
...
   1770 	*pmust_unlock = FALSE;
...
   1817 	if (!is_recursed) {
...
   1845 			was_locked = TRUE;
   1846 		} else {
   1847 			net->net_refcnt++;
   1848 			ASSERT(net->net_refcnt != 0);
...
   1859 	if (was_locked == FALSE) {
...
   1866 		if (mutex_tryenter(&net->net_tree_lock)) {
   1867 			*pmust_unlock = TRUE;
   1868 		} else {

So, if !is_recursed, then we can end up on line 1847 to bump the refcnt. At this point, *pmust_unlock is FALSE and was_locked is also FALSE. If we grab the net->net_tree_lock right away on 1866, then crud, it is okay.

But the other case is not. Assume that is_recursed is TRUE. That means we do not bump the refcnt and at 1859, the following values hold: was_locked is FALSE and *pmust_unlock is FALSE. Now if we grab the mutex right away, we end up with *pmust_unlock as TRUE. In my examination of the code, that then implies we *must* drop the refcnt at some exit point. But we just saw that we never grabbed it in this scenario.

This would account for us trying to rele a refcnt we never held. We return TRUE to nfs4_free_mount and it sends this into nfs4_ephemeral_umount_activate and thus into nfs4_ephemeral_umount_unlock.

I'm going to introduce more state in a *pmust_rele to keep track of whether I need to rele or not.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Trackback URL: http://blogs.sun.com/tdh/entry/more_on_that_refcnt_debugging
Comments:

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed