Welcome to the Event Horizon

Friday Sep 22, 2006

Hello again,

In my previous life at Sun, I was a software engineer working on Leadville. More specifically, I spent almost two years working on the FibreChannel stack.

From the time I started working on Leadville, there were numerous recurring panics that all revolved around one central issue. That issue was called the "pd mutex", aptly named after the mutex that is supposed to protect a structure that represents a remote port on the SAN. That structure used to be called the "p"ort "d"evice structure (thus the name pd_mutex), although it has since been appropriately renamed to fc_remote_port_t.

These pd_mutex bugs would pop up on a regular basis because the structures weren't being properly accounted for within the stack.

By the nature of SANs, devices come and go. A host reboots, a user allocates a new LUN on an array, any number of solicited or unsolicited events cause remote ports to appear and disappear. In theory, when a remote port disappears, the structure that represents that remote port also disappears. In reality, it isn't that simple.

I knew there was a fundamental root cause to all the open bugs that were related to "pd_mutex". My goal was to reduce all the open bugs down to that common root cause. The following bug is the one I chose to engineer a solution for:

Bug #4792071

That bug had been open for approximately two years already when I took it over. An initial look at the code made it clear that although there was a field within the remote port structure that was supposed to maintain a reference count, that field was rarely used. I came to the conclusion that the root cause of all the pd_mutex bugs was that there was no adequate reference counting associated with the structure. Thus began two months of in-depth analysis, bug fix development, and testing to ensure that this bug and the other eight or so related bugs no longer exhibited the panics associated with the pd_mutex problem.

After the fix was integrated, the issues slowed to a trickle, but since then there have been several new bugs that have arisen that also point to issues with regard to proper bookkeeping of the remote port structure. One of those recent bugs is the following:

Bug #6255534

Unfortunately, the bug listing in opensolaris isn't very helpful, but this is where talk of threads and synchronization comes into play. This bug is a great example for discussion of fundamental locking methodology in a multi-threaded environment. As the synopsis of the bug indicates, a system panic occurred during Leadville testing while target-side cable pulls were being performed.

A brief code inspection revealed a race condition to be the root cause of the panic. At the time of the panic, there were two threads operating with the same remote port structure. One thread was running at interrupt context and was attempting to destroy a packet that had evidently failed transmission at an earlier time. The second thread was handling a state change notification, presumably one indicating that the remote port had disappeared.

Now, as a bit of background, part of my fix for bug ID 4792071 was to implement a flag in addition to consistent and proper use of a reference counter for each remote port structure. The flag is an indication that the transport layer has notified the ULP that a remote port exists. Thus, there are now two conditions that must be true in order for a remote port structure to be deallocated. First, the reference count must be zero. Second, the flag indicating that ULPs are aware of the remote port must be clear. Then, and only then will a remote port structure be freed.

With that in mind, have a quick look at the relevant piece of code for each thread just prior to the panic. The following code fragments can be seen in the opensolaris NWS source (nws-src-20060724).

Thread 1 (fctl/src/fctl.c:882 in function fc_ulp_uninit_packet):

...
    mutex_enter(&pd->pd_mutex);                                     
                                                                                
    ASSERT(pd->pd_ref_count > 0);                                   
    pd->pd_ref_count--;                                             
                                                                                                                                          
    if (pd->pd_state == PORT_DEVICE_INVALID &&                      
        pd->pd_ref_count == 0) {                                    
            fc_remote_node_t *node = pd->pd_remote_nodep;           
                                                                                
            mutex_exit(&pd->pd_mutex);                              
                                                                                                                             
            if ((fctl_destroy_remote_port(port, pd) == 0) &&        
                (node != NULL)) {                                   
                    fctl_destroy_remote_node(node);                 
            }                                                       
            return (rval);                                          
    }
...

Thread 2 (fctl/src/fctl.c:2907 in function fctl_ulp_statec_cb):

...
    mutex_enter(&pd->pd_mutex);                             
                                                                                
    pd->pd_ref_count--;                                     
    ASSERT(pd->pd_ref_count >= 0);                          
                                                                                
    if (clist->clist_map[count].map_state !=                
        PORT_DEVICE_INVALID) {                                                     
            mutex_exit(&pd->pd_mutex);                      
            continue;                                       
    }                                                       
                                                                                
    node = pd->pd_remote_nodep;                             
    pd->pd_aux_flags &= ~PD_GIVEN_TO_ULPS;                  
                                                                                
    mutex_exit(&pd->pd_mutex);                              
                                                                                                                                    
    if ((fctl_destroy_remote_port(port, pd) == 0) &&        
        (node != NULL)) {                                   
            fctl_destroy_remote_node(node);                 
    }
...

Both threads end up calling fctl_destroy_remote_node, so here is the relevant portion of that function:

int                                                                             
fctl_destroy_remote_port(fc_local_port_t *port, fc_remote_port_t *pd)           
{                                                                               
        fc_remote_node_t        *rnodep;                                        
        int                     rcount = 0;                                     
                                                                                
        mutex_enter(&pd->pd_mutex);                                             
                                                                                                                                                 
        if ((pd->pd_ref_count > 0) ||                                           
            (pd->pd_aux_flags & PD_GIVEN_TO_ULPS)) {                            
                pd->pd_aux_flags |= PD_NEEDS_REMOVAL;                           
                pd->pd_type = PORT_DEVICE_OLD;                                  
                mutex_exit(&pd->pd_mutex);                                      
                return (1);                                                     
        }                                                                       
                                                                                
        pd->pd_type = PORT_DEVICE_OLD;                                          
                                                                                
        rnodep = pd->pd_remote_nodep;                                           
                                                                                
        mutex_exit(&pd->pd_mutex);                                              
                                                                                
        if (rnodep != NULL) {                                                                                                            
            rcount = fctl_unlink_remote_port_from_remote_node(rnodep, pd);  
        }                                                                       
                                                                                
        mutex_enter(&port->fp_mutex);                                           
        mutex_enter(&pd->pd_mutex);                                             
                                                                                
        fctl_delist_did_table(port, pd);                                        
        fctl_delist_pwwn_table(port, pd);                                       
                                                                                
        mutex_exit(&pd->pd_mutex);                                              
        fctl_dealloc_remote_port(pd);                                           
                                                                                
        mutex_exit(&port->fp_mutex);                                            
                                                                                
        return (rcount);                                                        
}

As can be seen above, fctl_destroy_remote_port checks the PD_GIVEN_TO_ULPS flag and the reference count (pd_ref_count) prior to actually deallocating the remote port structure.

In bug ID 6255534, thread 1 caused the deallocation of the remote port structure. Thread 2 caused the panic at fctl_destroy_remote_port+4. If both conditions must be met in order for the remote port structure to be deallocated, how did it happen that thread 2 had a reference to a deallocated structure?

Here is a chronological flow for each thread (based on my analysis):

Thread 1                            Thread 2

Acquire pd_mutex                    .
Decrement pd_ref_count              .
Release pd_mutex                    .
Call fctl_destroy_remote_port       Acquire pd_mutex
.                                   Decrement pd_ref_count
.                                   Clear PD_GIVEN_TO_ULPS flag
.                                   Release pd_mutex
Acquire pd_mutex                    Call fctl_destroy_remote_port
Check if deallocation is ok         .
Release pd_mutex                    .
Deallocate remote port              .
                            Acquire pd_mutex
                                    panic

Once the cause of the panic is evident from an understanding of the above timeline, the solution to the problem should be clear. After thread 2 acquired the pd_mutex, decremented the reference count and cleared the PD_GIVEN_TO_ULPS flag, thread 1 was able to acquire the mutex. At that point, fctl_destroy_remote_port determined (incorrectly) that it was appropriate to deallocate the remote port structure.

As I am no longer responsible for Leadville support, I suggested that the most straight-forward fix to this bug was to require that the pd_mutex be held while calling fctl_destroy_remote_port. If this restriction were put into place, the race condition would be eliminated.

When this suggestion was made, the following question was raised:

"Even if the mutex is held across the call the fctl_destroy_remote_port, wouldn't the panic still occur when the second thread tried to obtain the pd_mutex?"

My response was "There is no longer a second thread".

It is getting pretty late on Friday night, so I suppose this is a good time to wrap this up. I can't believe I spent time on Friday night working on this entry. Perhaps this could shed some light.

Until next time,

David

Comments:

Thank you for your blog - as an OpenSolaris developer (and Sun customer), just wanted to let you know I appreciated your thoroughness.

Posted by Alistairre Bowen on September 22, 2006 at 11:17 PM MDT #

Post a Comment:
Comments are closed for this entry.