Welcome to the Event Horizon

Wednesday Feb 06, 2008

If you're reading this, you may have seen the recent announcement on the storage-discuss mailing list regarding COMSTAR and the fact that there is now support for a SAS port provider. In case you missed it, here is the thread.

There is also an opensolaris page that discusses the usage model for mptt.

There is one note that isn't mentioned in the mptt discussion link above. First off, if your SAS initiator is not another LSI 1068/1068e based HBA (Sun-branded or not), there is no need to read further. Second, if you already have a COMSTAR SAS configuration set up and are not having any issues discovering your COMSTAR LUs, then again, no need to continue unless it's simply for morbid curiousity.

OK. So, you are running SAS on COMSTAR, you're using a 1068 or 1068e based HBA as the initiator, and you're having trouble seeing the COMSTAR targets you've taken the time to create, register, and map. Fear not.

Now, I can't exactly explain why this seems to be a problem in some configurations and not in others. However, this change has fixed my configuration when I've run into this problem.

First question: Is your SAS initiator running on SPARC or x86 (x64)? Read the appropriate section below.

"Fixing" your 1068[e] initiator HBA running on x64 for COMSTAR

Fortunately, the HBA option ROM (BIOS) will allow you to make the necessary change.

Step 1: Reboot the system.
Step 2: Hit Ctrl-C when prompted to enter the LSI Logic Configuration Utility.
Step 3: From the main screen, choose your initiator HBA by highlighting the appropriate entry and hitting ENTER.
Step 4: Choose the "Advanced Adapter Properties" option and hit ENTER.
Step 5: Choose the "Advanced Timing Properties" option and hit ENTER.
Step 6: Look for the "Direct Attached Max Targets To Spinup" option. Chances are, the value there is going to be "0". Highlight that value and hit the space bar to increment the value. Choosing 8 should be reasonable.
Step 7: Hit ESC three times to get to the exit menu.
Step 8: Choose "Save changes then exit this menu" and hit ENTER.
Step 9: From the Adapter List (main) page, hit ESC, then choose "Save changes and reboot."

When your system comes back up, you should be seeing all of your COMSTAR LUs.

"Fixing" your 1068[e] initiator HBA running on SPARC for COMSTAR

For the SPARC initiator, you're going to need an LSI provided program called "lsiutil". If you've ever installed the LSI initiator driver (itmpt), you may have this on your system already. Look for /usr/bin/lsiutil. If you don't have it, you can get it here. Choose the "Solaris SPARC" download. You should end up downloading a file called SAS_Solaris_8-9_SPARC.zip or something similar. If you already have lsiutil, skip down to step 6 below.

Step 1: Download the file as referenced above.
Step 2: unzip the zip file into some subdirectory, then cd into that directory.
Step 3: You'll likely have a file called "itmpt_5.07.04_sparc.tar.Z", although the version number may be different.
Step 4: Extract lsiutil by doing the following: zcat itmpt_5.07.04_sparc.tar.Z|tar xf - install/ITImpt/reloc/usr/bin/lsiutil"
Step 5: As root, "cp install/ITImpt/reloc/usr/bin/lsiutil /usr/bin/"
Step 6: Run lsiutil
Step 7: Select the appropriate device from the available list and hit ENTER. If you are using the Solaris mpt initiator driver and you don't see any mpt ports listed, your version of lsiutil may be too old. If this is the case, you'll need to download a newer version from the link above. The current version at the link above is 1.52.17 (November 14, 2007) and it does work.
Step 8: Choose option 9. It may not show up in the list of options unless you hit "e" and ENTER, but you don't have to do that.
Step 9: At the "Enter page type:" prompt, type in 2 and hit ENTER.
Step 10: At the "Enter page number:" prompt, type in 1 and hit ENTER.
Step 11: At the "Read NVRAM or current values?" prompt, just hit ENTER for the default (NVRAM). You should see output similar to the following:

Enter page type:  [0-255 or RETURN to quit] 2
Enter page number:  [0-255 or RETURN to quit] 1
Read NVRAM or current values?  [0=NVRAM, 1=Current, default is 0] 

0000 : 22010803
0004 : 00000700
0008 : 00500230
000c : 00000000
0010 : 00000000
0014 : 00000018
0018 : 000a000a
001c : 000a000a

Do you want to make changes?  [Yes or No, default is No] 
Step 12: Type in Yes and hit ENTER to make changes.
Step 13: At the "Enter offset of value to change:" prompt, type in 8 and hit ENTER.
Step 14: Look at the value at offset 8 above. In this example, it's 00500230. If you see something like 00508230, you shouldn't need to change anything, and shouldn't be having a problem. Otherwise, you want to change bits 12 through 15. For this example, you would enter 00508230 and hit ENTER. That is, keep all the values the same except for nibble 3.
Step 15: At the next "Enter offset of value to change" prompt, hit ENTER to quit.
Step 16: At the "Do you want to write your changes?" prompt, type in Yes and hit ENTER.
Step 17: Hit ENTER to quit option 9, then hit 0 followed by ENTER and 0 followed by ENTER again to quit lsiutil.
Step 18: Reboot your initiator system.

When the system comes back up, you should see your COMSTAR SAS LUs.

If you have any issues relating mptt, device discovery, or even COMSTAR in general, feel free to send a message to storage-discuss.

Good luck, and happy SASing.

Friday Sep 22, 2006

Hello again,

In my previous life at Sun, I was a software engineer working on Leadville. More specifically, I spent almost two years working on the FibreChannel stack.

From the time I started working on Leadville, there were numerous recurring panics that all revolved around one central issue. That issue was called the "pd mutex", aptly named after the mutex that is supposed to protect a structure that represents a remote port on the SAN. That structure used to be called the "p"ort "d"evice structure (thus the name pd_mutex), although it has since been appropriately renamed to fc_remote_port_t.

These pd_mutex bugs would pop up on a regular basis because the structures weren't being properly accounted for within the stack.

By the nature of SANs, devices come and go. A host reboots, a user allocates a new LUN on an array, any number of solicited or unsolicited events cause remote ports to appear and disappear. In theory, when a remote port disappears, the structure that represents that remote port also disappears. In reality, it isn't that simple.

I knew there was a fundamental root cause to all the open bugs that were related to "pd_mutex". My goal was to reduce all the open bugs down to that common root cause. The following bug is the one I chose to engineer a solution for:

Bug #4792071

That bug had been open for approximately two years already when I took it over. An initial look at the code made it clear that although there was a field within the remote port structure that was supposed to maintain a reference count, that field was rarely used. I came to the conclusion that the root cause of all the pd_mutex bugs was that there was no adequate reference counting associated with the structure. Thus began two months of in-depth analysis, bug fix development, and testing to ensure that this bug and the other eight or so related bugs no longer exhibited the panics associated with the pd_mutex problem.

After the fix was integrated, the issues slowed to a trickle, but since then there have been several new bugs that have arisen that also point to issues with regard to proper bookkeeping of the remote port structure. One of those recent bugs is the following:

Bug #6255534

Unfortunately, the bug listing in opensolaris isn't very helpful, but this is where talk of threads and synchronization comes into play. This bug is a great example for discussion of fundamental locking methodology in a multi-threaded environment. As the synopsis of the bug indicates, a system panic occurred during Leadville testing while target-side cable pulls were being performed.

A brief code inspection revealed a race condition to be the root cause of the panic. At the time of the panic, there were two threads operating with the same remote port structure. One thread was running at interrupt context and was attempting to destroy a packet that had evidently failed transmission at an earlier time. The second thread was handling a state change notification, presumably one indicating that the remote port had disappeared.

Now, as a bit of background, part of my fix for bug ID 4792071 was to implement a flag in addition to consistent and proper use of a reference counter for each remote port structure. The flag is an indication that the transport layer has notified the ULP that a remote port exists. Thus, there are now two conditions that must be true in order for a remote port structure to be deallocated. First, the reference count must be zero. Second, the flag indicating that ULPs are aware of the remote port must be clear. Then, and only then will a remote port structure be freed.

With that in mind, have a quick look at the relevant piece of code for each thread just prior to the panic. The following code fragments can be seen in the opensolaris NWS source (nws-src-20060724).

Thread 1 (fctl/src/fctl.c:882 in function fc_ulp_uninit_packet):

...
    mutex_enter(&pd->pd_mutex);                                     
                                                                                
    ASSERT(pd->pd_ref_count > 0);                                   
    pd->pd_ref_count--;                                             
                                                                                                                                          
    if (pd->pd_state == PORT_DEVICE_INVALID &&                      
        pd->pd_ref_count == 0) {                                    
            fc_remote_node_t *node = pd->pd_remote_nodep;           
                                                                                
            mutex_exit(&pd->pd_mutex);                              
                                                                                                                             
            if ((fctl_destroy_remote_port(port, pd) == 0) &&        
                (node != NULL)) {                                   
                    fctl_destroy_remote_node(node);                 
            }                                                       
            return (rval);                                          
    }
...

Thread 2 (fctl/src/fctl.c:2907 in function fctl_ulp_statec_cb):

...
    mutex_enter(&pd->pd_mutex);                             
                                                                                
    pd->pd_ref_count--;                                     
    ASSERT(pd->pd_ref_count >= 0);                          
                                                                                
    if (clist->clist_map[count].map_state !=                
        PORT_DEVICE_INVALID) {                                                     
            mutex_exit(&pd->pd_mutex);                      
            continue;                                       
    }                                                       
                                                                                
    node = pd->pd_remote_nodep;                             
    pd->pd_aux_flags &= ~PD_GIVEN_TO_ULPS;                  
                                                                                
    mutex_exit(&pd->pd_mutex);                              
                                                                                                                                    
    if ((fctl_destroy_remote_port(port, pd) == 0) &&        
        (node != NULL)) {                                   
            fctl_destroy_remote_node(node);                 
    }
...

Both threads end up calling fctl_destroy_remote_node, so here is the relevant portion of that function:

int                                                                             
fctl_destroy_remote_port(fc_local_port_t *port, fc_remote_port_t *pd)           
{                                                                               
        fc_remote_node_t        *rnodep;                                        
        int                     rcount = 0;                                     
                                                                                
        mutex_enter(&pd->pd_mutex);                                             
                                                                                                                                                 
        if ((pd->pd_ref_count > 0) ||                                           
            (pd->pd_aux_flags & PD_GIVEN_TO_ULPS)) {                            
                pd->pd_aux_flags |= PD_NEEDS_REMOVAL;                           
                pd->pd_type = PORT_DEVICE_OLD;                                  
                mutex_exit(&pd->pd_mutex);                                      
                return (1);                                                     
        }                                                                       
                                                                                
        pd->pd_type = PORT_DEVICE_OLD;                                          
                                                                                
        rnodep = pd->pd_remote_nodep;                                           
                                                                                
        mutex_exit(&pd->pd_mutex);                                              
                                                                                
        if (rnodep != NULL) {                                                                                                            
            rcount = fctl_unlink_remote_port_from_remote_node(rnodep, pd);  
        }                                                                       
                                                                                
        mutex_enter(&port->fp_mutex);                                           
        mutex_enter(&pd->pd_mutex);                                             
                                                                                
        fctl_delist_did_table(port, pd);                                        
        fctl_delist_pwwn_table(port, pd);                                       
                                                                                
        mutex_exit(&pd->pd_mutex);                                              
        fctl_dealloc_remote_port(pd);                                           
                                                                                
        mutex_exit(&port->fp_mutex);                                            
                                                                                
        return (rcount);                                                        
}

As can be seen above, fctl_destroy_remote_port checks the PD_GIVEN_TO_ULPS flag and the reference count (pd_ref_count) prior to actually deallocating the remote port structure.

In bug ID 6255534, thread 1 caused the deallocation of the remote port structure. Thread 2 caused the panic at fctl_destroy_remote_port+4. If both conditions must be met in order for the remote port structure to be deallocated, how did it happen that thread 2 had a reference to a deallocated structure?

Here is a chronological flow for each thread (based on my analysis):

Thread 1                            Thread 2

Acquire pd_mutex                    .
Decrement pd_ref_count              .
Release pd_mutex                    .
Call fctl_destroy_remote_port       Acquire pd_mutex
.                                   Decrement pd_ref_count
.                                   Clear PD_GIVEN_TO_ULPS flag
.                                   Release pd_mutex
Acquire pd_mutex                    Call fctl_destroy_remote_port
Check if deallocation is ok         .
Release pd_mutex                    .
Deallocate remote port              .
                            Acquire pd_mutex
                                    panic

Once the cause of the panic is evident from an understanding of the above timeline, the solution to the problem should be clear. After thread 2 acquired the pd_mutex, decremented the reference count and cleared the PD_GIVEN_TO_ULPS flag, thread 1 was able to acquire the mutex. At that point, fctl_destroy_remote_port determined (incorrectly) that it was appropriate to deallocate the remote port structure.

As I am no longer responsible for Leadville support, I suggested that the most straight-forward fix to this bug was to require that the pd_mutex be held while calling fctl_destroy_remote_port. If this restriction were put into place, the race condition would be eliminated.

When this suggestion was made, the following question was raised:

"Even if the mutex is held across the call the fctl_destroy_remote_port, wouldn't the panic still occur when the second thread tried to obtain the pd_mutex?"

My response was "There is no longer a second thread".

It is getting pretty late on Friday night, so I suppose this is a good time to wrap this up. I can't believe I spent time on Friday night working on this entry. Perhaps this could shed some light.

Until next time,

David