Weblog

All | General | Java | Music
« RAID 0+1 vs. RAID... | Main | A bit of SVM history... »
20050615 Wednesday June 15, 2005

Timing is Everything Timing is Everything

http://opensolaris.org


Timing is Everything

At a testing seminar I attended a couple of years ago I found myself surrounded by a large number of engineers responsible for testing applications whereas I've typically worked on operating system components. As we discussed situations we'd encountered on the job, I realized that there's often quite a gulf between the two types of testing. One of the areas of difference was in dealing with timing-related issues. The application test engineers were typically concerned with the timing of user-initiated operations, whereas I had to consider both user-initiated actions and the timing of various operations happening down below the surface. Most of the other engineers attending the seminar were surprised at the level of detail and precision that OS testing went into.

For this blog I'll use the Solaris Volume Manager (SVM) as an example as that's what I've spent the most time on. SVM allows you to combine multiple physical devices into a single logical device (known as a metadevice in SVM parlance) or split a single physical device into multiple metadevices.

One of the more complicated metadevice types in SVM is RAID5. RAID (Redundant Array of Inexpensive Disks) level 5 stripes data across multiple disks and provides data redundancy by adding one extra drive and generating and storing parity based on the data. The parity is spread across all disks, not concentrated on one. The following crude ASCII diagram gives an idea of how data and parity are laid out on a RAID5 metadevice:

The data in a RAID5 is split up into chunks based on the 'interlace size' for the device. When a write occurs to the device, the SVM code breaks the data up into chunks of this size and calculates parity. The chunks of data and parity are written to the components (disk slices) that make up the RAID5 device. The advantages of a RAID5 are that you get read performance gains from the striping of the data across multiple disks, and that you can lose any one component and still be able to access all data by looking at the remaining data and the parity info. The disadvantage is that calculating and writing the parity information slows down write operations. Also, the RAID5 must be 'zeroed out' (write zeroes at all addresses) before it can be used to insure that the data and parity match.

The following example shows data being written to a RAID5 device. Lower-case characters, 'd' and 'p', are used to show the existing data and parity information on the RAID5. Upper-case letters, 'D' and 'P', show the new data and parity information being written to the RAID5.

When no writes are being made to the RAID5, the data and parity on the metadevice are static so the timing of a failure doesn't really matter much. Things start to get interesting when the device is being written to. Consider what would happen if the system encountered a problem (power failure, panic, etc.) at the middle stage where the write was only partially complete. The data and parity no longer match, so if a component failed the data that would be reconstructed by using the parity would be corrupt. If it were possible to do an 'atomic' operation in which we could be guaranteed that either all three components had been updated or none of them had then we wouldn't have a problem. Unfortunately we have no method of doing so.

Since we can't guarantee simultaneous updates to all components, we need some other mechanism to keep the data and parity consistant. In the case of an SVM RAID5 metadevice, this is handled by having a prewrite area on each RAID component:

The prewrite area takes up a small amount of space at the start of each RAID5 component. When writing to a RAID5, data is first written to the prewrite area. Once the prewrite area for all components have been updated, the data can start to be written into the desired blocks within each component:

Note that the use of a prewrite area adds still more overhead to RAID5 writes, as all of the data gets written twice (first to the prewite area, then to the final destination).

Now let's look at what happens if we encounter a failure at any given point in the above example:

This explanation is a bit oversimplified (e.g. we must obviously store the target address for the write as well as the data) but it should give you an idea of how a RAID5 metadevice insures that the data and parity match.

To validate the functionality of RAID5 we need to be able simulate failures at or in between each of these points. This requires 'white box/glass box' testing where the test engineers have visibility to the internal workings of the product code in order to test it adequately. Understanding the order in which these steps occur, and having a syncronization method such that failures can be simulated at critical points, is necessary to make sure we've covered all the bases. Black box testing, where we're aware of the interfaces and the general operation but not the internal details, could easily miss a bug that will only manifest itself if a system or disk failure happens at one specific point or transition that's not visible to the user. If you look at the SVM code you'll find references to 'notify' -- these are 'hooks' built into the product code to allow an outside entity (like automated tests) to be informed of internal SVM events and state changes so that they can synchronize their actions with the SVM activity.

This isn't to say that black box testing isn't valuable. White box testing tends to be very 'development centric', as you're looking at how the code works and probing areas where you suspect there could be issues. However, while you may do a good job of proving that the specific implementation works or doesn't, it doesn't necessarily tell you if the code meets the larger goal of meeting a customer need. By contrast black box testing approaches the product more from a user/customer point of view, helping you determine if it fulfills it's intended purpose. Testing from both of these angles is better than using only one or the other.

The requirement to keep the system and data in a coherent and usable state after failures is one of the things that differentiates operating system validation and application testing. In most cases applications don't have stringient demands placed on their ability to recover after an error; as long as the application will start back up then the user will be satisfied; simply keeping a backup copy of any document modified by application is usually sufficient to avoid any significant impact on the user. For operating systems, however, it's critical to be able to recover without having corrupted any data and without requiring major intervention on the part of the user who probably doesn't have a lot of OS expertise. Identifying windows of vulnerability and being able to probe them with some degree of precision is a difficult but necessary part of operating system validation.


OpenSolaris

Solaris ( Jun 15 2005, 06:09:13 AM PDT ) Permalink Comments [0]

Trackback URL: http://blogs.sun.com/andresblog/entry/timing_is_everything
Comments:

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed

Calendar

RSS Feeds

Search

Links

Navigation

Referers