Weblog

All | General | Java | Music
20070928 Friday September 28, 2007

lofi tests, where art thou? lofi (loopback filesystem) tests are nowhere to be found [Read More] ( Sep 28 2007, 10:40:19 AM PDT ) Permalink

20050713 Wednesday July 13, 2005

A customer would never do that...

http://opensolaris.org

Not too long ago I was in a briefing where our marketing folks were providing data generated from visits to current and potential Sun customers. One of the items that was brought up elicited a response of 'A customer would never do that!" when I related it to a fellow engineer. Never mind that several customers had specifically told us they do exactly that, it was hard to believe because it didn't match up with this engineer's experience. Unfortunately most of us in development don't really have a good idea of how customers use (or would like to use) our products.

I won't go into detail on this specific instance because I'm not sure if that customer data can be shared outside of the company. However, I will use an example from a previous company I worked at to illustrate the point:


A previous employer of mine was also a UNIX systems shop, doing both the hardware and software. Loadable kernel modules hadn't come to this company yet, which meant that you had to build kernels specifically for the hardwdare you had installed in the machine. For a given machine you would install the OS with a generic kernel, then install the packages that contained the drivers for the specific hardware options present in the system. Once all the necessary packages were installed, you would build a new kernel on the machine and then reboot onto the new kernel to make all of your hardware functional. If you added a new piece of hardware later on, you would need to install the associated software package, rebuild, and reboot.

Installing the required packages was somewhat tedious, as the full set of packages spanned several 9-track tape reels. You didn't want to install packages you didn't need (due to the time required and limited disk space) so quickly identifying the correct reels was a big time saver. The order and number of packages per tape could vary quite a bit depending on what options were chosen when the tape was created, so installation could be a pain. We talked to the release folks to determine if there was something we could do to simplify life with the tape packages. Several options were discussed, but because "Customers get pre-installed systems and re-installs are very rare" the decision was made not to expend the effort to improve the process.

Sometime after that I was sent out to a customer site to assist the field engineers in the first installation of some brand new hardware. Once the system was physically put together, one of the field engineers proceeded to start a new install. "You don't need to do that," I told him, "It was preinstalled at the factory."

"I know," he replied, "but we've seen so many misconfigured systems that we never trust the factory install. When we put in a new system, we always reinstall from scratch."

And it got even worse. In talking to the site manager, I found that much of the time they loaded the machines with test versions of their database to run sanity checks to satisfy themselves that the new system was working properly. Once they were done, they would then typically re-install again from scratch to be certain that they'd purged all of the test setup before putting the machine into production.

So, while back in the R&D part of the company we thought that re-installs in the field were extremely rare, in reality many of the machines were installed twice at the customer site before going into production. And we didn't even have to go to a customer site to learn this, our own field support people could have told us. We thought we had a reasonable grasp on what customers did and good communication with our field folks but we were wrong.


One of the things I hope the OpenSolaris community can provide us with is a better connection to how people in the real world use or would like to use our products. We have a policy of "Eating our own dog food", or using our own machines to act as our mail servers, home directory servers, etc. While helpful, we've learned the hard way that how we set things up in a development environment is often very different from how it's done "out in the wild". The more channels we have for learning the customer reality the better, so hopefully comments from the OpenSolaris community will supplement and reinforce the information that gets fed back through marketing so that we can better design products that work the way people need them to.

OpenSolaris

Solaris ( Jul 13 2005, 01:15:47 PM PDT ) Permalink Comments [0]

20050712 Tuesday July 12, 2005

A bit of SVM history (answering Francis' question) A bit of SVM history (answering Francis' question)

http://opensolaris.org


I finally realized I should check for comments on my blog and found that Francis Liu asked the following question back on June 19th (with regards to my RAID 0+1 vs. RAID 1+0 and SVM blog):

One question, is the desciprion true for all versions of DiskSuite. Or is it only true for SVM (ie Solaris 10 or OpenSolaris)? I've seen documents that say that this sort of thing only applies to SDS 4.2.1 with patches and later.

While I started working with DiskSuite for the 4.2.1 release, I know that the convoluted mirror/stripe interaction dates back to well before then. I arrived on the SDS scene in 1999 when development was ramping back up after a period of inactivity. Several years earlier Sun had closed down the Rocky Mountain Technical Center (RMTC) in Colorado Springs. One of the casualties was the SDS development group. At the time upper management seemed to think that continued development of SDS wasn't a good investment and so while the product continued to be available there wasn't ongoing development. This decision was reversed some time later, but as the vast majority of the RMTC engineers had left Sun an SDS development group needed to be restarted mostly from scratch.

I know that current mirror/concat/stripe paradigm existed in the SDS 4.1 release (which dates to before the dissolution of RMTC) and I suspect its roots go back much further than that. The mirror and concat/stripe devices are so intertwined that my guess is that they were designed together in the very early days of the code base that eventually became SVM. My understanding is that mirror/concat/stripe devices are the earliest ODS/SDS/SVM devices, followed later by trans devices (since discontinued in favor of logging ufs) and RAID5. Soft Partitions are a relatively late addition, being made after I joined the SDS team.

A bit more SVM history: The earliest references I've seen to the product refer to it as ODS (for "Online: DiskSuite"). At some point the name was changed to SDS ("Solstice DiskSuite"). In both cases the volume manager was an "add on" product rather than being built into Solaris. One of the major disadvantages of this was that upgrade didn't understand SDS metadevices, so if your system had a mirrored root you had to eliminate the mirror before upgrading and then recreate the root mirror after the upgrade. The decision was made to integrate SDS into base Solaris for the S9 release and the initial name chosen was Solaris Logical Volume Manager (SLVM, pronounced "sliv-em"). Before S9 actually shipped the decision was made to shorten the name to Solaris Volume Manager (SVM, or "siv-em"). While all the release docs were changed to show SVM instead of SLVM, a few references to SLVM persisted in error messages and comments in the code for some time after the S9 release before finally being eradicated.

I've heard that original base code for ODS/SDS/SVM was bought by Sun from another company back in the late eighties or early nineties but have never seen any confirmation of that.


OpenSolaris

Solaris ( Jul 12 2005, 05:02:09 AM PDT ) Permalink Comments [0]

20050615 Wednesday June 15, 2005

Timing is Everything Timing is Everything

http://opensolaris.org


Timing is Everything

At a testing seminar I attended a couple of years ago I found myself surrounded by a large number of engineers responsible for testing applications whereas I've typically worked on operating system components. As we discussed situations we'd encountered on the job, I realized that there's often quite a gulf between the two types of testing. One of the areas of difference was in dealing with timing-related issues. The application test engineers were typically concerned with the timing of user-initiated operations, whereas I had to consider both user-initiated actions and the timing of various operations happening down below the surface. Most of the other engineers attending the seminar were surprised at the level of detail and precision that OS testing went into.

For this blog I'll use the Solaris Volume Manager (SVM) as an example as that's what I've spent the most time on. SVM allows you to combine multiple physical devices into a single logical device (known as a metadevice in SVM parlance) or split a single physical device into multiple metadevices.

One of the more complicated metadevice types in SVM is RAID5. RAID (Redundant Array of Inexpensive Disks) level 5 stripes data across multiple disks and provides data redundancy by adding one extra drive and generating and storing parity based on the data. The parity is spread across all disks, not concentrated on one. The following crude ASCII diagram gives an idea of how data and parity are laid out on a RAID5 metadevice:

The data in a RAID5 is split up into chunks based on the 'interlace size' for the device. When a write occurs to the device, the SVM code breaks the data up into chunks of this size and calculates parity. The chunks of data and parity are written to the components (disk slices) that make up the RAID5 device. The advantages of a RAID5 are that you get read performance gains from the striping of the data across multiple disks, and that you can lose any one component and still be able to access all data by looking at the remaining data and the parity info. The disadvantage is that calculating and writing the parity information slows down write operations. Also, the RAID5 must be 'zeroed out' (write zeroes at all addresses) before it can be used to insure that the data and parity match.

The following example shows data being written to a RAID5 device. Lower-case characters, 'd' and 'p', are used to show the existing data and parity information on the RAID5. Upper-case letters, 'D' and 'P', show the new data and parity information being written to the RAID5.

When no writes are being made to the RAID5, the data and parity on the metadevice are static so the timing of a failure doesn't really matter much. Things start to get interesting when the device is being written to. Consider what would happen if the system encountered a problem (power failure, panic, etc.) at the middle stage where the write was only partially complete. The data and parity no longer match, so if a component failed the data that would be reconstructed by using the parity would be corrupt. If it were possible to do an 'atomic' operation in which we could be guaranteed that either all three components had been updated or none of them had then we wouldn't have a problem. Unfortunately we have no method of doing so.

Since we can't guarantee simultaneous updates to all components, we need some other mechanism to keep the data and parity consistant. In the case of an SVM RAID5 metadevice, this is handled by having a prewrite area on each RAID component:

The prewrite area takes up a small amount of space at the start of each RAID5 component. When writing to a RAID5, data is first written to the prewrite area. Once the prewrite area for all components have been updated, the data can start to be written into the desired blocks within each component:

Note that the use of a prewrite area adds still more overhead to RAID5 writes, as all of the data gets written twice (first to the prewite area, then to the final destination).

Now let's look at what happens if we encounter a failure at any given point in the above example:

This explanation is a bit oversimplified (e.g. we must obviously store the target address for the write as well as the data) but it should give you an idea of how a RAID5 metadevice insures that the data and parity match.

To validate the functionality of RAID5 we need to be able simulate failures at or in between each of these points. This requires 'white box/glass box' testing where the test engineers have visibility to the internal workings of the product code in order to test it adequately. Understanding the order in which these steps occur, and having a syncronization method such that failures can be simulated at critical points, is necessary to make sure we've covered all the bases. Black box testing, where we're aware of the interfaces and the general operation but not the internal details, could easily miss a bug that will only manifest itself if a system or disk failure happens at one specific point or transition that's not visible to the user. If you look at the SVM code you'll find references to 'notify' -- these are 'hooks' built into the product code to allow an outside entity (like automated tests) to be informed of internal SVM events and state changes so that they can synchronize their actions with the SVM activity.

This isn't to say that black box testing isn't valuable. White box testing tends to be very 'development centric', as you're looking at how the code works and probing areas where you suspect there could be issues. However, while you may do a good job of proving that the specific implementation works or doesn't, it doesn't necessarily tell you if the code meets the larger goal of meeting a customer need. By contrast black box testing approaches the product more from a user/customer point of view, helping you determine if it fulfills it's intended purpose. Testing from both of these angles is better than using only one or the other.

The requirement to keep the system and data in a coherent and usable state after failures is one of the things that differentiates operating system validation and application testing. In most cases applications don't have stringient demands placed on their ability to recover after an error; as long as the application will start back up then the user will be satisfied; simply keeping a backup copy of any document modified by application is usually sufficient to avoid any significant impact on the user. For operating systems, however, it's critical to be able to recover without having corrupted any data and without requiring major intervention on the part of the user who probably doesn't have a lot of OS expertise. Identifying windows of vulnerability and being able to probe them with some degree of precision is a difficult but necessary part of operating system validation.


OpenSolaris

Solaris ( Jun 15 2005, 06:09:13 AM PDT ) Permalink Comments [0]

20050614 Tuesday June 14, 2005

RAID 0+1 vs. RAID 1+0 and SVM RAID 0+1 vs. RAID 1+0 and SVM

http://opensolaris.org


RAID 0+1 vs. RAID 1+0 and SVM

Six years ago when I first started working on Solaris Volume Manager's earlier incarnation (known as SDS) I was confused about whether it implemented RAID 0+1 or RAID 1+0. The answer ended up being more complicated than simply one or the other. The same implementation has carried forward into the current version of SVM. Since this question still comes up with some regularity I thought it was worth spending some time describing how this particular part of SVM works.

Background

RAID stands for 'Redundant Array of Inexpensive Disks', and the different numbers correspond to differing ways of placing data on the disks. There are two basic RAID levels that pertain to this subject in general plus an additional logical device type that's involved when you're dealing with SVM:

Since RAID0 improves performance, and RAID1 provides redundancy, someone came up with the idea to combine them. Fast and reliable. Two great tastes that taste great together!

When combining these two types of 'logical' devices there's a choice to be made -- do you mirror two stripes, or do you stripe across multiple mirrors? There are pros and cons to each approach:

SVM specifics

So, does SVM do RAID 0+1 or RAID 1+0? The answer is, "Yes." So it gives you a choice between the two? The answer is "No."

Obviously further explanation is necessary...

In SVM, mirror devices cannot be created from "bare" disks. You are required to create the mirror on top of another type of SVM metadevice, known as a concat/stripe*. SVM combines concatenations and stripes into a single metadevice type, in which one or more stripes are concatenated together. When used to build a mirror these concat/stripe logical devices are known as submirrors. If you want to expand the size of a mirror device you can do so by concatenating additional stripe(s) onto the concat/stripe devices that are serving as submirrors.

So, in SVM, you are always required to set up a stripe (concat/stripe) in order to create a mirror. On the surface this makes it appear that SVM does RAID 0+1. However, once you understand a bit about the SVM mirror code, you'll find RAID 1+0 lurking under the covers.

SVM mirrors are logically divided up into regions. The state of each mirror region is recorded in state database replicas* stored on disk. By individually recording the state of each region in the mirror, SVM can be smart about how it performs a resync. Following a disk failure or an unusual event (e.g. a power failure occurs after the first side of a mirror has been written to but before the matching write to the second side can be accomplished), SVM can determine which regions are out-of-sync and only synchronize them, not the entire mirror. This is known as an optimized resync.

The optimized resync mechanisms allow SVM to gain the redundancy benefits of RAID 1+0 while keeping the administrative benefits of RAID 0+1. If one of the drives in a concat/stripe device fails, only those mirror regions that correspond to data stored on the failed drive will lose redundancy. The SVM mirror code understands the layout of the concat/stripe submirrors and can therefore determine which resync regions reside on which underlying devices. For all regions of the mirror not affected by the failure, SVM will continue to provide redundancy, so a second disk failure won't necessarily prove fatal.

So, in a nutshell, SVM provides a RAID 0+1 style administrative interface but effectively implements RAID 1+0 functionality. Administrators get the best of each type, the relatively simple administration of RAID 0+1 plus the greater resilience of RAID 1+0 in the case of multiple device failures.


* concat/stripe logical devices (metadevices)

The following example shows a concat/stripe metadevice that's serving as a submirror to a mirror metadevice. Note that the metadevice is a concatenation of three separate stripes:

** State database replicas

SVM stores configuration and state information in a 'state database' in memory. Copies of this state database are stored on disk, where they are referred to as state database replicas. The primary purpose of the state database replicas is to provide non-volatile copies of the state database so that the SVM configuration is persistant across reboots. A secondary purpose of the replicas is to provide a 'scratch pad' to keep track of mirror region states.


OpenSolaris

Solaris ( Jun 14 2005, 08:10:53 AM PDT ) Permalink Comments [7]

20050608 Wednesday June 08, 2005

The beatings will continue until the blogging improves

Well, it's come to this. Blogging is all the rage at Sun right now and there's a lot of pressure to blog. I've held out up until now, but have decided to give in to the peer and management pressure. The fact that I'm choosing to start blogging right before APR (annual performance review) time is of course purely coincidental. :-) I do get to count this as an accomplishment, right?

Presumably an introduction is in order, on the off chance that somebody actually ends up reading this...

I've been with Sun as a test/QA engineer for nearly a decade now. I started out in the Computer Systems division qualifying SCSI HBAs. With the creation of Network Storage my group got sucked into the new division, but I soon returned to Computer Systems as a lead in the software QA group for desktops. After a while I moved to the Solaris division to work on Solaris Volume Manager and I've been there for the past six years.

I expect most of my ramblings will be with regards to test engineering at Sun, how we try to help ship quality products in the extremely short time frames that this industry forces upon you, and general observations on my experiences at Sun compared to other places I've worked. Hopefully I'll spare you tales of my family life, my poor taste in movies and music, etc. However, as long as the pressure to blog continues, I make no guarantees as to the type of content I'll post. Gotta be ready with everything and anything in case they impose some sort of word count quota.

Getting involved with blogging at Sun is kind of strange. You hear tons about it, there's a lot of interest in having employees do it, and there's both overt and subtle suggestions to get a blog going. However, when you do sign up for a blogging account, you're presented with a bunch of warnings. Basically, you're given all sorts of reasons why screwing up could cost Sun money and possibly cost you your job and you're asked, "Are you sure you want to do this?" The corporate culture seems to be heading towards "Blog or die!" while the legalese is suggesting "Blog and die!"

Seems like a good time to go read Catch-22... ( Jun 08 2005, 06:44:55 AM PDT ) Permalink Comments [1]

Calendar

RSS Feeds

Search

Links

Navigation

Referers