Storage Developers Conference '06
As the chair of the SNIA Technical Council, I had the pleasure of introducing most of the speakers at this week's Storage Developers Conference in San Jose. We had over 300 people show up for the event which included plug-fests on iSCSI and CIFS as well as great talks on all sorts of storage topics for developers.
My favorite talk (slightly biased) was the 4 hour tutorial on ZFS given by Jeff Bonwick and Bill Moore. They covered nearly all of the features of ZFS and dug down into the design decisions and tradeoffs which the developers made in the process. They also related some interesting anecdotes to punch up the entertainment value and demonstrate some important concepts as well. One was about a firmware bug in an adapter that would periodically trash one side or the other of a set of mirrored disks. The bug was fixed by upgrading the adapter firmware, but in the mean time it initiated a new ZFS developer into the intricacies of possible sources of data corruption. Of course ZFS quite happily corrected the data and logged the checksum errors it was seeing all the while.
Another great talk was given by Paul Strong (formerly of Sun) now working at Ebay. Ebay is a big user of grids. They have 4000 servers arranged in a grid just to do the searches of items that their customers do every day. In order to handle the scale and reduce complexity they settle on a fixed configuration of storage to servers and then clone their way to scalability. Of course this introduces its own manageability problems. Paul says that they break nearly every commercial tools they try and test with the scale of the systems, so Paul is off designing his own Grid Management Console to solve the problem (sounds like fun). It has an AJAX driven user interface. They wrap a bunch of APIs in order to disaggregate the existing tools, and a new tool must replace 5 or 6 other tools before it is considered an improvement.
Speaking of scalability, we also had a talk from Joshua Redstone on Google's GFS filesystem. GFS runs in user space on a custom Linux kernel, but it completely abstracts away the inherent reliability problems that arise from using cheap, non-redundant systems with direct attached storage. Joshua showed some great pictures of the early Google data centers and how they have improved over time. They have upwards of 1000 machines in each filesystem cluster and pools of 1000 or more clients using that cluster to perform services. Their filesystems are more than a PetaByte each with greater than 10 GB/sec read/write load. GFS itself reminds me of a SAN filesystem (Like Sun's QFS), but the network is a straight ethernet with what they call Chunkservers providing the data on demand. The metadata is retrieved from a GFS Master (replicated), then the client accesses the data directly from the chunkservers. GFS handles Fault Tolerance and Data Replication as well as load balancing. They are working on a distributed GFS Master so that it does not become a bottleneck.
There were many other good talks as well, including Marcellus Tabor describing Yahoo's design of their Storage Resource Manager for managing the large number of NAS servers in their environment. The slides from the talk are available to conference attendees, but after a few months will also be available to the public. The SNIA Technical Council and staff are already starting to plan the Storage Developers Conference for next year (likely in September again).

