inside the sausage factory Adam Leventhal's Weblog
Adam Leventhal, Fishworks engineer

Monday Dec 21, 2009

When I first wrote about triple-parity RAID in ZFS and the Sun Storage 7000 series, I alluded a looming requirement for triple-parity RAID due to a growing disparity between disk capacity and throughput. I've written an article in ACM Queue examining this phenomenon in detail, and making the case for triple-parity RAID. Dominic Kay helped me sift through hard drive data for the past ten years to build a model for how long it takes to fully populate a drive. I've reproduced a graph here from the paper than displays the timing data for a few common drive types — the trends are obviously quite clear.

The time to populate a drive is directly relevant for RAID rebuild. As disks in RAID systems take longer to reconstruct, the reliability of the total system decreases due to increased periods running in a degraded state. Today that can be four hours or longer; that could easily grow to days or weeks. RAID-6 grew out of a need for a system more reliable than what RAID-5 could offer. We are approaching a time when RAID-6 is no more reliable than RAID-5 once was. At that point, we will again need to refresh the reliability of RAID, and RAID-7, triple-parity RAID, will become the new standard.

Triple-Parity RAID and Beyond
ADAM LEVENTHAL, SUN MICROSYSTEMS
As hard-drive capacities continue to outpace their throughput, the time has come for a new level of RAID. How much longer will current RAID techniques persevere? The RAID levels were codified in the late 1980s; double-parity RAID, known as RAID-6, is the current standard for high-availability, space-efficient storage. The incredible growth of hard-drive capacities, however, could impose serious limitations on the reliability even of RAID-6 systems. Recent trends in hard drives show that triple-parity RAID must soon become pervasive. In 2005, Scientific American reported on Kryder's law, which predicts that hard-drive density will double annually. While the rate of doubling has not quite maintained that pace, it has been close.

Problematically for RAID, hard-disk throughput has failed to match that exponential rate of growth. Today repairing a high-density disk drive in a RAID group can easily take more than four hours, and the problem is getting significantly more pronounced as hard-drive capacities continue to outpace their throughput. As the time required for rebuilding a disk increases, so does the likelihood of data loss. The ability of hard-drive vendors to maintain reliability while pushing to higher capacities has already been called into question in this magazine. Perhaps even more ominously, in a few years, reconstruction will take so long as to effectively strip away a level of redundancy. What follows is an examination of RAID, the rate of capacity growth in the hard-drive industry, and the need for triple-parity RAID as a response to diminishing reliability.

[...]

Wednesday Dec 09, 2009

The Hybrid Storage Pool integrates flash into the storage hierarchy in two specific ways: as a massive read cache and as fast log devices. For read cache devices, Readzillas, there's no need for redundant configurations; it's a clean cache so the data necessarily also resides on disk. For log devices, Logzillas, redundancy is essential, but how that translates to their configuration can be complicated. How to decide whether to stripe or mirror?

ZFS intent log devices

Logzillas are used as ZFS intent log devices (slogs in ZFS jargon). For certain synchronous write operations, data is written to the Logzilla so the operation can be acknowledged to the client quickly before the data is later streamed out to disk. Rather than the milliseconds of latency for disks, Logzillas respond in about 100μs. If there's a power failure or system crash before the data can be written to disk, the log will be replayed when the system comes back up, the only scenario in which Logzillas are read. Under normal operation they are effectively write-only. Unlike Readzillas, Logzillas are integral to data integrity and they are relied upon for data integrity in the case of a system failure.

A common misconception is that a non-redundant Logzilla configuration introduces a single point of failure into the system, however this is not the case since the data contained on the log devices is also held in system memory. Though that memory is indeed volatile, data loss could only occur if both the Logzilla failed and the system failed within a fairly small time window.

Logzilla configuration

While a Logzilla doesn't represent a single point of failure, redundant configurations are still desirable in many situations. The Sun Storage 7000 series implements the Hybrid Storage Pool, and offers several different redundant disk configurations. Some of those configurations add a single level of redundancy: mirroring and single-parity RAID. Others provide additional redundancy: triple-mirroring, double-parity RAID and triple-parity RAID. For disk configurations that provide double disk redundancy of better, the best practice is to mirror Logzillas to achieve a similar level of reliability. For singly redundant disk configurations, non-redundant Logzillas might suffice, but there are conditions such as a critically damaged JBOD that could affect both Logzilla and controller more or less simultaneously. Mirrored Logzillas add additional protection against such scenarios.

Note that the Logzilla configuration screen (pictured) includes a column for No Single Point of Failure (NSPF). Logzillas are never truly a single point of failure as previous discussed; instead, this column refers to the arrangement of Logzillas in JBODs. A value of true indicates that the configuration is resilient against JBOD failure.

The most important factors to consider when deciding between mirrored or striped Logzillas are the consequences of potential data loss. In a failure of Logzillas and controller, data will not be corrupted, but the last 5-30 seconds worth of transactions could be lost. For example, while it typically makes sense to mirror Logzillas for triple-parity RAID configurations, it may be that the data stored is less important and the implications for data loss not worthy of the cost of another Logzilla device. Conversely, while a mirrored or single-parity RAID disk configuration provides only a single level of redundancy, the implications of data loss might be such that the redundancy of volatile system memory is insufficient. Just as it's important to choose the appropriate disk configuration for the right balance of performance, capacity, and reliability, it's at least as important to take care and gather data to make an informed decision about Logzilla configurations.

Wednesday Sep 16, 2009

Today we shipped our 2009.Q3 release. Amidst the many great new features, enhancements and bug fixes, we've added new storage profiles for triple-parity RAID and three-way mirroring. Here's an example on a 9 JBOD system of what you'll see in the updated storage configuration screen:



Note that the new Triple parity RAID, wide stripes option replaces the old Double parity RAID, wide stripes configuration. With RAID stripes that can easily be more than 40 disks wide, and resilver times that can be quite long as a result, we decided that the additional protection of triple-parity RAID trumped the very small space efficiency of double-parity RAID.

Ryan Matthews has updated the space calculator for the 7310 and 7410 to include the new profiles. Download the new update and give it a shot.

Wednesday Aug 12, 2009

At the Flash Memory Summit today, Sun's own Michael Cornwell delivered a keynote excoriating the overall direction of NAND flash and SSDs. In particular, he spoke of the "lithography death march" as NAND vendors push to deliver the most cost-efficient solution while making huge sacrifices in reliability and performance.

On Wednesday, August 12, I'll be giving two short talks as part of sessions on flash-enabled power savings and data center applications:

In the evening from 7:30 to 9:00, I'll be hosting a table discussion of software as it pertains to flash. We'll be talking about uses of flash such as the Hybrid Storage Pool, how software can enable the use of MLC flash in the enterprise, the role of the flash translation layer, and anything else that comes up.

Tuesday Jul 21, 2009

Double-parity RAID, or RAID-6, is the de facto industry standard for storage; when I started talking about triple-parity RAID for ZFS earlier this year, the need wasn't always immediately obvious. Double-parity RAID, of course, provides protection from up to two failures (data corruption or the whole drive) within a RAID stripe. The necessity of triple-parity RAID arises from the observation that while hard drive capacity has roughly followed Kryder's law, doubling annually, hard drive throughput has improved far more modestly. Accordingly, the time to populate a replacement drive in a RAID stripe is increasing rapidly. Today, a 1TB SAS drive takes about 4 hours to fill at its theoretical peak throughput; in a real-world environment that number can easily double, and 2TB and 3TB drives expected this year and next won't move data much faster. Those long periods spent in a degraded state increase the exposure to the bit errors and other drive failures that would in turn lead to data loss. The industry moved to double-parity RAID because one parity disk was insufficient; longer resilver times mean that we're spending more and more time back at single-parity. From that it was obvious that double-parity will soon become insufficient. (I'm working on an article that examines these phenomena quantitatively so stay tuned... update Dec 21, 2009: you can find the article here)

Last week I integrated triple-parity RAID into ZFS. You can take a look at the implementation and the details of the algorithm here, but rather than describing the specifics, I wanted to describe its genesis. For double-parity RAID-Z, we drew on the work of Peter Anvin which was also the basis of RAID-6 in Linux. This work was more or less a tutorial for systems programers, simplifying some of the more subtle underlying mathematics with an eye towards optimization. While a systems programmer by trade, I have a background in mathematics so was interested to understand the foundational work. James S. Plank's paper A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems describes a technique for generalized N+M RAID. Not only was it simple to implement, but it could easily be made to perform well. I struggled for far too long trying to make the code work before discovering trivial flaws with the math itself. A bit more digging revealed that the author himself had published Note: Correction to the 1997 Tutorial on Reed-Solomon Coding 8 years later addressing those same flaws.

Predictably, the mathematically accurate version was far harder to optimize, stifling my enthusiasm for the generalized case. My more serious concern was that the double-parity RAID-Z code suffered some similar systemic flaw. This fear was quickly assuaged as I verified that the RAID-6 algorithm was sound. Further, from this investigation I was able to find a related method for doing triple-parity RAID-Z that was nearly as simple as its double-parity cousin. The math is a bit dense; but the key observation was that given that 3 is the smallest factor of 255 (the largest value representable by an unsigned byte) it was possible to find exactly of 3 different seed or generator values after which there were collections of failures that formed uncorrectable singularities. Using that technique I was able to implement a triple-parity RAID-Z scheme that performed nearly as well as the double-parity version.

As far as generic N-way RAID-Z goes, it's still something I'd like to add to ZFS. Triple-parity will suffice for quite a while, but we may want more parity sooner for a variety of reasons. Plank's revised algorithm is an excellent start. The test will be if it can be made to perform well enough or if some new clever algorithm will need to be devised. Now, as for what to call these additional RAID levels, I'm not sure. RAID-7 or RAID-8 seem a bit ridiculous and RAID-TP and RAID-QP aren't any better. Fortunately, in ZFS triple-parity RAID is just raidz3.

A little over three years ago, I integrated double-parity RAID-Z into ZFS, a feature expected of enterprise class storage. This was in the early days of Fishworks when much of our focus was on addressing functional gaps. The move to triple-parity RAID-Z comes in the wake of a number of our unique advancements to the state of the art such as DTrace-powered Analytics and the Hybrid Storage Pool as the Sun Storage 7000 series products meet and exceed the standards set by the industry. Triple-parity RAID-Z will, of course, be a feature included in the next major software update for the 7000 series (2009.Q3).

Wednesday May 27, 2009

Today we're introducing a new member to the Sun Unified Storage family: the Sun Storage 7310. The 7310 is a scalable system from 12TB with a single half-populated J4400 JBOD up to 96TB with 4 JBODs. You can combine two 7310 head units to form a cluster. The base configuration includes a single quad-core CPU, 16GB of DRAM, a SAS HBA, and two available PCIe slots for NICs, backup cards, or the Fishworks cluster card. The 7310 can be thought of as a smaller capacity, lower cost version of the Sun Storage 7410. Like the 7410 it uses high density, low power disks as primary storage and can be enhanced with Readzilla and Logzilla flash accelerators for high performance. Like all the 7000 series products, the 7310 includes all protocols and software features without license fees.

The 7310 is an entry-level clusterable, scalable storage server, but the performance is hardly entry-level. Brendan Gregg from the Fishworks team has detailed the performance of the 7410, and has published the results of those tests on the new 7310. Our key metrics are cached reads from DRAM, uncached reads from disk, and writes to disk all over two 10GbE links with 20 client systems. As shown in the graph, the 7310 is an absolute champ, punching well above its weight. The numbers listed are in units of MB/s. Notice that the recent 2009.Q2 software update brought significant performance improvements to the 7410, and that the 7310 holds its own. For owners of entry-level systems from other vendors, check for yourself, but the 7310 is a fire-breather.

Added to the low-end 7110, the dense, expandable 7210, the high-end clusterable, expandable 7410, the 7310 fills an important role in the 7000 series product line: an entry-level clusterable, expandable system, with impressive performance, and an attractive price. If the specs and performance have piqued your interest, try out the user interface on the 7000 series with the Sun Storage 7000 simulator.

Tuesday May 26, 2009

As flash memory has become more and more prevalent in storage from the consumer to theenterprise people have been charmed by the performance characteristics, but get stuck on the longevity. SSDs based on SLC flash are typically rated at 100,000 to 1,000,000 write/erase cycles while MLC-based SSDs are rated for significantly less. For conventional hard drives, the distinct yet similar increase in failures over time has long been solved by mirroring (or other redundancy techniques). When applying this same solution to SSDs, a common concern is that two identical SSDs with identical firmware storing identical data would run out of write/erase cycles for a given cell at the same moment and thus data reliability would not be increased via mirroring. While the logic might seem reasonable, permit me to dispel that specious argument.

The operating system and filesystem

From the level of most operating systems or filesystems, an SSD appears like a conventional hard drive and is treated more or less identically (Solaris' ZFS being a notable exception). As with hard drives, SSDs can report predicted failures though SMART. For reasons described below, SSDs already keep track of the wear of cells, but one could imagine even the most trivial SSD firmware keeping track of the rapidly approaching write/erase cycle limit and notifying the OS or FS via SMART which would in turn the user. Well in advance of actual data loss, the user would have an opportunity to replace either or both sides of the mirror as needed.

SSD firmware

Proceeding down the stack to the level of the SSD firmware, there are two relevant features to understand: wear-leveling, and excess capacity. There is not a static mapping between the virtual offset of an I/O to an SSD and the physical flash cells that are chosen by the firmware to record the data. For a variety of reasons — flash call early mortality, write performance, bad cell remapping — it is necessary for the SSD firmware to remap data all over its physical flash cells. In fact, hard drives have a similar mechanism by which they hold sectors in reserve and remap them to fill in for defective sectors. SSDs have the added twist that they want to maximize the longevity of their cells each of which will ultimately decay over time. To do this, the firmware ensures that a given cell isn't written far more frequently than any other cell, a process called wear-leveling for obvious reasons.

To summarize, subsequent writes to the same LBA, the same virtual location, on an SSD could land on different physical cells for the several reasons listed. The firmware is, more often than not, deterministic thus two identical SSDs with the exact same physical media and I/O stream (as in a mirror) would behave identically, but minor timing variations in the commands from operating software, and differences in the media (described below) ensure that the identical SSDs will behave differently. As time passes, those differences are magnified such that two SSDs that started with the same mapping between virtual offsets and physical media will quickly and completely diverge.

Flash hardware and physics

Identical SSDs with identical firmware, still have their own physical flash memory which can vary in quality. To break the problem apart a bit, an SSD is composed of many cells, and each cell's ability to retain data slowly degrades as it's exercised. Each cell is in fact a physical component of an integrated circuit composed. Flash memory differs from many other integrated circuits in that it requires far higher voltages than others. It is this high voltage that causes the oxide layer to gradually degrade over time. Further, all cells are not created equal — microscopic variations in the thickness and consistency of the physical medium can make some cells more resilient and others less; some cells might be DOA, while others might last significantly longer than the norm. By analogy, if you install new light bulbs in a fixture, they might burn out in the same month, but how often do they fail on the same day? The variability of flash cells impacts the firmware's management of the underlying cells, but more trivially it means that two SSDs in a mirror would experience dataloss of corrsponding regions at different rates.

Wrapping up

As with conventional hard drives, mirroring SSDs is a good idea to preserve data integrity. The operating system, filesystem, SSD firmware, and physical properties of the flash medium make this approach sound both in theory and in practice. Flash is a new exciting technology and changes many of the assumptions derived from decades of experience with hard drives. As always proceed with care — especially when your data is at stake — but get the facts, and in this case the wisdom of conventional hard drives still applies.

Monday May 04, 2009

On the heels of the 2009.Q2.0.0 release, we've posted an update to the Sun Storage 7000 simulator. The simulator contains the exact same software as the other members of the 7000 series, but runs inside a VM rather than on actual hardware. It supports all the same features, and has all the same UI components; just remember that an actual 7000 series appliance is going to perform significantly better than a VM running a puny laptop CPU. Download the simulator here.

The new version of the simulator contains two enhancements. First, it comes with the 2009.Q2.0.0 release pre-installed. The Q2 release is the first to provide full support for the simulator, and as I wrote here you can simply upgrade your old simulator. In addition, while the original release of the simulator could only be run on VMware we now support both VMware and VirtualBox (version 2.2.2 or later). When we first launched the 7000 series back in November, we intended to support the simulator on VirtualBox, but a couple of issues thwarted us, in particular lack of OVF support and host-only networking. The recent 2.2.2 release of VirtualBox brought those missing features, so we're pleased to be able to support both virtualization platforms.

As OVF support is new in VirtualBox, here's a quick installation guide for the simulator. After uncompressing the SunStorageVBox.zip archive, select "Import Appliance...", and select "Sun Storage VirtualBox.ovf". Clicking through will bring up a progress bar. Be warned: this can take a while depending on the speed of your CPU and hard drive.

When that completes, you will see the "Sun Storage VirtualBox" VM in the VirtualBox UI. You may need to adjust settings such as the amount of allocated memory, or extended CPU features. Run the VM and follow the instructions when it boots up. You'll be prompted for some simple network information. If you're unsure how to fill in some of the fields, here are some pointers:

  • Host Name - whatever you want
  • DNS Domain - "localdomain"
  • Default Router - the same as the IP address but put 1 as the final octet
  • DNS Server - the same as the IP address but put 1 as the final octet
  • Password - whatever you want and something you can remember

When you complete that form, wait until you're given a URL to copy into a web browser. Note that you'll need to use the version of the URL with the IP address (unless you've added an entry to your DNS server). In the above example, that would be: https://192.168.56.101:215/. From the web browser, complete the appliance configuration, and then you can start serving up data, observing activity with Storage Analytics, and kicking the tires on a functional replica of a 7000 series appliance.

Monday Apr 27, 2009

Today we released version 2009.Q2.0.0, the first major software update for the Sun Storage 7000 series. It includes a bunch of new features, bug fixes, and improvements. Significantly for users of the Sun Storage 7000 simulator, the virtual machine version of the 7000 series, this is the first update that supports the VMs. As with a physical 7000 series appliance, upgrade by navigating to Maintenance > System, and click the + icon next to Available Updates. Remember not to ungzip the update binary — the appliance will do that itself. We'll be releasing an update VM preinstalled with the new bits so stay tuned.

Note: There were actually two releases of the VMware simulator. The first one came right around our initial launch, and the version string is ak-2008.11.07. This version cannot be upgraded so you'll need to download the updated simulator whose version is ak-2008.11.21. As noted above, we'll soon be releasing an updated VM with 2009.Q2.0.0 (ak-2009.04.10.0.0) preinstalled.

Tuesday Mar 10, 2009

We're announcing a couple of new things in the flash SSD space. First, support the Intel X25-E SSD in a bunch of our servers. This can be used to create a Hybrid Storage Pool like in the Sun Storage 7000 series, or as just a little flash for high performance / low power / tough environmentals.

Second, we're introducing a new open standard with the Open Flash Module. This creates a new form factor for SSDs bringing flash even closer to the CPU for higher performance and tighter system integration. SSDs in HDD form factors were a reasonable idea to gain market acceptance in much the same way as you first listened to your iPod over your car stereo with that weird tape adapter. Now the iPod is a first class citizen in many cars and, with the Open Flash Module, flash has found a native interface and form factor. This is a building block that we're very excited about, and it was designed specifically for use with ZFS and the Hybrid Storage Pool. Stay tuned: these flash miniDIMMs as they're called will be showing up in some interesting places soon enough. Speaking personally, this represents an exciting collaboration of hardware and software, and it's gratifying to see Sun showing real leadership around flash through innovation.

Saturday Mar 07, 2009

Today at The First Workshop on Integrating Solid-state Memory into the Storage Hierarchy (WISH 2009) I gave a short talk about our experience integrating flash into the storage hierarchy and the interaction with SSDs. In the talk I discussed the recent history of flash SSDs as well as some key areas for future improvements. You can download it here. The workshop was terrific with some great conversations about the state of solid state storage and its future directions; thank you to the organizers and participants.

Friday Mar 06, 2009

In May of 2007 I was lined up to give my first customer presentation of what would become the Sun Storage 7000 series. I inherited a well-worn slide deck describing the product, but we had seen the reactions of prospective customers who saw the software live and had a chance to interact with features such as Analytics; no slides would elicit that kind of response. So with some tinkering, I hacked up our installer and shoe-horned the prototype software into a virtual machine. The live demonstration was a hit despite some rocky software interactions.

As the months passed, our software became increasingly aware of our hardware platforms; the patches I had used for the virtual machine version fell into disrepair. Racing toward the product launch, neither I nor anyone else in the Fishworks group had the time to nurse it back to health. I found myself using months old software for a customer demo — a useful tool, but embarrassing given the advances we had made. We knew that the VM was going to be great for presentations, and we had talked about releasing a version to the general public, but that, we thought, was something that we could sort out after the product launch.

In the brief calm after the frenetic months finishing the product and just a few days before the launch in Las Vegas, our EVP of storage, John Fowler, paid a visit to the Fishworks office. When we mentioned the VM version, his eyes lit up at the thought of how it would help storage professionals. Great news, but we realized that the next few days had just become much busier.

Creating the VM version was a total barn-raising. Rather than a one-off with sharp edges, adequate for a canned demo, we wanted to hand a product to users that would simulate exactly a Sun Storage 7000 series box. In about three days, everyone in the group pitched in to build what was essentially a brand new product and platform complete with a hardware view conjured from bits of our actual appliances.

After a frenetic weekend in November, the Sun Unified Storage Simulator was ready in time for the launch. You can download it here for VMware. We had prepared versions for VirtualBox as well as VMware, preferring VirtualBox since it's a Sun product; along the way we found some usability issues with the VirtualBox version — we were pushing both products beyond their design center and VMware handled it better. Rest assured that we're working to resolve those issues and we'll release the simulator for VirtualBox just as soon as it's ready. Note that we didn't limit the functionality at all; what you see is exactly what you'll get with an actual 7000 series box (though the 7000 series will deliver much better performance than a laptop). Analytics, replication, compression, CIFS, iSCSI are all there; give it a try and see what you think.

Monday Mar 02, 2009

In my last blog post I responded to Barry Burke author of the Storage Anarchist blog. I was under the perhaps naive impression that Barry was an independent voice in the blogosphere. In fact, he's merely Storage Anarchist by night; by day he's the mild-mannered chief strategy officer for EMC's Symmetrix Products Group — a fact notable for its absence from Barry's blog. In my post, I observed that Barry had apparently picked his horse in the flash race and Chris Caldwell commented that "it would appear that not only has he chosen his horse, but that he's planted squarely on its back wearing an EMC jersey." Indeed.

While looking for some mention of his employment with EMC, I found this petard from Barry Burke chief strategy officer for EMC's Symmetrix Products Group:

And [the "enterprise" differentiation] does matter – recall this video of a Fishworks JBOD suffering a 100x impact on response times just because the guy yells at a drive. You wouldn't expect that to happen with an enterprise class disk drive, and with enterprise-class drives in an enterprise-class array, it won't.

Barry, we wondered the same thing so we got some time on what you'd consider an enterprise-class disk drive in an enterprise-class array from an enterprise-class vendor. The results were nearly identical (of course, measuring latency on other enterprise-class solutions isn't nearly as easy). It turns out drives don't like being shouted at (it's shock, not the traditional RV drives compensate for). That enterprise-class rig was not an EMC Symmetrix though I'd salivate over the opportunity to shout at one.

Thursday Feb 26, 2009

Barry Burke, the Storage Anarchist, has written an interesting roundup ("don't miss the amazing vendor flash dance") covering the flash strategies of some players in the server and storage spaces. Sun's position on flash comes out a bit mangled, but Barry can certainly be forgiven for missing the mark since Sun hasn't always communicated its position well. Allow me to clarify our version of the flash dance.

Barry's conclusion that Sun sees flash as well-suited for the server isn't wrong — of course it's harder to drive high IOPS and low latency outside a single box. However we've also proven not only that we see a big role for flash in storage, but that we're innovating in that realm with the Hybrid Storage Pool (HSP) an architecture that seamlessly integrates flash into the storage hierarchy. Rather than a Ron Popeil-esque sales pitch, let me take you through the genesis of the HSP.

The HSP is something we started to develop a bit over two years ago. By January of 2007, we had identified that a ZFS intent-log device using flash would greatly improve the performance of the nascent Sun Storage 7000 series in a way that was simpler and more efficient that some other options. We started getting our first flash SSD samples in February of that year. With SSDs on the brain, we started contemplating other uses and soon came up with the idea of using flash as a secondary caching tier between the DRAM cache (the ZFS ARC) and disk. We dubbed this the L2ARC.

At that time we knew that we'd be using mostly 7200 RPM disks in the 7000 series. Our primary goal with flash was to greatly improve the performance of synchronous writes and we addressed this with the flash log device that we call Logzilla. With the L2ARC we solved the other side of the performance equation by improving read IOPS by leaps and bounds over what hard drives of any rotational speed could provide. By August of 2007, Brendan had put together the initial implementation of the L2ARC, and, combined with some early SSD samples — Readzillas — our initial enthusiasm was borne out. Yes, it's a caching tier so some workloads will do better than others, but customers have been very pleased with their results.

These two distinct uses of flash comprise the Hybrid Storage Pool. In April 2008 we gave our first public talk about the HSP at the IDF in Shanghai, and a year and a bit after Brendan's proof of concept we shipped the 7410 with Logzilla and Readzilla. It's important to note that this system achieves remarkable price/performance through its marriage of commodity disks with flash. Brendan has done a terrific job of demonstrating the performance enabled by the HSP on that system.

While we were finishing the product, the WSJ reported that EMC was starting to use flash drives into their products. I was somewhat deflated initially until it became clear that EMC's solution didn't integrate flash into the storage hierarchy nearly as seamlessly or elegantly as we had with the HSP; instead they had merely replaced their fastest, most expensive drives with faster and even more expensive SSDs. I'll disagree with the Storage Anarchist's conclusion: EMC did not start the flash revolution nor are they leading the way (though I don't doubt they are, as Barry writes, "Taking Our Passion, And Making It Happen"). EMC though has done a great service to the industry by extolling the virtues of SSDs and, presumably, to EMC customers by providing a faster tier for HSM.

In the same article, Barry alludes to some of the problems with EMC's approach using SSDs from STEC:

STEC rates their ZeusIOPS drives at something north of 50,000 read IOPS each, but as I have explained before, this is a misleading number because it’s for 512-byte blocks, read-only, without the overhead of RAID protection. A more realistic expectation is that the drives will deliver somewhere around 5-6000 4K IOPS (4K is a more typical I/O block size).
The Hybrid Storage Pool avoids the bottlenecks associated with a tier 0 approach, drives much higher IOPS, scales, and makes highly efficient economical use of the resources from flash to DRAM and disk. Further, I think we'll be able to debunk this notion that the enterprise needs its own class of flash devices by architecting commodity flash to build an enterprise solution. There are a lot of horses in this race; Barry has clearly already picked his, but the rest of you may want survey the field.

Monday Feb 23, 2009

The organizers of the OpenSolaris Storage Summit asked me to give a presentation about Hybrid Storage Pools and ZFS. You can download the presentation titled ZFS, Cache, and Flash. In it, I talk about flash as a new caching tier in the storage hierarchy, some of the innovations in ZFS to enable the HSP, and an aside into the how we implement an HSP in the Sun Storage 7410.