inside the sausage factory Adam Leventhal's Weblog
Adam Leventhal, Fishworks engineer

Monday Nov 10, 2008

The Sun Storage 7000 Series launches today, and with it Sun has the world's first complete product that seamlessly adds flash into the storage hierarchy in what we call the Hybrid Storage Pool. The HSP represents a departure from convention, and a new way of thinking designing a storage system. I've written before about the principles of the HSP, but now that it has been formally announced I can focus on the specifics of the Sun Storage 7000 Series and how it implements the HSP.

Sun Storage 7410: The Cadillac of HSPs

The best example of the HSP in the 7000 Series is the 7410. This product combines a head unit (or two for high availability) with as many as 12 J4400 JBODs. By itself, this is a pretty vanilla box: big, economical, 7200 RPM drives don't win any races, and the maximum of 128GB of DRAM is certainly a lot, but some workloads will be too big to fit in that cache. With flash, however, this box turns into quite the speed demon.

Logzilla

The write performance of 7200 RPM drive isn't terrific. The appalling thing is that the next best solution — 15K RPM drives — aren't really that much better: a factor of two or three at best. To blow the doors off, the Sun Storage 7410 allows up to four write-optimized flash drives per JBOD each of which is capable of handling 10,000 writes per second. We call this flash device Logzilla.

Logzilla is a flash-based SSD that contains a pretty big DRAM cache backed by a supercapacitor so that the cache can effectively be treated as nonvolatile. We use Logzilla as a ZFS intent log device so that synchronous writes are directed to Logzilla and clients incur only that 100μs latency. This may sound a lot like how NVRAM is used to accelerate storage devices, and it is, but there are some important advantages of Logzilla. The first is capacity: most NVRAM maxes out at 4GB. That might seem like enough, but I've talked to enough customers to realize that it really isn't and that performance cliff is an awful long way down. Logzilla is an 18GB device which is big enough to hold the necessary data while ZFS syncs it out to disk even running full tilt. The second problem with NVRAM scalability: once you've stretched your NVRAM to its limit there's not much you can do. If your system supports it (and most don't) you can add another PCI card, but those slots tend to be valuable resources for NICs and HBAs, and even then there's necessarily a pretty small number to which you could conceivably scale. Logzilla is an SSD sitting in a SAS JBOD so it's easy to plug more devices into ZFS and use them as a growing pool of intent log devices.

Readzilla

The standard practice in storage systems is to use the available DRAM as a read cache for data that is likely to be frequently accessed, and the 7000 Series does the same. In fact, it can do quite a better job of it because, unlike most storage systems which stop at 64GB of cache, the 7410 has up to 256GB of DRAM to use as a read cache. As I mentioned before, that's still not going to be enough to cache the entire working set for a lot of use cases. This is where we at Fishworks came up with the innovative solution of using flash as a massive read cache. The 7410 can accomodate up to six 100GB, read-optimized, flash SSDs; accordingly, we call this device Readzilla.

With Readzilla, a maximum 7410 configuration can have 256GB of DRAM providing sub-μs latency to cached data and 600GB worth of Readzilla servicing read requests in around 50-100μs. Forgive me for stating the obvious: that's 856GB of cache &mdash. That may not suffice to cache all workloads, but it's certainly getting there. As with Logzilla, a wonderful property of Readzilla is its scalability. You can change the number of Readzilla devices to match your workload. Further, you can choose the right combination of DRAM and Readzilla to provide the requisite service times with the appopriate cost and power use. Readzilla is cheaper and less power-hungry than DRAM so applications that don't need the blazing speed of DRAM can prefer the more economical flash cache. It's a flexible solution that can be adapted to specific needs.

Putting It All Together

We started with DRAM and 7200 RPM disks, and by adding Logzilla and Readzilla the Sun Storage 7410 also has great write and read IOPS. Further, you can design the specific system you need with just the right balance of write IOPS, read IOPS, throughput, capacity, power-use, and cost. Once you have a system, the Hybrid Storage Pool lets you solve problems with targeted solutions. Need capacity? Add disk. Out of read IOPS? Toss in another Readzilla or two. Write bogging down? Another Logzilla will net you another 10,000 write IOPS. In the old model, of course, all problems were simple because the solution was always the same: buy more fast drives. The HSP in the 7410 lets you address the specific problem you're having without paying for a solution to three other problems that you don't have.

Of course, this means that administrators need to better understand the performance limiters, and fortunately the Sun Storage 7000 Series has a great answer to that in Analytics. Pop over to Bryan's blog where he talks all about that feature of the Fishworks software stack and how to use it to find performance problems on the 7000 Series. If you want to read more details about Hybrid Storage Pools and how exactly all this works, take a look my article on the subject in CACM, as well as this post about the L2ARC (the magic behind using Readzilla) and a nice marketing pitch on HSPs.

Monday Oct 20, 2008

I've written about Hybrid Storage Pools (HSPs) here several times as well as in an article that appeared in the ACM's Queue and CACM publications. Now the folks in Sun marketing on the occasion of our joint SSD announcement with Intel have distilled that down to a four page glossy, and they've done a terrific job. I suggest taking a look.

The concept behind the HSP is a simple one: combine disk, flash, and DRAM into a single coherent and seamless data store that makes optimal use of each component and its economic niche. The mechanics of how this happens required innovation from the Fishworks and ZFS groups to integrate flash as a new tier in storage hierarchy for use in our forthcoming line of storage products. The impact of the HSP is pure economics: it delivers superior capacity and performance for a lower cost and smaller power footprint. That's the marketing pitch; if you want to wade into the details, check out the links above.

Tuesday Jul 01, 2008

As I mentioned in my previous post, I wrote an article about the hybrid storage pool (HSP); that article appears in the recently released July issue of Communications of the ACM. You can find it here. In the article, I talk about a novel way of augmenting the traditional storage stack with flash memory as a new level in the hierarchy between DRAM and disk, as well as the ways in which we've adapted ZFS and optimized it for use with flash.

So what's the impact of the HSP? Very simply, the article demonstrates that, considering the axes of cost, throughput, capacity, IOPS and power-efficiency, HSPs can match and exceed what's possible with either drives or flash alone. Further, an HSP can be built or modified to address specific goals independently. For example, it's common to use 15K RPM drives to get high IOPS; unfortunately, they're expensive, power-hungry, and offer only a modest improvement. It's possible to build an HSP that can match the necessary IOPS count at a much lower cost both in terms of the initial investment and the power and cooling costs. As another example, people are starting to consider all-flash solutions to get very high IOPS with low power consumption. Using flash as primary storage means that some capacity will be lost to redundancy. An HSP can provide the same IOPS, but use conventional disks to provide redundancy yielding a significantly lower cost.

My hope — perhaps risibly naive — is that HSPs will mean the eventual death of the 15K RPM drive. If it also puts to bed the notion of flash as general purpose mass storage, well, I'd be happy to see that as well.

Wednesday Jun 11, 2008

Jonathan had a terrific post yesterday that does an excellent job of presenting Sun's strategy for flash for the next few years. With my colleagues at Fishworks, an advanced product development team, I've spent more than a year working with flash and figuring out ways to integrate flash into ZFS, the storage hierarchy, and our future storage products — a fact to which John Fowler, EVP of storage, alluded recently. Flash opens surprising new vistas; it's exciting to see Sun leading in this field, and it's frankly exciting to be part of it.

Jonathan's post sketches out some of the basic ideas on how we're going to be integrating flash into ZFS to create what we call hybrid storage pools that combine flash with conventional (cheap) disks to create an aggregate that's cost-effective, power-efficient, and high-performing by capitalizing on the strengths of the component technologies (not unlike a hybrid car). We presented some early results at IDF which has already been getting a bit of buzz. Next month I have an article in Communications of the ACM that provides many more details on what exactly a hybrid pool is and how exactly it works. I've pulled out some excerpts from that article and included them below as a teaser and will be sure to post an update when the full article is available in print and online.

While its prospects are tantalizing, the challenge is to find uses for flash that strike the right balance of cost and performance. Flash should be viewed not as a replacement for existing storage, but rather as a means to enhance it. Conventional storage systems mix dynamic memory (DRAM) and hard drives; flash is interesting because it falls in a sweet spot between those two components for both cost and performance in that flash is significantly cheaper and denser than DRAM and also significantly faster than disk. Flash accordingly can augment the system to form a new tier in the storage hierarchy – perhaps the most significant new tier since the introduction of the disk drive with RAMAC in 1956.

...

A brute force solution to improve latency is to simply spin the platters faster to reduce rotational latency, using 15k RPM drives rather than 10k RPM or 7,200 RPM drives. This will improve both read and write latency, but only by a factor of two or so. ...

...

ZFS provides for the use of a separate intent-log device, a slog in ZFS jargon, to which synchronous writes can be quickly written and acknowledged to the client before the data is written to the storage pool. The slog is used only for small transactions while large transactions use the main storage pool – it's tough to beat the raw throughput of large numbers of disks. The flash-based log device would be ideally suited for a ZFS slog. ... Using such a device with ZFS in a test system, latencies measure in the range of 80-100µs which approaches the performance of NVRAM while having many other benefits. ...

...

By combining the use of flash as an intent-log to reduce write latency with flash as a cache to reduce read latency, we can create a system that performs far better and consumes less power than other system of similar cost. It's now possible to construct systems with a precise mix of write-optimized flash, flash for caching, DRAM, and cheap disks designed specifically to achieve the right balance of cost and performance for any given workload with data automatically handled by the appropriate level of the hierarchy. ... Most generally, this new flash tier can be thought of as a radical form of hierarchical storage management (HSM) without the need for explicit management.

Updated July, 1: I've posted the link to the article in my subsequent blog post.