|
Today we shipped our 2009.Q3 release. Amidst the many great new features, enhancements and bug fixes, we've added new storage profiles for triple-parity RAID and three-way mirroring. Here's an example on a 9 JBOD system of what you'll see in the updated storage configuration screen:
Note that the new Triple parity RAID, wide stripes option replaces the old Double parity RAID, wide stripes configuration. With RAID stripes that can easily be more than 40 disks wide, and resilver times that can be quite long as a result, we decided that the additional protection of triple-parity RAID trumped the very small space efficiency of double-parity RAID.
Ryan Matthews has updated the space calculator for the 7310 and 7410 to include the new profiles. Download the new update and give it a shot.
At the Flash Memory Summit today, Sun's own Michael Cornwell delivered a keynote excoriating the overall direction of NAND flash and SSDs. In particular, he spoke of the "lithography death march" as NAND vendors push to deliver the most cost-efficient solution while making huge sacrifices in reliability and performance.
On Wednesday, August 12, I'll be giving two short talks as part of sessions on flash-enabled power savings and data center applications:
In the evening from 7:30 to 9:00, I'll be hosting a table discussion of software as it pertains to flash. We'll be talking about uses of flash such as the Hybrid Storage Pool, how software can enable the use of MLC flash in the enterprise, the role of the flash translation layer, and anything else that comes up.
Today we're introducing a new member to the Sun Unified Storage family: the Sun Storage 7310. The 7310 is a scalable system from 12TB with a single half-populated J4400 JBOD up to 96TB with 4 JBODs. You can combine two 7310 head units to form a cluster. The base configuration includes a single quad-core CPU, 16GB of DRAM, a SAS HBA, and two available PCIe slots for NICs, backup cards, or the Fishworks cluster card. The 7310 can be thought of as a smaller capacity, lower cost version of the Sun Storage 7410. Like the 7410 it uses high density, low power disks as primary storage and can be enhanced with Readzilla and Logzilla flash accelerators for high performance. Like all the 7000 series products, the 7310 includes all protocols and software features without license fees.
The 7310 is an entry-level clusterable, scalable storage server, but the performance is hardly entry-level. Brendan Gregg from the Fishworks team has detailed the performance of the 7410, and has published the results of those tests on the new 7310. Our key metrics are cached reads from DRAM, uncached reads from disk, and writes to disk all over two 10GbE links with 20 client systems. As shown in the graph, the 7310 is an absolute champ, punching well above its weight. The numbers listed are in units of MB/s. Notice that the recent 2009.Q2 software update brought significant performance improvements to the 7410, and that the 7310 holds its own. For owners of entry-level systems from other vendors, check for yourself, but the 7310 is a fire-breather.
Added to the low-end 7110, the dense, expandable 7210, the high-end clusterable, expandable 7410, the 7310 fills an important role in the 7000 series product line: an entry-level clusterable, expandable system, with impressive performance, and an attractive price. If the specs and performance have piqued your interest, try out the user interface on the 7000 series with the Sun Storage 7000 simulator.
As flash memory has become more and more prevalent in storage from the consumer to theenterprise people have been charmed by the performance characteristics, but get stuck on the longevity. SSDs based on SLC flash are typically rated at 100,000 to 1,000,000 write/erase cycles while MLC-based SSDs are rated for significantly less. For conventional hard drives, the distinct yet similar increase in failures over time has long been solved by mirroring (or other redundancy techniques). When applying this same solution to SSDs, a common concern is that two identical SSDs with identical firmware storing identical data would run out of write/erase cycles for a given cell at the same moment and thus data reliability would not be increased via mirroring. While the logic might seem reasonable, permit me to dispel that specious argument.
The operating system and filesystem
From the level of most operating systems or filesystems, an SSD appears like a conventional hard drive and is treated more or less identically (Solaris' ZFS being a notable exception). As with hard drives, SSDs can report predicted failures though SMART. For reasons described below, SSDs already keep track of the wear of cells, but one could imagine even the most trivial SSD firmware keeping track of the rapidly approaching write/erase cycle limit and notifying the OS or FS via SMART which would in turn the user. Well in advance of actual data loss, the user would have an opportunity to replace either or both sides of the mirror as needed.
SSD firmware
Proceeding down the stack to the level of the SSD firmware, there are two relevant features to understand: wear-leveling, and excess capacity. There is not a static mapping between the virtual offset of an I/O to an SSD and the physical flash cells that are chosen by the firmware to record the data. For a variety of reasons — flash call early mortality, write performance, bad cell remapping — it is necessary for the SSD firmware to remap data all over its physical flash cells. In fact, hard drives have a similar mechanism by which they hold sectors in reserve and remap them to fill in for defective sectors. SSDs have the added twist that they want to maximize the longevity of their cells each of which will ultimately decay over time. To do this, the firmware ensures that a given cell isn't written far more frequently than any other cell, a process called wear-leveling for obvious reasons.
To summarize, subsequent writes to the same LBA, the same virtual location, on an SSD could land on different physical cells for the several reasons listed. The firmware is, more often than not, deterministic thus two identical SSDs with the exact same physical media and I/O stream (as in a mirror) would behave identically, but minor timing variations in the commands from operating software, and differences in the media (described below) ensure that the identical SSDs will behave differently. As time passes, those differences are magnified such that two SSDs that started with the same mapping between virtual offsets and physical media will quickly and completely diverge.
Flash hardware and physics
Identical SSDs with identical firmware, still have their own physical flash memory which can vary in quality. To break the problem apart a bit, an SSD is composed of many cells, and each cell's ability to retain data slowly degrades as it's exercised. Each cell is in fact a physical component of an integrated circuit composed. Flash memory differs from many other integrated circuits in that it requires far higher voltages than others. It is this high voltage that causes the oxide layer to gradually degrade over time. Further, all cells are not created equal — microscopic variations in the thickness and consistency of the physical medium can make some cells more resilient and others less; some cells might be DOA, while others might last significantly longer than the norm. By analogy, if you install new light bulbs in a fixture, they might burn out in the same month, but how often do they fail on the same day? The variability of flash cells impacts the firmware's management of the underlying cells, but more trivially it means that two SSDs in a mirror would experience dataloss of corrsponding regions at different rates.
Wrapping up
As with conventional hard drives, mirroring SSDs is a good idea to preserve data integrity. The operating system, filesystem, SSD firmware, and physical properties of the flash medium make this approach sound both in theory and in practice. Flash is a new exciting technology and changes many of the assumptions derived from decades of experience with hard drives. As always proceed with care — especially when your data is at stake — but get the facts, and in this case the wisdom of conventional hard drives still applies.
On the heels of the 2009.Q2.0.0 release, we've posted an update to the Sun Storage 7000 simulator. The simulator contains the exact same software as the other members of the 7000 series, but runs inside a VM rather than on actual hardware. It supports all the same features, and has all the same UI components; just remember that an actual 7000 series appliance is going to perform significantly better than a VM running a puny laptop CPU. Download the simulator here.
The new version of the simulator contains two enhancements. First, it comes with the 2009.Q2.0.0 release pre-installed. The Q2 release is the first to provide full support for the simulator, and as I wrote here you can simply upgrade your old simulator. In addition, while the original release of the simulator could only be run on VMware we now support both VMware and VirtualBox (version 2.2.2 or later). When we first launched the 7000 series back in November, we intended to support the simulator on VirtualBox, but a couple of issues thwarted us, in particular lack of OVF support and host-only networking. The recent 2.2.2 release of VirtualBox brought those missing features, so we're pleased to be able to support both virtualization platforms.
As OVF support is new in VirtualBox, here's a quick installation guide for the simulator. After uncompressing the SunStorageVBox.zip archive, select "Import Appliance...", and select "Sun Storage VirtualBox.ovf". Clicking through will bring up a progress bar. Be warned: this can take a while depending on the speed of your CPU and hard drive.
When that completes, you will see the "Sun Storage VirtualBox" VM in the VirtualBox UI. You may need to adjust settings such as the amount of allocated memory, or extended CPU features. Run the VM and follow the instructions when it boots up. You'll be prompted for some simple network information. If you're unsure how to fill in some of the fields, here are some pointers:
- Host Name - whatever you want
- DNS Domain - "localdomain"
- Default Router - the same as the IP address but put 1 as the final octet
- DNS Server - the same as the IP address but put 1 as the final octet
- Password - whatever you want and something you can remember
When you complete that form, wait until you're given a URL to copy into a web browser. Note that you'll need to use the version of the URL with the IP address (unless you've added an entry to your DNS server). In the above example, that would be: https://192.168.56.101:215/. From the web browser, complete the appliance configuration, and then you can start serving up data, observing activity with Storage Analytics, and kicking the tires on a functional replica of a 7000 series appliance.
Today we released version 2009.Q2.0.0, the first major software update for the Sun Storage 7000 series. It includes a bunch of new features, bug fixes, and improvements. Significantly for users of the Sun Storage 7000 simulator, the virtual machine version of the 7000 series, this is the first update that supports the VMs. As with a physical 7000 series appliance, upgrade by navigating to Maintenance > System, and click the + icon next to Available Updates. Remember not to ungzip the update binary — the appliance will do that itself. We'll be releasing an update VM preinstalled with the new bits so stay tuned.
Note: There were actually two releases of the VMware simulator. The first one came right around our initial launch, and the version string is ak-2008.11.07. This version cannot be upgraded so you'll need to download the updated simulator whose version is ak-2008.11.21. As noted above, we'll soon be releasing an updated VM with 2009.Q2.0.0 (ak-2009.04.10.0.0) preinstalled.
We're announcing a couple of new things in the flash SSD space. First, support the Intel X25-E SSD in a bunch of our servers. This can be used to create a Hybrid Storage Pool like in the Sun Storage 7000 series, or as just a little flash for high performance / low power / tough environmentals.
Second, we're introducing a new open standard with the Open Flash Module. This creates a new form factor for SSDs bringing flash even closer to the CPU for higher performance and tighter system integration. SSDs in HDD form factors were a reasonable idea to gain market acceptance in much the same way as you first listened to your iPod over your car stereo with that weird tape adapter. Now the iPod is a first class citizen in many cars and, with the Open Flash Module, flash has found a native interface and form factor. This is a building block that we're very excited about, and it was designed specifically for use with ZFS and the Hybrid Storage Pool. Stay tuned: these flash miniDIMMs as they're called will be showing up in some interesting places soon enough. Speaking personally, this represents an exciting collaboration of hardware and software, and it's gratifying to see Sun showing real leadership around flash through innovation.
In May of 2007 I was lined up to give my first customer presentation of what
would become the Sun Storage 7000 series. I inherited a well-worn slide deck
describing the product, but we had seen the
reactions of prospective customers who saw the software live and had a chance
to interact with features such as Analytics; no slides
would elicit that kind of response. So with some tinkering, I hacked up our
installer and shoe-horned the prototype software into a virtual machine. The
live demonstration was a hit despite some rocky software interactions.
As the months passed, our software became increasingly aware of our hardware platforms;
the patches I had used for the virtual machine version fell into
disrepair.
Racing toward the product launch, neither I nor anyone else in
the Fishworks group had the time to nurse it back to health.
I found myself using months old software for a customer demo
— a useful tool, but embarrassing given the advances we had made.
We knew that the VM was going to be great for presentations, and we had
talked about releasing a version to the general public, but that, we thought,
was something that we could sort out after the product launch.
In the brief calm after the frenetic months finishing the product and just a few days before the
launch in Las Vegas, our EVP of
storage, John Fowler, paid a visit to the Fishworks office. When we mentioned
the VM version, his eyes lit up at the thought of how it would help storage professionals.
Great news, but we realized that the next few days had just become much busier.
Creating the VM version was a total barn-raising. Rather than a one-off
with sharp edges, adequate for a canned demo, we wanted to hand a
product to users that would simulate exactly
a Sun Storage 7000 series box. In about three days, everyone in the
group pitched in to build what was essentially a brand new product and platform complete with a hardware view conjured from bits of our actual appliances.
After a frenetic weekend in November, the Sun Unified Storage Simulator was ready in time for the launch. You can download it here for VMware. We had prepared versions for VirtualBox as well as VMware, preferring VirtualBox since it's a Sun product; along the way we found some usability issues with the VirtualBox version — we were pushing both products beyond their design center and VMware handled it better. Rest assured that we're working to resolve those issues and we'll release the simulator for VirtualBox just as soon as it's ready. Note that we didn't limit the functionality at all; what you see is exactly what you'll get with an actual 7000 series box (though the 7000 series will deliver much better performance than a laptop). Analytics, replication, compression, CIFS, iSCSI are all there; give it a try and see what you think.
In my last blog post I responded to Barry Burke author of the Storage Anarchist blog. I was under the perhaps naive impression that Barry was an independent voice in the blogosphere. In fact, he's merely Storage Anarchist by night; by day he's the mild-mannered chief strategy officer for EMC's Symmetrix Products Group — a fact notable for its absence from Barry's blog. In my post, I observed that Barry had apparently picked his horse in the flash race and Chris Caldwell commented that "it would appear that not only has he chosen his horse, but that he's planted squarely on its back wearing an EMC jersey." Indeed.
While looking for some mention of his employment with EMC, I found this petard from Barry Burke chief strategy officer for EMC's Symmetrix Products Group:
And [the "enterprise" differentiation] does matter – recall this video of a Fishworks JBOD suffering a 100x impact on response times just because the guy yells at a drive. You wouldn't expect that to happen with an enterprise class disk drive, and with enterprise-class drives in an enterprise-class array, it won't.
Barry, we wondered the same thing so we got some time on what you'd consider an enterprise-class disk drive in an enterprise-class array from an enterprise-class vendor. The results were nearly identical (of course, measuring latency on other enterprise-class solutions isn't nearly as easy). It turns out drives don't like being shouted at (it's shock, not the traditional RV drives compensate for). That enterprise-class rig was not an EMC Symmetrix though I'd salivate over the opportunity to shout at one.
Barry Burke, the Storage Anarchist, has written an interesting roundup ("don't miss the amazing vendor flash dance") covering the flash strategies of some players in the server and storage spaces. Sun's position on flash comes out a bit mangled, but Barry can certainly be forgiven for missing the mark since Sun hasn't always communicated its position well. Allow me to clarify our version of the flash dance.
Barry's conclusion that Sun sees flash as well-suited for the server isn't wrong — of course it's harder to drive high IOPS and low latency outside a single box. However we've also proven not only that we see a big role for flash in storage, but that we're innovating in that realm with the Hybrid Storage Pool (HSP) an architecture that seamlessly integrates flash into the storage hierarchy. Rather than a Ron Popeil-esque sales pitch, let me take you through the genesis of the HSP.
The HSP is something we started to develop a bit over two years ago. By January of 2007, we had identified that a ZFS intent-log device using flash would greatly improve the performance of the nascent Sun Storage 7000 series in a way that was simpler and more efficient that some other options. We started getting our first flash SSD samples in February of that year. With SSDs on the brain, we started contemplating other uses and soon came up with the idea of using flash as a secondary caching tier between the DRAM cache (the ZFS ARC) and disk. We dubbed this the L2ARC.
At that time we knew that we'd be using mostly 7200 RPM disks in the 7000 series. Our primary goal with flash was to greatly improve the performance of synchronous writes and we addressed this with the flash log device that we call Logzilla. With the L2ARC we solved the other side of the performance equation by improving read IOPS by leaps and bounds over what hard drives of any rotational speed could provide. By August of 2007, Brendan had put together the initial implementation of the L2ARC, and, combined with some early SSD samples — Readzillas — our initial enthusiasm was borne out. Yes, it's a caching tier so some workloads will do better than others, but customers have been very pleased with their results.
These two distinct uses of flash comprise the Hybrid Storage Pool. In April 2008 we gave our first public talk about the HSP at the IDF in Shanghai, and a year and a bit after Brendan's proof of concept we shipped the 7410 with Logzilla and Readzilla. It's important to note that this system achieves remarkable price/performance through its marriage of commodity disks with flash. Brendan has done a terrific job of demonstrating the performance enabled by the HSP on that system.
While we were finishing the product, the WSJ reported that EMC was starting to use flash drives into their products. I was somewhat deflated initially until it became clear that EMC's solution didn't integrate flash into the storage hierarchy nearly as seamlessly or elegantly as we had with the HSP; instead they had merely replaced their fastest, most expensive drives with faster and even more expensive SSDs. I'll disagree with the Storage Anarchist's conclusion: EMC did not start the flash revolution nor are they leading the way (though I don't doubt they are, as Barry writes, "Taking Our Passion, And Making It Happen"). EMC though has done a great service to the industry by extolling the virtues of SSDs and, presumably, to EMC customers by providing a faster tier for HSM.
In the same article, Barry alludes to some of the problems with EMC's approach using SSDs from STEC:
STEC rates their ZeusIOPS drives at something north of 50,000 read IOPS each, but as I have explained before, this is a misleading number because it’s for 512-byte blocks, read-only, without the overhead of RAID protection. A more realistic expectation is that the drives will deliver somewhere around 5-6000 4K IOPS (4K is a more typical I/O block size).
The Hybrid Storage Pool avoids the bottlenecks associated with a tier 0 approach, drives much higher IOPS, scales, and makes highly efficient economical use of the resources from flash to DRAM and disk. Further, I think we'll be able to debunk this notion that the enterprise needs its own class of flash devices by architecting commodity flash to build an enterprise solution. There are a lot of horses in this race; Barry has clearly already picked his, but the rest of you may want survey the field.
The debate, calmly waged, on the best use of flash in the enterprise can be
summarized as whether flash should be a replacement for disk, acting as
primary storage, or it should be regarded as a new, and complementary tier in
the storage hierarchy, acting as a massive read cache. The market leaders in
storage have weighed in the issue, and have declared incontrovertibly that,
yes, both are the right answer, but there's some bias underlying that
equanimity.
Chuck Hollis, EMC's Global Marketing CTO, writes, that
"flash
as cache will eventually become less interesting as part of the overall
discussion... Flash as storage? Well, that's going to be really
interesting."
Standing boldly with a foot in each camp, Dave Hitz, founder and EVP at Netapp, thinks that
"Flash is
too expensive to replace disk right away, so first we'll see a new generation of
storage systems that combine the two: flash for performance and disk for
capacity."
So what are these guys really talking about, what does the landscape look like,
and where does Sun fit in all this?
Flash as primary storage (a.k.a. tier 0)
Integrating flash efficiently into a storage system isn't obvious; the simplest
way is as a direct replacement for disks. This is why most of the flash we use
today in enterprise systems comes in units that look and act just like hard
drives: SSDs are designed to be drop in replacements. Now, a flash SSD is
quite different than a hard drive — rather than a servo spinning
platters while a head chatters back and forth, an SSD has floating gates
arranged in blocks... actually it's probably simpler to list what they have
in common, and that's just the form factor and interface (SATA, SAS, FC).
Hard drives have all kind of properties that don't make sense in the world of
SSDs (e.g. I've seen an SSD that reports it's RPM telemetry as 1),
and SSDs have their own quirks with no direct analog (read/write asymmetry,
limited write cycles, etc). SSD venders, however, manage to pound these round
pegs into their square holes, and produce something that can stand in for an
existing hard drive. Array vendors are all too happy to attain buzzword
compliance by stuffing these SSDs into their products.
The trouble with HSM is the burden of the M.
Storage vendors already know how to deal with a caste system for disks: they
striate them in layers with fast, expensive 15K RPM disks as tier 1, and
slower, cheaper disks filling out the chain down to tape. What to do with
these faster, more expensive disks? Tier-0 of course! An astute Netapp
blogger asks,
"when
the industry comes up with something even faster... are we going to have
tier -1" — great question.
What's wrong with that approach? Nothing. It works; it's simple; and we (the
computing industry) basically know how to manage a bunch of tiers of storage
with something called
hierarchical
storage management.
The trouble with HSM is the burden of the M. This solution kicks the problem
down the road, leaving administrators to figure out where to put data, what
applications should have priority, and when to migrate data.
Flash as a cache
The other school of thought around flash is to use it not as a replacement
for hard drives, but rather as a massive cache for reading frequently accessed
data. As I wrote back in June for CACM,
"this
new flash tier can be thought of as a radical form of hierarchical storage
management (HSM) without the need for explicit management. Tersely,
HSM without the M. This idea forms a major component of what we at Sun
are calling the
Hybrid
Storage Pool (HSP), a mechanism for integrating flash with disk and DRAM
to form a new, and —
I
argue — superior storage solution.
Let's set aside the specifics of how we implement the HSP in
ZFS — you can
read about that
elsewhere.
Rather, I'll compare the use of flash as a cache to flash as a replacement
for disk independent of any specific solution.
The case for cache
It's easy to see why using flash as primary storage is attractive. Flash is
faster than the fastest disks by at least a factor of 10 for writes and a
factor of 100 for reads measured in IOPS.
Replacing disks with flash though isn't without nuance;
there are several inhibitors, primary among
them is cost. The cost of flash continues to drop, but it's still much more
expensive than cheap disks, and will continue to be for quite awhile. With
flash as primary storage, you still need data redundancy — SSDs can and
do fail — and while we could use RAID with single- or
double-device redundancy, that would cleave the available IOPS by a factor of
the stripe width. The reason to migrate to flash is for performance so it
wouldn't make much sense to hang a the majority of that performance back with
RAID.
The remaining option, therefore, is to mirror SSDs whereby the already high
cost is doubled.
It's hard to argue with results, all-flash solutions do rip. If money were
no object that may well be the best solution (but if cost truly wasn't a
factor, everyone would strap batteries to DRAM and call it a day).
Can flash as a cache do better? Say we need to store a 50TB of data. With an
all-flash pool, we'll need to buy SSDs that can hold roughly 100TB of data if
we want to mirror for optimal performance, and maybe 60TB if we're willing to
accept a
far more modest performance improvement over conventional hard drives. Since
we're already resigned to cutting a pretty hefty check, we have quite a bit
of money to play with to design a hybrid solution.
If we were to provision our system with
50TB of flash and 60TB of hard drives we'd have enough cache to retain every
byte of active data in flash while the disks provide the necessary
redundancy. As writes come in the filesystem would populate the flash while
it writes data persistently to disk. The performance of this system would be
epsilon away from the mirrored flash solution as read requests would only go
to disk in the case of faults from the flash devices. Note that we never rely on
correctness from the flash; it's the hard drives that provide reliability.
The performance of this system would be epsilon away from the mirrored flash solution...
The hybrid solution is cheaper, and it's also far more flexible. If a smaller
working set accounted for a disproportionally large number of reads, the total
IOPS capacity of the all-flash solution could be underused. With flash as a
cache, data could be migrated to dynamically distribute load, and additional
cache could be used to enhance the performance of the working set. It would be
possible to use some of the same techniques with an all-flash storage pool, but
it could be tricky. The luxury of a cache is that the looser contraints allow
for more aggressive data manipulation.
Building on the idea of concentrating the use of flash for hot data,
it's easy to see how flash as a cache can improve
performance even without every byte present in the cache. Most data doesn't
require 50μs random access latency over the entire dataset, users would see a
significant performance improvement with just the active subset in a flash
cache.
Of course, this means
that software needs to be able to anticipate what data is in use which probably
inspired this comment from Chuck Hollis: "cache is cache — we all know
what it can and can't do." That may be so, but comparing an ocean of flash for
primary storage to a thimbleful of cache reflects fairly obtuse thinking.
Caching algorithms will always be imperfect, but the massive scale to which we
can grow a flash cache radically alters the landscape.
Even when a working set is too large to be cached, it's possible for a hybrid
solution to pay huge dividends.
Over at Facebook, Jason Sobel
(a colleague of mine in college)
produced an interesting
presentation
on their use of storage (take a look at Jason's penultimate slide for his take
on SSDs).
Their datasets are so vast and sporadically accessed that the latency of
actually loading a picture, say, off of hard drives isn't actually the biggest
concern, rather it's the time it takes to read the indirect blocks, the
metadata. At facebook, they've taken great pains to reduce the number of
dependent disk accesses from fifteen down to about three.
In a case such as theirs, it would never be economical store or cache the full
dataset on flash and the working set is similarly too large as data access can
be quite unpredictable.
It could, however, be possible to cache all of their metadata in flash.
This would reduce the latency to an infrequently accessed image by nearly a
factor of three. Today in ZFS this is a manual setting per-filesystem, but it
would be possible to evolve a caching algorithm to detect a condition where this
was the right policy and make the adjustment dynamically.
Using flash as a cache offers the potential to do better, and to
make more efficient and more economical use of flash. Sun, and the industry
as a whole have only just started to build the software designed to realize
that potential.
Putting products before words
At Sun, we've just released our first line of products that offer complete
flash integration with the Hybrid Storage Pool; you can read about that in
my blog post
on the occassion of our product launch. On the eve
of that launch, Netapp announced their own offering: a flash-laden PCI card that
plays much the same part as their DRAM-based Performance Acceleration Module
(PAM). This will apparently be available
sometime
in 2009.
EMC offers a tier 0 solution that employs very fast and very expensive flash
SSDs.
What we have in ZFS today isn't perfect.
Indeed, the Hybrid Storage Pool casts the state of the art forward, and we'll be
catching up with solutions to the hard questions it raises for at least a few
years. Only then will we realize the full potential of flash as a cache.
What we have today though integrates flash in a way that changes the landscape
of storage economics and delivers cost efficiencies that haven't been seen
before. If the drives manufacturers don't already, it can't be long until they
hear the death knell for 15K RPM drives loud and clear.
Perhaps it's cynical or solipsistic to conclude that the timing of Dave
Hitz's and Chuck Hollis' blogs were designed to coincide with the release of
our new product and perhaps take some of the wind out of our sails,
but I will — as the
commenters on Dave's Blog have — take it as a sign
that we're on the right track. For the moment, I'll put my faith in
this bit of marketing material
enigmatically referenced in a number of Netapp
blogs
on the subject of flash:
In today's competitive environment, bringing a product or service to market
faster than the competition can make a significant difference. Releasing a
product to market in a shorter time can give you first-mover advantage and
result in larger market share and higher revenues.
The Sun Storage 7410 is our expandable storage appliance that can be hooked up to anywhere from one and twelve JBODs with 24 1TB disks. With all those disks we provide the several different options for how to arrange them into your storage pool: double-parity RAID-Z, wide-strip double-parity RAID-Z, mirror, striped, and single-parity RAID-Z with narrow stripes. Each of these options has a different mix of availability, performance, and capacity that are described both in the UI and in the installation documentation. With the wide array of supported configurations, it can be hard to know how much usable space each will support.
To address this, I wrote a python script that presents a hypothetical hardware configuration to an appliance and reports back the available options. We use the logic on the appliance itself to ensure that the results are completely accurate as the same algorithms would be applied as when then the physical pallet of hardware shows up. This, of course, requires you to have an appliance available to query — fortunately, you can run a virtual instance of the appliance on your laptop.
You can download the sizecalc.py here; you'll need python installed on the system where you run it. Note that the script uses XML-RPC to interact with the appliance, and consequently it relies on unstable interfaces that are subject to change. Others are welcome to interact with the appliance at the XML-RPC layer, but note that it's unstable and unsupported. If you're interested in scripting the appliance, take a look at Bryan's recent post. Feel free to post comments here if you have questions, but there's no support for the script, implied, explicit, unofficial or otherwise.
Running the script by itself produces a usage help message:
$ ./sizecalc.py
usage: ./sizecalc.py [ -h <half jbod count> ] <appliance name or address>
<root password> <jbod count>
Remember that you need a Sun Storage 7000 appliance (even a virtual one) to execute the capacity calculation. In this case, I'll specify a physical appliance running in our lab, and I'll start with a single JBOD (note that I've redacted the root password, but of course you'll need to type in the actual root password for your appliance):
$ ./sizecalc.py catfish ***** 1
type NSPF width spares data drives capacity (TB)
raidz2 False 11 2 22 18
raidz2 wide False 23 1 23 21
mirror False 2 2 22 11
stripe False 0 0 24 24
raidz1 False 4 4 20 15
Note that with only one JBOD no configurations support NSPF (No Single Point of Failure) since that one JBOD is always a single point of failure. If we go up to three JBODs, we'll see that we have a few more options:
$ ./sizecalc.py catfish ***** 3
type NSPF width spares data drives capacity (TB)
raidz2 False 13 7 65 55
raidz2 True 6 6 66 44
raidz2 wide False 34 4 68 64
raidz2 wide True 6 6 66 44
mirror False 2 4 68 34
mirror True 2 4 68 34
stripe False 0 0 72 72
raidz1 False 4 4 68 51
In this case we have to give up a bunch of capacity in order to attain NSPF. Now let's look at the largest configuration we support today with twelve JBODs:
$ ./sizecalc.py catfish ***** 12
type NSPF width spares data drives capacity (TB)
raidz2 False 14 8 280 240
raidz2 True 14 8 280 240
raidz2 wide False 47 6 282 270
raidz2 wide True 20 8 280 252
mirror False 2 4 284 142
mirror True 2 4 284 142
stripe False 0 0 288 288
raidz1 False 4 4 284 213
raidz1 True 4 4 284 213
The size calculator also allows you to model a system with Logzilla devices, write-optimized flash devices that form a key part of the Hybrid Storage Pool. After you specify the number of JBODs in the configuration, you can include a list of how many Logzillas are in each JBOD. For example, the following invocation models twelve JBODs with four Logzillas in the first 2 JBODs:
$ ./sizecalc.py catfish ***** 12 4 4
type NSPF width spares data drives capacity (TB)
raidz2 False 13 7 273 231
raidz2 True 13 7 273 231
raidz2 wide False 55 5 275 265
raidz2 wide True 23 4 276 252
mirror False 2 4 276 138
mirror True 2 4 276 138
stripe False 0 0 280 280
raidz1 False 4 4 276 207
raidz1 True 4 4 276 207
A very common area of confusion has been how to size Sun Storage 7410 systems, and the relationship between the physical storage and the delivered capacity. I hope that this little tool will help to answer those questions. A side benefit should be still more interest in the virtual version of the appliance — a subject I've been meaning to post about so stay tuned.
Update December 14, 2008: A couple of folks requested that the script allow for modeling half-JBOD allocations because the 7410 allows you to split JBODs between heads in a cluster. To accommodate this, I've added a -h option that takes as its parameter the number of half JBODs. For example:
$ ./sizecalc.py -h 12 192.168.18.134 ***** 0
type NSPF width spares data drives capacity (TB)
raidz2 False 14 4 140 120
raidz2 True 14 4 140 120
raidz2 wide False 35 4 140 132
raidz2 wide True 20 4 140 126
mirror False 2 4 140 70
mirror True 2 4 140 70
stripe False 0 0 144 144
raidz1 False 4 4 140 105
raidz1 True 4 4 140 105
Update February 4, 2009: Ryan Matthews and I collaborated on a new version of the size calculator that now lists the raw space available in TB (decimal as quoted by drive manufacturers for example) as well as the usable space in TiB (binary as reported by many system tools). The latter also takes account of the sliver (1/64th) reserved by ZFS:
$ ./sizecalc.py 192.168.18.134 ***** 12
type NSPF width spares data drives raw (TB) usable (TiB)
raidz2 False 14 8 280 240.00 214.87
raidz2 True 14 8 280 240.00 214.87
raidz2 wide False 47 6 282 270.00 241.73
raidz2 wide True 20 8 280 252.00 225.61
mirror False 2 4 284 142.00 127.13
mirror True 2 4 284 142.00 127.13
stripe False 0 0 288 288.00 257.84
raidz1 False 4 4 284 213.00 190.70
raidz1 True 4 4 284 213.00 190.70
Update June 17, 2009: Ryan Matthews with help from has again revised the size calculator to model both adding expansion JBODs and to account for the now expandable Sun Storage 7210. Take a look at Ryan's post for usage information. Here's an example of the output:
$ ./sizecalc.py 172.16.131.131 *** 1 h1 add 1 h add 1
Sun Storage 7000 Size Calculator Version 2009.Q2
type NSPF width spares data drives raw (TB) usable (TiB)
mirror False 2 5 42 21.00 18.80
raidz1 False 4 11 36 27.00 24.17
raidz2 False 10-11 4 43 35.00 31.33
raidz2 wide False 10-23 3 44 38.00 34.02
stripe False 0 0 47 47.00 42.08
Update September 16, 2009: Ryan Matthews updated the size calculator for the 2009.Q3 release. The update includes the new triple-parity RAID wide stripe and three-way mirror profiles:
$ ./sizecalc.py boga *** 4
Sun Storage 7000 Size Calculator Version 2009.Q3
type NSPF width spares data drives raw (TB) usable (TiB)
mirror False 2 4 92 46.00 41.18
mirror True 2 4 92 46.00 41.18
mirror3 False 3 6 90 30.00 26.86
mirror3 True 3 6 90 30.00 26.86
raidz1 False 4 4 92 69.00 61.77
raidz1 True 4 4 92 69.00 61.77
raidz2 False 13 5 91 77.00 68.94
raidz2 True 8 8 88 66.00 59.09
raidz2 wide False 46 4 92 88.00 78.78
raidz2 wide True 8 8 88 66.00 59.09
raidz3 wide False 46 4 92 86.00 76.99
raidz3 wide True 11 8 88 64.00 57.30
stripe False 0 0 96 96.00 85.95
** As of 2009.Q3, the raidz2 wide profile has been deprecated.
** New configurations should use the raidz3 wide profile.
|
|