Ramblings from Richard's Ranch

A story of two MTTDL models

Wednesday Jan 17, 2007

Mean Time to Data Loss (MTTDL) is a metric that we find useful for comparing data storage systems. I think it is particularly useful for determining what sort of data protection you may want to use for a RAID system. For example, suppose you have a Sun Fire X4550 (aka Thumper) server with 48 internal disk drives. What would be the best way to configure the disks for redundancy? Previously, I explored space versus MTTDL and space versus unscheduled Mean Time Between System Interruptions (U_MTBSI) for the X4500 running ZFS. The same analysis works for SVM or LVM, too.

For this blog, I want to explore the calculation of MTTDL for a bunch of disks. It turns out, there are multiple models for calculating MTTDL. The one described previously here is the simplest and only considers the Mean Time Between Failure (MTBF) of a disk and the Mean Time to Repair (MTTR) of the repair and reconstruction process. I'll call that model #1 which solves for MTTDL[1]. To quickly recap:

For non-protected schemes (dynamic striping, RAID-0)
MTTDL[1] = MTBF / N
For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):
MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
For double parity schemes (3-way mirror, raidz2, RAID-6):
MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)
You can often get MTBF data from your drive vendor and you can measure or estimate your MTTR with reasonable accuracy. But MTTDL[1] does not consider the Unrecoverable Error Rate (UER) for read operations on disk drives. It turns out that the UER is often easier to get from the disk drive data sheets, because sometimes the drive vendors don't list MTBF (or Annual Failure Rate, AFR) for all of their drive models. Typically, UER will be 1 per 1014 bits read for consumer class drives and 1 per 1015 for enterprise class drives. This can be alarming, because you could also say that consumer class drives should see 1 UER per 12.5 TBytes of data read. Today, 500-750 GByte drives are readily available and 1 TByte drives are announced. Most people will be unhappy if they get an unrecoverable read error once every dozen or so times they read the whole disk. Worse yet, if we have the data protected with RAID and we have to replace a drive, we really do hope that the data reconstruction completes correctly. To add to our nightmare, the UER does not decrease by adding disks. If we can't rely on the data to be correctly read, we can't be sure that our data reconstruction will succeed, and we'll have data loss. Clearly, we need a model which takes this into account. Let's call that model #2, for MTTDL[2]:
First, we calculate the probability of unsuccessful reconstruction due to a UER for N disks of a given size (unit conversion omitted):
Precon_fail = (N-1) * size / UER
For single-disk failure protection:
MTTDL[2] = MTBF / (N * Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)
Comparing the MTTDL[1] model to the MTTDL[2] model shows some interesting aspects of design. First, there is no MTTDL[2] model for RAID-0 because there is no data reconstruction – any failure and you lose data. Second, the MTTR doesn't enter into the MTTDL[2] model until you get to double-disk failure scenarios. You could nit pick about this, but as you'll soon see, it really doesn't make any difference for our design decision process. Third, you can see that the Precon_fail is a function of the size of the data set. This is because the UER doesn't change as you grow the data set. Or, to look at it from a different direction, if you use consumer class drives with 1 UER for 1014 bits, and you have 12.5 TBytes of data, the probability of an unrecoverable read during the data reconstruction is 1. Ugh. If the Precon_fail is 1, then the MTTDL[2] model looks a lot like the RAID-0 model and friends don't let friends use RAID-0! Maybe you could consider a smaller sized data set to offset this risk. Let's see how that looks in pictures.

 MTTDL models for 2-way mirror

2-way mirroring is an example of a configuration which provides single-disk failure protection. Each data point represents the space available when using 2-way mirrors in a zpool. Since this is for a X4500, we consider 46 total available disks and any disks not used for data are available as spares. In this graph you can clearly see that the MTTDL[1] model encourages the use of hot spares. More importantly, although the results of the calculations of the two models are around 5 orders of magnitude different, the overall shape of the curve remains the same. Keep in mind that we are talking years here, perhaps 10 million years, which is well beyond the 5-year expected life span of a disk. This is the nature of the beast when using a constant MTBF. For models which consider the change in MTBF as the device ages, you should never see such large numbers. But the wish for more accurate models does not change the relative merits of the design decision, which is what we really care about – the best RAID configuration given a bunch of disks. Should I use single disk failure protection or double disk failure protection? To answer that, lets look at the model for raidz2.

MTTDL models for raidz2 

From this graph you can see that double disk protection is clearly better than single disk protection above, regardless of which model we choose. Good, this makes sense. You can also see that with raidz2 we have a larger number of disk configuration options. A 3-disk raidz2 set is somewhat similar to a 3-way mirror with the best MTTDL, but doesn't offer much available space. A 4-disk set will offer better space, but not quite as good MTTDL. This pattern continues through 8 disks/set. Judging from the graphs, you should see that a 3-disk set will offer approximately an order of magnitude better MTTDL than an 8-disk, for either MTTDL model. This is because the UER remains constant while the data to be reconstructed increases.
I hope that these models give you an insight into how you can model systems for RAS. In my experience, most people get all jazzed up with the space and forget that they are often making a space vs. RAS trade-off. You can use these models to help you make good design decisions when configuring RAID systems. Since the graphs use Space on the X-axis, it is easy to look at the design trade-offs for a given amount of available space.

Just one more teaser... there are other MTTDL models, but it is unclear if they would help make better decisions, and I'll explore those in another blog.

[5] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg
Comments:

[Trackback] Now his blog is on my list of independently subscribed Sun Blogs: Richard goes into the math of the "Mean time to data loss" calculation. A real must-read!

Posted by c0t0d0s0.org on January 17, 2007 at 10:22 PM PST #

Hey Richard. Great post! Just a couple quick thoughts. First, a 3rd MTTDL model might take into account backup periodicity and the window of vulnerability of new data before it is backed up (unrecoverable data). Also, you might add a factor to UER to take into account the utilization of disks. A 12TB disk that averages 50% utilization has only half the probability of a data loss event. Finally, a model might take into account the stats about data reuse. If 70% of stored data is never again accessed, then, like the tree that falls in woods with no one to hear, does a bad bit in a never again accessed data location matter? Of course, Murphy will ensure that it will matter! :-) Anyway, good stuff. Thanks.

Posted by Dave Brillhart on January 18, 2007 at 03:22 AM PST #

Hi Dave, you are a great straight man! Stay tuned...

Posted by Richard Elling on January 18, 2007 at 10:10 AM PST #

Richard, Thanks for continuing your honest discussion with customers.
With respect to ZFS on these RAID Z sets I am interested to know whether the changes in ZFS allow it to recover from disk unrecoverable read error scenarios.
ZFS actually addresses the disks, unlike other FS or FS using hardware RAID controllers and as such will be able to identity the cluster that reported the unrecoverable error. Can ZFS use its integrated checksum to repair a single cluster error?
If so then the MTTDL for ZFS would be substantially better than competing filesystems.
The unrecoverable read error rate on the SATA disk platters is actually much higher than 1E14 but is masked by the sector CRC (thus the BER events being at sector level not bytes read).
If ZFS can recover a single sector error in a file as another layer of recovery the probability of data loss drops dramatically.
It is also interesting to compare the specifications of SATA and FC disks from the same manufacturer and observe the same platters and electronics. The SATA disks have slightly larger unformatted space but a higher BER whilst the FC disks are slightly smaller and have a lower BER. Do you also suspect that the apparent size loss for FC is down to more CRC per sector and more spare sectors for bad sector remap?

Posted by Liam Newcombe on January 18, 2007 at 03:15 PM PST #

If you had some time, help or your opinions would be appreciated: I have a 939 Gigabyte motherboard, with 4 SATAII ports on the nForce4 chipset, and 4 SATA ports off the SIL3114 controller. I recently purchased 5, 320gig SATAII drives... http://tinyurl.com/yf5z9o I wanted to install Solaris86 on this machine and turn it into a reliable home NAS server. ZFS looks very attractive, but I don't believe it can be used for a boot drive. How would you setup this system? I can purchase additional SATA or IDE hard drives...For example, I could get 3 more 320gig SATAII drives, and fill all the SATA ports. And hook up a IDE drive as the system boot drive.

Posted by Rob on January 22, 2007 at 09:37 AM PST #

Post a Comment:
Comments are closed for this entry.