Ramblings from Richard's Ranch

ZFS RAID recommendations: space vs MTTDL

Thursday Jan 11, 2007

It is not always obvious what the best RAID set configuration should be for a given set of disks. This is even more difficult to see as the number of disks grows large, like on a Sun Fire X4500 (aka Thumper) server. By default, the X4500 ships with 46 disks available for data. This leads to hundreds of possible permutations of RAID sets.  Which would be best? One analysis is the trade-off space and Mean Time To Data Loss (MTTDL). For this blog, I will try to stick with ZFS terminology in the text, but the principles apply to other RAID systems, too.

The space calculation is straightforward.  Configure the RAID sets and sum the total space available.

The MTTDL calculation is one attribute of Reliability, Availability, and Serviceability (RAS) which we can also calculate relatively easily. For large numbers of disks, MTTDL is particularly useful because we only need to consider the reliability of the disks, and not the other parts of the system (fodder for a later blog :-). While this doesn't tell the whole RAS story, it is a very good method for evaluating a big bunch of disks. The equations are fairly straightforward:

For non-protected schemes (dynamic striping, RAID-0)

MTTDL = MTBF / N

For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):

MTTDL = MTBF2 / (N * (N-1) * MTTR)

For double parity schemes (3-way mirror, raidz2, RAID-6):

MTTDL = MTBF3 / (N * (N-1) * (N-2) * MTTR2)

Where MTBF is the Mean Time Between Failure and MTTR is the Mean Time To Recover. You can get MTBF values from disk data sheets which are usually readily available. You could also adjust them for your situation or based upon your actual experience. At Sun, we have many years of field failure data for disks and use design values which are consistent with our experiences. YMMV, of course. For MTTR you need to consider the logistical repair time, which is usually the time required to identify the failed disk and physically replace it.  You also need to consider the data reconstruction time, which may be a long time for large disks, depending on how rapidly ZFS or your logical volume manager (LVM) will reconstruct the data. Obviously, a spreadsheet or tool helps ease the computational burden.

Note: the reconstruction time for ZFS is a function of the amount of data, not the size of the disk. Traditional LVMs or hardware RAID arrays have no context of the data and therefore have to reconstruct the entire disk rather than just reconstruct the data. In the best case (0% used), ZFS will reconstruct the data almost instantaneously.  In the worst case (100% used) ZFS will have to reconstruct the entire disk, just like a traditional LVM.  This is one of the advantages of ZFS over traditional LVMs: faster reconstruction time, lower MTTR, better MTTDL.

Note: if you have a multi-level RAID set, such as RAID-1+0, then you need to use both the single parity and no protection MTTDL calculations to get the MTTDL of the top-level volume. 

So, I took a large number of possible ZFS configurations for a X4500 and calculated the space and MTTDL for the zpool. The interesting thing is that the various RAID protection schemes fall out in clumps. For example, you would expect that a 3-way mirror has better MTTDL and less available space than a 2-way mirror. As you vary the configurations, you can see the changes in space and MTTDL, but you would never expect a 2-way mirror to have better MTTDL than a 3-way mirror. The result is that if you plot the available space against the MTTDL, then the various RAID configurations will tend to clump together.

X4500 MTTDL vs Space

The most obvious conclusion from the above data is that you shouldn't use simply dynamic striping or RAID-0. Friends don't let friends use RAID-0!

You will also notice that I've omitted the values on the MTTDL axis. You've noticed that the MTTDL axis uses a log scale, so that should give you a clue as to the magnitude of the differences. The real reason I've omitted the values is because they are a rat hole opportunity with a high entrance probability. It really doesn't matter if the MTTDL calculation shows that you should see a trillion years of MTTDL because the expected lifetime of a disk is on the order of 5 years.  I don't really expect any disk to last more than a decade or two. What you should take away from this is that bigger MTTDL is better, and you get a much bigger MTTDL as you increase the number of redundant copies of the data. It is better to stay out of the MTTDL value rat hole. 

The other obvious conclusion is that you should use hot spares. The reason for this is that when a hot spare is used, the MTTR is decreased because we don't have to wait for the physical disk to be replaced before we start data reconstruction on a spare disk. The time you must wait for the data to be reconstructed and available is time where you are exposed to another failure which may cause data loss. In general, you always want to increase MTBF (the numerator) and decrease MTTR (the denominator) to get high RAS.

The most interesting result of this analysis is that the RAID configurations will tend to clump together. For example, there isn't much difference between the MTTDL of a 5-disk zpool versus a 6-disk raidz zpool.

But if you look at this data another way, there is a huge difference in the RAS. For example, suppose you want 15,000 GBytes of space in your X4500.  You could use either raidz or raidz2 with or without spares. Clearly, you would have better RAS if you choose raidz2 with spares than any of the other options for the space requirement. Whether you use 6, 7, 8, or 9 disks in your raidz2 set makes less difference in MTTDL.

 

There are other considerations when choosing the ZFS or RAID configurations which I plan to address in later blogs. For now, I hope that this will encourage you to think about how you might approach the space and RAS trade-offs for your storage configurations.

 

[10] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg
Comments:

Scatter plot is good, but table (in addition to it) would be even better. And why it's 46 drives rather then 48?

Posted by Igor on January 11, 2007 at 05:53 PM PST #

46 disks 'coz normally you'll put system on two disks in mirror and rest disks for your data. Now it would be interesting to see how MTDDL would change if one large pool is created combined from many smaller raid groups like: 4x 11-raidz2 + 2x hot spares

Posted by Robert Milkowski on January 12, 2007 at 02:06 AM PST #

Very useful, thanks! How about adding throughput and going for a 3-d plot? Also, I'd like a pony.

Posted by Dick Davies on January 12, 2007 at 02:46 AM PST #

[Trackback] Richard Elling put a good article about the "space to mean time to data loss"-tradoff considerations: ZFS RAID recommendations: space vs MTTDL

Posted by c0t0d0s0.org on January 12, 2007 at 02:49 AM PST #

Hi Robert, I do have the data for your config. Actually, I have data for all possible configs. The list shown is for a select few configs. I hate to tease so much, but it is in the blog pipeline, so stay tuned.

Posted by Richard Elling on January 12, 2007 at 08:20 AM PST #

Nice analysis Richard. A couple of other factors come to mind based on some of our experience in Storage. First off we find that many data loss events are caused by service or configuration errors, for example the wrong disc drive being pulled when an alert is posted. I affectionally refer to this as "i.i.d. -Not!" In some of our modeling, we've used a factor called Pcd (probability of Correct Diagnosis) to help address some of this issue. Another thing we are seeing is improvements in reconstruction times, where for example data is copied (instead of using parity reconstruction) from the failing drive when possible. Many drive failures are not catastrophic and this approach reduces processing and bandwidth requirements. I note the previous call for a 3D view that also considers the performance tradeoffs. That would be great. I too would also like a pony!

Posted by Bob Wood on January 12, 2007 at 10:45 AM PST #

Interesting post Richard. I have also been modeling the data reliability of the thumper using dual parity ZFS to persuade a customer to stop buying overpriced and underperforming storage from a well known vendor.
Please correct me if I have missed it but you do not appear to be considering the Bit Error Rate (BER) for the SATA disks in this calculation though. This has a substantial impact on the data reliability of both single and dual parity configurations by preventing rebuild from the no parity remaining state.
Whilst the BERs appear to be low (1 Cluster failure per 1E14 bits read) for the SATA disks used in the X4500 they grow in significance quickly with the size arrays available on the thumper (~7E13 Bits). This is actually a major differentiator for ZFS on a thumper against most of the bigiron vendors SATA solutions that only do single parity and are only really safe for use in RAID10.

Posted by Liam Newcombe on January 14, 2007 at 07:23 AM PST #

Hi Liam, you are correct. For this calculation of MTTDL, I do not consider BER (UER). However, I do have that data and will post how that looks soon.

Posted by Richard Elling on January 14, 2007 at 12:37 PM PST #

Wouldn't ZFS checksumming help prevent silent errors creeping in through a higher SATA BER?

Posted by Kai Howells on January 15, 2007 at 03:37 PM PST #

Hi Kai, ZFS checksumming will, with very high probability, detect such silent errors. In some of the ZFS white papers and presentations you'll see this referenced as "bit rot," though that is just one mechanism contributing to BER, the principles are the same. What ZFS can do with the corrupted data depends on how it is configured. Clearly, if you want all of your data to be accessible, then you need to use some form of redundancy such as mirroring, raidz, or raidz2.

Posted by Richard Elling on February 01, 2007 at 03:11 PM PST #

Post a Comment:
Comments are closed for this entry.