Ramblings from Richard's Ranch

ZFS RAID recommendations: space, performance, and MTTDL

Tuesday Jan 30, 2007

In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems. 

The best thing about a model is that it is a simplification of real life.
The worst thing about a model is that it is a simplification of real life.

Small, Random Read Performance Model

For this analysis, we will use a small, random read performance model. The calculations for the model can be made with data which is readily available from disk data sheets. We calculate the expected I/O operations per second (iops) based on the average read seek and rotational speed of the disk. We don't consider the command overhead, as it is generally small for modern drives and is not always specified in disk data sheets.

maximum rotational latency = 60,000 (ms/min) / rotational speed (rpm)

iops = 1000 (ms/s) / (average read seek time (ms) + (maximum rotational latency (ms) / 2))

Since most disks use consistent rotational speeds, this small table may help you to see what the rotational speed contribution will be.

Rotational Speed (rpm)

Maximum Rotational Latency (ms)

4,200

14.3

5,400

11.1

7,200

8.3

10,000

6.0

15,000

4.0

For example, if we have a 73 GByte, 2.5" Seagate Saviio SAS drive which has a 4.1 ms average read seek and rotational speed of 10,000 rpm:

iops = 1000 / (4.1 + (6.0 / 2)) = 140.8

By comparison, a 750 GByte, 3.5" Seagate Barracuda SATA drive which has a 8.5 ms average read seek and rotational speed of 7,200 rpm:

iops = 1000 / (8.5 + (8.3 / 2)) = 79.0

I purposely used those two examples because people are always wondering why we tend to prefer smaller, faster, and (unfortunately) more expensive drives over larger, slower, less expensive drives - a 78% performance improvement is rather significant. The 3.5" drives also use about 25-75% more power than their smaller cousins, largely due to the rotating mass. Small is beautiful in a SWaP sense.

Next we use the RAID set configuration information to calculate the total small, random read iops for the zpool or volume. Here we need to talk about sets of disks which may make up a multi-level zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of mirrored sets (RAID-1). RAID-0 is a stripe of disks.

  • For dynamic striping (RAID-0), add the iops for each set or disk. On average the iops are spread randomly across all sets or disks, gaining concurrency.

  • For mirroring (RAID-1), add the iops for each set or disk. For reads, any set or disk can satisfy a read, so we also get concurrency.

  • For single parity raidz (RAID-5), the set operates at the performance of one disk. See below.

  • For double parity raidz2 (RAID-6), the set operates at the performance of one disk. See below.

For example, if you have 6 disks, then there are many different ways you can configure them, with varying performance calculations

RAID Configuration (6 disks)

Small, Random Read Performance Relative to a Single Disk

6-disk dynamic stripe (RAID-0)

6

3-set dynamic stripe, 2-way mirror (RAID-1+0)

6

2-set dynamic stripe, 3-way mirror (RAID-1+0)

6

6-disk raidz (RAID-5)

1

2-set dynamic stripe, 3-disk raidz (RAID-5+0)

2

2-way mirror, 3-disk raidz (RAID-5+1)

2

6-disk raidz2 (RAID-6)

1

Clearly, using mirrors improves both performance and data reliability. Using stripes increases performance, at the cost of data reliability. raidz and raidz2 offer data reliability, at the cost of performance. This leads us down a rathole...

The Parity Performance Rathole

Many people expect that data protection schemes based on parity, such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance of striped volumes, except for the parity disk. In other words, they expect that a 6-disk raidz zpool would have the same small. random read performance as a 5-disk dynamic stripe. Similarly, they expect that a 6-disk raidz2 zpool would have the same performance as a 4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a checksum to validate the contents of a block of data written. The block is spread across the disks (vdevs) in the set. In order to validate the checksum, ZFS must read the blocks from more than one disk, thus not taking advantage of spreading unrelated, random reads concurrently across the disks. In other words, the small, random read performance of a raidz or raidz2 set is, essentially, the same as the single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.

Many people also think that this is a design deficiency. As a RAS guy, I value the data validation offered by the checksum over the performance supposedly gained by RAID-5. Reasonable people can disagree, but perhaps some day a clever person will solve this for ZFS.

So, what do other logical volume managers or RAID arrays do? The results seem mixed. I have seen some RAID array performance characterization data which is very similar to the ZFS performance for parity sets. I have heard anecdotes that other implementations will read the blocks and only reconstruct a failed block as needed. The problem is, how do such systems know that a block has failed? Anecdotally, it seems that some of them trust what is read from the disk. To implement a per-disk block checksum verification, you'd still have to perform at least two reads from different disks, so it seems to me that you are trading off data integrity for performance. In ZFS, data integrity is paramount. Perhaps there is more room for research here, or perhaps it is just one of those engineering trade-offs that we must live with.

Other Performance Models

I'm also looking for other performance models which can be applied to generic disks with data that is readily available to the public. The reason that the small, random read iops model works is that it doesn't need to consider caching or channel resource utilization. Adding these variables would require some knowledge of the configuration topology and the cache policies (which may also change with firmware updates.) I've kicked around the idea of a total disk bandwidth model which will describe a range of possible bandwidths based upon the media speed of the drives, but it is not clear to me that it will offer any satisfaction. Drop me a line if you have a good model or further thoughts on this topic.

You should be cautious about extrapolating the performance results described here to other workloads. You could consider this to be a worst-case model because it assumes 0% disk cache hits. I would hope that most workloads exhibit better performance, but rather than guessing (hoping) the best way to find out is to run the workload and measure the performance. If you characterize a number of different configurations, then you might build your own performance graphs which fit your workload.

Putting It All Together

Now we have a method to compare a variety of different ZFS or RAID disk configurations by evaluating space, performance, and MTTDL. First, let's look at single parity schemes such as 2-way mirrors and raidz on the Sun Fire X4500 (aka Thumper) server.

Single Parity Model Results 

Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better performance and MTTDL than raidz for any specific space requirement except for the case where we run out of hot spares for the 2-way mirror (using all 46 disks for data). By contrast, all of the raidz configurations here have hot spares. You can use this to help make design trade-offs by prioritizing space, performance, and MTTDL.

You'll also note that I did not label the left-side Y axis (MTTDL) again, but I did label the right-side Y axis (small, random read iops). I did this with mixed emotion. I didn't label the MTTDL axis values as I explained previously. But I did label the performance axis so that you can do a rough comparison to the double parity graph below. Note that in the double parity graph, the MTTDL axis is in units of Millions of years, instead of years above.

Double Parity Model Results

Here you can see the same sort of comparison between 3-way mirrors and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.

Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place.  If you want to be happier, you should use mirroring with at least one hot spare.

Conclusion

We can make design trade-offs between space, performance, and MTTDL for disk storage systems. As with most engineering decisions, there often is not a clear best solution given all of the possible solutions. By using some simple models, we can see the trade-offs more clearly.


[8] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg
Comments:

Sure, a 15K RPM drive gives a 78% performance improvement, but at a huge price premium (over 200%). You might as well just buy twice as many 7200 RPM drives and hope that the extra power/space cost is less than the savings in capital cost. Of course, you can't order Thumper with 80GB drives...

Posted by Wes Felter on January 30, 2007 at 01:26 PM PST #

Wes, the 73 GByte drives are being replaced with 146 GByte drives, but the critical change in my example is the size. We now have 1 TByte 3.5" drives, and anticipate 500 GByte 2.5" drives in the next year or so. This also reminds me of the age old problem: inexpensive, fast, or reliable. Pick any two.

Posted by Richard Elling on January 30, 2007 at 02:11 PM PST #

Just to make sure I have things right: Given (by the ZFS layer) a block D of data to store, RAID-Z will first split the block in several smaller blocks D_1..D_n as needed and calculate the parity block P from those. (n is the stripe width for this write) Then D_1..D_n and p are written to separate disks. When you request data from block D, ZFS has to read D_1..D_n and calculate the checksum over that to ensure data integrity, taking up bandwidth on those n disks. This makes read performance of a RAID-Z pool be the same as that of a single disk, even if you only needed a small read from block D. RAID-Z as it stands now in fact reinforces the layer principle behind "traditional" filesystem+volume manager systems. Now consider ditto blocks. They implement a flexible RAID-1 mirror, by merging the filesystem and volume layers. => What if ZFS had parity blocks? Try this scenario: Given data to store, that data is stored in regular ZFS blocks, and a parity block is calculated. The data and parity blocks are laid out across the available disks in the pool. When you need data from one of those blocks, only one block needs to be read to be able to calculate its checksum. If it is corrupt, the other data blocks and the parity block can be used to recreate its contents. That would implement something akin to RAID-5 storage, but "the ZFS way", merging filesystem and volume layers. The read performance would be the same as that of ditto blocks. On top of that, you would be able to store unrelated data blocks in the same stripe, thus ensuring maximum disk bandwidth and minimum parity size overhead. If an application asks for fsync(), a smaller stripe could be written. No RAID-5 write hole, RAID-1 read performance, tunable redundancy. Am I missing something? Is there something that would prevent this scenario? Wout.

Posted by Wout Mertens on January 31, 2007 at 05:19 AM PST #

Doesn't ZFS provide checksumming of data for all redundancy levels, even if I use a level without redundancy (such as RAID0)?

As far as I understand, the checksum if then stored near the data, or one level up the metadata hierarchy. Why isn't that scheme used for RAID-Z too, that is, store a checksum in the data block itself, on the same disk?
That way one could have the data verification provided by the checksum, and still be able to employ read concurrency. I don't think silent data corruption that is consistent with the checksum to be very likely.

Posted by Florian Laws on January 31, 2007 at 08:13 AM PST #

Wout's comment has been redirected to the OpenSolaris ZFS forum at http://www.opensolaris.org/jive/thread.jspa?threadID=23093&tstart=0
The conversation is interesting...

Posted by Richard Elling on January 31, 2007 at 12:14 PM PST #

Florian,
Yes, ZFS does provide checksumming. Anecdotally, about once a month someone posts to the ZFS forum that ZFS detected some corruption which was previously undetected. In other words, our previous thoughts about silent data corruption being very rare may not be correct. Now that we are looking for corruption, we are finding it more often and from more causes than we'd like. I believe that we will see this sort of checksumming become more and more common in the future.

Posted by Richard Elling on January 31, 2007 at 12:21 PM PST #

What I was tyring to say was:
Since ZFS already has in-block checksumming, why not use this for detecting silent corruption even on RAID-Z, instead of going to several disks?

Posted by Florian Laws on January 31, 2007 at 01:35 PM PST #

The confusion occurs in the operational definition of block. ZFS checksums are done at the file system block, not the disk block. Perhaps a picture would help, I'll see what I can dig up.

Posted by Richard Elling on January 31, 2007 at 02:54 PM PST #

Post a Comment:
Comments are closed for this entry.