Friday May 04, 2007
OpenSolaris build 61 (or later) is now available for download. ZFS
has added a new feature that will improve data protection: redundant
copies for data (aka ditto blocks for data). Previously, ZFS stored redundant copies of metadata.
Now this feature is available for data, too.
This represents a new feature which is unique to ZFS: you can set
the data protection policy on a per-file system basis, beyond that
offered by the underlying device or volume. For single-device
systems, like my laptop with its single disk drive, this is very
powerful. I can have a different data protection policy for the files
that I really care about (my personal files) than the files that I
really don't care about or that can be easily reloaded from the OS
installation DVD. For systems with multiple disks assembled in a RAID
configuration, the data protection is not quite so obvious. Let's
explore this feature, look under the hood, and then analyze some
possible configurations.
Using
Copies
To change the numbers of data copies, set the copies
property. For example, suppose I have a zpool named "zwimming."
The default number of data copies is 1. But you can change that to 2
quite easily.
# zfs set copies=2 zwimming
|
The copies property works for all new writes, so I recommend that
you set that policy when you create the file system or immediately
after you create a zpool.
You can verify the copies setting by looking at the properties.
# zfs get copies zwimming
NAME PROPERTY VALUE SOURCE
zwimming copies 2 local
|
ZFS will account for the space used. For example, suppose I create
three new file systems and copy some data to them. You can then see
that the space used reflects the number of copies. If you use quotas,
then the copies will be charged against the quotas, too.
# zfs create -o copies=1 zwimming/single
# zfs create -o copies=2 zwimming/dual
# zfs create -o copies=3 zwimming/triple
# cp -rp /usr/share/man1 /zwimming/single
# cp -rp /usr/share/man1 /zwimming/dual
# cp -rp /usr/share/man1 /zwimming/triple
# zfs list -r zwimming NAME USED AVAIL REFER MOUNTPOINT zwimming 48.2M 310M 33.5K /zwimming zwimming/dual 16.0M 310M 16.0M /zwimming/dual zwimming/single 8.09M 310M 8.09M /zwimming/single zwimming/triple 23.8M 310M 23.8M /zwimming/triple
|
This makes sense. Each file system has one, two, or three copies
of the data and will use correspondingly one, two, or three times as
much space to store the data.
Under
the Covers
ZFS will spread the ditto blocks across the vdev or vdevs to
provide spatial diversity. Bill
Moore has previously blogged about this, or you can see
it in the code for yourself. From a RAS perspective, this is a
good thing. We want to reduce the possibility that a single failure,
such as a drive head impact with media, could disturb both copies of
our data. If we have multiple disks, ZFS will try to spread the
copies across multiple disks. This is different than mirroring, in
subtle ways. The actual placement is ultimately based upon available
space. Let's look at some simplified examples. First, for the default
file system configuration settings on a single disk.

Note that there are two copies of the metadata, by default. If we
have two or more copies of the data, the number of metadata copies is
three.
Suppose you have a 2-disk stripe. In that case, ZFS will try to
spread the copies across the disks.

Since the copies are created above the zpool, a mirrored zpool
will faithfully mirror the copies.
Since the copies policy is set at the file system level, not the
zpool level, a single zpool may contain multiple file systems, each
with different policies. In other words, you could have data which is
not copied allocated along with data that is copied.
Using different policies for different file systems allows you to have different data protection policies, allows you to improve data protection, and offers many more permutations of configurations for you to weigh in your designs.
RAS
Modeling
It is obvious that increasing the number of data copies will
effectively reduce the amount of available space accordingly. But how
will this affect reliability? To answer that question we use the
MTTDL[2]
model I previously described, with the following changes:
First, we calculate the probability of unsuccessful
reconstruction due to a UER for N disks of a given size (unit
conversion omitted). The number of copies decreases this probability.
This makes sense as we could use another copy of the data for
reconstruction and to completely fail, we'd need to lose all copies:
Precon_fail =
((N-1) * size / UER)copies
For single-disk failure protection:
MTTDL[2] = MTBF / (N *
Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/
(N * (N-1) * MTTR * Precon_fail)
Note that as the number of copies increases, Precon_fail
approaches zero quickly. This will increase the MTTDL. We want higher
MTTDL, so this is a good thing.
OK, now that we can calculate available space and MTTDL, let's
look at some configurations for 46 disks available on a Sun
Fire X4500 (aka Thumper). We'll look at single parity schemes, to
reduce the clutter, but double parity schemes will show the same,
relative improvements.
bigger view
You can see that we are trading off space for MTTDL. You can also
see that for raidz zpools, having more disks in the sets reduces the
MTTDL. It gets more interesting to see that the 2-way mirror with
copies=2 is very similar in space and MTTDL to the 5-disk raidz with
copies=3. Hmm. Also, the 2-way mirror with copies=1 is similar in
MTTDL to the 7-disk raidz with copies=2, though the mirror
configurations allow more space. This information may be useful as
you make trade-offs. Since the copies parameter is set per file
system, you can still set the data protection policy for important
data separately from unimportant data. This might be a good idea for
some situations where you might have permanent originals (eg. CDs,
DVDs) and want to apply a different data protection policy.
In the future, once we have a better feel for the real performance
considerations, we'll be able to add a performance component into the
analysis.
Single
Device Revisited
Now that we see how data protection is improved, let's revisit the
single device case. I use the term device here because there is a
significant change occurring in storage as we replace disk drives
with solid state, non-volatile memory devices (eg. flash disks and
future MRAM or PRAM devices). A large number of enterprise customers
demand dual disk drives for mirroring root file systems in servers.
However, there is also a growing demand for solid state boot devices,
and we
have some Sun servers with this option. Some believe that by
2009, the majority of laptops will also have solid state devices
instead of disk drives. In the interim, there are also hybrid disk
drives.
What affect will these devices have on data retention? We know
that if the entire device completely fails, then the data is most
likely unrecoverable. In real life, these devices can suffer many
failures which result in data loss, but which are not complete device
failures. For disks, we see the most common failure is an
unrecoverable read where data is lost from one or more sector (bar 1
in the graph below). For flash memories, there is an endurance issue
where repeated writes to a cell may reduce the probability of reading
the data correctly. If you only have one copy of the data, then the
data is lost, never to be read correctly again.
We captured disk error codes returned from a number of disk drives
in the field. The Pareto chart below shows the relationship between
the error codes. Bar 1 is the unrecoverable read which accounts for
about 24% of the errors recorded. The violet bars show recoverable
errors which did succeed. Examples of successfully recovered errors
are: write error - recovered with block reallocation, read error -
recovered by ECC using normal retries, etc. The recovered errors do
not (immediately) indicate a data loss event, so they are largely
transparent to applications. We worry more about the unrecoverable
errors.
Approximately 1/3 of the errors were unrecoverable. If such an
error occurs in ZFS metadata, then ZFS will try to read alternate
metadata copy and repair the metadata. If the data has multiple
copies, then it is likely that we will not lose any data. This is a
more detailed view of the storage device because we are not treating
all failures as a full device failure.
Both real and anecdotal evidence suggests that unrecoverable
errors can occur while the device is still largely operational. ZFS
has the ability to survive such errors without data loss. Very cool.
Murphy's Law will ultimately catch up with you, though. In the case
where ZFS cannot recover the data, ZFS will tell you which file is
corrupted. You can then decide whether or not you should recover it
from backups or source media.
Another
Single Device
Now that I've got you to think of the single device as a single
device, I'd like to extend the thought to RAID arrays. There is much
confusion amongst people about whether ZFS should or should not be
used with RAID arrays. If you search, you'll find comments and
recommendations both for and against using hardware RAID for ZFS. The main
argument is centered around the ability of ZFS to correct errors. If
you have a single device backed by a RAID array with some sort of
data protection, then previous versions of ZFS could not recover data
which was lost. Hold it right there, fella! Do I mean that RAID
arrays and the channel from the array to main memory can have errors?
Yes, of course! We have seen cases where errors were introduced
somewhere along the path between disk media to main memory where data
was lost or corrupted. Prior to ZFS, these were silent errors and
blissfully ignored. With ZFS, the checksum now detects these errors
and tries to recover. If you don't believe me, then watch the ZFS
forum on opensolaris.org where we get reports like this about
once a month or so. With ZFS copies, you can now recover from such
errors without changing the RAID array configuration.
If ZFS can correct a data error, it will attempt to do so. You now
have a the option to improve your data protection even when using a
single RAID LUN. And this is the same mechanism we can use for a
single disk or flash drive: data copies. You can implement the copies
on a per-file system basis and thus have different data protection
policies even though the data is physically stored on a RAID LUN in a
hardware RAID array. I really hope we can put to rest the "ZFS
prefers JBOD" argument and just concentrate our efforts on
implementing the best data protection policies for the requirements.
ZFS with data copies is another tool in your toolbelt to improve your
life, and the life of your data.
Wednesday Jan 31, 2007
Wrapping up the thread on space, performance, and MTTDL, I thought that you might like to see one graph which would show the entire design space I've been using. Here it is:
This shows the data I've previously blogged about in scale. You can easily see that for MTTDL, double parity protection is better than single parity protection which is better than striping (no parity protection). Mirroring is also better than raidz or raidz2 for MTTDL and small, random read iops. I call this the "all-in" slide because, in a sense, it puts everything in one pot.
While this sort of analysis is useful, the problem is that there are more dimensions of the problem. I will show you some of the other models we use to evaluate and model systems in later blogs, but it might not be so easy to show so many useful factors on one graph. I'll try my best...
Tuesday Jan 30, 2007
In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems.
The best thing about a model is that it is a simplification of
real life.
The worst thing about a model is that it is a
simplification of real life.
Small, Random Read Performance Model
For this analysis, we will use a small, random read performance
model. The calculations for the model can be made with data which is
readily available from disk data sheets. We calculate the expected
I/O operations per second (iops) based on the average read seek and
rotational speed of the disk. We don't consider the command overhead,
as it is generally small for modern drives and is not always
specified in disk data sheets.
maximum rotational latency = 60,000
(ms/min) / rotational speed (rpm)
iops = 1000 (ms/s) / (average read
seek time (ms) + (maximum rotational latency (ms) / 2))
Since most disks use consistent rotational speeds, this small
table may help you to see what the rotational speed contribution will
be.
-
|
Rotational Speed (rpm)
|
Maximum Rotational Latency (ms)
|
|
4,200
|
14.3
|
|
5,400
|
11.1
|
|
7,200
|
8.3
|
|
10,000
|
6.0
|
|
15,000
|
4.0
|
For example, if we have a 73
GByte, 2.5" Seagate Saviio SAS drive which has a 4.1 ms
average read seek and rotational speed of 10,000 rpm:
iops = 1000 / (4.1 + (6.0 / 2)) =
140.8
By comparison, a 750
GByte, 3.5" Seagate Barracuda SATA drive which has a 8.5 ms
average read seek and rotational speed of 7,200 rpm:
iops = 1000 / (8.5 + (8.3 / 2)) = 79.0
I purposely used those two examples because people are always
wondering why we tend to prefer smaller, faster, and (unfortunately)
more expensive drives over larger, slower, less expensive drives - a
78% performance improvement is rather significant. The 3.5"
drives also use about 25-75% more power than their smaller cousins,
largely due to the rotating mass. Small is beautiful in a SWaP
sense.
Next we use the RAID set configuration information to calculate
the total small, random read iops for the zpool or volume. Here we
need to talk about sets of disks which may make up a multi-level
zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of
mirrored sets (RAID-1). RAID-0 is a stripe of disks.
For dynamic striping (RAID-0), add the iops for each set or
disk. On average the iops are spread randomly across all sets or
disks, gaining concurrency.
For mirroring (RAID-1), add the iops for each set or disk.
For reads, any set or disk can satisfy a read, so we also get
concurrency.
For single parity raidz (RAID-5), the set operates at the
performance of one disk. See below.
For double parity raidz2 (RAID-6), the set operates at the
performance of one disk. See below.
For example, if you have 6 disks, then there are many different
ways you can configure them, with varying performance calculations
|
RAID Configuration (6 disks)
|
Small, Random Read Performance Relative to a Single Disk
|
|
6-disk dynamic stripe (RAID-0)
|
6
|
|
3-set dynamic stripe, 2-way mirror (RAID-1+0)
|
6
|
|
2-set dynamic stripe, 3-way mirror (RAID-1+0)
|
6
|
|
6-disk raidz (RAID-5)
|
1
|
|
2-set dynamic stripe, 3-disk raidz (RAID-5+0)
|
2
|
|
2-way mirror, 3-disk raidz (RAID-5+1)
|
2
|
|
6-disk raidz2 (RAID-6)
|
1
|
Clearly, using mirrors improves both performance and data
reliability. Using stripes increases performance, at the cost of data
reliability. raidz and raidz2 offer data reliability, at the cost of
performance. This leads us down a rathole...
The Parity Performance Rathole
Many people expect that data protection schemes based on parity,
such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance
of striped volumes, except for the parity disk. In other words, they
expect that a 6-disk raidz zpool would have the same small. random
read performance as a 5-disk dynamic stripe. Similarly, they expect
that a 6-disk raidz2 zpool would have the same performance as a
4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a
checksum to validate the contents of a block of data written. The
block is spread across the disks (vdevs) in the set. In order to
validate the checksum, ZFS must read the blocks from more than one
disk, thus not taking advantage of spreading unrelated, random reads
concurrently across the disks. In other words, the small, random read
performance of a raidz or raidz2 set is, essentially, the same as the
single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.
Many people also think that this is a design deficiency. As a RAS
guy, I value the data validation offered by the checksum over the
performance supposedly gained by RAID-5. Reasonable people can
disagree, but perhaps some day a clever person will solve this for
ZFS.
So, what do other logical volume managers or RAID arrays do? The
results seem mixed. I have seen some RAID array performance
characterization data which is very similar to the ZFS performance
for parity sets. I have heard anecdotes that other implementations
will read the blocks and only reconstruct a failed block as
needed. The problem is, how do such systems know that a block has
failed? Anecdotally,
it seems that some of them trust what is read from the disk. To
implement a per-disk block checksum verification, you'd still have to
perform at least two reads from different disks, so it seems to me
that you are trading off data integrity for performance. In ZFS, data
integrity is paramount. Perhaps there is more room for research here,
or perhaps it is just one of those engineering trade-offs that we
must live with.
Other Performance Models
I'm also looking for other performance models which can be applied
to generic disks with data that is readily available to the public.
The reason that the small, random read iops model works is that it
doesn't need to consider caching or channel resource utilization.
Adding these variables would require some knowledge of the
configuration topology and the cache policies (which may also change
with firmware updates.) I've kicked around the idea of a total disk
bandwidth model which will describe a range of possible bandwidths
based upon the media speed of the drives, but it is not clear to me
that it will offer any satisfaction. Drop me a line if you have a
good model or further thoughts on this topic.
You should be cautious about extrapolating the performance results
described here to other workloads. You could consider this to be a
worst-case model because it assumes 0% disk cache hits. I would hope
that most workloads exhibit better performance, but rather than
guessing (hoping) the best way to find out is to run the workload and
measure the performance. If you characterize a number of different
configurations, then you might build your own performance graphs
which fit your workload.
Putting It All Together
Now we have a method to compare a variety of different ZFS or RAID
disk configurations by evaluating space, performance, and MTTDL.
First, let's look at single parity schemes such as 2-way mirrors and
raidz on the Sun
Fire X4500 (aka Thumper) server.
Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better
performance and MTTDL than raidz for any specific space requirement
except for the case where we run out of hot spares for the 2-way
mirror (using all 46 disks for data). By contrast, all of the raidz
configurations here have hot spares. You can use this to help make
design trade-offs by prioritizing space, performance, and MTTDL.
You'll also note that I did not label the left-side Y axis (MTTDL)
again, but I did label the right-side Y axis (small, random read
iops). I did this with mixed emotion. I didn't label the MTTDL axis
values as I explained previously. But I did label the performance
axis so that you can do a rough comparison to the double parity graph
below. Note that in the double parity graph, the MTTDL axis is in
units of Millions of years, instead of years above.

Here you can see the same sort of comparison between 3-way mirrors
and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.
Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place. If you want to be happier, you should use mirroring with at least one hot spare.
Conclusion
We can make design trade-offs between space, performance, and
MTTDL for disk storage systems. As with most engineering decisions,
there often is not a clear best solution given all of the possible
solutions. By using some simple models, we can see the trade-offs
more clearly.
Wednesday Jan 17, 2007
Mean Time to Data Loss (MTTDL) is a metric that we find useful for
comparing data storage systems. I think it is particularly useful for
determining what sort of data protection you may want to use for a
RAID system. For example, suppose you have a Sun Fire X4550 (aka
Thumper) server with 48 internal disk drives. What would be the best
way to configure the disks for redundancy? Previously, I explored
space versus MTTDL and space versus unscheduled Mean Time Between
System Interruptions (U_MTBSI) for the X4500 running ZFS. The same
analysis works for SVM or LVM, too.
For this blog, I want to explore the calculation of MTTDL for a
bunch of disks. It turns out, there are multiple models for
calculating MTTDL. The one described previously here is the simplest
and only considers the Mean Time Between Failure (MTBF) of a disk and
the Mean Time to Repair (MTTR) of the repair and reconstruction
process. I'll call that model #1 which solves for MTTDL[1]. To
quickly recap:
For non-protected schemes (dynamic striping, RAID-0)
MTTDL[1]
= MTBF / N
For single parity schemes (2-way mirror, raidz, RAID-1,
RAID-5):
MTTDL[1]
= MTBF2 / (N * (N-1) * MTTR)
For double parity schemes (3-way mirror, raidz2, RAID-6):
MTTDL[1]
= MTBF3 / (N * (N-1) * (N-2) * MTTR2)
You can
often get MTBF data from your drive vendor and you can measure or
estimate your MTTR with reasonable accuracy. But MTTDL[1] does not
consider the Unrecoverable Error Rate (UER) for read operations on
disk drives. It turns out that the UER is often easier to get from
the disk drive data sheets, because sometimes the drive vendors don't
list MTBF (or Annual Failure Rate, AFR) for all of their drive
models. Typically, UER will be 1 per 1014 bits read for consumer
class drives and 1 per 1015 for enterprise class drives. This can be
alarming, because you could also say that consumer class drives
should see 1 UER per 12.5 TBytes of data read. Today, 500-750 GByte
drives are readily available and 1 TByte drives are announced. Most
people will be unhappy if they get an unrecoverable read error once
every dozen or so times they read the whole disk. Worse yet, if we
have the data protected with RAID and we have to replace a drive, we
really do hope that the data reconstruction completes correctly. To
add to our nightmare, the UER does not decrease by adding disks. If
we can't rely on the data to be correctly read, we can't be sure that
our data reconstruction will succeed, and we'll have data loss.
Clearly, we need a model which takes this into account. Let's call
that model #2, for MTTDL[2]:
First, we calculate the probability of unsuccessful
reconstruction due to a UER for N disks of a given size (unit
conversion omitted):
Precon_fail = (N-1) * size /
UER
For single-disk failure protection:
MTTDL[2] = MTBF / (N *
Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/ (N * (N-1)
* MTTR * Precon_fail)
Comparing the MTTDL[1] model to
the MTTDL[2] model shows some interesting aspects of design. First,
there is no MTTDL[2] model for RAID-0 because there is no data
reconstruction – any failure and you lose data. Second, the MTTR
doesn't enter into the MTTDL[2] model until you get to double-disk
failure scenarios. You could nit pick about this, but as you'll soon
see, it really doesn't make any difference for our design decision
process. Third, you can see that the Precon_fail is a function of the
size of the data set. This is because the UER doesn't change as you
grow the data set. Or, to look at it from a different direction, if
you use consumer class drives with 1 UER for 1014 bits, and you have
12.5 TBytes of data, the probability of an unrecoverable read during
the data reconstruction is 1. Ugh. If the Precon_fail is 1, then the
MTTDL[2] model looks a lot like the RAID-0 model and friends don't
let friends use RAID-0! Maybe you could consider a smaller sized
data set to offset this risk. Let's see how that looks in pictures.

2-way mirroring is an example of
a configuration which provides single-disk failure protection. Each
data point represents the space available when using 2-way mirrors in
a zpool. Since this is for a X4500, we consider 46 total available
disks and any disks not used for data are available as spares. In
this graph you can clearly see that the MTTDL[1] model encourages the
use of hot spares. More importantly, although the results of the
calculations of the two models are around 5 orders of magnitude
different, the overall shape of the curve remains the same. Keep in
mind that we are talking years here, perhaps 10 million years, which
is well beyond the 5-year expected life span of a disk. This is the
nature of the beast when using a constant MTBF. For models which
consider the change in MTBF as the device ages, you should never see
such large numbers. But the wish for more accurate models does not
change the relative merits of the design decision, which is
what we really care about – the best RAID configuration given a
bunch of disks. Should I use single disk failure protection or double
disk failure protection? To answer that, lets look at the model for
raidz2.
From this graph you can see that
double disk protection is clearly better than single disk protection
above, regardless of which model we choose. Good, this makes sense.
You can also see that with raidz2 we have a larger number of disk
configuration options. A 3-disk raidz2 set is somewhat similar to a
3-way mirror with the best MTTDL, but doesn't offer much available
space. A 4-disk set will offer better space, but not quite as good
MTTDL. This pattern continues through 8 disks/set. Judging from the
graphs, you should see that a 3-disk set will offer approximately an
order of magnitude better MTTDL than an 8-disk, for either MTTDL
model. This is because the UER remains constant while the data to be
reconstructed increases.
I hope that these models give
you an insight into how you can model systems for RAS. In my
experience, most people get all jazzed up with the space and forget
that they are often making a space vs. RAS trade-off. You can use
these models to help you make good design decisions when configuring
RAID systems. Since the graphs use Space on the X-axis, it is easy to look at the design trade-offs for a given amount of available space.
Just one more teaser... there are other MTTDL models,
but it is unclear if they would help make better decisions, and I'll
explore those in another blog.