Tuesday Apr 08, 2008
My colleague Christine asked me some questions about my holey files posts. These are really good questions, and I'm just a little surprised that more people didn't ask them... hey, that is what the comments section is for! So, I thought I would reply publically, helping to stimulation some conversations.
Q1. How could you have a degraded pool and data corruption w/o a repair?
I assume this pool must be raidz or mirror.
A1. No, this was a simple pool, not protected at the pool level. I used the ZFS copies parameter to set the number of redundant data copies to 2. For more information on how copies works, see my post with pictures.
There is another, hidden question here. How did I install Indiana such that it uses copies=2? By opening a shell and becoming root prior to beginning the install, I was able to set the copies=2 property just after the storage pool was created. By default, it gets inherited by any subsequent file system creation. Simple as that. OK, so it isn't that simple. I've also experimented with better ways to intercept the zpool create, but am not really happy with my hacks thus far. A better solution is for the installer to pick up a set of properties, but it doesn't, at least for now.
Q2. Can a striped pool be in a degraded state? Wouldn't a device
faulting in that pool renders it unusable and therefore faulted?
A2. Yes, a striped storage pool can be in a degraded state. To understand this, you need to know the definitions of DEGRADED and FAULTED. Fortunately, they are right there in the zpool manual page.
-
DEGRADED
-
One or more top-level vdevs is in the degraded state because one or
more component devices are offline. Sufficient replicas exist to
continue functioning.
...
-
FAULTED
-
One or more top-level vdevs is in the faulted state because one or
more component devices are offline. Insufficient replicas exist to
continue functioning.
...
By default, there are multiple replicas, so for a striped volume it is possible to be in a DEGRADED state. However, I expect that the more common case will be a FAULTED state. In other words, I do tend to recommend a more redundant storage pool: mirror, raidz, raidz2.
Q3. What does filling the corrupted part with zero do for me? It doesn't
fix it, those bits weren't zero to begin with.
A3. Filling with zeros will just make sure that the size of the "recovered" file is the same as the original. Some applications get to data in a file via a seek to an offset (random access), so this is how you would want to recover the file. For applications which process files sequentially, it might not matter.
Thursday Mar 13, 2008
Bob Netherton took a look
at my
last post on corrupted file recovery (?) and asked whether I had
considered using the noerror
option to dd. Yes, I did
experiment with dd and the
noerror option.
The noerror option is described in dd(1)
as:
noerror
Does not stop processing on an input error.
When
an input error occurs, a diagnostic mes-
sage
is written on standard error, followed
by
the current input and output block counts
in
the same format as used at completion. If
the
sync conversion is specified, the missing
input
is replaced with null bytes and pro-
cessed
normally. Otherwise, the input block
will
be omitted from the output.
This looks like the perfect solution, rather than my dd and iseek
script. But I didn't post this because, quite simply, I don't really
understand what I get out of it.
Recall that I had a corrupted file which is 2.9 MBytes in size.
Somewhere around 1.1 MBytes into the file, the data is corrupted and
fails the ZFS checksum test.
|
#
zpool scrub zpl_slim #
zpool status -v zpl_slim pool:
zpl_slim state:
DEGRADED status:
One or more devices has experienced an error resulting in data corruption.
Applications may be affected.
action:
Restore the file in question if possible. Otherwise restore the entire
pool from backup.
see:
http://www.sun.com/msg/ZFS-8000-8A scrub:
scrub completed after 0h2m with 1 errors on Tue Mar 11 13:12:42
2008
config: NAME STATE READ WRITE CKSUM zpl_slim
DEGRADED 0 0 9 c2t0d0s0 DEGRADED 0 0 9
errors:
Permanent errors have been detected in the following files: /mnt/root/lib/amd64/libc.so.1 #
ls -ls /mnt/root/lib/amd64/libc.so.1 4667
-rwxr-xr-x 1 root bin 2984368 Oct 31 18:04
/mnt/root/lib/amd64/libc.so.1
|
I attempted to use dd with the noerror flag using several
different block sizes to see what I could come up with. Here are
those results:
|
#
for i in 1k 8k 16k 32k 128k 256k 512k >
do >
dd if=libc.so.1 of=/tmp/whii.$i bs=$i conv=noerror >
done read:
I/O error 1152+0
records in 1152+0
records out ... grond#
ls -ls /tmp/whii* 3584
-rw-r--r-- 1 root root 1835008 Mar 13 11:27
/tmp/whii.128k 2464
-rw-r--r-- 1 root root 1261568 Mar 13 11:27
/tmp/whii.16k 2320
-rw-r--r-- 1 root root 1184768 Mar 13 11:27
/tmp/whii.1k 4608
-rw-r--r-- 1 root root 2359296 Mar 13 11:27
/tmp/whii.256k 2624
-rw-r--r-- 1 root root 1343488 Mar 13 11:27
/tmp/whii.32k 7168
-rw-r--r-- 1 root root 3670016 Mar 13 11:27
/tmp/whii.512k 2384
-rw-r--r-- 1 root root 1220608 Mar 13 11:27
/tmp/whii.8k
|
hmmm... all of these files are of
different sizes, so I'm really unsure what I've ended up with. None
of them are the same size as the original file, which is a bit
unexpected.
|
#
dd if=libc.so.1
of=/tmp/whaa.1k bs=1k conv=noerror read:
I/O error 1152+0
records in 1152+0
records out read:
I/O error 1153+0
records in 1153+0
records out read:
I/O error 1154+0
records in 1154+0
records out read:
I/O error 1155+0
records in 1155+0
records out read:
I/O error 1156+0
records in 1156+0
records out read:
I/O error 1157+0
records in 1157+0
records out #
ls -ls /tmp/whaa.1k 2320
-rw-r--r-- 1 root root 1184768 Mar 13 11:12
/tmp/whaa.1k
|
hmmm... well, dd
did copy some of the file, but seemed to give up after around 5
attempts and I only seemed to get the first 1.1 MBytes of the file.
What is going on here? A quick look at the dd
source (open source is a good thing) shows that there is a
definition of BADLIMIT which is how many times dd
will try before giving up. The default compilation sets BADLIMIT to
5. Aha! A quick download of the dd
code and I set BADLIMIT to be really huge and tried again.
|
#
bigbaddd if=libc.so.1
of=/tmp/whbb.1k bs=1k conv=noerror read:
I/O error 1152+0
records in 1152+0
records out ... read:
I/O error 3458+0
records in 3458+0
records out ^C
I give up #
ls -ls /tmp/whbb.1k 6920
-rw-r--r-- 1 root root 3543040 Mar 13 11:47
/tmp/whbb.1k
|
As dd
processes the input file, it doesn't really do a seek, so it can't
really get past the corruption. It is getting something, because od
shows that the end of the whbb.1k
file is not full of nulls. But I really don't believe this is the
data in a form which could be useful. And I really can't explain why
the new file is much larger than the original. I suspect that dd
gets stuck at the corrupted area and does not seek beyond it. In any
case, it appears that letting dd
do the dirty work by itself will not acheive the desired results.
This is, of course, yet another opportunity...
Wednesday Mar 12, 2008
I was RASing around with ZFS the other day, and managed to find a
file which was corrupted.
|
#
zpool scrub zpl_slim #
zpool status -v zpl_slim pool:
zpl_slim state:
DEGRADED status:
One or more devices has experienced an error resulting in data corruption.
Applications may be affected. action:
Restore the file in question if possible. Otherwise restore the entire
pool from backup. see:
http://www.sun.com/msg/ZFS-8000-8A scrub:
scrub completed after 0h2m with 1 errors on Tue Mar 11 13:12:42
2008 config: NAME
STATE READ WRITE CKSUM zpl_slim
DEGRADED 0 0 9 c2t0d0s0 DEGRADED 0 0 9
errors:
Permanent errors have been detected in the following files:
/mnt/root/lib/amd64/libc.so.1
#
ls -ls /mnt/root/lib/amd64/libc.so.1 4667 -rwxr-xr-x 1 root
bin 2984368 Oct 31 18:04 /mnt/root/lib/amd64/libc.so.1
|
argv! Of course, this particular file
is easily extracted from the original media, it does't contain
anything unique. For those who might be concerned that it is the C
runtime library, and thus very critical to running Solaris, the
machine in use is only 32-bit, so the 64-bit (amd64) version of this
file is never used. But suppose this were an important file for me
and I wanted to recover something from it? This is a more interesting
challenge...
First, let's review a little bit about
how ZFS works. By default, when ZFS writes anything, it generates a
checksum which is recorded someplace else, presumably safe.
Actually, the checksum is recorded at least twice, just to be doubly
sure it is correct. And that record is also checksummed. Back to the
story, the checksum is computed on a block, not for the whole file.
This is an important distinction which will come into play later. If
we perform a storage pool scrub, ZFS will find the broken file and
report it to you (see above), which is a good thing -- much better
than simply ignoring it, like many other file systems will do.
OK, so we know that somewhere in the
midst of this 2.8 MByte file, we have some corruption. But can we at
least recover the bits that aren't corrupted? The answer is yes.
But if you try a copy, then it bails with an error.
|
# cp
/mnt/root/lib/amd64/libc.so.1 /tmp /mnt/root/lib/amd64/libc.so.1:
I/O error
|
Since the copy was not successful,
there is no destination file, not even a partial file. It turns out
that cp
uses mmap(2) to map the
input file and copies it to the output file with a big write(2).
Since the write doesn't complete correctly, it complains and removes
the output file. What we need is something less clever, dd.
|
#
dd if=/mnt/root/lib/amd64/libc.so.1 of=/tmp/whee read:
I/O error 2304+0
records in 2304+0
records out #
ls -ls /tmp/whee 2304 -rw-r--r-- 1 root
root 1179648 Mar 12 18:53 /tmp/whee
|
OK, from this experiment we know that
we can get about 1.2 MBytes by directly copying with dd. But this
isn't all, or even half of the file. We can get a little more clever
than that. To make it simpler, I wrote a little ksh
script:
|
#!/bin/ksh integer
i=0 while
((i < 23)) do typeset
-RZ2 j=$i dd
if=$1 of=$2.$j bs=128k iseek=$i count=1 i=i+1 done
|
This script will write each of the
first 23 128kByte blocks from the first argument (a file) to a unique
filename as a number appended to the second argument. dd
is really dumb and doesn't offer much error handling which is why I
hardwired the count into the script. An enterprising soul with a
little bit of C programming skill could do something more complex
which handles the more general case. Ok, that was difficult to
understand, and I wrote it. To demonstrate, I first appologize for
the redundant verbosity:
|
#
./getaround.ksh libc.so.1 /tmp/zz 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out read:
I/O error 0+0
records in 0+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 0+1
records in 0+1
records out #
ls -ls /tmp/zz.* 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.00 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.01 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.02 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.03 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.04 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.05 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.06 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.07 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.08 0
-rw-r--r-- 1 root root 0 Mar 12 19:00 /tmp/zz.09 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.10 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.11 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.12 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.13 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.14 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.15 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.16 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.17 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.18 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.19 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.20 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.21 200 -rw-r--r-- 1 root
root 100784 Mar 12 19:00 /tmp/zz.22
|
So we can clearly see that the 10th
(128kByte) block is corrupted, but the rest of the blocks are ok. We
can now reassemble the file with a zero-filled block.
|
#
dd if=/dev/zero of=/tmp/zz.09 bs=128k count=1 1+0
records in 1+0
records out #
cat /tmp/zz.* > /tmp/zz #
ls -ls /tmp/zz 5832 -rw-r--r-- 1 root
root 2984368 Mar 12 19:03 /tmp/zz
|
Now I have recreated the file with a
zero-filled hole where the data corruption was. Just for grins, if
you try to compare with the previous file, you should get what you
expect.
|
#
cmp libc.so.1 /tmp/zz+ cmp:
EOF on libc.so.1
|
How is this useful?
Personally, I'm not sure this will be
very useful for many corruption cases. As a RAS guy, I advocate many
verified copies of important data placed on diverse systems and
media. But most folks aren't so inclined. Everytime we talk about
this on the zfs-discuss alias, somebody will say that they don't care
about corruption in the middle of their mp3 files. I'm no audiophile,
but I prefer my mp3s to be hole-less. So I did this little exercise
to show how you can regain full access to the non-corrupted bits of a
corrupted file in a more-or-less easy way. Consider this a proof of
concept. There are many possible variations, such as filling with
spaces instead of nulls
when you are missing parts of a text file -- opportunities abound.
Wednesday Oct 03, 2007
Adaptec has put together a nice webinar called Nearline Data Drives and Error Handling. If you work with disks or are contemplating building your own home data server, I recommend that you take 22 minutes to review the webinar. As a systems vendor, we are often asked why we made certain design decisions to favor data over costs, and I think this webinar does a good job of showing how some of the complexity of systems design covers a large number of decision points. Here in the RAS Engineering group we tend to gravitate towards the best reliability and availability of systems, which still requires a staggering number of design trade-offs. Rest assured that we do our best to make these decisions with your data in mind.
For the ZFSers in the world, this webinar also provides some insight into how RAID systems like ZFS are designed, and why end-to-end data protection is vitally important.
Enjoy! And if you don't want your Starbuck's gift card, send it to me :-)
Tuesday Sep 18, 2007
Jeff Bonwick recently blogged about why ZFS uses space maps for keeping track of allocations. In my recent blog on looking at ZFS I teased you with a comment about the space map floating near the Channel Islands. Now that Jeff has explained how they work, I'll show you what they look like as viewed from space.

This is a view of a space map for a ZFS file system which was created as a recursive copy of the /usr directory followed by a recursive remove of the /usr/share directory. This allows you to see how some space is allocated and some space is free.
I wrote an add-on to NASA's Worldwind to parse zdb output looking for the space map information. Each allocation appears as a green rectangle with a starting offset and length mapped onto a square field floating above the earth. The allocations are green and the frees are yellow. The frees are also floating 100m above the allocations, though it is not easy to see from this view. Each map entry also has an optional user-facing icon which shows up as a shadowed green or yellow square. I snagged these from the StarOffice bullets images. If you hover the mouse over an icon, then a tool tip will appear showing the information about the space. In this example, the tooltip says "Free, txg=611, pass=1, offset=53fe000, size=800"
I can think of about a half dozen cool extensions to make for this, such as showing metaslab boundaries. I also need to trim the shadow field to fit; it extends too far on the right. So much to do, so little time...
Wednesday Aug 29, 2007
I'm walking a line - I'm thinking about I/O in motion
I'm walking a line - Just barely enough to be living
Get outta the way - No time to begin
This isn't the time - So nothing was biodone
Not talking about - Not many at all
I'm turning around - No trouble at all
You notice there's nothing around you, around you
I'm walking a line - Divide and dissolve.
[theme song for this post is Houses in Motion by the Talking Heads]
Previously, I mentioned a movie. OK, so perhaps it isn't a movie, but an animated GIF.

This is a time-lapse animation of some of the data shown in my previous blog on ZFS usage of mirrors. Here we're looking at one second intervals and the I/O to the slow disk of a two-disk mirrored ZFS file system. The workload is a recursive copy of the /usr/share directory into this file system.
The yellow areas on the device field are write I/O operations. For each time interval, the new I/O operations are shown with their latency elevators. Shorter elevators mean lower latency. Green elevators mean the latency is 10ms or less, yellow until 25ms, and red beyond 25ms. This provides some insight into the way the slab allocator works for ZFS. If you look closely, you can also see the redundant uberblock updates along the lower-right side near the edge. If you can't see that in the small GIF, click on the GIF for a larger version which is easier to see.
ZFS makes redundant copies of the metadata. By preference, these will be placed in a different slab. You can see this in the animation as there are occasionally writes further out than the bulk of the data writes. As the disk begins to fill, the gaps become filled. Interestingly, the writes to the next slab (metadata) do not have much latency - they are in the green zone. This is a simple IDE disk, so there is a seek required by these writes. This should help allay one of the fears of ZFS, that the tendency to have data spread out will be a performance problem - I see no clear evidence of that here.
I have implemented this as a series of Worldwind layers. This isn't really what Worldwind was designed to do, so there are some inefficiencies in the implementation, or it may be that there is still some trick I have yet to learn. But it is functional in that you can see I/Os in motion.
Wednesday Aug 29, 2007
A few months ago, I blogged about why I wasn't at JavaOne and mentioned that I was looking at some JOGL code. Now I'm ready to show you some cool pictures which provide a view into how ZFS uses disks.
The examples here show a mirrored disk pair. I created a mirrored zpool and use the default ZFS settings. I then did a recursive copy of /usr/share into the ZFS file system. This is a write-mostly workload.
There are several problems with trying to visualize this sort of data:
- There is a huge number of data points. A 500 GByte disk has about a billion blocks. Mirror that and you are trying to visualize two billion data points. My workstation screen size is only 1.92 million pixels (1600x1200) so there is no way that I could see this much data.
- If I look at an ASCII table of this data, then it may be hundreds of pages long. Just for fun, try looking at the output of zdb -dddddd to get an idea of how the data might look in ASCII, but I'll warn you in advance, try this only on a small zpool located on a non-production system.
- One dimensional views of the data are possible. Actually, this is what zdb will show for you. There is some reasoning here because a device is accessed as a single set of blocks using an offset and size for read or write operations. But this doesn't scale well, especially to a billion data points.
- Two dimensional views are also possible, where we basically make a two dimensional array of the one dimensional data. This does hide some behaviour, as disks are two dimensional, but they are stacks of circles of different sizes. These physical details are cleverly hidden and subject to change on a per-case basis. So, perhaps we can see some info in two dimensions that would help us understand what is happening.
- Three dimensional views can show even more data. This is where JOGL comes in, it is a 3-D libary for JAVA.
It is clear that some sort of 3-D visualization system could help provide some insight into this massive amount of data. So I did it.
Where is the data going?

This is a view of the two devices in the mirror after they have been filled by the recursive copy. Yellow blocks indicate write operations, green blocks are read operations. Since this was a copy into the file system, there aren't very many reads. I would presume that your browser window is not of sufficient resolution to show the few, small reads anyway, so you'll just have to trust me.
What you should be able to see, even at a relatively low resolution, is that we're looking at a 2-D representation of each device from a 3-D viewpoint. Zooming, panning, and moving the viewpoint allows me to observe more or less detail.
To gather this data, I used TNF tracing. I could also write a dtrace script to do the same thing. But I decided to use TNF data because it has been available since Solaris 8 (7-8 years or so) and I have an archive of old TNF traces that I might want to take a look at some day. So what you see here are the I/O operations for each disk during the experiment.
How long did it take? (Or, what is the latency?)
The TNF data also contains latency information. The latency is measured as the difference in time between the start of the I/O and its completion. Using the 3rd dimension, I put the latency in the Z-axis.

Ahhh... this view tells me something interesting. The latency is shown as a line emitting from the starting offset of the block being written. You can see some regularity over the space as ZFS will coalesce writes into 128 kByte I/Os. The pattern is more clearly visible on the device on the right.
But wait! What about all of the red? I color the latency line green when the latency is less than 10ms, yellow until 25ms, and red for latency > 25ms. The height of the line is a multiple of its actual latency. Wow! The device on the left has a lot of red, it sure looks slow. And it is. On the other hand, the device on the right sure looks fast. And it is. But this view is still hard to see, even when you can fly around and look at it from different angles. So, I added some icons...

I put icons at the top of the line. If I hover the mouse over an icon, it will show a tooltip which contains more information about that data point. In this case, the tooltip says, "Write, block=202688, size=64, flags=3080101, time=87.85" The size is in blocks, the flags are defined in a header file somewhere, and the time is latency in milliseconds. So we wrote 32 kBytes at block 202,688 in 87.85 ms. This is becoming useful! By cruising around, it becomes apparent that for this slow device, small writes are faster than large writes, which is pretty much what you would expect.
Finding a place in the world
Now for the kicker. I implemented this as an add-on to NASA's Worldwind.
I floated my devices at 10,000 m above the ocean off the west coast of San Diego! By leveraging the Worldwind for Java SDK, I was able to implement my visualization by writing approximately 2,000 lines of code. This is a pretty efficient way of extending a GIS tool into non-GIS use, while leveraging the fact that GIS tools are inherently designed to look at billions of data points in 3-D.
More details of the experiment
The two devices are intentionally very different from a performance perspective. The device on the left is an old, slow, relatively small IDE disk. The device on the right is a ramdisk.
I believe that this technique can lead to a better view of how systems work under the covers, even beyond disk devices. I've got some cool ideas, but not enough days in the hour to explore them all. Drop me a line if you've got a cool idea.
The astute observer will notice another view of the data just to the north of the devices. This is the ZFS space map allocation of one of the mirror vdevs. More on that later... I've got a movie to put together...
Tuesday May 08, 2007
I'm not going to JavaONE this year. And I'm a little sad about that, but I'll make it one day. I was part of the very first JavaONE prelude. "Prelude" is the operative term because at that time Sun employees were actively discouraged from attending. This was a dain bramaged policy which has since been fixed, but at the time the idea was to fill the event with customers and developers. Now it is just full of every sort of person. Y'all have fun up there!
So to celebrate this year's JavaONE, I'm learning JOGL. Why? Well, I've got some data that I'd like to visualize and I've not found a reasonable tool for doing it in an open, shareable manner. Stay tuned...
Friday May 04, 2007
OpenSolaris build 61 (or later) is now available for download. ZFS
has added a new feature that will improve data protection: redundant
copies for data (aka ditto blocks for data). Previously, ZFS stored redundant copies of metadata.
Now this feature is available for data, too.
This represents a new feature which is unique to ZFS: you can set
the data protection policy on a per-file system basis, beyond that
offered by the underlying device or volume. For single-device
systems, like my laptop with its single disk drive, this is very
powerful. I can have a different data protection policy for the files
that I really care about (my personal files) than the files that I
really don't care about or that can be easily reloaded from the OS
installation DVD. For systems with multiple disks assembled in a RAID
configuration, the data protection is not quite so obvious. Let's
explore this feature, look under the hood, and then analyze some
possible configurations.
Using
Copies
To change the numbers of data copies, set the copies
property. For example, suppose I have a zpool named "zwimming."
The default number of data copies is 1. But you can change that to 2
quite easily.
# zfs set copies=2 zwimming
|
The copies property works for all new writes, so I recommend that
you set that policy when you create the file system or immediately
after you create a zpool.
You can verify the copies setting by looking at the properties.
# zfs get copies zwimming
NAME PROPERTY VALUE SOURCE
zwimming copies 2 local
|
ZFS will account for the space used. For example, suppose I create
three new file systems and copy some data to them. You can then see
that the space used reflects the number of copies. If you use quotas,
then the copies will be charged against the quotas, too.
# zfs create -o copies=1 zwimming/single
# zfs create -o copies=2 zwimming/dual
# zfs create -o copies=3 zwimming/triple
# cp -rp /usr/share/man1 /zwimming/single
# cp -rp /usr/share/man1 /zwimming/dual
# cp -rp /usr/share/man1 /zwimming/triple
# zfs list -r zwimming NAME USED AVAIL REFER MOUNTPOINT zwimming 48.2M 310M 33.5K /zwimming zwimming/dual 16.0M 310M 16.0M /zwimming/dual zwimming/single 8.09M 310M 8.09M /zwimming/single zwimming/triple 23.8M 310M 23.8M /zwimming/triple
|
This makes sense. Each file system has one, two, or three copies
of the data and will use correspondingly one, two, or three times as
much space to store the data.
Under
the Covers
ZFS will spread the ditto blocks across the vdev or vdevs to
provide spatial diversity. Bill
Moore has previously blogged about this, or you can see
it in the code for yourself. From a RAS perspective, this is a
good thing. We want to reduce the possibility that a single failure,
such as a drive head impact with media, could disturb both copies of
our data. If we have multiple disks, ZFS will try to spread the
copies across multiple disks. This is different than mirroring, in
subtle ways. The actual placement is ultimately based upon available
space. Let's look at some simplified examples. First, for the default
file system configuration settings on a single disk.

Note that there are two copies of the metadata, by default. If we
have two or more copies of the data, the number of metadata copies is
three.
Suppose you have a 2-disk stripe. In that case, ZFS will try to
spread the copies across the disks.

Since the copies are created above the zpool, a mirrored zpool
will faithfully mirror the copies.
Since the copies policy is set at the file system level, not the
zpool level, a single zpool may contain multiple file systems, each
with different policies. In other words, you could have data which is
not copied allocated along with data that is copied.
Using different policies for different file systems allows you to have different data protection policies, allows you to improve data protection, and offers many more permutations of configurations for you to weigh in your designs.
RAS
Modeling
It is obvious that increasing the number of data copies will
effectively reduce the amount of available space accordingly. But how
will this affect reliability? To answer that question we use the
MTTDL[2]
model I previously described, with the following changes:
First, we calculate the probability of unsuccessful
reconstruction due to a UER for N disks of a given size (unit
conversion omitted). The number of copies decreases this probability.
This makes sense as we could use another copy of the data for
reconstruction and to completely fail, we'd need to lose all copies:
Precon_fail =
((N-1) * size / UER)copies
For single-disk failure protection:
MTTDL[2] = MTBF / (N *
Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/
(N * (N-1) * MTTR * Precon_fail)
Note that as the number of copies increases, Precon_fail
approaches zero quickly. This will increase the MTTDL. We want higher
MTTDL, so this is a good thing.
OK, now that we can calculate available space and MTTDL, let's
look at some configurations for 46 disks available on a Sun
Fire X4500 (aka Thumper). We'll look at single parity schemes, to
reduce the clutter, but double parity schemes will show the same,
relative improvements.
bigger view
You can see that we are trading off space for MTTDL. You can also
see that for raidz zpools, having more disks in the sets reduces the
MTTDL. It gets more interesting to see that the 2-way mirror with
copies=2 is very similar in space and MTTDL to the 5-disk raidz with
copies=3. Hmm. Also, the 2-way mirror with copies=1 is similar in
MTTDL to the 7-disk raidz with copies=2, though the mirror
configurations allow more space. This information may be useful as
you make trade-offs. Since the copies parameter is set per file
system, you can still set the data protection policy for important
data separately from unimportant data. This might be a good idea for
some situations where you might have permanent originals (eg. CDs,
DVDs) and want to apply a different data protection policy.
In the future, once we have a better feel for the real performance
considerations, we'll be able to add a performance component into the
analysis.
Single
Device Revisited
Now that we see how data protection is improved, let's revisit the
single device case. I use the term device here because there is a
significant change occurring in storage as we replace disk drives
with solid state, non-volatile memory devices (eg. flash disks and
future MRAM or PRAM devices). A large number of enterprise customers
demand dual disk drives for mirroring root file systems in servers.
However, there is also a growing demand for solid state boot devices,
and we
have some Sun servers with this option. Some believe that by
2009, the majority of laptops will also have solid state devices
instead of disk drives. In the interim, there are also hybrid disk
drives.
What affect will these devices have on data retention? We know
that if the entire device completely fails, then the data is most
likely unrecoverable. In real life, these devices can suffer many
failures which result in data loss, but which are not complete device
failures. For disks, we see the most common failure is an
unrecoverable read where data is lost from one or more sector (bar 1
in the graph below). For flash memories, there is an endurance issue
where repeated writes to a cell may reduce the probability of reading
the data correctly. If you only have one copy of the data, then the
data is lost, never to be read correctly again.
We captured disk error codes returned from a number of disk drives
in the field. The Pareto chart below shows the relationship between
the error codes. Bar 1 is the unrecoverable read which accounts for
about 24% of the errors recorded. The violet bars show recoverable
errors which did succeed. Examples of successfully recovered errors
are: write error - recovered with block reallocation, read error -
recovered by ECC using normal retries, etc. The recovered errors do
not (immediately) indicate a data loss event, so they are largely
transparent to applications. We worry more about the unrecoverable
errors.
Approximately 1/3 of the errors were unrecoverable. If such an
error occurs in ZFS metadata, then ZFS will try to read alternate
metadata copy and repair the metadata. If the data has multiple
copies, then it is likely that we will not lose any data. This is a
more detailed view of the storage device because we are not treating
all failures as a full device failure.
Both real and anecdotal evidence suggests that unrecoverable
errors can occur while the device is still largely operational. ZFS
has the ability to survive such errors without data loss. Very cool.
Murphy's Law will ultimately catch up with you, though. In the case
where ZFS cannot recover the data, ZFS will tell you which file is
corrupted. You can then decide whether or not you should recover it
from backups or source media.
Another
Single Device
Now that I've got you to think of the single device as a single
device, I'd like to extend the thought to RAID arrays. There is much
confusion amongst people about whether ZFS should or should not be
used with RAID arrays. If you search, you'll find comments and
recommendations both for and against using hardware RAID for ZFS. The main
argument is centered around the ability of ZFS to correct errors. If
you have a single device backed by a RAID array with some sort of
data protection, then previous versions of ZFS could not recover data
which was lost. Hold it right there, fella! Do I mean that RAID
arrays and the channel from the array to main memory can have errors?
Yes, of course! We have seen cases where errors were introduced
somewhere along the path between disk media to main memory where data
was lost or corrupted. Prior to ZFS, these were silent errors and
blissfully ignored. With ZFS, the checksum now detects these errors
and tries to recover. If you don't believe me, then watch the ZFS
forum on opensolaris.org where we get reports like this about
once a month or so. With ZFS copies, you can now recover from such
errors without changing the RAID array configuration.
If ZFS can correct a data error, it will attempt to do so. You now
have a the option to improve your data protection even when using a
single RAID LUN. And this is the same mechanism we can use for a
single disk or flash drive: data copies. You can implement the copies
on a per-file system basis and thus have different data protection
policies even though the data is physically stored on a RAID LUN in a
hardware RAID array. I really hope we can put to rest the "ZFS
prefers JBOD" argument and just concentrate our efforts on
implementing the best data protection policies for the requirements.
ZFS with data copies is another tool in your toolbelt to improve your
life, and the life of your data.
Wednesday Apr 18, 2007
eWeek has published a nice article describing Sun's new, low-cost RAID array: the Sun StorageTek ST2500 Low Cost Array. This is an interesting new product that has broad appeal and will be a heck of a good box to run under ZFS.
But I'm worried about eWeek. It seems that they've lost track of time. Many of us have been running on internet time for most of our lives. This quote makes me wonder if eWeek forgot to update their timezone data:
"The ZFS [the speedy Zeta[sic] file system, recently released to the open-source community by Sun] is very interesting, and people are looking at it," [Henry] Baltazar told eWeek.
I'm pretty sure Henry Baltazar is running on internet time, and provided a very nice quote. But whoever added the editorial clarification at eWeek spelled Zettabyte wrong and is running on island time. ZFS was released to the open-source community on June 14, 2005 - nearly 2 years ago in real time. Even in real time, 2 years can hardly be considered "recent." Sigh.
Thursday Apr 05, 2007
The FreeBSD team has added ZFS to the FreeBSD-7.0 release! This is excellent news and all of us are happy to share with the FreeBSD community. Pawel Jakub Dawidek has posted this note to the FreeBSD and ZFS community. This will greatly expand the use of ZFS and will no doubt lead to more innovative developments in the community. Well done!
Wednesday Jan 31, 2007
Wrapping up the thread on space, performance, and MTTDL, I thought that you might like to see one graph which would show the entire design space I've been using. Here it is:
This shows the data I've previously blogged about in scale. You can easily see that for MTTDL, double parity protection is better than single parity protection which is better than striping (no parity protection). Mirroring is also better than raidz or raidz2 for MTTDL and small, random read iops. I call this the "all-in" slide because, in a sense, it puts everything in one pot.
While this sort of analysis is useful, the problem is that there are more dimensions of the problem. I will show you some of the other models we use to evaluate and model systems in later blogs, but it might not be so easy to show so many useful factors on one graph. I'll try my best...
Tuesday Jan 30, 2007
In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems.
The best thing about a model is that it is a simplification of
real life.
The worst thing about a model is that it is a
simplification of real life.
Small, Random Read Performance Model
For this analysis, we will use a small, random read performance
model. The calculations for the model can be made with data which is
readily available from disk data sheets. We calculate the expected
I/O operations per second (iops) based on the average read seek and
rotational speed of the disk. We don't consider the command overhead,
as it is generally small for modern drives and is not always
specified in disk data sheets.
maximum rotational latency = 60,000
(ms/min) / rotational speed (rpm)
iops = 1000 (ms/s) / (average read
seek time (ms) + (maximum rotational latency (ms) / 2))
Since most disks use consistent rotational speeds, this small
table may help you to see what the rotational speed contribution will
be.
-
|
Rotational Speed (rpm)
|
Maximum Rotational Latency (ms)
|
|
4,200
|
14.3
|
|
5,400
|
11.1
|
|
7,200
|
8.3
|
|
10,000
|
6.0
|
|
15,000
|
4.0
|
For example, if we have a 73
GByte, 2.5" Seagate Saviio SAS drive which has a 4.1 ms
average read seek and rotational speed of 10,000 rpm:
iops = 1000 / (4.1 + (6.0 / 2)) =
140.8
By comparison, a 750
GByte, 3.5" Seagate Barracuda SATA drive which has a 8.5 ms
average read seek and rotational speed of 7,200 rpm:
iops = 1000 / (8.5 + (8.3 / 2)) = 79.0
I purposely used those two examples because people are always
wondering why we tend to prefer smaller, faster, and (unfortunately)
more expensive drives over larger, slower, less expensive drives - a
78% performance improvement is rather significant. The 3.5"
drives also use about 25-75% more power than their smaller cousins,
largely due to the rotating mass. Small is beautiful in a SWaP
sense.
Next we use the RAID set configuration information to calculate
the total small, random read iops for the zpool or volume. Here we
need to talk about sets of disks which may make up a multi-level
zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of
mirrored sets (RAID-1). RAID-0 is a stripe of disks.
For dynamic striping (RAID-0), add the iops for each set or
disk. On average the iops are spread randomly across all sets or
disks, gaining concurrency.
For mirroring (RAID-1), add the iops for each set or disk.
For reads, any set or disk can satisfy a read, so we also get
concurrency.
For single parity raidz (RAID-5), the set operates at the
performance of one disk. See below.
For double parity raidz2 (RAID-6), the set operates at the
performance of one disk. See below.
For example, if you have 6 disks, then there are many different
ways you can configure them, with varying performance calculations
|
RAID Configuration (6 disks)
|
Small, Random Read Performance Relative to a Single Disk
|
|
6-disk dynamic stripe (RAID-0)
|
6
|
|
3-set dynamic stripe, 2-way mirror (RAID-1+0)
|
6
|
|
2-set dynamic stripe, 3-way mirror (RAID-1+0)
|
6
|
|
6-disk raidz (RAID-5)
|
1
|
|
2-set dynamic stripe, 3-disk raidz (RAID-5+0)
|
2
|
|
2-way mirror, 3-disk raidz (RAID-5+1)
|
2
|
|
6-disk raidz2 (RAID-6)
|
1
|
Clearly, using mirrors improves both performance and data
reliability. Using stripes increases performance, at the cost of data
reliability. raidz and raidz2 offer data reliability, at the cost of
performance. This leads us down a rathole...
The Parity Performance Rathole
Many people expect that data protection schemes based on parity,
such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance
of striped volumes, except for the parity disk. In other words, they
expect that a 6-disk raidz zpool would have the same small. random
read performance as a 5-disk dynamic stripe. Similarly, they expect
that a 6-disk raidz2 zpool would have the same performance as a
4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a
checksum to validate the contents of a block of data written. The
block is spread across the disks (vdevs) in the set. In order to
validate the checksum, ZFS must read the blocks from more than one
disk, thus not taking advantage of spreading unrelated, random reads
concurrently across the disks. In other words, the small, random read
performance of a raidz or raidz2 set is, essentially, the same as the
single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.
Many people also think that this is a design deficiency. As a RAS
guy, I value the data validation offered by the checksum over the
performance supposedly gained by RAID-5. Reasonable people can
disagree, but perhaps some day a clever person will solve this for
ZFS.
So, what do other logical volume managers or RAID arrays do? The
results seem mixed. I have seen some RAID array performance
characterization data which is very similar to the ZFS performance
for parity sets. I have heard anecdotes that other implementations
will read the blocks and only reconstruct a failed block as
needed. The problem is, how do such systems know that a block has
failed? Anecdotally,
it seems that some of them trust what is read from the disk. To
implement a per-disk block checksum verification, you'd still have to
perform at least two reads from different disks, so it seems to me
that you are trading off data integrity for performance. In ZFS, data
integrity is paramount. Perhaps there is more room for research here,
or perhaps it is just one of those engineering trade-offs that we
must live with.
Other Performance Models
I'm also looking for other performance models which can be applied
to generic disks with data that is readily available to the public.
The reason that the small, random read iops model works is that it
doesn't need to consider caching or channel resource utilization.
Adding these variables would require some knowledge of the
configuration topology and the cache policies (which may also change
with firmware updates.) I've kicked around the idea of a total disk
bandwidth model which will describe a range of possible bandwidths
based upon the media speed of the drives, but it is not clear to me
that it will offer any satisfaction. Drop me a line if you have a
good model or further thoughts on this topic.
You should be cautious about extrapolating the performance results
described here to other workloads. You could consider this to be a
worst-case model because it assumes 0% disk cache hits. I would hope
that most workloads exhibit better performance, but rather than
guessing (hoping) the best way to find out is to run the workload and
measure the performance. If you characterize a number of different
configurations, then you might build your own performance graphs
which fit your workload.
Putting It All Together
Now we have a method to compare a variety of different ZFS or RAID
disk configurations by evaluating space, performance, and MTTDL.
First, let's look at single parity schemes such as 2-way mirrors and
raidz on the Sun
Fire X4500 (aka Thumper) server.
Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better
performance and MTTDL than raidz for any specific space requirement
except for the case where we run out of hot spares for the 2-way
mirror (using all 46 disks for data). By contrast, all of the raidz
configurations here have hot spares. You can use this to help make
design trade-offs by prioritizing space, performance, and MTTDL.
You'll also note that I did not label the left-side Y axis (MTTDL)
again, but I did label the right-side Y axis (small, random read
iops). I did this with mixed emotion. I didn't label the MTTDL axis
values as I explained previously. But I did label the performance
axis so that you can do a rough comparison to the double parity graph
below. Note that in the double parity graph, the MTTDL axis is in
units of Millions of years, instead of years above.

Here you can see the same sort of comparison between 3-way mirrors
and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.
Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place. If you want to be happier, you should use mirroring with at least one hot spare.
Conclusion
We can make design trade-offs between space, performance, and
MTTDL for disk storage systems. As with most engineering decisions,
there often is not a clear best solution given all of the possible
solutions. By using some simple models, we can see the trade-offs
more clearly.
Wednesday Jan 17, 2007
Mean Time to Data Loss (MTTDL) is a metric that we find useful for
comparing data storage systems. I think it is particularly useful for
determining what sort of data protection you may want to use for a
RAID system. For example, suppose you have a Sun Fire X4550 (aka
Thumper) server with 48 internal disk drives. What would be the best
way to configure the disks for redundancy? Previously, I explored
space versus MTTDL and space versus unscheduled Mean Time Between
System Interruptions (U_MTBSI) for the X4500 running ZFS. The same
analysis works for SVM or LVM, too.
For this blog, I want to explore the calculation of MTTDL for a
bunch of disks. It turns out, there are multiple models for
calculating MTTDL. The one described previously here is the simplest
and only considers the Mean Time Between Failure (MTBF) of a disk and
the Mean Time to Repair (MTTR) of the repair and reconstruction
process. I'll call that model #1 which solves for MTTDL[1]. To
quickly recap:
For non-protected schemes (dynamic striping, RAID-0)
MTTDL[1]
= MTBF / N
For single parity schemes (2-way mirror, raidz, RAID-1,
RAID-5):
MTTDL[1]
= MTBF2 / (N * (N-1) * MTTR)
For double parity schemes (3-way mirror, raidz2, RAID-6):
MTTDL[1]
= MTBF3 / (N * (N-1) * (N-2) * MTTR2)
You can
often get MTBF data from your drive vendor and you can measure or
estimate your MTTR with reasonable accuracy. But MTTDL[1] does not
consider the Unrecoverable Error Rate (UER) for read operations on
disk drives. It turns out that the UER is often easier to get from
the disk drive data sheets, because sometimes the drive vendors don't
list MTBF (or Annual Failure Rate, AFR) for all of their drive
models. Typically, UER will be 1 per 1014 bits read for consumer
class drives and 1 per 1015 for enterprise class drives. This can be
alarming, because you could also say that consumer class drives
should see 1 UER per 12.5 TBytes of data read. Today, 500-750 GByte
drives are readily available and 1 TByte drives are announced. Most
people will be unhappy if they get an unrecoverable read error once
every dozen or so times they read the whole disk. Worse yet, if we
have the data protected with RAID and we have to replace a drive, we
really do hope that the data reconstruction completes correctly. To
add to our nightmare, the UER does not decrease by adding disks. If
we can't rely on the data to be correctly read, we can't be sure that
our data reconstruction will succeed, and we'll have data loss.
Clearly, we need a model which takes this into account. Let's call
that model #2, for MTTDL[2]:
First, we calculate the probability of unsuccessful
reconstruction due to a UER for N disks of a given size (unit
conversion omitted):
Precon_fail = (N-1) * size /
UER
For single-disk failure protection:
MTTDL[2] = MTBF / (N *
Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/ (N * (N-1)
* MTTR * Precon_fail)
Comparing the MTTDL[1] model to
the MTTDL[2] model shows some interesting aspects of design. First,
there is no MTTDL[2] model for RAID-0 because there is no data
reconstruction – any failure and you lose data. Second, the MTTR
doesn't enter into the MTTDL[2] model until you get to double-disk
failure scenarios. You could nit pick about this, but as you'll soon
see, it really doesn't make any difference for our design decision
process. Third, you can see that the Precon_fail is a function of the
size of the data set. This is because the UER doesn't change as you
grow the data set. Or, to look at it from a different direction, if
you use consumer class drives with 1 UER for 1014 bits, and you have
12.5 TBytes of data, the probability of an unrecoverable read during
the data reconstruction is 1. Ugh. If the Precon_fail is 1, then the
MTTDL[2] model looks a lot like the RAID-0 model and friends don't
let friends use RAID-0! Maybe you could consider a smaller sized
data set to offset this risk. Let's see how that looks in pictures.

2-way mirroring is an example of
a configuration which provides single-disk failure protection. Each
data point represents the space available when using 2-way mirrors in
a zpool. Since this is for a X4500, we consider 46 total available
disks and any disks not used for data are available as spares. In
this graph you can clearly see that the MTTDL[1] model encourages the
use of hot spares. More importantly, although the results of the
calculations of the two models are around 5 orders of magnitude
different, the overall shape of the curve remains the same. Keep in
mind that we are talking years here, perhaps 10 million years, which
is well beyond the 5-year expected life span of a disk. This is the
nature of the beast when using a constant MTBF. For models which
consider the change in MTBF as the device ages, you should never see
such large numbers. But the wish for more accurate models does not
change the relative merits of the design decision, which is
what we really care about – the best RAID configuration given a
bunch of disks. Should I use single disk failure protection or double
disk failure protection? To answer that, lets look at the model for
raidz2.
From this graph you can see that
double disk protection is clearly better than single disk protection
above, regardless of which model we choose. Good, this makes sense.
You can also see that with raidz2 we have a larger number of disk
configuration options. A 3-disk raidz2 set is somewhat similar to a
3-way mirror with the best MTTDL, but doesn't offer much available
space. A 4-disk set will offer better space, but not quite as good
MTTDL. This pattern continues through 8 disks/set. Judging from the
graphs, you should see that a 3-disk set will offer approximately an
order of magnitude better MTTDL than an 8-disk, for either MTTDL
model. This is because the UER remains constant while the data to be
reconstructed increases.
I hope that these models give
you an insight into how you can model systems for RAS. In my
experience, most people get all jazzed up with the space and forget
that they are often making a space vs. RAS trade-off. You can use
these models to help you make good design decisions when configuring
RAID systems. Since the graphs use Space on the X-axis, it is easy to look at the design trade-offs for a given amount of available space.
Just one more teaser... there are other MTTDL models,
but it is unclear if they would help make better decisions, and I'll
explore those in another blog.
Don't forget 'conv=sync'; that may help. (otherw...
The ",sync" part is important in the con...
I agree that if dd actually handled the EIO proper...