Tuesday Apr 08, 2008
My colleague Christine asked me some questions about my holey files posts. These are really good questions, and I'm just a little surprised that more people didn't ask them... hey, that is what the comments section is for! So, I thought I would reply publically, helping to stimulation some conversations.
Q1. How could you have a degraded pool and data corruption w/o a repair?
I assume this pool must be raidz or mirror.
A1. No, this was a simple pool, not protected at the pool level. I used the ZFS copies parameter to set the number of redundant data copies to 2. For more information on how copies works, see my post with pictures.
There is another, hidden question here. How did I install Indiana such that it uses copies=2? By opening a shell and becoming root prior to beginning the install, I was able to set the copies=2 property just after the storage pool was created. By default, it gets inherited by any subsequent file system creation. Simple as that. OK, so it isn't that simple. I've also experimented with better ways to intercept the zpool create, but am not really happy with my hacks thus far. A better solution is for the installer to pick up a set of properties, but it doesn't, at least for now.
Q2. Can a striped pool be in a degraded state? Wouldn't a device
faulting in that pool renders it unusable and therefore faulted?
A2. Yes, a striped storage pool can be in a degraded state. To understand this, you need to know the definitions of DEGRADED and FAULTED. Fortunately, they are right there in the zpool manual page.
-
DEGRADED
-
One or more top-level vdevs is in the degraded state because one or
more component devices are offline. Sufficient replicas exist to
continue functioning.
...
-
FAULTED
-
One or more top-level vdevs is in the faulted state because one or
more component devices are offline. Insufficient replicas exist to
continue functioning.
...
By default, there are multiple replicas, so for a striped volume it is possible to be in a DEGRADED state. However, I expect that the more common case will be a FAULTED state. In other words, I do tend to recommend a more redundant storage pool: mirror, raidz, raidz2.
Q3. What does filling the corrupted part with zero do for me? It doesn't
fix it, those bits weren't zero to begin with.
A3. Filling with zeros will just make sure that the size of the "recovered" file is the same as the original. Some applications get to data in a file via a seek to an offset (random access), so this is how you would want to recover the file. For applications which process files sequentially, it might not matter.
Thursday Mar 13, 2008
Bob Netherton took a look
at my
last post on corrupted file recovery (?) and asked whether I had
considered using the noerror
option to dd. Yes, I did
experiment with dd and the
noerror option.
The noerror option is described in dd(1)
as:
noerror
Does not stop processing on an input error.
When
an input error occurs, a diagnostic mes-
sage
is written on standard error, followed
by
the current input and output block counts
in
the same format as used at completion. If
the
sync conversion is specified, the missing
input
is replaced with null bytes and pro-
cessed
normally. Otherwise, the input block
will
be omitted from the output.
This looks like the perfect solution, rather than my dd and iseek
script. But I didn't post this because, quite simply, I don't really
understand what I get out of it.
Recall that I had a corrupted file which is 2.9 MBytes in size.
Somewhere around 1.1 MBytes into the file, the data is corrupted and
fails the ZFS checksum test.
|
#
zpool scrub zpl_slim #
zpool status -v zpl_slim pool:
zpl_slim state:
DEGRADED status:
One or more devices has experienced an error resulting in data corruption.
Applications may be affected.
action:
Restore the file in question if possible. Otherwise restore the entire
pool from backup.
see:
http://www.sun.com/msg/ZFS-8000-8A scrub:
scrub completed after 0h2m with 1 errors on Tue Mar 11 13:12:42
2008
config: NAME STATE READ WRITE CKSUM zpl_slim
DEGRADED 0 0 9 c2t0d0s0 DEGRADED 0 0 9
errors:
Permanent errors have been detected in the following files: /mnt/root/lib/amd64/libc.so.1 #
ls -ls /mnt/root/lib/amd64/libc.so.1 4667
-rwxr-xr-x 1 root bin 2984368 Oct 31 18:04
/mnt/root/lib/amd64/libc.so.1
|
I attempted to use dd with the noerror flag using several
different block sizes to see what I could come up with. Here are
those results:
|
#
for i in 1k 8k 16k 32k 128k 256k 512k >
do >
dd if=libc.so.1 of=/tmp/whii.$i bs=$i conv=noerror >
done read:
I/O error 1152+0
records in 1152+0
records out ... grond#
ls -ls /tmp/whii* 3584
-rw-r--r-- 1 root root 1835008 Mar 13 11:27
/tmp/whii.128k 2464
-rw-r--r-- 1 root root 1261568 Mar 13 11:27
/tmp/whii.16k 2320
-rw-r--r-- 1 root root 1184768 Mar 13 11:27
/tmp/whii.1k 4608
-rw-r--r-- 1 root root 2359296 Mar 13 11:27
/tmp/whii.256k 2624
-rw-r--r-- 1 root root 1343488 Mar 13 11:27
/tmp/whii.32k 7168
-rw-r--r-- 1 root root 3670016 Mar 13 11:27
/tmp/whii.512k 2384
-rw-r--r-- 1 root root 1220608 Mar 13 11:27
/tmp/whii.8k
|
hmmm... all of these files are of
different sizes, so I'm really unsure what I've ended up with. None
of them are the same size as the original file, which is a bit
unexpected.
|
#
dd if=libc.so.1
of=/tmp/whaa.1k bs=1k conv=noerror read:
I/O error 1152+0
records in 1152+0
records out read:
I/O error 1153+0
records in 1153+0
records out read:
I/O error 1154+0
records in 1154+0
records out read:
I/O error 1155+0
records in 1155+0
records out read:
I/O error 1156+0
records in 1156+0
records out read:
I/O error 1157+0
records in 1157+0
records out #
ls -ls /tmp/whaa.1k 2320
-rw-r--r-- 1 root root 1184768 Mar 13 11:12
/tmp/whaa.1k
|
hmmm... well, dd
did copy some of the file, but seemed to give up after around 5
attempts and I only seemed to get the first 1.1 MBytes of the file.
What is going on here? A quick look at the dd
source (open source is a good thing) shows that there is a
definition of BADLIMIT which is how many times dd
will try before giving up. The default compilation sets BADLIMIT to
5. Aha! A quick download of the dd
code and I set BADLIMIT to be really huge and tried again.
|
#
bigbaddd if=libc.so.1
of=/tmp/whbb.1k bs=1k conv=noerror read:
I/O error 1152+0
records in 1152+0
records out ... read:
I/O error 3458+0
records in 3458+0
records out ^C
I give up #
ls -ls /tmp/whbb.1k 6920
-rw-r--r-- 1 root root 3543040 Mar 13 11:47
/tmp/whbb.1k
|
As dd
processes the input file, it doesn't really do a seek, so it can't
really get past the corruption. It is getting something, because od
shows that the end of the whbb.1k
file is not full of nulls. But I really don't believe this is the
data in a form which could be useful. And I really can't explain why
the new file is much larger than the original. I suspect that dd
gets stuck at the corrupted area and does not seek beyond it. In any
case, it appears that letting dd
do the dirty work by itself will not acheive the desired results.
This is, of course, yet another opportunity...
Wednesday Mar 12, 2008
I was RASing around with ZFS the other day, and managed to find a
file which was corrupted.
|
#
zpool scrub zpl_slim #
zpool status -v zpl_slim pool:
zpl_slim state:
DEGRADED status:
One or more devices has experienced an error resulting in data corruption.
Applications may be affected. action:
Restore the file in question if possible. Otherwise restore the entire
pool from backup. see:
http://www.sun.com/msg/ZFS-8000-8A scrub:
scrub completed after 0h2m with 1 errors on Tue Mar 11 13:12:42
2008 config: NAME
STATE READ WRITE CKSUM zpl_slim
DEGRADED 0 0 9 c2t0d0s0 DEGRADED 0 0 9
errors:
Permanent errors have been detected in the following files:
/mnt/root/lib/amd64/libc.so.1
#
ls -ls /mnt/root/lib/amd64/libc.so.1 4667 -rwxr-xr-x 1 root
bin 2984368 Oct 31 18:04 /mnt/root/lib/amd64/libc.so.1
|
argv! Of course, this particular file
is easily extracted from the original media, it does't contain
anything unique. For those who might be concerned that it is the C
runtime library, and thus very critical to running Solaris, the
machine in use is only 32-bit, so the 64-bit (amd64) version of this
file is never used. But suppose this were an important file for me
and I wanted to recover something from it? This is a more interesting
challenge...
First, let's review a little bit about
how ZFS works. By default, when ZFS writes anything, it generates a
checksum which is recorded someplace else, presumably safe.
Actually, the checksum is recorded at least twice, just to be doubly
sure it is correct. And that record is also checksummed. Back to the
story, the checksum is computed on a block, not for the whole file.
This is an important distinction which will come into play later. If
we perform a storage pool scrub, ZFS will find the broken file and
report it to you (see above), which is a good thing -- much better
than simply ignoring it, like many other file systems will do.
OK, so we know that somewhere in the
midst of this 2.8 MByte file, we have some corruption. But can we at
least recover the bits that aren't corrupted? The answer is yes.
But if you try a copy, then it bails with an error.
|
# cp
/mnt/root/lib/amd64/libc.so.1 /tmp /mnt/root/lib/amd64/libc.so.1:
I/O error
|
Since the copy was not successful,
there is no destination file, not even a partial file. It turns out
that cp
uses mmap(2) to map the
input file and copies it to the output file with a big write(2).
Since the write doesn't complete correctly, it complains and removes
the output file. What we need is something less clever, dd.
|
#
dd if=/mnt/root/lib/amd64/libc.so.1 of=/tmp/whee read:
I/O error 2304+0
records in 2304+0
records out #
ls -ls /tmp/whee 2304 -rw-r--r-- 1 root
root 1179648 Mar 12 18:53 /tmp/whee
|
OK, from this experiment we know that
we can get about 1.2 MBytes by directly copying with dd. But this
isn't all, or even half of the file. We can get a little more clever
than that. To make it simpler, I wrote a little ksh
script:
|
#!/bin/ksh integer
i=0 while
((i < 23)) do typeset
-RZ2 j=$i dd
if=$1 of=$2.$j bs=128k iseek=$i count=1 i=i+1 done
|
This script will write each of the
first 23 128kByte blocks from the first argument (a file) to a unique
filename as a number appended to the second argument. dd
is really dumb and doesn't offer much error handling which is why I
hardwired the count into the script. An enterprising soul with a
little bit of C programming skill could do something more complex
which handles the more general case. Ok, that was difficult to
understand, and I wrote it. To demonstrate, I first appologize for
the redundant verbosity:
|
#
./getaround.ksh libc.so.1 /tmp/zz 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out read:
I/O error 0+0
records in 0+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 1+0
records in 1+0
records out 0+1
records in 0+1
records out #
ls -ls /tmp/zz.* 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.00 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.01 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.02 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.03 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.04 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.05 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.06 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.07 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.08 0
-rw-r--r-- 1 root root 0 Mar 12 19:00 /tmp/zz.09 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.10 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.11 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.12 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.13 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.14 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.15 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.16 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.17 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.18 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.19 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.20 256
-rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.21 200 -rw-r--r-- 1 root
root 100784 Mar 12 19:00 /tmp/zz.22
|
So we can clearly see that the 10th
(128kByte) block is corrupted, but the rest of the blocks are ok. We
can now reassemble the file with a zero-filled block.
|
#
dd if=/dev/zero of=/tmp/zz.09 bs=128k count=1 1+0
records in 1+0
records out #
cat /tmp/zz.* > /tmp/zz #
ls -ls /tmp/zz 5832 -rw-r--r-- 1 root
root 2984368 Mar 12 19:03 /tmp/zz
|
Now I have recreated the file with a
zero-filled hole where the data corruption was. Just for grins, if
you try to compare with the previous file, you should get what you
expect.
|
#
cmp libc.so.1 /tmp/zz+ cmp:
EOF on libc.so.1
|
How is this useful?
Personally, I'm not sure this will be
very useful for many corruption cases. As a RAS guy, I advocate many
verified copies of important data placed on diverse systems and
media. But most folks aren't so inclined. Everytime we talk about
this on the zfs-discuss alias, somebody will say that they don't care
about corruption in the middle of their mp3 files. I'm no audiophile,
but I prefer my mp3s to be hole-less. So I did this little exercise
to show how you can regain full access to the non-corrupted bits of a
corrupted file in a more-or-less easy way. Consider this a proof of
concept. There are many possible variations, such as filling with
spaces instead of nulls
when you are missing parts of a text file -- opportunities abound.
Don't forget 'conv=sync'; that may help. (otherw...
The ",sync" part is important in the con...
I agree that if dd actually handled the EIO proper...