Tuesday August 12, 2008 | Constantin's Blooog |
|
Useful stuff for your blog-reading pleasure.
All
|
General
ZFS saved my data. Right now.
For storage, I use Western Digital's MyBook Essential Edition USB drives because they are the cheapest ones I could find from a well-known brand. The packaging says "Put your life on it!". How fitting. Last week, I had a team meeting and a colleague introduced us to some performance tuning techiques. When we started playing with iostat(1M), I logged into my server to do some stress tests. That was when my server said something like this: constant@condorito:~$ zpool status (data from other pools omitted) pool: santiago state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 16h28m with 0 errors on Fri Aug 8 11:19:37 2008 config: NAME STATE READ WRITE CKSUM santiago DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c10t0d0 DEGRADED 0 0 135 too many errors c9t0d0 DEGRADED 0 0 20 too many errors mirror ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 errors: No known data errors This tells us 3 important things:
Over the weekend, I ordered myself a new disk (sheesh, they dropped EUR 5 in price already after just 5 days...) and after a " constant@condorito:~$ zpool status
(data from other pools omitted)
pool: santiago
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver in progress for 1h13m, 6.23% done, 18h23m to go
config:
NAME STATE READ WRITE CKSUM
santiago DEGRADED 0 0 0
mirror DEGRADED 0 0 0
replacing DEGRADED 0 0 0
c10t0d0 DEGRADED 0 0 135 too many errors
c11t0d0 ONLINE 0 0 0
c9t0d0 DEGRADED 0 0 20 too many errors
mirror ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
errors: No known data errors
The next step for me is to send the c10t0d0 drive back and ask for a replacement under warranty (it's only a couple of months old). After receiving c10's replacement, I'll consider sending in c9 for replacement (depending on how the next scrub goes). Which makes me wonder: How will drive manufacturers react to a new wave of warranty cases based on drive errors that were not easily detectable before? [1] To the guys at Drobo: Of course you're invited to implement ZFS into the next revision of your products. It's open source. In fact, Drobo and ZFS would make a perfect team!
"ZFS saved my data. Right now." has been brought to you by Constantin's Blooog.
This entry was created on 2008-08-12 06:44:22.0 PST and is associated with the following tags:
corruption
data
drobo
integrity
opensolaris
solaris
storage
zfs
Post a Comment: Comments are closed for this entry. « Welcome to the year... | Main | ZFS Replicator Scrip... » |
|
This is Tom from Data Robotics, makers of Drobo. ZFS is definitely an interesting technology and one that we'll explore as it could be complementary to Drobo as you point out. Right now though we haven't seen very wide adoption of ZFS and it is still a little rough around the edges from a user interface point of view (especially relative to Drobo) :)
Also, in the situation you pointed out, your data would have been just fine on a Drobo. Please see the following blog posting to learn more:
http://www.drobospace.com/blog/entry/11007/How-Does-Drobo-Protect-My-Data-/
See the "soft failures" section.
That said, I do understand that ZFS has some unique features--so thanks very much for the feedback. Feel free to email me with more thoughts!
Posted by Tom on August 12, 2008 at 07:42 PM CEST #
Ist doch ganz einfach: wer seine Daten mag, sichert sie auf sinnvollen Dateisystemen.
Und bevor sowas wie ReiserFS kaputt geht, lösche ich meine Daten lieber selbst ;)
Posted by Sebastian on August 12, 2008 at 09:43 PM CEST #
So ... before you return disks and foist the issue onto the disk vendors, what common components in the data path can you identify that may be responsible for the corruption. If you've been running happily for a while and then corruption appears on two distinct devices in a similar timeframe ;-)
Common driver, PCI bus, PCI card, USB controller, etc ... there are a number of documented cases of power supply issues causing corruption which was picked up by ZFS checksumming for instance ...
Just a thought, but a bit of SGR may be called for!
Posted by Craig Morgan on August 12, 2008 at 10:20 PM CEST #
Hi Constantin, sorry to hear of your drive errors. As it's the summer, and you have USB disks enclosed in a small box, could it be that the drives are overheating? What happens if you run this script to check the drive temps?
http://breden.org.uk/2008/05/16/home-fileserver-drive-temps/
I'm running my drives around 40 degrees C and, so far, in 7 months of operation, I have not seen any read, write or checksum errors after scrubbing the pool.
Cheers,
Simon
Posted by Simon Breden on August 12, 2008 at 11:05 PM CEST #
Hi,
first of all: thank you all for your comments. I wasn't expecting that many comments in such a short time!
Tom, thanks for commenting and let me point out that Drobo is truely a brilliant concept. You mention the ability of drives to detect broken blocks upon reading (soft failures). This kind of detection relies on the data block being transported correctly to the drive. As Craig pointed out, the issue might have been the power supply jamming the bus or the USB connection. Then, the drive only uses 8 Bits per blocks for error detection which is not enough to detect multiple bit errors per block. In this particular case, the drive did not report read errors (I checked the logfiles). The checksum errors that ZFS reported were at the ZFS level and not the drive level, so they could detect what the drive couldn't.
In a possible Drobo scenario, the controller inside the drobo box could implement ZFS and provide better data protection. Even better would be a ZFS implementation at the driver level on the host, but that may be difficult to get acceptance for at the user base.
To Craig: Yes, the errors could have been caused by common components such as the motherboard, the USB controller etc. Incidentally, both drives that exhibited errors are 1TB drives, while the other two drives are only 512 GB. Also, one of the failed drives reported read errors (that were re-tryable) a while ago, so in this particular case I'm thinking that 1TB is not (yet?) as reliable as 512GB. Let's see what WD says.
To Simon: Thanks for the pointer. I'll download and install the monitoring scripts and check the temperature. My server is in a basement and the drives are set up with their holes to the top, so I think I'm ok here, but it never hurts to check.
Again, thanks and keep up the good comments!
Cheers,
Constantin
Posted by Constantin Gonzalez on August 13, 2008 at 09:58 AM CEST #