|
I've been using ZFS internally for awhile now. For someone who used to administer several machines with Solaris Volume Manager (SVM), UFS, and a pile of aging JBOD disks, my experience so far is easily summed up: "Dude this so @#%& simple, so reliable, and so much more powerful, how did I never live without it??"
So, you can imagine my excitement when ZFS finally hit the gate. The very next day I BFU'ed my workstation, created a ZFS pool, setup a few filesystems and (four commands later, I might add) started beating on it.
Imagine my surprise when my machine stayed up less than two hours!!
No, this wasn't a bug in ZFS... it was a fatal checksum error. One of those "you might want to know that your data just went away" sort of errors. Of course, I had been running UFS on this disk for about a year, and apparently never noticed the silent data corruption. But then I reached into the far recesses of my brain, and I recalled a few strange moments -- like the one time when I did a bringover into a workspace on the disk, and I got a conflict on a file I hadn't changed. Or the other time after a reboot I got a strange panic in UFS while it was unrolling the log. At the time I didn't think much of these things -- I just deleted the file and got another copy from the golden source -- or rebooted and didn't see the problem recur -- but it makes sense to me now. ZFS, with its end-to-end checksums, had discovered in less than two hours what I hadn't known for almost a year -- that I had bad hardware, and it was slowly eating away at my data.
Figuring that I had a bad disk on my hands, I popped a few extra SATA drives in, clobbered the disk and this time set myself up a three-disk vdev using raidz. I copied my data back over, started banging on it again, and after a few minutes, lo and behold, the checksum errors began to pour in:
elowe@oceana% zpool status
pool: junk
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool online' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
junk ONLINE 0 0 0
raidz ONLINE 0 0 0
c0d0 ONLINE 0 0 0
c3d0 ONLINE 0 0 1
c3d1 ONLINE 0 0 0
A checksum error on a different disk! The drive wasn't at fault after all.
I emailed the internal ZFS interest list with my saga, and quickly got a response. Another user, also running a Tyan 2885 dual-Opteron workstation like mine, had experienced data corruption with SATA disks. The root cause? A faulty power supply.
Since my data is still intact, and the performance isn't hampered at all, I haven't bothered to fix the problem yet. I've been running over a week now with a faulty setup which is still corrupting data on its way to the disk, and have yet to see a problem with my data, since ZFS handily detects and corrects these errors on the fly.
Eventually I suppose I'll get around to replacing that faulty power supply...
Technorati Tags:
[OpenSolaris
Solaris
ZFS]
(2005-11-17 15:06:06.0/2005-11-16 10:34:14.0)
Permalink
Trackback: http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta
|
Posted by Asgeir S. Nilsen on November 29, 2005 at 02:45 PM CST #
Posted by Brian Hechinger on January 11, 2006 at 08:35 AM CST #
Posted by Humberto Ramirez on August 04, 2006 at 12:19 AM CDT #