I woke up today to find that one of the disks in my home server had failed overnight. I was actually able to work this out while still in bed shortly after waking up, because I could hear it clicking and whirring pathetically as it tried to spin up - not a nice way to start your day. As I write this I'm in the process of filing an RMA to get the disk replaced, which promises to be a painful, drawn-out process, but hey - at least my data is still safe thanks to ZFS (so long as none of my other disks decide to break - not inconcievable, seeing as they're all identical...).
However, hardware faults aren't always audible, so I was pleased to see that my script for detecting hardware faults and then emailing me had triggered. Here's what I got sent:
-------- Original Message --------
Subject: Hardware failed on zebedee
Date: Sat, 25 Apr 2009 13:54:02 +0200
From: lamsey@zebedee
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Apr 25 09:32:07 43d4b6e4-1219-e9d5-bac5-f829b8fb2f2a ZFS-8000-D3 Major
Fault class : fault.fs.zfs.device
Description : A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for
more information.
Response : No automated response will occur.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
Logging into my server and running zpool status -x showed me which disk was at fault (c4d0), and a bit of searching in the output of prtconf -v allowed me to work out the serial number of the affected disk (more specifically, it allowed me to work out the serial numbers of the disks which were still working, meaning I could work out which physical disk was broken by a process of elimination after cracking the box open).
So, how do I achieve the above? The answer is actually incredibly simple. The content of the email is just the output of fmadm faulty, a command which interrogates Solaris' FMA (Fault Management Architecture) feature to see if there's any hardware issues on a system. Wrap it up in a script (the below is based on one I found on the 'net eons ago and can no longer find), and you end up with something like:
lamsey@zebedee:bin$ cat check_hardware.ksh #!/bin/ksh
# Public domain. Use as you wish. EMAIL=liam@lamsey.co.uk TMPFILE=/tmp/fmadm.output.$$ # run fmadm and cut away the first two lines (headers) /usr/bin/pfexec /usr/sbin/fmadm faulty | /usr/bin/sed 1,2d > $TMPFILE # Check if the file size is greater than zero. This means we got
# some output from fmadm and therefore some hardware may be bad.
# Using HTML here means we can use <pre> to preserve formatting. if [ -s $TMPFILE ]; then ( /usr/bin/echo "Subject: Hardware failed on `hostname`" /usr/bin/echo "From: lamsey@zebedee" /usr/bin/echo "MIME-Version: 1.0" /usr/bin/echo "Content-Type: text/html" /usr/bin/echo "Content-Disposition: inline" /usr/bin/echo /usr/bin/echo '<pre>' # don't just use the temp file, it's missing headers /usr/bin/pfexec /usr/sbin/fmadm faulty /usr/bin/echo '</pre>' ) | /usr/local/bin/msmtp -a 1and1 $EMAIL fi # clean up the temp file /usr/bin/rm -f $TMPFILE
Simply slap a call to the above script into your crontab, ideally running at least once a day, and you're good to go. Note that I use msmtp for sending emails automatically as it's a heck of a lot easier to configure than sendmail (which is important if you use an ISP like o2 which blocks outgoing SMTP traffic, preventing you from using sendmail in its out-of-the-box configuration). It doesn't come with Solaris though, so you'll need to compile it if you want to do the same (very simple, works fine with configure / make / make install).
Edit (01/5/09): I received the replacement disk today (took them long enough...). Slammed it into the server, issued a quick zpool replace c4d0 command, and all is good with the world again :-)
lamsey@zebedee:~$ zpool status shared
pool: shared
state: ONLINE
scrub: resilver completed after 3h13m with 0 errors on Fri May 1 17:12:23 2009
config:
NAME STATE READ WRITE CKSUM
shared ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c3d0 ONLINE 0 0 0 217M resilvered
c3d1 ONLINE 0 0 0 217M resilvered
c4d0 ONLINE 0 0 0 236G resilvered
c4d1 ONLINE 0 0 0 217M resilvered
c6d1 ONLINE 0 0 0 217M resilvered
errors: No known data errors
