« November 2009
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
     
       
Today
XML

Blog::Navigation

Blog::Editing

Bookmarks::Blogroll

Blog::Referers

Today's Page Hits: 6

Site notes

This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.

Powered by Roller Weblogger.
Main | Joining the party »
20050204 Friday February 04, 2005
System failures

So, starting last Friday, I had a series of embarrassing system failures on the servers I run in my office, here at the Catnip Coast. I feel they are embarrassing because they all were preventable - and I really should have been thinking better, and recovered more quickly.

Quick background: for a few years now, I've run a DNS/web/mail server at home for my family and friends. This serves a handful of domains I've registered for myself (holyhippie .net, .org, .com; and catnipcoast.com) and one I registered for my college friends (backtable.org). There's about a dozen people with mail accounts on this system, 3 people outside my house use it on a semi-regular basis. My wife uses this as her primary mail account, and I use it as a secondary account.

The first incarnation of this server was on a SparcStation 5 named fnord. He ran the site quite well, until about two years ago, when I got an Ultra 2 (named muscat) to replace it. I had in my office at the time the hardware for the third incarnation (named bocana) a SunFire v120.

Muscat has the ability to have two internal, hot-pluggable hard drives. I had originally set it up so the system was mirrored across both drives; in case of a failure in one, it wouldn't take the whole system down. A few months ago, the bearings on one of the drives started to go, and it developed an unbearable high-pitched whine. Rather than live with it whining at me, I just pulled the drive, and let muscat run on just one.

The first event: Friday, a bit before 11AM, a power failure happens. The first thing I notice is that a bunch of things on my desk suddenly shut off, but my laptop stays on. I think "Power to the house is out". It takes a couple of seconds to realize that was not the case, and instead it was the UPS (Uninterruptible Power Supply; a battery for computers, so that if the power fails, the computer won't.)

Now, the effect of the UPS failing was to take down exactly the machines I needed to be up: muscat, the DSL router, my wireless, my monitor ... many things are now dead in the water, and I start scrambling to re-plug everything in.

Once I have things plugged in, I try to bring muscat back up. He won't boot. The failure is odd and cryptic. It seems to me like the software doing the mirroring of the root drive is having a brain fart. Much cursing on my part ensues; see, I don't have a CD-ROM for muscat now (he only can use SCSI), and I don't have on hand another way to boot him to where I can start recovering.

So, I start shuffling things around, and making sure that things won't go astray. Bocana was sitting there, waiting to be configured in muscat's place, so I start setting bocana up to take over muscat's job.

Saturday, I go on a quest to the closest Fry's (a 40 minute drive away) to find a SCSI CD-ROM. I figure there is no other retail place in the area likely to carry such a thing; and I'm surprised to find that Fry's doesn't. While I'm there, I have a realization: muscat's hard drives probably will plug into bocana. I also pick up a new UPS for home. I figured that the battery for the old UPS might be replaceable, but right now it was less hassle (and not that expensive) to just get a new one.

Saturday evening, when I get home, I try it out. Sure enough, I can take the drives out of muscat and plug them into bocana. I do a bit of fiddling, but I can't figure out how to repair the drive to make it bootable again. Oh well - at least I can get the data off the drive, and I do just that. I put a bit of more work in, and bocana now can do most all the DNS, web, DHCP and firewall stuff that muscat used to do. I'm happy at the progress, so I leave the LDAP and mail stuff for later.

Sunday, it's time to do projects around the house. Up for today, replacing the dimmer in the dining room.
This means that I have to turn off breakers until I find the right one. It turns out, that my office is on the same breaker as the dining room overhead light. The new UPS is plugged in, but nothing is plugged into it yet.

When I get back to my office after finishing work, I try to boot bocana. It won't boot. The failure is odd and cryptic - and seems like the software doing the mirroring of the root drive is having a brain fart. This time, the cursing is at about twice the volume as before.

I now have two dead systems, no way to boot either (bocana didn't have a CD-ROM either), and no other systems I can plug the hard drives into.

I spend a lot of time Sunday night packing up bocana and muscat, and getting ready to drive to the office - a mere 2.5 hour drive. I leave monday, 7:30AM.

Once I get to the office, tons of stuff is happening. I'm running to various meetings, and stealing time in between to work on bocana. I find a CD-ROM in another system in my lab that will go into bocana - so at least I can boot it now. I still can't figure out how to recover the drives. However, since bocana's data was mirrored, I can experiment on one half of the mirror without risking everything. I work on this until 7PM, then drive home. Meanwhile, Valkyrie has called a couple of times, politely complaining that she can't check her email.

Sometime around now, I have a realization - I've been trying to solve this problem the wrong way. I've been trying to get the mirroring fixed, and haven't been able to. What I can do instead is tell the system to ignore the mirroring, and treat the disk partition that has a half of the mirror as if were just a normal filesystem.

Things start falling together once I've realized this. I can now boot bocana off of one of his old drives, without mirroring. I'm back to where I was Sunday morning - and feverishly work on finishing the job of moving all of the services that muscat used to provide to bocana. It takes a while, but around 3AM Tuesday, everything seems finished and working.

The embarrassing part - both of the key realizations I had to fixing things (that I could put muscat's drives in bocana, and how to turn off mirroring) are things I have known for a while, and done before. I should have realized them right off, and been able to resurrect muscat on Friday, without having to spend many hours on the road and more hacking away.

Regardless, things are fixed now, and all should be working fine. In the process, I got to get familiar with Solaris 10 - and there's some really cool new stuff in it.

Comments:

Post a Comment:

Comments are closed for this entry.
Copyright (C) 2003, Capitan Holy Hippie's ramblings