Friday May 16, 2008 | Notes from a Carbon Based Life Form thoughts, opinions, and drivel. 100% free, guaranteed. |
|
diskread: reading beyond end of ramdisk (& how I recovered) We had to do a maintenance to replace a NEM module in a Sun Blade 8000 Modular System. Two of my team mates went on down to the datacenter on other business and graciously offered to SWAP the NEM for me. The pulled the old one out, stuck the new one in. That's as simple as it should have been. Should have been. I wish. Instead, the chassis started to freak out, cycling it's power over and over, and somehow was taking the CMM with it. In between one set of cycles, I was able to connect to the CMM via console and paste in a bunch of commands to shut down chassis power. I let it sit for a moment, then began to power up the system. First the chassis, then the individual blades. One blade came up, no problem. The next two, though, were very much less than happy, spitting out errors like:diskread: reading beyond end of ramdisk start = 0x2000, size = 0x2000 failed to read superblock diskread: reading beyond end of ramdisk start = 0x2000, size = 0x2000 failed to read superblock panic: cannot mount boot archive Press any key to rebootThe GRUB menu was coming up OK, though, so I pressed the trusty any key, booted into Solaris 10 Failsafe mode. This was no picnic either. SunOS Release 5.10 Version Generic_120012-14 32-bit
Copyright 1983-2007 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Booting to milestone "milestone/single-user:default".
Configuring devices.
Searching for installed OS instances...
NOTICE: /a: unexpected free inode 5825, run fsck(1M)
/dev/dsk/c2t0d0s0 is under md control, skipping.
To manually recover the boot archive on a root mirror,mount the first
side (the one that the system boots from) and run:
bootadm update-archive -R
My immediate thought was "WTF? No installed OS instance found?" Closer inspection revealed that it had in fact found two possibilities, but one c2t1d0s0 was inconsistent and needed a fsck, and the second c2t1d0s0 was under md control, and so being skipped.
An fsck of /dev/dsk/c2t0d0s0 revealed a few inconsistencies. Here's an example. I think this was actually the 3rd of 4 fscks I ran on this dev:
bash-3.00# fsck /dev/dsk/c2t0d0s0 ** /dev/rdsk/c2t0d0s0 ** Last Mounted on / ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3a - Check Connectivity ** Phase 3b - Verify Shadows/ACLs ** Phase 4 - Check Reference Counts UNREF FILE I=1457 OWNER=root MODE=100644 SIZE=657 MTIME=May 15 18:01 2008 RECONNECT? y UNREF FILE I=1458 OWNER=root MODE=100644 SIZE=675 MTIME=May 15 18:06 2008 RECONNECT? y ** Phase 5 - Check Cylinder Groups CORRECT BAD CG SUMMARIES? y CORRECTED SUMMARY FOR CG 0 FRAG BITMAP WRONG FIX? y FRAG BITMAP WRONG (CORRECTED) CORRECTED SUMMARY FOR CG 4 CORRECTED SUMMARY FOR CG 12 CORRECTED SUMMARY FOR CG 30 CORRECTED SUMMARY FOR CG 70 CORRECT GLOBAL SUMMARY SALVAGE? y Log was discarded, updating cyl groups 46737 files, 1720899 used, 24099860 free (21460 frags, 3009800 blocks, 0.1% fragmentation) ***** FILE SYSTEM WAS MODIFIED *****So far, so good. Let's reboot, and see if we an come up in a multi-user state. So ... reboot ... wait ... wait ... CRAP! Same panic as our previous boot. We're missing something. A further delve into google reveals that I need to recreate the ramdisks for boot. A boot into failsafe mode again, allows me to fsck c2t0d0s0, which is mounted on /a, and remount it -o rw. bootadm update-archive fails, due to fs inconsistency. Another fcsk, we're in single user, nothing is using that disk, so I just ran the fsck without remounting -o ro. Now, let's skip bootadm and just move straight along to /boot/solaris/bin/create_ramdisk. bash-3.00# /boot/solaris/bin/create_ramdisk -R /a Creating ram disk for /a updating /a/platform/i86pc/boot_archive...this may take a minuteThat's it! That's the little piece of magic that fixed it. After that, I was able to reboot, and the server came right up into runlevel 3. Not without a few minor errors, but at least it was up. SunOS Release 5.10 Version Generic_127112-11 64-bit Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Hostname: generic NOTICE: /: unexpected free inode 9193, run fsck(1M) -o f NOTICE: /: unexpected free inode 5961, run fsck(1M) -o f WARNING: /: unexpected allocated inode 9637, run fsck(1M) -o f Loading smf(5) service descriptions: 1/1 /dev/md/rdsk/d60 is clean /dev/md/rdsk/d30 is clean /dev/md/rdsk/d20 is clean generic console login:At this point it was pretty simple to complete the fix, which, not wanting to reboot into failsafe mode and fsck a bunch more to recover from the unexpected free and allocated inodes, I wrote a script to: by turns, detach each have of the root mirror, clear the detached metadevice, newfs the raw device, re-create the metadevice, and attach it once again to the mirror. Let it sit long enough to complete the resync, and repeat the same steps on the other half of the mirror. #!/bin/sh # # fix-mirror.sh # # 05-16-2008 Tim KennedyNow these blades are happy once again. We'll see how long that lasts or if they continue to have problems of any sort. My hope is for the former. Have a good weekend. Posted by tkblog ( May 16 2008, 10:46:57 PM EDT ) Permalink Comments:
Post a Comment: Comments are closed for this entry. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||