http://andrew-gray.com/unixfaq/solaris_disks.shtml


fsck is used to check and resolve problems with filesystems. If you have
corruption on one or more filesystems then read on.. Fsck is not used
to check the functions of disks - under Solaris use format(8) for that.


Above all remember this;

You MUST NOT run fsck on a mounted filesystem

If you're in a hurry, skip down to Interacting
with Fsck
..


Most people's first experience with fsck comes after their system has
crashed and they're faced with cryptic and daunting questions from it.
This is unfortunate because they're probably under considerable pressure
to get the system running again and don't know what to do. If you're new
to Unix and responsible for one or more systems I would encourage you
to find an unimportant workstation and experiment with fsck a little -
umount a filesystem and fsck it. If the machine doesn't have any data
on it you could pull the power and see what happens when the machine reboots...


This FAQ focuses on Solaris, though most of it is also applicable
to other Unix variants, including Linux.


How fsck normally works


Unix, any Unix, will refuse to mount a filesystem that was not unmounted
cleanly. This is because it may be corrupt and mounting a corrupt filesystem
will likely cause the system to crash.


When the system boots all filesystems are checked to see whether they
are Clean. The term simply means whether the filesystem was unmounted
properly after it's last use. If the filesystem is Dirty then fsck
will be called in to check it out in more detail. Some Unix variants such
as Linux will also run fsck after the filesystem has been mounted N times
- N is the maximal mount count.


Modern Unix systems run fsck automatically in what is known as Preen
mode. In this mode fsck will fix minor problems that do not result in
data loss - such as the Clean/Dirty state flag. If it finds any problems
that may result in data loss it will flip into Interactive mode
- this is how most people first encounter fsck.


Interacting with Fsck


When you first encounter fsck it seems that though only people with a
PhD in computer science should be dealing with it - the messages are that
cryptic.


Its really not that hard; tell someone to deal with the panicing users,
close the door, and turn your phone off. You need to concentrate on this....


Take note of these points;


You must not mount a corrupt filesystem.
Some systems (including older Solaris systems) will let you mount
a corrupt filesystem after fsck has been run on it. Doing so will almost
certainly cause the system to crash later and your corruption might
be even worse.
Most interaction with fsck consists of answering Yes or No
This to a series questions that, in essence mean 'Shall I fix this
corruption?'. Newcomers are inclined to answer No because they don't
understand the implications. If you answer No even once, the filesystem
corruption may not be cleared. You must run fsck again in this instance.

Minor Corruption


I define minor corruption as where you've not lost data, but fsck can't
tell.

An example of fsck encountering a minor corruption is show below;


sun (ksh) # fsck /dev/rdsk/c0t3d0s3
** /dev/rdsk/c0t3d0s3
** Last mounted on /usr
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
UNREF FILE I=343651 OWNER=root MODE=100644
SIZE=0 MTIME=Jun 13 09:43 2003
CLEAR? y

** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? y

25947 files, 588044 used, 133186 free (11674 frags, 15189 blocks, 1.6%
fragmentation)


Here fsck found an unreferenced file - that's an inode with no directory
entry pointing to it. There's no name on the file because filenames are
stored in directories. The only information shown is the inode number
(I=343651), size, ownership, permissions and modification time. This inode
refers to a file that is empty. Also as the Inode number is a high one
it's very unlikely that this file is important - we answer Y (yes) to
the CLEAR? question.


The superblock's free block count ("FREE BLK COUNT"blogs.sun.com/images/smileys/wink.gif" class="smiley" alt=";)" title=";)" /> will likely
always be wrong if fsck made any modification to the file system on earlier
phases. We answer Y (yes) to tell fsck to correct it.


Fsck's preen mode could not be expected to resolve this problem automatically
- it is possible that an empty file could be significant. We made a judgement
here, as you may have to.


Mid-Level Corruption


If you get to this point then you have lost at least one and possibly
several files. If you're lucky you've only lost a few files that were
open when the system crashed. At worst you've lost several directories
and with them all the files in them. Send out for the backup tape, you're
going to need it.


The following example of corruption showing loss of real data has been
abridged for inclusion here;


** /dev/rdsk/c0t1d0s6
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
UNKNOWN FILE TYPE I=97
CLEAR? yes

UNALLOCATED I=10 OWNER=root MODE=0
SIZE=0 MTIME=Jan 1 07:00 1970
NAME=?

REMOVE? yes

** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
UNREF DIR I=213509 OWNER=root MODE=40755
SIZE=512 MTIME=Mar 13 17:16 1999
RECONNECT? yes

** Phase 4 - Check Reference Counts
LINK COUNT DIR I=35722 OWNER=bin MODE=40755
SIZE=512 MTIME=Mar 13 17:24 1999 COUNT 5 SHOULD BE 4
ADJUST? yes

** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? yes

2683 files, 164403 used, 504020 free (1804 frags, 62777 blocks,
0.2% fragmentation)

***** FILE SYSTEM WAS MODIFIED *****


Phase#1 shows that we've lost two files, we have no idea of there
size or contents. The I=10 entry is probably suspect because all the values
are zero - root is UID 0, and the 1st Jan 1970 also equates to an epoch
time of 0. I=10 is a very low inode number - in general the lower the
number the more serious the problem. Both lost inodes could have been
directories - there is no way of knowing. There has been serious corruption
of the inode table here.


Phase#3 reveals an unconnected directory.
This is a directory that is not included in any other directory, and should
only hold true for the root inode, which with I=213509 this certainly
is not. The 'RECONNECT? yes' causes fsck to make an entry in the lost+found
directory, the name will be '#213509'. Once the filesystem is mounted
you can 'cd <mountpoint>/lost+found/#213509' and investigate what
the directory contains and possibly identify where in the filesystem it
should be.


Phase#4 shows a directory with an incorrect link count. The inode
holding the directory has a link count of 5, but fsck could only find
4 directory entries pointing to it. This is probably the least serious
error shown on this run.


The filesystem is probably safe to mount, though to be 100% sure you
ought to fsck it again.


Assuming this is the only corrupt filesystem you can either 'exit' single
user mode, or simply reboot the machine.


After the machine boots you need to decide what to do with this filesystem.
This is a judgement call that you must make and which depends on may factors
outside the scope of this FAQ. Personally, faced with the above fsck results,
then unless the filesystem was totally unimportant I consider that the
overall level of damage to it sufficient to warrant a full restore.


You shouldn't spend to long trying to fix this level of corruption, if
more than half a dozen files have gone west you need to be considering
restoring the whole filesystem from backup.


Severe Corruption


At this level you may have lost the entire filesystem. It really a case
of seeing what you can salvage rather than getting the filesystem back
on it's feet. If it's a file system that the system can live without to
boot then you might consider removing it from /etc/vfstab (/etc/fstab
on linux) so that you can boot the system multi-user.


If you run fsck on what you consider to
be a 'good' filesystem, and see something like this, then you have severe
corruption;


sun# fsck /dev/rdsk/c0t0d0s1
** /dev/rdsk/c0t1d0s1 (NO WRITE)
BAD SUPER BLOCK: MAGIC NUMBER WRONG
USE AN ALTERNATE SUPER-BLOCK TO SUPPLY NEEDED INFORMATION;
eg. fsck [-F ufs] -o b=# [special ...]
where # is the alternate super block. SEE fsck_ufs(1M).

fsck did not identify this partition as containing a filesystem. Double
check that you entered the correct device file, assuming you did...


Using alternate
superblocks


When you create a filesystem with newfs it pumps out a long list of numbers
- super-block locations. The super-block contains key information about
a filesystem, without it you don't have a usable filesystem. Solaris creates
a backup super-block at the start of every cylinder group and there is
always one at block #32. Try this, who knows....


sun (ksh) # fsck -o b=32 /dev/rdsk/c0t1d0s6
Alternate super block location: 32.
** /dev/rdsk/c0t1d0s6
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? y

2746 files, 169956 used, 498467 free (1051 frags, 62177 blocks,
0.1% fragmentation)
***** FILE SYSTEM WAS MODIFIED *****


Well that doesn happen very often ! Looks like the superblock itself
was the only thing corrupted. It lives at the start of the disk, so perhaps
something wrote there ?


Officially you are supposed to record the super-block numbers when you
create filesystem, no-one ever does. Assuming the filesystem was created
with default parameters you can get a list of super-block backups by running
newfs with the '-N' option;


sun (ksh) # newfs -N /dev/rdsk/c0t1d0s1
/dev/rdsk/c0t1d0s1: 237000 sectors in 50 cylinders of 20 tracks,
237 sectors
115.7MB in 4 cyl groups (16 c/g, 37.03MB/g, 17792 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 76112, 152192, 228272,

This yields the next superblock backup at block 76112, this you can try
if you weren't as lucky as me, though to be honest if things are that
bad it's probably a waste of time

Comments:

While it is a really nice blog entry, but depending on who you ask, it is at least 3 months to late and by some would say its about 5 years to late.

The future has arrived, ZFS perhaps you have heard of it. It doesn't require or even allow you to run fsck. Solaris 10 update 6 aka 11/08 now enables the root filesystem to be ZFS.

Even before this thanks to transaction logs, it was extremely rare to need to fsck a file system that didn't happen automatically on boot up.

Posted by James Dickens on January 08, 2009 at 01:41 PM PST #

That might be great. Actually several minutes ago, my Solaris got shutting down illegally. Then I got the problem which was solved using FSCK. If ZFS can hidden us away from this kind of problems, it will be very nice. Looking forward to the new update!

Posted by Yunpu Zhu on January 08, 2009 at 01:51 PM PST #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by Yunpu Zhu