Weblog

All | General | Solaris
« Previous day (Jul 26, 2005) | Main | Next day (Jul 28, 2005) »
20050727 Wednesday July 27, 2005

UFS file system defragmentation... UFS defragmentation... A while back I worked on a utility to defragment a Unix File System (UFS).
I haven't integrated it yet into the source base for two reasons:

1. Couldn't run it on any customer system - to give it a taste of real world.

2. This involves modifying the block numbers underneath a file. And looks
   like cluster folks have a project, which depends on block numbers not
   changing underneath a file.

But as this project hasn't made it, this constraint can be overlooked (I hope).

In this post, I will explain a little bit about why I did this, what it does
and what it doesn't do.
Few customers reported the following problem:

# df -k .
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d40      26614583 24092573 2255865    92%    /stats
#
# dd bs=1024 if=/dev/zero of=test count=8
write: No space left on device
5+0 records in
5+0 records out
#

What this means is, even though there is ~2.2Gb of space is avaialbe,
'dd' failed to create a file of size 8k.
This is because the file system is so fragmented, that it could not
locate 8Kb of contiguous space and hence writes from 'dd' failed.

UFS terminology:
Being a block based file system, allocation unit under ufs is a block of
size 8k. To minimize wastage of space when files of small size are created,
a block is actually divided into framgnets of 1k size (this can be changed
while creating a UFS FS. see mkfs(1M)).
Fragments are the ones which are actually numbered.
What this means is a block is nothing but 8contiguous fragments and the
fragment number of the leading fragment should be a multiple of 8.

So if we need to store 8 files, each of size, say, 200bytes, ufs sets aside
8fragments instead of 8blocks and this results in huge space saving (8*1k Vs
8*8k). What does the math say, if we have a million of such small files?

In the above case, we also got the 'fstyp -v' o/p of the file system:

# fstyp -v /dev/md/rdsk/d40 | sed -e '/^$/,$d'
ufs
magic   11954   format  dynamic time    Tue Aug 27 19:20:27 2002
sblkno  16      cblkno  24      iblkno  32      dblkno  808
sbsize  2048    cgsize  8192    cgoffset 128    cgmask  0xffffffe0
ncg     522     size    27028032        blocks  26614583
bsize   8192    shift   13      mask    0xffffe000
fsize   1024    shift   10      mask    0xfffffc00
^^^^^^^^^^^^
frag    8       shift   3       fsbtodb 1
minfree 1%      maxbpg  2048    optim   time
maxcontig 16    rotdelay 0ms    rps     120
csaddr  808     cssize  9216    shift   9       mask    0xfffffe00
ntrak   19      nsect   248     spc     4712    ncyl    11472
cpg     22      bpg     6479    fpg     51832   ipg     6208
nindir  2048    inopb   64      nspf    2
nbfree  0       ndir    38886   nifree  1891931 nffree  4858090
^^^^^^^^^^                                                 ^^^^^^^^^^^^^^^
cgrotor 81      fmod    0       ronly   0


From this o/p, we can see that number of free blocks (nbfree) on the
system is 0. And all the free space is available in the form of fragments
(nffree) - that means no 8 fragments are contiguous.

How did the FS get into this state? The system serves as a web server and it
creates and deletes tons of small files and thus the fragmentation over
a period of time.

Traditionally the best way to avoid this is, have the frament size and block
size match such that 'df -k' o/p and 'dd' action are consistent. So if
fragment size was 8k in the above example, 'df -k' would have showed 100%
full, so you can't expect 'dd' to succeed anyway. What that means is allocate
an 8k sized block even if the request was for just a few  100bytes !!! So
you may end up buying more disk space.

The other way to fix this problem was, ones we start getting 'dd' failures,
take the file system offline - use the traditional tools ufsdump(1M)/
ufsrestore(1M) to dump the files to a backup media and restore it back.
As you can see, there are two disadvantages with this approach: unavailability
of the FS for a while and a backup device.

So one approach to fix the problem is, reconsider the block allocation
techniques of ufs (which will be an expensive proposition). And the other
approach is, defragment the file system on the fly. This is what I have
implemented.

Initially I thought of adding a ufs kernel thread to do the defragmentation
whenever the frament/block ratio reaches a threshold; but on popular advise,
I switched to a utility which is called 'fsdefrag' - which can be launched
from a crontab file if required.

What this does is, it sweeps through the FS checking for  all the cylinder
groups (another ufs thing), starts from the worst affected cylinder group to
the relatively better ones, and tries to reallocate the fragments set aside for
an inode such that lot of contiguous fragments can be made available at the end
of the exercise. For the inode that is touched, its fragment numbers will
change to the newer ones and old ones get freed after data is copied.
There are some checks made in deciding whether a given inode's data block is a
an appropriate one or not.

What this is not:
Once I say, defragmentation, first thing that comes to the mind of database
folks is, their ~2Gb file, which has blocks spread all over the FS will be
made contiguous such that access times improve in a big way. This utility
doesn't look at files which occupy more than 96k (in UFS, upto 96kb, files
are allocated direct blocks i.e block numbers are stored in the inode
itself). So if you have read this far and this paragraph let you down, I am
extremely sorry.

If you are interested in this, we can have a discussion on 'ufs: discuss' forum
off opensolaris.org/

( Jul 27 2005, 07:22:22 AM PDT ) Permalink Comments [2]

Calendar

RSS Feeds

Search

Links

Navigation

Referers