
Wednesday July 27, 2005
UFS file system defragmentation...
UFS defragmentation...
A while back I worked on a utility to defragment a Unix File System
(UFS).
I haven't integrated it yet into the source base for two reasons:
1. Couldn't run it on any customer system - to give it a taste of real
world.
2. This involves modifying the block numbers underneath a file. And
looks
like cluster folks have a project, which depends on block
numbers not
changing underneath a file.
But as this project hasn't made it, this constraint can be overlooked
(I hope).
In this post, I will explain a little bit about why I did this, what it
does
and what it doesn't do.
Few customers reported the following problem:
# df -k .
Filesystem
kbytes used avail capacity Mounted
on
/dev/md/dsk/d40 26614583 24092573
2255865 92% /stats
#
# dd bs=1024 if=/dev/zero of=test count=8
write: No space left on device
5+0 records in
5+0 records out
#
What this means is, even though there is ~2.2Gb of space is avaialbe,
'dd' failed to create a file of size 8k.
This is because the file system is so fragmented, that it could not
locate 8Kb of contiguous space and hence writes from 'dd' failed.
UFS terminology:
Being a block based file system, allocation unit under ufs is a block of
size 8k. To minimize wastage of space when files of small size are
created,
a block is actually divided into framgnets of 1k size (this can be
changed
while creating a UFS FS. see mkfs(1M)).
Fragments are the ones which are actually numbered.
What this means is a block is nothing but 8contiguous fragments and the
fragment number of the leading fragment should be a multiple of 8.
So if we need to store 8 files, each of size, say, 200bytes, ufs sets
aside
8fragments instead of 8blocks and this results in huge space saving
(8*1k Vs
8*8k). What does the math say, if we have a million of such small files?
In the above case, we also got the 'fstyp -v' o/p of the file system:
# fstyp -v /dev/md/rdsk/d40 | sed -e '/^$/,$d'
ufs
magic 11954 format dynamic
time Tue Aug 27 19:20:27 2002
sblkno 16 cblkno
24 iblkno
32 dblkno 808
sbsize 2048 cgsize 8192
cgoffset 128 cgmask 0xffffffe0
ncg 522
size
27028032 blocks 26614583
bsize 8192 shift
13 mask 0xffffe000
fsize 1024 shift
10 mask 0xfffffc00
^^^^^^^^^^^^
frag 8
shift 3 fsbtodb 1
minfree 1% maxbpg
2048 optim time
maxcontig 16 rotdelay 0ms
rps 120
csaddr 808 cssize
9216 shift
9 mask 0xfffffe00
ntrak 19 nsect
248 spc
4712 ncyl 11472
cpg 22
bpg 6479
fpg 51832
ipg 6208
nindir 2048 inopb
64 nspf 2
nbfree 0
ndir 38886 nifree 1891931
nffree 4858090
^^^^^^^^^^
^^^^^^^^^^^^^^^
cgrotor 81 fmod
0 ronly 0
From this o/p, we can see that number of free blocks (nbfree) on the
system is 0. And all the free space is available in the form of
fragments
(nffree) - that means no 8 fragments are contiguous.
How did the FS get into this state? The system serves as a web server
and it
creates and deletes tons of small files and thus the fragmentation over
a period of time.
Traditionally the best way to avoid this is, have the frament size and
block
size match such that 'df -k' o/p and 'dd' action are consistent. So if
fragment size was 8k in the above example, 'df -k' would have showed
100%
full, so you can't expect 'dd' to succeed anyway. What that means is
allocate
an 8k sized block even if the request was for just a few 100bytes
!!! So
you may end up buying more disk space.
The other way to fix this problem was, ones we start getting 'dd'
failures,
take the file system offline - use the traditional tools ufsdump(1M)/
ufsrestore(1M) to dump the files to a backup media and restore it back.
As you can see, there are two disadvantages with this approach:
unavailability
of the FS for a while and a backup device.
So one approach to fix the problem is, reconsider the block allocation
techniques of ufs (which will be an expensive proposition). And the
other
approach is, defragment the file system on the fly. This is what I have
implemented.
Initially I thought of adding a ufs kernel thread to do the
defragmentation
whenever the frament/block ratio reaches a threshold; but on popular
advise,
I switched to a utility which is called 'fsdefrag' - which can be
launched
from a crontab file if required.
What this does is, it sweeps through the FS checking for all the
cylinder
groups (another ufs thing), starts from the worst affected cylinder
group to
the relatively better ones, and tries to reallocate the fragments set
aside for
an inode such that lot of contiguous fragments can be made available at
the end
of the exercise. For the inode that is touched, its fragment numbers
will
change to the newer ones and old ones get freed after data is copied.
There are some checks made in deciding whether a given inode's data
block is a
an appropriate one or not.
What this is not:
Once I say, defragmentation, first thing that comes to the mind of
database
folks is, their ~2Gb file, which has blocks spread all over the FS will
be
made contiguous such that access times improve in a big way. This
utility
doesn't look at files which occupy more than 96k (in UFS, upto 96kb,
files
are allocated direct blocks i.e block numbers are stored in the inode
itself). So if you have read this far and this paragraph let you down,
I am
extremely sorry.
If you are interested in this, we can have a discussion on 'ufs:
discuss' forum
off opensolaris.org/
( Jul 27 2005, 07:22:22 AM PDT )
Permalink
Trackback URL: http://blogs.sun.com/sprakki/entry/ufs_file_system_defragmentation
Posted by Donny on September 16, 2005 at 03:22 AM PDT #
Posted by fdasfdsa on October 11, 2006 at 06:40 PM PDT #