Weblog

All | General | Solaris
« Previous month (Jun 2005) | Main | Next month (Aug 2005) »
20050727 Wednesday July 27, 2005

UFS file system defragmentation... UFS defragmentation... A while back I worked on a utility to defragment a Unix File System (UFS).
I haven't integrated it yet into the source base for two reasons:

1. Couldn't run it on any customer system - to give it a taste of real world.

2. This involves modifying the block numbers underneath a file. And looks
   like cluster folks have a project, which depends on block numbers not
   changing underneath a file.

But as this project hasn't made it, this constraint can be overlooked (I hope).

In this post, I will explain a little bit about why I did this, what it does
and what it doesn't do.
Few customers reported the following problem:

# df -k .
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d40      26614583 24092573 2255865    92%    /stats
#
# dd bs=1024 if=/dev/zero of=test count=8
write: No space left on device
5+0 records in
5+0 records out
#

What this means is, even though there is ~2.2Gb of space is avaialbe,
'dd' failed to create a file of size 8k.
This is because the file system is so fragmented, that it could not
locate 8Kb of contiguous space and hence writes from 'dd' failed.

UFS terminology:
Being a block based file system, allocation unit under ufs is a block of
size 8k. To minimize wastage of space when files of small size are created,
a block is actually divided into framgnets of 1k size (this can be changed
while creating a UFS FS. see mkfs(1M)).
Fragments are the ones which are actually numbered.
What this means is a block is nothing but 8contiguous fragments and the
fragment number of the leading fragment should be a multiple of 8.

So if we need to store 8 files, each of size, say, 200bytes, ufs sets aside
8fragments instead of 8blocks and this results in huge space saving (8*1k Vs
8*8k). What does the math say, if we have a million of such small files?

In the above case, we also got the 'fstyp -v' o/p of the file system:

# fstyp -v /dev/md/rdsk/d40 | sed -e '/^$/,$d'
ufs
magic   11954   format  dynamic time    Tue Aug 27 19:20:27 2002
sblkno  16      cblkno  24      iblkno  32      dblkno  808
sbsize  2048    cgsize  8192    cgoffset 128    cgmask  0xffffffe0
ncg     522     size    27028032        blocks  26614583
bsize   8192    shift   13      mask    0xffffe000
fsize   1024    shift   10      mask    0xfffffc00
^^^^^^^^^^^^
frag    8       shift   3       fsbtodb 1
minfree 1%      maxbpg  2048    optim   time
maxcontig 16    rotdelay 0ms    rps     120
csaddr  808     cssize  9216    shift   9       mask    0xfffffe00
ntrak   19      nsect   248     spc     4712    ncyl    11472
cpg     22      bpg     6479    fpg     51832   ipg     6208
nindir  2048    inopb   64      nspf    2
nbfree  0       ndir    38886   nifree  1891931 nffree  4858090
^^^^^^^^^^                                                 ^^^^^^^^^^^^^^^
cgrotor 81      fmod    0       ronly   0


From this o/p, we can see that number of free blocks (nbfree) on the
system is 0. And all the free space is available in the form of fragments
(nffree) - that means no 8 fragments are contiguous.

How did the FS get into this state? The system serves as a web server and it
creates and deletes tons of small files and thus the fragmentation over
a period of time.

Traditionally the best way to avoid this is, have the frament size and block
size match such that 'df -k' o/p and 'dd' action are consistent. So if
fragment size was 8k in the above example, 'df -k' would have showed 100%
full, so you can't expect 'dd' to succeed anyway. What that means is allocate
an 8k sized block even if the request was for just a few  100bytes !!! So
you may end up buying more disk space.

The other way to fix this problem was, ones we start getting 'dd' failures,
take the file system offline - use the traditional tools ufsdump(1M)/
ufsrestore(1M) to dump the files to a backup media and restore it back.
As you can see, there are two disadvantages with this approach: unavailability
of the FS for a while and a backup device.

So one approach to fix the problem is, reconsider the block allocation
techniques of ufs (which will be an expensive proposition). And the other
approach is, defragment the file system on the fly. This is what I have
implemented.

Initially I thought of adding a ufs kernel thread to do the defragmentation
whenever the frament/block ratio reaches a threshold; but on popular advise,
I switched to a utility which is called 'fsdefrag' - which can be launched
from a crontab file if required.

What this does is, it sweeps through the FS checking for  all the cylinder
groups (another ufs thing), starts from the worst affected cylinder group to
the relatively better ones, and tries to reallocate the fragments set aside for
an inode such that lot of contiguous fragments can be made available at the end
of the exercise. For the inode that is touched, its fragment numbers will
change to the newer ones and old ones get freed after data is copied.
There are some checks made in deciding whether a given inode's data block is a
an appropriate one or not.

What this is not:
Once I say, defragmentation, first thing that comes to the mind of database
folks is, their ~2Gb file, which has blocks spread all over the FS will be
made contiguous such that access times improve in a big way. This utility
doesn't look at files which occupy more than 96k (in UFS, upto 96kb, files
are allocated direct blocks i.e block numbers are stored in the inode
itself). So if you have read this far and this paragraph let you down, I am
extremely sorry.

If you are interested in this, we can have a discussion on 'ufs: discuss' forum
off opensolaris.org/

( Jul 27 2005, 07:22:22 AM PDT ) Permalink Comments [2]

20050712 Tuesday July 12, 2005

system crash file layout kernel memory corruption detected !! and root caused

Sometime back while working on a pet project (may be next blog), I studied the layout of
a system crash file and this blog is about that. This is applicable for crash files from
sun4u platform (but, things should not be very much different in x86/64 land, but I don't
bet on it).

In this document i am explaing the layout of the system crash file on the
swap disk and how it is present in the vmcore.* subsequent to running
a 'savecore'.
Use: 1. Getting to know the layout
2. How we can extract unix.* from vcore.* and thus avoid copying over
unix.* while downloading cores (Not a major gain, but for fun...)
3. How we can avoid kvm_* calls and do the vitual to physical translation
on our own and then read the appropriate location in vmcore.* (again
for fun...)

No sooner the system comes back after a crash, crash dump is avialable
on primary swap device and it starts at:

offset = (size of the device - 64k);
'offset' is truncated to get aligned on a 64k boundary

and grows upwards. At the offset, calculated as above, there will be a
dumphdr_t structure which describes the dump.

No need to know the name of the primary swap device, the following lines
do the job:

dumpfd=open("/dev/dump", O_RDONLY, 0444)
ioctl(dumpfd, DIOCGETDEV, dumpfile) <- this gets dump device into
"dumpfile"
dumpfd=open(dumpfile, O_RDONLY, 0444)

Seek to 64k:
llseek(dumpfd, -DUMP_OFFSET, SEEK_END) & -DUMP_OFFSET;

And read the dumphdr_t:

pread64(dumpfd, &dumphdr, sizeof(dumphdr_t), endoff);

dumphdr contains info to describe the various sections of crash dump,
their start points, their sizes and some more info like crash time, utsname...

sections:

1. dump_start of dumphdr denotes the beginning of crash and there will be a
blank page (size: 8192) to start with (Even this contains one copy of
dumphdr_t, so its not exactly blank :-))

2. Then starts the symbol table in compressed form (its no longer compressed !!)
which will subsequently become 'unix.*' upon running savecore. Its exact
size is available in dump_ksyms_size/dump_ksyms_csize of dumphdr.

3. This is followed by the page translation map. Its an array of structures
of type mem_vtop:

struct mem_vtop {
struct as *m_as;
void *m_va;
pfn_t m_pfn;
}

How does the system populate this? It walks through all the segements of
an address space (assuming we are dumping only kernel pages, then it will
just be 'kas') and for each pages of VA mapped, it will find out underlying
page frame number and updates the above structure. This is the key.
In our debuggers, whenever we try to access a VA, debugger just looks at
this map, finds out the PFN, seek to the appropriate page/offset and read
whatever is present there (So we should be able to write our own read
routines instead of using kvm_read() for fun).

How big is it? its determined by the 'dump_nvtop' member of dumphdr.
(size is: dump_nvtop*sizeof(mem_vtop_t))

4. This is followed by dumping all the Page Frame Numbers (PFN) of all the
memory pages in use, in an array form.

How big is this section? Its determined by the 'dump_npages' member of
dumphdr_t.
(size is: dump_npages*sizeof(pfn_t))

5. This is followed by the contents of all the PFNs noted above, in compressed
form. Each compressed buffer is preceded by the compressed buffer size
(which is a uint32_t).

Big picture:


| |
------------------------- ^ start
| | |
| | pagesize (8192)
| | |
ksyms start ------------------------- -
| | ^
| | |
| | dump_ksyms_csize/dump_ksyms_size
| | |
| | v
page xlation map------------------------- -
start | | |
| | |
| |
| | dump_nvtop * sizeof(mem_vtop_t)
| |
| | |
| | |
| | v
pfn start ------------------------- -
| | ^
| | |
| | |
| |
| | dump_npages * sizeof(pfn_t)
| |
| | |
| | |
Acutal dump strt------------------------- v
| |
| |
| |
| |
| |
------------------------- <--- this offset is 64k aligned
| dumphdr stored here |
| |
_________________________
| | ^
| | |
| | 64k
| | |
------------------------- v End of swap device



Layout of vmcore.*:
-------------------

1. First page will be a blank page.

2. Kernel symbol table is read from swap device into the vmcore.*, uncompress
it (nothing happens here, as it was not compressed in the first hand) and
then write to it unix.*;
Which means we don't need to pull over unix.* while copying over the crash
dumps, we can extract it from vmcore.* (some more fun....). All we need
to do is: open vmcore.*; seek by 8192; and then read dumphdr.dump_ksyms_size
bytes into unix.*

(Now the question is: as we don't have access to swap device, how do we
find out the value of dumphdr.dump_ksyms_size?

The thing is point-1 is wrong - First page is not exactly blank - but it
contains coredhr - so read dumphdr_t bytes from offset:0 and then get
corehdr.dump_ksyms_size bytes, populate unix.* with that many bytes from
vmcore.* - Now we have a unix.* !!

3. Pfn table is dumped next. As the above diagram indicates, we need to
dump (dump_npages * sizeof(pfn_t)) bytes from appropriate offset i.e
dumphdr.dump_pfn

4. Now we create the page transalation info by reading the contents from
swap device at offset: 'page xlation map start' - in swap device data
type is mem_vtop_t and in core file its: dump_map_t

struct dump_map {
offset_t dm_first;
offset_t dm_next;
offset_t dm_data;
struct as *dm_as;
uintptr_t dm_va;
}

There will be 'dump_nvtop' mem_vtop_t structures in swap device.
'dump_nvtop' can be higher than or equal or 'dump_npages' - as a page
could have been mapped at multiple VAes.

Now we read each mem_vtop_t, look at its PFN, then find out its location
in pfn_map (created in step 3) and update the 'dm_data' variable as
follows:

dm_data = (dump_data + (location * PAGESIZE))

For 'dump_data_start', please refer to the below figure.

5. Then the actual pages are dumped. These pages are stored in compressed
form. How they are stored in the dump device is: compressed size followed
by the compressed contents.
To extract them, something like this is done:

pread(dumpfd, &csize, sizeof (uint32_t), dumpoff);
^^^^^ <- this gets us the compressed size
dumpoff += sizeof (uint32_t); <- go past the data type

pread(dumpfd, inbuf, csize, dumpoff) <- read in the compressed buf;

decompress(inbuf, outbuf, csize, pagesize); <- uncompress it
^^^^^^^^ <- expected uncompressed size

write(corefd, outbuf, pagesize,corehdr.dump_data+x*pagesize) <- write
the uncompressed buf to the vmcore.* file

Big picture (vmcore.X):


offset: 0 --------------------------
| corehdr_t |^
| ||
| |8192
| ||
corehdr. | |v
dump_kyms --------------------------
| |^
| Kernel ||
| Symbols (A.K.A) ||
| unix.* |unix.* size aligned to page size
| ||
| ||
| |v
dump_pfn --------------------------
| |^
| Number of elements: ||
| (dump_npages) |pfn_table (PFNs of all dumped pages)
| || (aligned to page size)
| |v
dump_map --------------------------
| |^
| Array of dump_map_t ||
| structures ||
| (dump_nvtop) |Xlation map
| ||
| ||
| |v
dump_start --------------------------
| |^
| ||
| ||
| |Contents of Physical pages
| |are here
| ||
| ||
| |v
--------------------------

So given this picture how do we do the xlation given a VA:

1. First search for the VA in the tranlation map starting at: corehdr.dump_map
2. Get the 'dm_data' of the corresponding dump_map_t
3. Seek to that offset, and just read.


( Jul 12 2005, 11:50:47 PM PDT ) Permalink Comments [2]

Calendar

RSS Feeds

Search

Links

Navigation

Referers