Weblog

All | General | Solaris
« Previous day (Jul 11, 2005) | Main | Next day (Jul 13, 2005) »
20050712 Tuesday July 12, 2005

system crash file layout kernel memory corruption detected !! and root caused

Sometime back while working on a pet project (may be next blog), I studied the layout of
a system crash file and this blog is about that. This is applicable for crash files from
sun4u platform (but, things should not be very much different in x86/64 land, but I don't
bet on it).

In this document i am explaing the layout of the system crash file on the
swap disk and how it is present in the vmcore.* subsequent to running
a 'savecore'.
Use: 1. Getting to know the layout
2. How we can extract unix.* from vcore.* and thus avoid copying over
unix.* while downloading cores (Not a major gain, but for fun...)
3. How we can avoid kvm_* calls and do the vitual to physical translation
on our own and then read the appropriate location in vmcore.* (again
for fun...)

No sooner the system comes back after a crash, crash dump is avialable
on primary swap device and it starts at:

offset = (size of the device - 64k);
'offset' is truncated to get aligned on a 64k boundary

and grows upwards. At the offset, calculated as above, there will be a
dumphdr_t structure which describes the dump.

No need to know the name of the primary swap device, the following lines
do the job:

dumpfd=open("/dev/dump", O_RDONLY, 0444)
ioctl(dumpfd, DIOCGETDEV, dumpfile) <- this gets dump device into
"dumpfile"
dumpfd=open(dumpfile, O_RDONLY, 0444)

Seek to 64k:
llseek(dumpfd, -DUMP_OFFSET, SEEK_END) & -DUMP_OFFSET;

And read the dumphdr_t:

pread64(dumpfd, &dumphdr, sizeof(dumphdr_t), endoff);

dumphdr contains info to describe the various sections of crash dump,
their start points, their sizes and some more info like crash time, utsname...

sections:

1. dump_start of dumphdr denotes the beginning of crash and there will be a
blank page (size: 8192) to start with (Even this contains one copy of
dumphdr_t, so its not exactly blank :-))

2. Then starts the symbol table in compressed form (its no longer compressed !!)
which will subsequently become 'unix.*' upon running savecore. Its exact
size is available in dump_ksyms_size/dump_ksyms_csize of dumphdr.

3. This is followed by the page translation map. Its an array of structures
of type mem_vtop:

struct mem_vtop {
struct as *m_as;
void *m_va;
pfn_t m_pfn;
}

How does the system populate this? It walks through all the segements of
an address space (assuming we are dumping only kernel pages, then it will
just be 'kas') and for each pages of VA mapped, it will find out underlying
page frame number and updates the above structure. This is the key.
In our debuggers, whenever we try to access a VA, debugger just looks at
this map, finds out the PFN, seek to the appropriate page/offset and read
whatever is present there (So we should be able to write our own read
routines instead of using kvm_read() for fun).

How big is it? its determined by the 'dump_nvtop' member of dumphdr.
(size is: dump_nvtop*sizeof(mem_vtop_t))

4. This is followed by dumping all the Page Frame Numbers (PFN) of all the
memory pages in use, in an array form.

How big is this section? Its determined by the 'dump_npages' member of
dumphdr_t.
(size is: dump_npages*sizeof(pfn_t))

5. This is followed by the contents of all the PFNs noted above, in compressed
form. Each compressed buffer is preceded by the compressed buffer size
(which is a uint32_t).

Big picture:


| |
------------------------- ^ start
| | |
| | pagesize (8192)
| | |
ksyms start ------------------------- -
| | ^
| | |
| | dump_ksyms_csize/dump_ksyms_size
| | |
| | v
page xlation map------------------------- -
start | | |
| | |
| |
| | dump_nvtop * sizeof(mem_vtop_t)
| |
| | |
| | |
| | v
pfn start ------------------------- -
| | ^
| | |
| | |
| |
| | dump_npages * sizeof(pfn_t)
| |
| | |
| | |
Acutal dump strt------------------------- v
| |
| |
| |
| |
| |
------------------------- <--- this offset is 64k aligned
| dumphdr stored here |
| |
_________________________
| | ^
| | |
| | 64k
| | |
------------------------- v End of swap device



Layout of vmcore.*:
-------------------

1. First page will be a blank page.

2. Kernel symbol table is read from swap device into the vmcore.*, uncompress
it (nothing happens here, as it was not compressed in the first hand) and
then write to it unix.*;
Which means we don't need to pull over unix.* while copying over the crash
dumps, we can extract it from vmcore.* (some more fun....). All we need
to do is: open vmcore.*; seek by 8192; and then read dumphdr.dump_ksyms_size
bytes into unix.*

(Now the question is: as we don't have access to swap device, how do we
find out the value of dumphdr.dump_ksyms_size?

The thing is point-1 is wrong - First page is not exactly blank - but it
contains coredhr - so read dumphdr_t bytes from offset:0 and then get
corehdr.dump_ksyms_size bytes, populate unix.* with that many bytes from
vmcore.* - Now we have a unix.* !!

3. Pfn table is dumped next. As the above diagram indicates, we need to
dump (dump_npages * sizeof(pfn_t)) bytes from appropriate offset i.e
dumphdr.dump_pfn

4. Now we create the page transalation info by reading the contents from
swap device at offset: 'page xlation map start' - in swap device data
type is mem_vtop_t and in core file its: dump_map_t

struct dump_map {
offset_t dm_first;
offset_t dm_next;
offset_t dm_data;
struct as *dm_as;
uintptr_t dm_va;
}

There will be 'dump_nvtop' mem_vtop_t structures in swap device.
'dump_nvtop' can be higher than or equal or 'dump_npages' - as a page
could have been mapped at multiple VAes.

Now we read each mem_vtop_t, look at its PFN, then find out its location
in pfn_map (created in step 3) and update the 'dm_data' variable as
follows:

dm_data = (dump_data + (location * PAGESIZE))

For 'dump_data_start', please refer to the below figure.

5. Then the actual pages are dumped. These pages are stored in compressed
form. How they are stored in the dump device is: compressed size followed
by the compressed contents.
To extract them, something like this is done:

pread(dumpfd, &csize, sizeof (uint32_t), dumpoff);
^^^^^ <- this gets us the compressed size
dumpoff += sizeof (uint32_t); <- go past the data type

pread(dumpfd, inbuf, csize, dumpoff) <- read in the compressed buf;

decompress(inbuf, outbuf, csize, pagesize); <- uncompress it
^^^^^^^^ <- expected uncompressed size

write(corefd, outbuf, pagesize,corehdr.dump_data+x*pagesize) <- write
the uncompressed buf to the vmcore.* file

Big picture (vmcore.X):


offset: 0 --------------------------
| corehdr_t |^
| ||
| |8192
| ||
corehdr. | |v
dump_kyms --------------------------
| |^
| Kernel ||
| Symbols (A.K.A) ||
| unix.* |unix.* size aligned to page size
| ||
| ||
| |v
dump_pfn --------------------------
| |^
| Number of elements: ||
| (dump_npages) |pfn_table (PFNs of all dumped pages)
| || (aligned to page size)
| |v
dump_map --------------------------
| |^
| Array of dump_map_t ||
| structures ||
| (dump_nvtop) |Xlation map
| ||
| ||
| |v
dump_start --------------------------
| |^
| ||
| ||
| |Contents of Physical pages
| |are here
| ||
| ||
| |v
--------------------------

So given this picture how do we do the xlation given a VA:

1. First search for the VA in the tranlation map starting at: corehdr.dump_map
2. Get the 'dm_data' of the corresponding dump_map_t
3. Seek to that offset, and just read.


( Jul 12 2005, 11:50:47 PM PDT ) Permalink Comments [2]

Calendar

RSS Feeds

Search

Links

Navigation

Referers