Weblog

All | General | Solaris
« Previous day (Sep 20, 2005) | Main | Next day (Sep 22, 2005) »
20050921 Wednesday September 21, 2005

data loss root caused... moddebug fun on DEBUG builds... Few weeks back I was looking at a data loss issue reported by a customer and in this  post I describe how it is root caused.

This initially came in as a NFSv3 issue.  Customer doesn't see the problem in case of ufs. But sees it only if he uses nfs. Even with nfs, he doesn't see the problem if  'noac' mount option is used (noac suppresses caching). Ofcourse customer doesn't  want to use noac, as  performance is penalized.  Even when he switched NFS servers, the problem persisted.  Customer doesn't see the problem with solaris10 on a sparc box but sees it on an opteron box with same solaris 10.  He sees it with  both 32bit and 64bit kernel.

Couple of enterprising 'nfs' engineers, after looking at the data from customer site, cooked up a 10 line reproducible test case and proved that the problem is not in NFS land but in the 'vm' layer beneath.

Armed with a simple test case,  a killer weapon called DTrace and a killer system called V40z I took 5hrs to zero-in on the problematic routine.

What the test case does is: it opens an existing file, lseek's to an offset: 2633940  [== (321*8192) + (4096+212)] and does a write of 60bytes and then seeks back to the beginning of the file and starts reading the whole file in 8k chunks. And when finished reading, closes the file.
Thats all and last 60bytes written is lost, somehow. And the problem happens every time.

Whenever  a user reads or writes to a file, the data is read/copied into the kernel address space before being pushed to userland or the FS.  Where to copy into the kernel is controlled  by a segment driver called: segmap.  What it does is, given a vnode and an offset it churns out a kernel VA where a page can be mapped to, such that FileSystems can copy the data from this page or into the page depending upon whether its a write or read request.

With 64bit kernel, 64Mb of kernel VA is set aside for segmap. And each 'smap' structure maps 8Kb of this VA. Thus at boot time 8192  'smap'  structures are created to map the whole 64Mb.  Primarily  'smap' structure contains the vnode/offset members, beside some other information.
Even in 64bit land, page size on an x86 box is still 4Kb only.  Thus given a segmap addres, kernel needs 2pages to map 8k size.

If you refer back to the offset where the write is happening, its actually happening to a 2nd page of a segmap address.
The stack trace when we load a PTE for the segmap address as part of write is:


hat_memload:entry adr: fffffe80c0ddb000 pp: (pfn:dfc87)
genunix`segmap_fault+0x1c9
genunix`as_fault+0x205
unix`pagefault+0x8b
unix`trap+0xdbc
unix`_cmntrap+0x11b
unix`kcopy+0x2c
genunix`uiomove+0xdf
nfs`writerp+0x87
nfs`nfs3_write+0x3ad
genunix`fop_write+0xb
 
So what this means is we got an address:  fffffe80c0ddb000   from segmap for the vnode/offset we passed. And the page number kernel set aside for this is: dfc87. Ones we load a translation entry for this page, and when the write happens to the page,  processor marks the page dirty in one of the bits of the PTE.

What happens ususally is, fsflush or say pageout, finds that this page is dirty (refer to pvn_getdirty()), syncs these flags from PTE into the page structure and wirtes the contents to the backign store before disposing the page.

What the test case does next is, seeks back to the beginning and read the page in 8k chunks :

As the thread reads the pages in 8k chunks, when it comes to offset (321*8192),  in the next read it has to  return (4102+212+the 60bytes that we wrote last),  as segmap already has an entry for this vnode/offset it returns the segmap address the corresponds to the first page:
	segmap_fault:entry addr: fffffe80c0dda000, len: 8192
genunix`segmap_fault+0x1c9
genunix`segmap_getmapflt+0x3ab
nfs`nfs3_read+0x1da
genunix`fop_read+0xb
genunix`read+0x188
genunix`read32+0xe
unix`sys_syscall32+0xd9

Its understandable that this address and the earlier segmap address are 4k apart.

The catch is in segmap_getmapflt() - What it does is, it does a hat_probe() on the beginning address, as this is the first time we are accessing this address, there is no valid PTE for this address and ones hat_probe() returns zero saying there is no valid PTE for this address,  it thinks same is valid even for the 2nd page thats part of this segmap address.

So it now goes ahead and installs a new PTE for both the pages.  And thus a valid entry for the 2nd page is lost :

	x86pte_set:return new pte: 80000000dfc87103, old pte: 80000000dfc87163
	unix`hati_pte_map+0x93                                   
        unix`hat_kmap_load+0b7
unix`hat_memload+0x87
genunix`segmap_fault+0x1c9 genunix`segmap_getmapflt+0x3ab
nfs`nfs3_read+0x1da genunix`fop_read+0xb genunix`read+0x188 genunix`read32+0xe unix`sys_syscall32+0xd9

x86pte_set() installs a new pte and returns the old pte - as we can see the page involved is: dfc87.
The returned pte has bits 0x60 (which stand for PT_MOD | PT_REF).  Ones the new pte is installed, the 60bytes written  is guanranteed to be lost and are lost.
The problem doesn't happen in sparc land as the page size is 8Kb.
As you may have guessed by now, if the test case reads in 4k chunks, the problem would not have happened.

This problem is fixed both in s10 and nevada.
( Sep 21 2005, 09:43:45 PM PDT ) Permalink Comments [13]

Calendar

RSS Feeds

Search

Links

Navigation

Referers