
Wednesday September 21, 2005
data loss root caused...
moddebug fun on DEBUG builds...
Few weeks back I was looking at a data loss issue reported by a
customer and in this post I describe how it is root caused.
This initially came in as a NFSv3 issue. Customer doesn't see the
problem in case of ufs. But sees it only if he uses nfs. Even with nfs,
he doesn't see the problem if 'noac' mount option is used (noac
suppresses caching). Ofcourse customer doesn't want to use noac,
as performance is penalized. Even when he switched NFS
servers, the problem persisted. Customer doesn't see the problem
with solaris10 on a sparc box but sees it on an opteron box with same
solaris 10. He sees it with both 32bit and 64bit kernel.
Couple of enterprising 'nfs' engineers, after looking at the data from
customer site, cooked up a 10 line reproducible test case and proved
that the problem is not in NFS land but in the 'vm' layer beneath.
Armed with a simple test case, a killer weapon called DTrace and
a killer system called V40z I took 5hrs to zero-in on the problematic
routine.
What the test case does is: it opens an existing file, lseek's to an
offset: 2633940 [== (321*8192) + (4096+212)] and does a write of
60bytes and then seeks back to the beginning of the file and starts
reading the whole file in 8k chunks. And when finished reading, closes
the file.
Thats all and last 60bytes written is lost, somehow. And the problem
happens every time.
Whenever a user reads or writes to a file, the data is
read/copied into the kernel address space before being pushed to
userland or the FS. Where to copy into the kernel is controlled
by a segment driver called: segmap. What it does is, given
a vnode and an offset it churns out a kernel VA where a page can be
mapped to, such that FileSystems can copy the data from this page or
into the page depending upon whether its a write or read request.
With 64bit kernel, 64Mb of kernel VA is set aside for segmap. And each
'smap' structure maps 8Kb of this VA. Thus at boot time 8192
'smap' structures are created to map the whole 64Mb.
Primarily 'smap' structure contains the vnode/offset
members, beside some other information.
Even in 64bit land, page size on an x86 box is still 4Kb only.
Thus given a segmap addres, kernel needs 2pages to map 8k size.
If you refer back to the offset where the write is happening, its
actually happening to a 2nd page of a segmap address.
The stack trace when we load a PTE for the segmap address as part of
write is:
hat_memload:entry adr: fffffe80c0ddb000 pp: (pfn:dfc87)
genunix`segmap_fault+0x1c9
genunix`as_fault+0x205
unix`pagefault+0x8b
unix`trap+0xdbc
unix`_cmntrap+0x11b
unix`kcopy+0x2c
genunix`uiomove+0xdf
nfs`writerp+0x87
nfs`nfs3_write+0x3ad
genunix`fop_write+0xb
So what this means is we got an address: fffffe80c0ddb000
from segmap for the vnode/offset we passed. And the page number kernel
set aside for this is: dfc87. Ones we load a translation entry for this
page, and when the write happens to the page, processor marks the
page dirty in one of the bits of the PTE.
What happens ususally is, fsflush or say pageout, finds that this page
is dirty (refer to pvn_getdirty()), syncs these flags from PTE into the
page structure and wirtes the contents to the backign store before
disposing the page.
What the test case does next is, seeks back to the beginning and read
the page in 8k chunks :
As the thread reads the pages in 8k chunks, when it comes to offset
(321*8192), in the next read it has to return (4102+212+the
60bytes that we wrote last), as segmap already has an entry for
this vnode/offset it returns the segmap address the corresponds to the
first page:
segmap_fault:entry addr: fffffe80c0dda000, len: 8192
genunix`segmap_fault+0x1c9
genunix`segmap_getmapflt+0x3ab
nfs`nfs3_read+0x1da
genunix`fop_read+0xb
genunix`read+0x188
genunix`read32+0xe
unix`sys_syscall32+0xd9
Its understandable that this address and the earlier segmap address are
4k apart.
The catch is in segmap_getmapflt() - What it does is, it does a
hat_probe() on the beginning address, as this is the first time we are
accessing this address, there is no valid PTE for this address and ones
hat_probe() returns zero saying there is no valid PTE for this address,
it thinks same is valid even for the 2nd page thats part of this
segmap address.
So it now goes ahead and installs a new PTE for both the pages.
And thus a valid entry for the 2nd page is lost :
x86pte_set:return new pte: 80000000dfc87103, old pte: 80000000dfc87163
unix`hati_pte_map+0x93
unix`hat_kmap_load+0b7
unix`hat_memload+0x87
genunix`segmap_fault+0x1c9
genunix`segmap_getmapflt+0x3ab
nfs`nfs3_read+0x1da
genunix`fop_read+0xb
genunix`read+0x188
genunix`read32+0xe
unix`sys_syscall32+0xd9
x86pte_set() installs a new pte and returns the old pte - as we can see
the page involved is: dfc87.
The returned pte has bits 0x60 (which stand for PT_MOD | PT_REF).
Ones the new pte is installed, the 60bytes written is
guanranteed to be lost and are lost.
The problem doesn't happen in sparc land as the page size is 8Kb.
As you may have guessed by now, if the test case reads in 4k chunks,
the problem would not have happened.
This problem is fixed both in s10 and nevada.
( Sep 21 2005, 09:43:45 PM PDT )
Permalink
Trackback URL: http://blogs.sun.com/sprakki/entry/data_loss_root_caused
Posted by hjk on January 22, 2006 at 06:29 AM PST #
Posted by rewrew on July 08, 2006 at 07:15 PM PDT #
Posted by wer on July 24, 2006 at 08:12 AM PDT #
Posted by wer on August 02, 2006 at 07:05 AM PDT #
Posted by wrewmkewkdre on August 09, 2006 at 08:32 PM PDT #
Posted by werwer on August 20, 2006 at 08:51 PM PDT #
Posted by rwerwe on September 17, 2006 at 06:59 PM PDT #
Posted by werwe on September 28, 2006 at 11:32 PM PDT #
Posted by fdasfdsa on October 11, 2006 at 06:32 PM PDT #
Posted by sfwer on October 11, 2006 at 11:38 PM PDT #
Posted by fdasfdsa on October 12, 2006 at 01:29 AM PDT #
Posted by fdadsfdas on October 14, 2006 at 09:55 PM PDT #