Debugging a UFS file truncation bug
I decided for my
first
OpenSolaris blog I would
talk about a particularly difficult to find UFS bug that I worked on
about a year ago. This bug came along early in my new career in UFS
and the Solaris kernel. It was a bug that taught me a huge amount
about UFS, Solaris VM paging, delayed writes, DTrace and other
miscellaneous kernel things. The fix was trivial, a one parameter
change in a function call, but tracking it down was difficult. The
trivial fix bugs are sometimes some of the most interesting to work
on. The evaluation is where it gets interesting.
The bug is
5047967
ufs_itrunc does not account for pagesize < blocksize. This
was originally known as the 'bmap' bug. Some of the details below
show why it was named this. For more gory details you can read the
bug report. Subsequently, during the course of my
working
on it I found it was not a
bmap_write
but indeed a
ufs_itrunc
bug and I changed the synopsis to reflect this.
This bug was
found running a filesystem stress test called fsx.c. I cannot give
you source to this test, but the basic things that this test did were
to create a file, truncate it, grow it, write to it, write past the
end to create and populate the holes. Then it validated that the data
at every block was indeed what was expected. The beginning output
from the test showed all was OK. Later the output will show where the
error occurred.
So far, so good:
141(141 mod 256): TRUNCATE UP from 0xc0f9 to 0x2ad8a******WWWW
142(142 mod 256): TRUNCATE DOWN from 0x2ad8a to 0x1c4f8 ******WWWW
143(143 mod 256): READ 0xe4c3 thru 0xfa85 (0x15c3 bytes)
144(144 mod 256): WRITE 0x34119 thru 0x3c577 (0x845f bytes) HOLE ***WWWW
145(145 mod 256): READ 0x1c9ef thru 0x1e0e9 (0x16fb bytes) ***RRRR***
So, I was pondering how I could possibly see what was happening with
this bug, and what the values were we actually passed in to the
appropriate function calls. Enter the miracle known as DTrace.
Output from my DTrace script validating that the data passed to
these functions is:
length ufs_itrunc: 1c4f8
inode: 7
inode orig size: 2ad8a
boff: 4f8
bmask: ffffe000
0 14386 bmap_write:entry
off: 1c4f7
i size: 4f8
0 9098 fbzero:entry
fbzero off: 1c000
fbzero size: 2000
You can see that the length passed in to ufs_itrunc equals the
new end point as shown in the last truncate down. This is all good.
Then, the call to bmap_write is correct since when we call
this we pass in length-1, and get 1c4f7. All good so far. Then the
call to fbzero
should have zeroed the blocks between 1c000-1e000, which should have
covered the range of the truncate, that is from up to 1c4f8. However,
the failure output from fsx.c shows:
READ BAD DATA: offset = 0x1c9ef, size = 0x16fb
OFFSET GOOD BAD RANGE
0x1d000 0x0000 0x80ea 0x ff4
That starting at 0x1d000 we have bad data. The call to the truncate
up above, prior to the truncate down, looked like it also had the
accurate data passed in(DTrace output not shown for this). The
initial thought on this bug was that we had a race since the file was
not opened with O_DSYNC. This turned out to be not the case. Read the
bug report for more details. In short, ftruncate does not use
the O_DSYNC flag at all so calls coming in to ufs_itrunc from
there are never synchronous.
So, what was the real problem?
Well it was thought to be a problem when we were trying to fill in a
hole in the file when creating the hole. It is, indirectly as you
will see with the details below. We had truncated the file down, then
written it again, past the new end point, thus creating a hole. The
output of fsx.c for this process shows:
142(142 mod 256): TRUNCATE DOWN from 0x2ad8a to 0x1c4f8 ******WWWW
143(143 mod 256): READ 0xe4c3 thru 0xfa85 (0x15c3 bytes)
144(144 mod 256): WRITE 0x34119 thru 0x3c577 (0x845f bytes) HOLE ***WWWW <--right here we should zero the pages from 1c4f9->34119
145(145 mod 256): READ 0x1c9ef thru 0x1e0e9 (0x16fb bytes) ***RRRR***
From the DTrace output below it is clear we were not doing that. It
is kind of long output, but an important piece of the debugging
story, so it is important for me to include it here. There is a bit
of extra data in my output, partly because I had just started
learning DTrace too. I hadn't quite mastered the DTrace script
writing skills to narrow down my output a bit. I am much better now
:-). The data I got from this DTrace script made me *think* we were
having a problem in ufs_getpage.
Again, we were kind of, but in an indirect way...
DTrace output:
2 15344 rdip:entry
2 11884 segmap_getmapflt:entry
off + mapon: 1c4f9
n: 1b07
!pagecreate: 1
2 15061 ufs_getpage:entry offset: 1c000
length: 2000
plsz: 2000
rw: 1
2 1463 page_lookup:entry pgoff: 1c000
2 1464 page_lookup:return page_lookup return: c6388058<--we find the 1c000 offset page
2 1463 page_lookup:entry pgoff: 1d000<--have incremented offset by 4k, and look again
2 1464 page_lookup:return page_lookup return: 0<--don't find this page
2 15065 ufs_getpage_miss:entry offset: 1d000<--get a miss so go searching
length: 1000
plsz: 1000
rw: 1
2 15650 bmap_read:entry offset: 1d000
lbn: e
boff: 1000
2 8248 bread_common:entry blkno: 2000
2 8249 bread_common:return
2 14933 findextent:entry bshift: d<--should never have gotten here from bmap_read. This whole block is a hole.
n: 11
2 14934 findextent:return len: 1
2 15651 bmap_read:return
2 10888 pvn_read_kluster:entry
2 1373 page_create_va:entry offset: 1d000
2 1374 page_create_va:return pp: c63736a8<--we create the page here, but do not zero it.
tid: 1
2 10889 pvn_read_kluster:return
3 15062 ufs_getpage:return
So, we return from ufs_getpage without zeroing the 1d000 page.
Why? The code snippet below shows how we were getting in to this
situation. The code in ufs_getpage_miss
that zero pages that are not supported by a backing store, i.e.
UFS_HOLE is:
/*
* Figure out whether the page can be created, or must be
* must be read from the disk.
*/
if (rw == S_CREATE)
crpage = 1;
else {
contig = 0;
if (err = bmap_read(ip, off, &bn, &contig))
return (err);
crpage = (bn == UFS_HOLE);<--bmap_read should have returned bp==UFS_HOLE
}
if (crpage) {<---we don't come through here as a result of crpage == 0
if ((pp = page_create_va(vp, off, PAGESIZE, PG_WAIT, seg,
addr)) == NULL) {
return (ufs_fault(vp,
"ufs_getpage_miss: page_create == NULL"));
}
if (rw != S_CREATE)
pagezero(pp, 0, PAGESIZE);
} else {
u_offset_t io_off;
uint_t xlen;
struct buf *bp;
ufsvfs_t *ufsvfsp = ip->i_ufsvfs;
/*
* If access is not in sequential order, we read from disk
* in bsize units.
*
* We limit the size of the transfer to bsize if we are reading
* from the beginning of the file. Note in this situation we
* will hedge our bets and initiate an async read ahead of
* the second block.
*/
if (!seq || off == 0)
contig = MIN(contig, bsize);
pp = pvn_read_kluster(vp, off, seg, addr, &io_off,
&io_len, off, contig, 0); <--Read the data from disk.
However, as seen in the dtrace output above, bmap_read
does not return that 1d000 is a hole in the file. The disk has
garbage on it which we somehow think is good data and was not
recognized as a hole. Therefore we return the garbage we read.
Remember the old adage, 'Garbage In, Garbage Out'. Exactly what was
happening here.
I modified fsx.c to output when it found a hole that did not have
zero data. This is the output:
1c4f9 - 34119******
byte #: 1d000 = 80ea
byte #: 1d001 = ea80
byte #: 1d002 = 806b
byte #: 1d003 = 6b80
byte #: 1d004 = 8065
byte #: 1d005 = 6580
byte #: 1d006 = 80ba
byte #: 1d007 = ba80
byte #: 1d008 = 80a2
byte #: 1d009 = a280
byte #: 1d00a = 801c
byte #: 1d00b = 1c80
byte #: 1d00c = 80d6
...
byte #: 1dfff = 5700
1 page worth of data.
So, why is the read to the disk blocks
at this address not showing a hole? And, why was it always the 2nd
page in a filesystem block that had this problem? And why didn't it
happen every time we truncated to a non filesystem block boundary?
These questions and their answers were the key to finding the true
problem.
Remember from above, this bug only
happened when pagesize < filesystem blocksize. A filesystem
blocksize is usually 8k, but can be modified to be 4k. When pagesize
was 4k and filesystem block size was 8k we saw the bug. In the 8k
block/4k page size scenario truncate a file to a boundary such as the
picture below:
Now, re-grow the file to include the
whole block. ufs_itrunc will zero the whole block from the
end of the truncate which will zero a partial page and a whole page,
when pagesize < filesystem blocksize. This was happening as shown
by this code snippet from ufs_itrunc:
line 1346:
bsize = (int)lblkno(fs, length - 1) >= NDADDR ?
fs->fs_bsize : fragroundup(fs, boff);
pvn_vpzero(ITOV(oip), length, (size_t)(bsize - boff));
(void) pvn_vplist_dirty(ITOV(oip), length, ufs_putapage,
B_INVAL | B_TRUNC, CRED());
We zero (bsize-boff ) bytes worth of data with the call to
pvn_vpzero.
This is the value passed in for the 3rd parameter, called
zbyes in the pvn_vpzero function. It is important to
note that this size is a function of filesystem block size not page
size. So, we are zeroing the difference between where we last wrote
data and the end of a filesystem block.
The call to pvn_vplist_dirty
should mark the pages associated with this zeroing as dirty
for later write. This is where we were getting in to trouble. To keep
from writing useless data beyond the EOF, the pvn_vplist_dirty
call is passed file length and B_INVAL | B_TRUNC. Since it only
knows page boundaries it will of course throw away the second page of
the block as it was asked to by the ufs_itrunc code. Note we
passed in 'length' to the 2nd parameter of the
pvn_vplist_dirty function call. This parameter represents the
the offset for which to process the vnodes pages whose offset >=
off. We were telling it only to worry about up to the end of the
file, and in the case where pagesize == blocksize, we marked the page
covering the offset to the next filesystem block size boundary and it
works.
But, when pagesize < filesystem blocksize, we did not mark the
last page in the filesystem block as dirty(because the offset is
within the first page), therefore never pushed it, and therefore had
bad data on disk. Now, when the file grows to encompass the whole
block containing the thrown away page there will be garbage in the
file where that page resided.
The fix was trivial and is shown:
bsize = (int)lblkno(fs, length - 1) >= NDADDR ?
fs_bsize : fragroundup(fs, boff);
pvn_vpzero(ITOV(oip), length, (size_t)(bsize - boff));
/*
* Ensure full fs block is marked as dirty.
*/
(void) pvn_vplist_dirty(ITOV(oip), length + (bsize – boff), <---modified the value passed in to the
2nd parameter (offset) to account for the fs block size boundary
ufs_putapage, B_INVAL | B_TRUNC, CRED());
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace