Tuesday Jun 14, 2005

Debugging a UFS file truncation bug
I decided for my first OpenSolaris blog I would talk about a particularly difficult to find UFS bug that I worked on about a year ago. This bug came along early in my new career in UFS and the Solaris kernel. It was a bug that taught me a huge amount about UFS, Solaris VM paging, delayed writes, DTrace and other miscellaneous kernel things. The fix was trivial, a one parameter change in a function call, but tracking it down was difficult. The trivial fix bugs are sometimes some of the most interesting to work on. The evaluation is where it gets interesting.

The bug is 5047967 ufs_itrunc does not account for pagesize < blocksize. This was originally known as the 'bmap' bug. Some of the details below show why it was named this. For more gory details you can read the bug report. Subsequently, during the course of my working on it I found it was not a bmap_write but indeed a ufs_itrunc bug and I changed the synopsis to reflect this.

This bug was found running a filesystem stress test called fsx.c. I cannot give you source to this test, but the basic things that this test did were to create a file, truncate it, grow it, write to it, write past the end to create and populate the holes. Then it validated that the data at every block was indeed what was expected. The beginning output from the test showed all was OK. Later the output will show where the error occurred.

So far, so good:

141(141 mod 256): TRUNCATE UP from 0xc0f9 to 0x2ad8a******WWWW 
142(142 mod 256): TRUNCATE DOWN from 0x2ad8a to 0x1c4f8 ******WWWW 
143(143 mod 256): READ 0xe4c3 thru 0xfa85 (0x15c3 bytes) 
144(144 mod 256): WRITE 0x34119 thru 0x3c577 (0x845f bytes) HOLE ***WWWW 
145(145 mod 256): READ 0x1c9ef thru 0x1e0e9 (0x16fb bytes) ***RRRR*** 

So, I was pondering how I could possibly see what was happening with this bug, and what the values were we actually passed in to the appropriate function calls. Enter the miracle known as DTrace.

Output from my DTrace script validating that the data passed to these functions is:

length ufs_itrunc: 1c4f8 
inode: 7 
inode orig size: 2ad8a 
boff: 4f8 
bmask: ffffe000
 
0 14386      bmap_write:entry 
off: 1c4f7 
i size: 4f8 

0 9098         fbzero:entry
fbzero off: 1c000 
fbzero size: 2000 

You can see that the length passed in to ufs_itrunc equals the new end point as shown in the last truncate down. This is all good. Then, the call to bmap_write is correct since when we call this we pass in length-1, and get 1c4f7. All good so far. Then the call to fbzero should have zeroed the blocks between 1c000-1e000, which should have covered the range of the truncate, that is from up to 1c4f8. However, the failure output from fsx.c shows:

READ BAD DATA: offset = 0x1c9ef, size = 0x16fb 
OFFSET GOOD BAD RANGE 
0x1d000 0x0000 0x80ea 0x ff4 

That starting at 0x1d000 we have bad data. The call to the truncate up above, prior to the truncate down, looked like it also had the accurate data passed in(DTrace output not shown for this). The initial thought on this bug was that we had a race since the file was not opened with O_DSYNC. This turned out to be not the case. Read the bug report for more details. In short, ftruncate does not use the O_DSYNC flag at all so calls coming in to ufs_itrunc from there are never synchronous.

So, what was the real problem? Well it was thought to be a problem when we were trying to fill in a hole in the file when creating the hole. It is, indirectly as you will see with the details below. We had truncated the file down, then written it again, past the new end point, thus creating a hole. The output of fsx.c for this process shows:

142(142 mod 256): TRUNCATE DOWN from 0x2ad8a to 0x1c4f8 ******WWWW 
143(143 mod 256): READ 0xe4c3 thru 0xfa85 (0x15c3 bytes) 
144(144 mod 256): WRITE 0x34119 thru 0x3c577 (0x845f bytes) HOLE ***WWWW <--right here we should zero the pages from 1c4f9->34119 
145(145 mod 256): READ 0x1c9ef thru 0x1e0e9 (0x16fb bytes) ***RRRR*** 

From the DTrace output below it is clear we were not doing that. It is kind of long output, but an important piece of the debugging story, so it is important for me to include it here. There is a bit of extra data in my output, partly because I had just started learning DTrace too. I hadn't quite mastered the DTrace script writing skills to narrow down my output a bit. I am much better now :-). The data I got from this DTrace script made me *think* we were having a problem in ufs_getpage. Again, we were kind of, but in an indirect way...

DTrace output:

2  15344                       rdip:entry
2  11884               segmap_getmapflt:entry 
off + mapon: 1c4f9
n: 1b07
!pagecreate: 1

2  15061               ufs_getpage:entry offset: 1c000
length: 2000
plsz: 2000
rw: 1

2   1463               page_lookup:entry pgoff: 1c000

2   1464               page_lookup:return page_lookup return: c6388058<--we find the 1c000 offset page

2   1463               page_lookup:entry pgoff: 1d000<--have incremented offset by 4k, and look again

2   1464               page_lookup:return page_lookup return: 0<--don't find this page

2  15065               ufs_getpage_miss:entry offset: 1d000<--get a miss so go searching
length: 1000
plsz: 1000
rw: 1

2  15650               bmap_read:entry offset: 1d000
lbn: e
boff: 1000

2   8248               bread_common:entry blkno: 2000

2   8249               bread_common:return
2  14933               findextent:entry bshift: d<--should never have gotten here from bmap_read. This whole block is a hole.
n: 11

2  14934               findextent:return len: 1

2  15651               bmap_read:return
2  10888               pvn_read_kluster:entry
2   1373               page_create_va:entry offset: 1d000

2   1374               page_create_va:return pp: c63736a8<--we create the page here, but do not zero it.
tid: 1
2  10889               pvn_read_kluster:return
3  15062               ufs_getpage:return



So, we return from ufs_getpage without zeroing the 1d000 page. Why? The code snippet below shows how we were getting in to this situation. The code in ufs_getpage_miss that zero pages that are not supported by a backing store, i.e. UFS_HOLE is:

         /*
         * Figure out whether the page can be created, or must be
         * must be read from the disk.
         */
        if (rw == S_CREATE)
                crpage = 1;
        else {
                contig = 0;
                if (err = bmap_read(ip, off, &bn, &contig))
                        return (err);
                crpage = (bn == UFS_HOLE);<--bmap_read should have returned bp==UFS_HOLE
        }

        if (crpage) {<---we don't come through here as a result of crpage == 0
                if ((pp = page_create_va(vp, off, PAGESIZE, PG_WAIT, seg,
                    addr)) == NULL) {
                        return (ufs_fault(vp,
                                    "ufs_getpage_miss: page_create == NULL"));
                }

                if (rw != S_CREATE)
                        pagezero(pp, 0, PAGESIZE);
        } else {
                u_offset_t      io_off;
                uint_t  xlen;
                struct buf      *bp;
                ufsvfs_t        *ufsvfsp = ip->i_ufsvfs;

                /*
                 * If access is not in sequential order, we read from disk
                 * in bsize units.
                 *
                 * We limit the size of the transfer to bsize if we are reading
                 * from the beginning of the file. Note in this situation we
                 * will hedge our bets and initiate an async read ahead of
                 * the second block.
                 */
                if (!seq || off == 0)
                        contig = MIN(contig, bsize);

                pp = pvn_read_kluster(vp, off, seg, addr, &io_off,
                    &io_len, off, contig, 0); <--Read the data from disk.

However, as seen in the dtrace output above, bmap_read does not return that 1d000 is a hole in the file. The disk has garbage on it which we somehow think is good data and was not recognized as a hole. Therefore we return the garbage we read. Remember the old adage, 'Garbage In, Garbage Out'. Exactly what was happening here.

I modified fsx.c to output when it found a hole that did not have zero data. This is the output:

1c4f9 - 34119******
 byte #: 1d000 = 80ea
 byte #: 1d001 = ea80
 byte #: 1d002 = 806b
 byte #: 1d003 = 6b80
 byte #: 1d004 = 8065
 byte #: 1d005 = 6580
 byte #: 1d006 = 80ba
 byte #: 1d007 = ba80
 byte #: 1d008 = 80a2
 byte #: 1d009 = a280
 byte #: 1d00a = 801c
 byte #: 1d00b = 1c80
 byte #: 1d00c = 80d6
...
 byte #: 1dfff = 5700

1 page worth of data.

So, why is the read to the disk blocks at this address not showing a hole? And, why was it always the 2nd page in a filesystem block that had this problem? And why didn't it happen every time we truncated to a non filesystem block boundary? These questions and their answers were the key to finding the true problem.


Remember from above, this bug only happened when pagesize < filesystem blocksize. A filesystem blocksize is usually 8k, but can be modified to be 4k. When pagesize was 4k and filesystem block size was 8k we saw the bug. In the 8k block/4k page size scenario truncate a file to a boundary such as the picture below:


Now, re-grow the file to include the whole block. ufs_itrunc will zero the whole block from the end of the truncate which will zero a partial page and a whole page, when pagesize < filesystem blocksize. This was happening as shown by this code snippet from ufs_itrunc:


line 1346:
                bsize = (int)lblkno(fs, length - 1) >= NDADDR ?
                    fs->fs_bsize : fragroundup(fs, boff);
                pvn_vpzero(ITOV(oip), length, (size_t)(bsize - boff));
 
               (void) pvn_vplist_dirty(ITOV(oip), length, ufs_putapage,
                   B_INVAL | B_TRUNC, CRED());

We zero (bsize-boff ) bytes worth of data with the call to pvn_vpzero. This is the value passed in for the 3rd parameter, called zbyes in the pvn_vpzero function. It is important to note that this size is a function of filesystem block size not page size. So, we are zeroing the difference between where we last wrote data and the end of a filesystem block.

The call to pvn_vplist_dirty should mark the pages associated with this zeroing as dirty for later write. This is where we were getting in to trouble. To keep from writing useless data beyond the EOF, the pvn_vplist_dirty call is passed file length and B_INVAL | B_TRUNC. Since it only knows page boundaries it will of course throw away the second page of the block as it was asked to by the ufs_itrunc code. Note we passed in 'length' to the 2nd parameter of the pvn_vplist_dirty function call. This parameter represents the the offset for which to process the vnodes pages whose offset >= off. We were telling it only to worry about up to the end of the file, and in the case where pagesize == blocksize, we marked the page covering the offset to the next filesystem block size boundary and it works.

But, when pagesize < filesystem blocksize, we did not mark the last page in the filesystem block as dirty(because the offset is within the first page), therefore never pushed it, and therefore had bad data on disk. Now, when the file grows to encompass the whole block containing the thrown away page there will be garbage in the file where that page resided.

The fix was trivial and is shown:

            bsize = (int)lblkno(fs, length - 1) >= NDADDR ?
                        fs_bsize : fragroundup(fs, boff);
                 pvn_vpzero(ITOV(oip), length, (size_t)(bsize - boff));
                /*
                 * Ensure full fs block is marked as dirty.
                 */
                (void) pvn_vplist_dirty(ITOV(oip), length + (bsize – boff), <---modified the value passed in to the 
		2nd parameter (offset) to account for the fs block size boundary
                   ufs_putapage, B_INVAL | B_TRUNC, CRED());

Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace








Comments:

Post a Comment:
  • HTML Syntax: NOT allowed