Jeff Bonwick's Blog |
Monday Dec 12, 2005
SEEK_HOLE and SEEK_DATA for sparse files
A sparse file is a file that contains much less data than its size would suggest. If you create a new file and then do a 1-byte write to the billionth byte, for example, you've just created a 1GB sparse file. By convention, reads from unwritten parts of a file return zeroes. File-based backup and archiving programs like tar, cpio, and rsync can detect runs of zeroes and ignore them, so that the archives they produce aren't filled with zeroes. Still, this is terribly inefficient. If you're a backup program, what you really want is a list of the non-zero segments in the file. But historically there's been no way to get this information without looking at filesystem-specific metadata. Desperately seeking segments As part of the ZFS project, we introduced two general extensions to lseek(2): SEEK_HOLE and SEEK_DATA. These allow you to quickly discover the non-zero regions of sparse files. Quoting from the new man page:
o If whence is SEEK_HOLE, the offset of the start of the
next hole greater than or equal to the supplied offset
is returned. The definition of a hole is provided near
the end of the DESCRIPTION.
o If whence is SEEK_DATA, the file pointer is set to the
start of the next non-hole file region greater than or
equal to the supplied offset.
[...]
A "hole" is defined as a contiguous range of bytes in a
file, all having the value of zero, but not all zeros in a
file are guaranteed to be represented as holes returned with
SEEK_HOLE. Filesystems are allowed to expose ranges of zeros
with SEEK_HOLE, but not required to. Applications can use
SEEK_HOLE to optimise their behavior for ranges of zeros,
but must not depend on it to find all such ranges in a file.
The existence of a hole at the end of every data region
allows for easy programming and implies that a virtual hole
exists at the end of the file. Applications should use
fpathconf(_PC_MIN_HOLE_SIZE) or pathconf(_PC_MIN_HOLE_SIZE)
to determine if a filesystem supports SEEK_HOLE. See fpath-
conf(2).
For filesystems that do not supply information about holes,
the file will be represented as one entire data region.
Any filesystem can support SEEK_HOLE / SEEK_DATA. Even a filesystem like UFS, which has no special support for sparseness, can walk its block pointers much faster than it can copy out a bunch of zeroes. Even if the search algorithm is linear-time, at least the constant is thousands of times smaller. Sparse file navigation in ZFS A file in ZFS is a tree of blocks. To maximize the performance of SEEK_HOLE and SEEK_DATA, ZFS maintains in each block pointer a fill count indicating the number of data blocks beneath it in the tree (see below). Fill counts reveal where holes and data reside, so that ZFS can find both holes and data in guaranteed logarithmic time.
Portability At this writing, SEEK_HOLE and SEEK_DATA are Solaris-specific. I encourage (implore? beg?) other operating systems to adopt these lseek(2) extensions verbatim (100% tax-free) so that sparse file navigation becomes a ubiquitous feature that every backup and archiving program can rely on. It's long overdue. Technorati Tags: OpenSolaris Solaris ZFS Posted at 06:42PM Dec 12, 2005 by bonwick in ZFS | Comments[18] |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Do I understand correctly, that ZFS "tree of blocks" is more of a radix-tree than B-tree? I wonder why this solution was choosen, because few problems with radix-trees immediately come to mind, e.g., one cannot use extents, only individual block pointers on the leaf level, and tree for a file with a huge hole in the middle will be unnecessarily tall.
Also, tree fanout of 3 is unbelievably small. How comes only 3 block pointers fit into whole block?
Posted by nikita on December 13, 2005 at 04:24 AM PST #
And I encourage Sun to implement this with UFS and to make tools like cpio sparse-file aware.
This is particularly important in an environment that uses live upgrade and large UID's. A good example of a sparse file is /var/adm/lastlog. In my environment, there are a few users clustered in the 0 - 100 range, a few more in 1,000 - 20,000, and clusters around 100,000,000, 212,000,000 and 500,000,000. This means that lastlog looks like:
So, if I use standard live upgrade commands or flash archive commands a file like this will balloon to the full 13G. This makes it pretty hard to maintain several boot environments on a 36 GB disk.
FWIW, the RFE has been filed and I was told "will not fix". My suggestion in the RFE was to switch the lu/flash tools to use GNU cpio with --sparse, which is presumably a lot less work than updating Sun's cpio to be sparse-file aware.
Posted by Mike Gerdts on December 13, 2005 at 10:13 AM PST #
Posted by Bill Sommerfeld on December 13, 2005 at 07:37 PM PST #
Posted by Glenn on December 14, 2005 at 12:43 PM PST #
Let me add to my own comment. Your implementation of SEEK_HOLE as not moving the file pointer effectively presumes what it is being used for -- namely, measuring the length of a data segment, probably for use in a backup program. But my use may be instead to find the next place to safely store a piece of data. In that case, I want SEEK_HOLE to move the file pointer, and I want SEEK_DATA to not move the file pointer, since I'll use SEEK_DATA after SEEK_HOLE to measure the size of the hole.
All of which is just to point out the inherent bias you've assumed in your definitions of these new operations, rather than consistently following the existing conventions. A clean way out would be to have SEEK_HOLE and SEEK_DATA both move the file pointer, and FIND_HOLE and FIND_DATA return the equivalent offset without moving the file pointer.
Posted by Glenn on December 14, 2005 at 01:07 PM PST #
Posted by Glenn on December 14, 2005 at 02:03 PM PST #
Posted by Jeff Bonwick on December 19, 2005 at 04:53 AM PST #
Posted by Glenn on December 19, 2005 at 04:47 PM PST #
Posted by Tim Mooney on December 21, 2005 at 10:15 AM PST #
Yes, the man page needs an edit to make it clearer that SEEK_HOLE moves the seek pointer.
Also, didn't Sun's old "Backup Copilot" product have a mechanism for finding out where the holes were in files? As I remember, it had a version of "dump" that read files rather than reading the raw disk, so you could do online backups, which meant that if it was to dump holes as such rather than as blocks of zeroes, it would either have to check for regions of zeroes and assume they're holes or ask the kernel where the holes were (or whether a given region includes a hole).
I think it might've been a kernel add-on that was later integrated into SunOS 4.1.x (presumably so that Backup Copilot would run on top of a vanilla kernel and wouldn't have to install a patched <tt>/vmunix</tt>) and, if I remember from when I last saw the source (over 11 years ago), it might've used an <tt>fcntl()</tt>; some might find that cleaner than <tt>lseek()</tt>.
Check out the 4.1.3 or 4.1.4 sources (if Sun still has them around...), or check out the Sun paper “Engineering a Commercial Backup Program” from the proceedings of the 5th LISA conference.
Posted by Guy Harris on December 23, 2005 at 04:47 PM PST #
The result would be a modified flock structure with l_start set to the first byte of the hole and l_len set to the hole length. You'd iterate the file by adding the returned l_len to the returned l_start, setting l_len back to 0, and calling it again to get the next hole.
Posted by Terry Lambert on January 03, 2006 at 05:26 AM PST #
Posted by Björn on February 02, 2006 at 04:50 PM PST #
Posted by kenneth topp on February 14, 2006 at 09:19 PM PST #
Posted by Ragnar Sundblad on March 05, 2006 at 04:07 AM PST #
Posted by André Goddard Rosa on June 18, 2006 at 08:34 AM PDT #
Posted by Jeff Johnson on August 22, 2006 at 12:47 PM PDT #
Posted by 199.72.20.100 on August 22, 2006 at 12:48 PM PDT #
Posted by Anonymous on October 19, 2006 at 11:36 AM PDT #