Jeff Bonwick's Blog

Wednesday Dec 14, 2005

ZFS BoF at FAST

The ZFS team is hosting a BoF at FAST in San Francisco at 9PM tonight.

We'll begin with an in-depth look at what ZFS is and how it works. This will be followed by the customary Airing of Grievances. (I'll get you a free ZFS T-shirt if you can get a Festivus pole past security and into our BoF. (Seriously.))

Quite a few of us will be there, so come on up and meet the folks who put the dot in vmcore.0.

Monday Dec 12, 2005

SEEK_HOLE and SEEK_DATA for sparse files

A sparse file is a file that contains much less data than its size would suggest. If you create a new file and then do a 1-byte write to the billionth byte, for example, you've just created a 1GB sparse file. By convention, reads from unwritten parts of a file return zeroes.

File-based backup and archiving programs like tar, cpio, and rsync can detect runs of zeroes and ignore them, so that the archives they produce aren't filled with zeroes. Still, this is terribly inefficient. If you're a backup program, what you really want is a list of the non-zero segments in the file. But historically there's been no way to get this information without looking at filesystem-specific metadata.

Desperately seeking segments

As part of the ZFS project, we introduced two general extensions to lseek(2): SEEK_HOLE and SEEK_DATA. These allow you to quickly discover the non-zero regions of sparse files. Quoting from the new man page:

       o  If whence is SEEK_HOLE, the offset of the start of  the
          next  hole greater than or equal to the supplied offset
          is returned. The definition of a hole is provided  near
          the end of the DESCRIPTION.

       o  If whence is SEEK_DATA, the file pointer is set to  the
          start  of the next non-hole file region greater than or
          equal to the supplied offset.

        [...]

     A "hole" is defined as a contiguous  range  of  bytes  in  a
     file,  all  having the value of zero, but not all zeros in a
     file are guaranteed to be represented as holes returned with
     SEEK_HOLE. Filesystems are allowed to expose ranges of zeros
     with SEEK_HOLE, but not required to.  Applications  can  use
     SEEK_HOLE  to  optimise  their behavior for ranges of zeros,
     but must not depend on it to find all such ranges in a file.
     The  existence  of  a  hole  at the end of every data region
     allows for easy programming and implies that a virtual  hole
     exists  at  the  end  of  the  file. Applications should use
     fpathconf(_PC_MIN_HOLE_SIZE) or  pathconf(_PC_MIN_HOLE_SIZE)
     to  determine if a filesystem supports SEEK_HOLE. See fpath-
     conf(2).

     For filesystems that do not supply information about  holes,
     the file will be represented as one entire data region.

Any filesystem can support SEEK_HOLE / SEEK_DATA. Even a filesystem like UFS, which has no special support for sparseness, can walk its block pointers much faster than it can copy out a bunch of zeroes. Even if the search algorithm is linear-time, at least the constant is thousands of times smaller.

Sparse file navigation in ZFS

A file in ZFS is a tree of blocks. To maximize the performance of SEEK_HOLE and SEEK_DATA, ZFS maintains in each block pointer a fill count indicating the number of data blocks beneath it in the tree (see below). Fill counts reveal where holes and data reside, so that ZFS can find both holes and data in guaranteed logarithmic time.

  L4                           6
              -----------------------------------
  L3          5                0                1
         -----------      -----------      -----------
  L2     3    0    2                       0    0    1
        ---  ---  ---                     ---  ---  ---
  L1    111       101                               001


An indirect block tree for a sparse file, showing the fill count in each block pointer. In this example there are three block pointers per indirect block. The lowest-level (L1) block pointers describe either one block of data or one block-sized hole, indicated by a fill count of 1 or 0, respectively. At L2 and above, each block pointer's fill count is the sum of the fill counts of its children. At any level, a non-zero fill count indicates that there's data below. A fill count less than the maximum (3L-1 in this example) indicates that there are holes below. From any offset in the file, ZFS can find the next hole or next data in logarithmic time by following the fill counts in the indirect block tree.

Portability

At this writing, SEEK_HOLE and SEEK_DATA are Solaris-specific. I encourage (implore? beg?) other operating systems to adopt these lseek(2) extensions verbatim (100% tax-free) so that sparse file navigation becomes a ubiquitous feature that every backup and archiving program can rely on. It's long overdue.


Technorati Tags:

Friday Dec 09, 2005

ZFS End-to-End Data Integrity

The job of any filesystem boils down to this: when asked to read a block, it should return the same data that was previously written to that block. If it can't do that -- because the disk is offline or the data has been damaged or tampered with -- it should detect this and return an error.

Incredibly, most filesystems fail this test. They depend on the underlying hardware to detect and report errors. If a disk simply returns bad data, the average filesystem won't even detect it.

Even if we could assume that all disks were perfect, the data would still be vulnerable to damage in transit: controller bugs, DMA parity errors, and so on. All you'd really know is that the data was intact when it left the platter. If you think of your data as a package, this would be like UPS saying, "We guarantee that your package wasn't damaged when we picked it up." Not quite the guarantee you were looking for.

In-flight damage is not a mere academic concern: even something as mundane as a bad power supply can cause silent data corruption.

Arbitrarily expensive storage arrays can't solve the problem. The I/O path remains just as vulnerable, but becomes even longer: after leaving the platter, the data has to survive whatever hardware and firmware bugs the array has to offer.

And if you're on a SAN, you're using a network designed by disk firmware writers. God help you.

What to do? One option is to store a checksum with every disk block. Most modern disk drives can be formatted with sectors that are slightly larger than the usual 512 bytes -- typically 520 or 528. These extra bytes can be used to hold a block checksum. But making good use of this checksum is harder than it sounds: the effectiveness of a checksum depends tremendously on where it's stored and when it's evaluated.

In many storage arrays (see the Dell|EMC PowerVault paper for a typical example with an excellent description of the issues), the data is compared to its checksum inside the array. Unfortunately this doesn't help much. It doesn't detect common firmware bugs such as phantom writes (the previous write never made it to disk) because the data and checksum are stored as a unit -- so they're self-consistent even when the disk returns stale data. And the rest of the I/O path from the array to the host remains unprotected. In short, this type of block checksum provides a good way to ensure that an array product is not any less reliable than the disks it contains, but that's about all.

NetApp's block-appended checksum approach appears similar but is in fact much stronger. Like many arrays, NetApp formats its drives with 520-byte sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it compares the checksum to the data just like an array would, but there's a key difference: it does this comparison after the data has made it through the I/O path, so it validates that the block made the journey from platter to memory without damage in transit.

This is a major improvement, but it's still not enough. A block-level checksum only proves that a block is self-consistent; it doesn't prove that it's the right block. Reprising our UPS analogy, "We guarantee that the package you received is not damaged. We do not guarantee that it's your package."

The fundamental problem with all of these schemes is that they don't provide fault isolation between the data and the checksum that protects it.

ZFS Data Authentication

End-to-end data integrity requires that each data block be verified against an independent checksum, after the data has arrived in the host's memory. It's not enough to know that each block is merely consistent with itself, or that it was correct at some earlier point in the I/O path. Our goal is to detect every possible form of damage, including human mistakes like swapping on a filesystem disk or mistyping the arguments to dd(1). (Have you ever typed "of=" when you meant "if="?)

A ZFS storage pool is really just a tree of blocks. ZFS provides fault isolation between data and checksum by storing the checksum of each block in its parent block pointer -- not in the block itself. Every block in the tree contains the checksums for all its children, so the entire pool is self-validating. [The uberblock (the root of the tree) is a special case because it has no parent; more on how we handle that in another post.]

When the data and checksum disagree, ZFS knows that the checksum can be trusted because the checksum itself is part of some other block that's one level higher in the tree, and that block has already been validated.

ZFS uses its end-to-end checksums to detect and correct silent data corruption. If a disk returns bad data transiently, ZFS will detect it and retry the read. If the disk is part of a mirror or RAID-Z group, ZFS will both detect and correct the error: it will use the checksum to determine which copy is correct, provide good data to the application, and repair the damaged copy.

As always, note that ZFS end-to-end data integrity doesn't require any special hardware. You don't need pricey disks or arrays, you don't need to reformat drives with 520-byte sectors, and you don't have to modify applications to benefit from it. It's entirely automatic, and it works with cheap disks.

But wait, there's more!

The blocks of a ZFS storage pool form a Merkle tree in which each block validates all of its children. Merkle trees have been proven to provide cryptographically-strong authentication for any component of the tree, and for the tree as a whole. ZFS employs 256-bit checksums for every block, and offers checksum functions ranging from the simple-and-fast fletcher2 (the default) to the slower-but-secure SHA-256. When using a cryptographic hash like SHA-256, the uberblock checksum provides a constantly up-to-date digital signature for the entire storage pool.

Which comes in handy if you ask UPS to move it.


Technorati Tags:


Archives
Languages:
Links
Referrers