Jeff Bonwick's Blog

Monday Dec 31, 2007

How to Lose a Customer

For over a year I have been the proud and happy owner of a Garmin GPS unit -- the Nuvi 360.  I have practically been a walking billboard for the company.  Go ahead, ask me about my Nuvi!

That changed today, permanently.  When I powered on the Nuvi this morning, it alerted me that its map database was over a year old and should be updated.  That makes sense, I thought -- indeed, how nice of them to remind me!  So I brought the Nuvi inside, plugged it into my Mac, and went to Garmin's website to begin the update.

Wait a minute, what's this?  They want to charge $69 for the update!  Excuse me?  This isn't new functionality I'm getting, it's a bug fix.  The product I bought is a mapping device.  Its maps are now "out of date", as Garmin puts it -- well, yes, in the same way that the phlogiston theory is "out of date".  The old maps are wrong, which means that the product has become defective and should be fixed.  Given the (somewhat pathetic) fact that the Nuvi doesn't automatically update its maps from Web or satellite sources, the least Garmin could do to keep their devices operating correctly in the field is provide regular, free fixes to the map database.  I didn't buy a GPS unit so I could forever navigate 2005 America.

But wait, it gets better.

You might imagine that getting the update would require supplying a credit card number to get a license key, downloading the map update, and then using the key to activate it.  Nope!  You have to order a physical DVD from Garmin, which takes 3-5 weeks to ship.  3-5 weeks!  Any reason they can't include a first-class postage stamp as part of the $69 shakedown?  And seriously, if you work for Garmin and you're reading this, check out this cool new technology.  It really works.  Swear to God.  You're soaking in it.

Assuming you ordered the DVD, you would not discover until after it arrived -- because this is mentioned nowhere on Garmin's website -- that the DVD will only work for one device.  Yes, that's right -- after going to all the trouble to get a physical copy of the map update, you have to get on their website to activate it, and it's only good for one unit.  So to update my wife's unit as well as my own, I'd have to order two DVDs, for $138.  That's offensive.  Even the RIAA doesn't expect me to buy two copies of every CD just because I'm married.  And the only reason I know about this is because I checked Amazon first, and found many reviewers had learned the hard way and were livid about it.  Garmin's policy is bad, but their failure to disclose it is even worse.

Moreover, the 2008 map update isn't a one-time purchase.  There's an update every year, so it's really a $138/year subscription.  That's $11.50/month.  For maps.  For a mapping device.  That I already paid for.

What does one get for this $11.50/month map subscription?  According to the reviews on Amazon, not much.  Major construction projects that were completed several years ago aren't reflected in the 2008 maps, and Garmin still hasn't fixed the long-standing bug that any store that's part of a mall isn't in their database.  (Want to find the nearest McDonald's?  No dice.  You just have to know that the nearest McDonald's is in the XYZ Shopping Center, and ask for directions to that.  This is really annoying in practice.)

I can get better information from Google maps, continuously updated, with integrated real-time traffic data, for free, forever -- and my iPhone will happily use that data to plot time-optimal routes.  (In fact, all the iPhone needs is the right antenna and a SIRF-3 chipset to make dedicated GPS devices instantly obsolete.  This is so obvious it can't be more than a year out.  I can live with the stale maps until then, and have a $138 down payment on the GPS iPhone earning interest while I wait.)

And so, starting today, that's exactly what I'll do.

I don't mind paying a reasonable fee for services rendered.  I do mind getting locked into a closed-source platform and being forced to pay monopoly rents for a proprietary, stale and limited version of data that's already available to the general public.  That business model is so over.

Everything about this stinks, Garmin.  You tell me, unexpectedly, that I have to pay for routine map updates.  You make the price outrageous.  You don't actually disclose what's in the update.  (Several Amazon reviewers say the new maps are actually worse.)  You make the update hard to do.  You needlessly add to our landfills by creating single-use DVDs.  You have an unreasonable licensing policy.  And you hide that policy until after the purchase.

Way to go, Garmin.  You have pissed off a formerly delighted customer, and that is generally a one-way ticket.  You have lost both my business and my respect.  I won't be coming back.  Ever.

Friday Sep 14, 2007

Space Maps


Every filesystem must keep track of two basic things: where your data is, and where the free space is.

In principle, keeping track of free space is not strictly necessary: every block is either allocated or free, so the free space can be computed by assuming everything is free and then subtracting out everything that's allocated; and the allocated space can be found by traversing the entire filesystem from the root.  Any block that cannot be found by traversal from the root is, by definition, free.

In practice, finding free space this way would be insufferable because it would take far too long for any filesystem of non-trivial size.  To make the allocation and freeing of blocks fast, the filesystem needs an efficient way to keep track of free space.  In this post we'll examine the most common methods, why they don't scale well, and the new approach we devised for ZFS.

Bitmaps

The most common way to represent free space is by using a bitmap.  A bitmap is simply an array of bits, with the Nth bit indicating whether the Nth block is allocated or free.  The overhead for a bitmap is quite low: 1 bit per block.  For a 4K blocksize, that's 1/(4096*8) = 0.003%.  (The 8 comes from 8 bits per byte.)

For a 1GB filesystem, the bitmap is 32KB -- something that easily fits in memory, and can be scanned quickly to find free space.  For a 1TB filesystem, the bitmap is 32MB -- still stuffable in memory, but no longer trivial in either size or scan time.  For a 1PB filesystem, the bitmap is 32GB, and that simply won't fit in memory on most machines.  This means that scanning the bitmap requires reading it from disk, which is slower still.

Clearly, this doesn't scale.

One seemingly obvious remedy is to break the bitmap into small chunks, and keep track of the number of bits set in each chunk.  For example, for a 1PB filesystem using 4K blocks, the free space can be divided into a million bitmaps, each 32KB in size.  The summary information (the million integers indicating how much space is in each bitmap) fits in memory, so it's easy to find a bitmap with free space, and it's quick to scan that bitmap.

But there's still a fundamental problem: the bitmap(s) must be updated not only when a new block is allocated, but also when an old block is freed.  The filesystem controls the locality of allocations (it decides which blocks to put new data into), but it has no control over the locality of frees.  Something as simple as 'rm -rf' can cause blocks all over the platter to be freed.  With our 1PB filesystem example, in the worst case, removing 4GB of data (a million 4K blocks) could require each of the million bitmaps to be read, modified, and written out again.  That's two million disk I/Os to free a measly 4GB -- and that's just not reasonable, even as worst-case behavior.

More than any other single factor, this is why bitmaps don't scale: because frees are often random, and bitmaps that don't fit in memory perform pathologically when they are accessed randomly.

B-trees

Another common way to represent free space is with a B-tree of extents.  An extent is a contiguous region of free space described by two integers: offset and length.  The B-tree sorts the extents by offset so that contiguous space allocation is efficient.  Unfortunately, B-trees of extents suffer the same pathology as bitmaps when confronted with random frees.

What to do?

Deferred frees

One way to mitigate the pathology of random frees is to defer the update of the bitmaps or B-trees, and instead keep a list of recently freed blocks.  When this deferred free list reaches a certain size, it can be sorted, in memory, and then freed to the underlying bitmaps or B-trees with somewhat better locality.  Not ideal, but it helps.

But what if we went further?

Space maps:  log-structured free lists

Recall that log-structured filesystems long ago posed this question: what if, instead of periodically folding a transaction log back into the filesystem, we made the transaction log be the filesystem?

Well, the same question could be asked of our deferred free list: what if, instead of folding it into a bitmap or B-tree, we made the deferred free list be the free space representation?

That is precisely what ZFS does.  ZFS divides the space on each virtual device into a few hundred regions called metaslabs.  Each metaslab has an associated space map, which describes that metaslab's free space.  The space map is simply a log of allocations and frees, in time order.  Space maps make random frees just as efficient as sequential frees, because regardless of which extent is being freed, it's represented on disk by appending the extent (a couple of integers) to the space map object -- and appends have perfect locality.  Allocations, similarly, are represented on disk as extents appended to the space map object (with, of course, a bit set indicating that it's an allocation, not a free).

When ZFS decides to allocate blocks from a particular metaslab, it first reads that metaslab's space map from disk and replays the allocations and frees into an in-memory AVL tree of free space, sorted by offset.  This yields a compact in-memory representation of free space that supports efficient allocation of contiguous space.  ZFS also takes this opportunity to condense the space map: if there are many allocation-free pairs that cancel out, ZFS replaces the on-disk space map with the smaller in-memory version.

Space maps have several nice properties:

  • They don't require initialization: a space map with no entries indicates that there have been no allocations and no frees, so all space is free.

  • They scale: because space maps are append-only, only the last block of the space map object needs to be in memory to ensure excellent performance, no matter how much space is being managed.

  • They have no pathologies: space maps are efficient to update regardless of the pattern of allocations and frees.

  • They are equally efficient at finding free space whether the pool is empty or full (unlike bitmaps, which take longer to scan as they fill up).

Finally, note that when a space map is completely full, it is represented by a single extent.  Space maps therefore have the appealing property that as your storage pool approaches 100% full, the space maps start to evaporate, thus making every last drop of disk space available to hold useful information.

Sunday Jun 24, 2007

ZFS License Announcement

My oldest son, Andrew, recently gave me something cool for the back of my car:

 

That's Andrew(11), David(8), and Galen(6).  If you look closely, you can see the CEO potential...

What, you were looking for yet another CDDL vs. GPL thread? 

Friday May 04, 2007

Rampant Layering Violation?

Andrew Morton has famously called ZFS a "rampant layering violation" because it combines the functionality of a filesystem, volume manager, and RAID controller.  I suppose it depends what the meaning of the word violate is.  While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic.  We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler.

An example from mathematics (my actual background) provides a useful prologue.

Suppose you had to compute the sum, from n=1 to infinity, of 1/n(n+1).

Expanding that out term by term, we have:

        1/(1*2) + 1/(2*3) + 1/(3*4) + 1/(4*5) + ...

That is,

        1/2 + 1/6 + 1/12 + 1/20 + ...

What does that infinite series add up to?  It may seem like a hard problem, but that's only because we're not looking at it right.  If you're clever, you might notice that there's a different way to express each term:

        1/n(n+1) = 1/n - 1/(n+1)

For example,

        1/(1*2) = 1/1 - 1/2
        1/(2*3) = 1/2 - 1/3
        1/(3*4) = 1/3 - 1/4

Thus, our sum can be expressed as:

        (1/1 - 1/2) + (1/2 - 1/3) + (1/3 - 1/4) + (1/4 - 1/5) + ...

Now, notice the pattern: each term that we subtract, we add back.  Only in Congress does that count as work.  So if we just rearrange the parentheses -- that is, if we rampantly violate the layering of the original problem by using associativity to refactor the arithmetic across adjacent terms of the series -- we get this:

        1/1 + (-1/2 + 1/2) + (-1/3 + 1/3) + (-1/4 + 1/4) + ...

or

        1/1 + 0 + 0 + 0 + ...

In others words,

        1.

Isn't that cool?

Mathematicians have a term for this.  When you rearrange the terms of a series so that they cancel out, it's called telescoping -- by analogy with a collapsable hand-held telescope.  In a nutshell, that's what ZFS does: it telescopes the storage stack.  That's what allows us to have a filesystem, volume manager, single- and double-parity RAID, compression, snapshots, clones, and a ton of other useful stuff in just 80,000 lines of code.

A storage system is more complex than this simple analogy, but at a high level the same idea really does apply.  You can think of any storage stack as a series of translations from one naming scheme to another -- ultimately translating a filename to a disk LBA (logical block address).  Typically it looks like this:

        filesystem(upper): filename to object (inode)
        filesystem(lower): object to volume LBA
        volume manager: volume LBA to array LBA
        RAID controller: array LBA to disk LBA

This is the stack we're about to refactor.

First, note that the traditional filesystem layer is too monolithic.  It would be better to separate the filename-to-object part (the upper half) from the object-to-volume-LBA part (the lower half) so that we could reuse the same lower-half code to support other kinds of storage, like objects and iSCSI targets, which don't have filenames.  These storage classes could then speak to the object layer directly.  This is more efficient than going through something like /dev/lofi, which makes a POSIX file look like a device.  But more importantly, it provides a powerful new programming model -- object storage -- without any additional code.

Second, note that the volume LBA is completely useless. Adding a layer of indirection often adds flexibility, but not in this case: in effect we're translating from English to French to German when we could just as easily translate from English to German directly.  The intermediate French has no intrinsic value.  It's not visible to applications, it's not visible to the RAID array, and it doesn't provide any administrative function.  It's just overhead.

So ZFS telescoped that entire layer away.  There are just three distinct layers in ZFS: the ZPL (ZFS POSIX Layer), which provides traditional POSIX filesystem semantics; the DMU (Data Management Unit), which provides a general-purpose transactional object store; and the SPA (Storage Pool Allocator), which provides virtual block allocation and data transformations (replication, compression, and soon encryption).  The overall ZFS translation stack looks like this:

        ZPL: filename to object
        DMU: object to DVA (data virtual address)
        SPA: DVA to disk LBA

The DMU provides both file and block access to a common pool of physical storage.  File access goes through the ZPL, while block access is just a direct mapping to a single DMU object.  We're also developing new data access methods that use the DMU's transactional capabilities in more interesting ways -- more about that another day.

The ZFS architecture eliminates an entire layer of translation -- and along with it, an entire class of metadata (volume LBAs).  It also eliminates the need for hardware RAID controllers.  At the same time, it provides a useful new interface -- object storage -- that was previously inaccessible because it was buried inside a monolithic filesystem.

I certainly don't feel violated.  Do you?

Thursday Apr 12, 2007

A Near-Death Experience

Evidently, my previous post was just a tad too cheerful for some folks' taste.  But I speak with the optimism of a man who has cheated death.  And ironically, Pete's reference to George Cameron had a lot to do with it.

Several years ago, George and a few other Sun folks went off to form 3par, a new storage company.  They all had Solaris expertise, and understood its advantages, so they wanted to use it inside their box.  But we weren't open-source at the time, and our licensing terms really sucked.  Both of us -- George at 3par, and me at Sun -- tried for months to arrange something reasonable.  We failed.  So finally -- because Sun literally gave them no choice -- 3par went with Linux.

I couldn't believe it.  A cool new company wanted to use our product, and instead of giving them a hand, we gave them the finger.

For many of us, that was the tipping point.  If we had any reservations about open-sourcing Solaris, that ended them.  It was a gamble, to be sure, but the alternative was certain death.  Even if the 3par situation had ended differently, it was clear that we needed to change our business practices.  To do that, we'd first have to change our culture.

But cultures don't change easily -- it usually takes some traumatic event.  In Sun's case, watching our stock shed 95% of its value did the trick.  It was that total collapse of confidence -- that near-death experience -- that opened us up to things that had previously seemed too dangerous.  We had to face a number of hard questions, including the most fundamental ones: Can we make a viable business out of this wreckage?  Why are we doing SPARC?  Why not AMD and Intel?  Why Solaris?  Why not Linux and Windows?  Where are we going with Java?  And not rah-rah why, but really, why?

In each case, asking the question with a truly open mind changed the answer.  We killed our more-of-the-same SPARC roadmap and went multi-core, multi-thread, and low-power instead.  We started building AMD and Intel systems.  We launched a wave of innovation in Solaris (DTrace, ZFS, zones, FMA, SMF, FireEngine, CrossBow) and open-sourced all of it.  We started supporting Linux and Windows.  And most recently, we open-sourced Java.  In short, we changed just about everything.  Including, over time, the culture.

Still, there was no guarantee that open-sourcing Solaris would change anything.  It's that same nagging fear you have the first time you throw a party: what if nobody comes?  But in fact, it changed everything: the level of interest, the rate of adoption, the pace of communication.  Most significantly, it changed the way we do development.  It's not just the code that's open, but the entire development process.  And that, in turn, is attracting developers and ISVs whom we couldn't even have spoken to a few years ago.  The openness permits us to have the conversation; the technology makes the conversation interesting.

After coming so close to augering into the ground, it's immensely gratifying to see the Solaris revival now underway.  So if I sometimes sound a bit like the proud papa going on and on about his son, well, I hope you can forgive me.

Oh, and Pete, if you're reading this -- George Cameron is back at Sun now, three doors down the hall from me.  Small valley!

Tuesday Apr 10, 2007

Solaris Inside

When you choose an OS for your laptop, many things affect your decision: application support, availability of drivers, ease of use, and so on.

But if you were developing a storage appliance, what would you want from the operating system that runs inside it?

The first thing you notice is all the things you don't care about: graphics cards, educational software, photoshop... none of it matters. What's left, then?  What do you really need from a storage OS? And why isn't Linux the answer?  Well, let's think about that.

You need something rock-solid, so it doesn't break or corrupt data.

You need something that scales, so you can take advantage of all those cores the microprocessor folks will be giving you.

You need really good tools for performance analysis, so you can figure out how to make your application scale as well as the OS does.

You need extensive hardware diagnostic support, so that when parts of the box fail or are about to fail, you can take appropriate action.

You need reliable crash dumps and first-rate debugging tools so you can perform first-fault diagnosis when something goes wrong.

And you need a community of equally serious developers who can help you out.

OpenSolaris gives you all of these: a robust kernel that scales to thousands of threads and spindles; DTrace, the best performance analysis tool on the planet; FMA (Fault Management Architecture) to monitor the hardware and predict and manage failures; mdb to analyze software problems; and of course the OpenSolaris community, a large, vibrant, professional, high signal-to-noise environment.

The other operating systems one might consider are so far behind on so many of these metrics, it just seems like a no-brainer.

Let's put it this way: if I ever leave Sun to do a storage startup, I'll have a lot of things to think about.  Choosing the OS won't be one of them.  OpenSolaris is the ideal storage development platform.

The General-Purpose Storage Revolution

It happened so slowly, most people didn't notice until it was over.

I'm speaking, of course, of the rise of general-purpose computing during the 1990s.  It was not so long ago that you could choose from a truly bewildering variety of machines.  Symbolics, for example, made hardware specifically designed to run Lisp programs.  We debated SIMD vs. MIMD, dataflow vs. control flow, VLIW, and so on.  Meanwhile, those boring little PCs just kept getting faster.  And more capable.  And cheaper.  By the end of the decade, even the largest supercomputers were just clusters of PCs. A simple, general-purpose computing device crushed all manner of clever, sophisticated, highly specialized systems.

And the thing is, it had nothing to do with technology. It was all about volume economics.  It was inevitable.

With that in mind, I bring news that is very good for you, very good for Sun, and not so good for our competitors:  the same thing that happened to compute in the 1990s is happening to storage, right now. Now, as then, the fundamental driver is volume economics, and we see it playing out at all levels of the stack: the hardware, the operating system, and the interconnect.

First, custom RAID hardware can't keep up with general-purpose CPUs. A single Opteron core can XOR data at about 6 GB/sec.  There's just no reason to dedicate special silicon to this anymore.  It's expensive, it wastes power, and it was always a compromise: array-based RAID can't provide the same end-to-end data integrity that host-based RAID can. No matter how good the array is, a flaky cable or FC port can still flip bits in transit.  A host-based RAID solution like RAID-Z in ZFS can both detect and correct silent data corruption, no matter where it arises.

Second, custom kernels can't keep up with volume operating systems. I try to avoid naming specific competitors in this blog -- it seems tacky -- but think about what's inside your favorite storage box. Is it open source?  Does it have an open developer community? Does it scale?  Can the vendor make it scale?  Do they even get a vote?

The latter question is becoming much more important due to trends in CPU design.  The clock rate party of the 1990s, during which we went from 20MHz to 2GHz -- a factor of 100 -- is over.  Seven years into the new decade we're not even 2x faster in clock rate, and there's no sign of that changing soon.  What we are getting, however, is more transistors.  We're using them to put multiple cores on each chip and multiple threads on each core (so the chip can do something useful during load stalls) -- and this trend will only accelerate.

Which brings us back to the operating system inside your storage device. Does it have any prayer of making good use of a 16-core, 64-thread CPU?

Third, custom interconnects can't keep up with Ethernet.  In the time that Fibre Channel went from 1Gb to 4Gb -- a factor of 4 -- Ethernet went from 10Mb to 10Gb -- a factor of 1000.  That SAN is just slowing you down.

Today's world of array products running custom firmware on custom RAID controllers on a Fibre Channel SAN is in for massive disruption. It will be replaced by intelligent storage servers, built from commodity hardware, running an open operating system, speaking over the real network.

You've already seen the first instance of this: Thumper (the x4500) is a 4-CPU, 48-disk storage system with no hardware RAID controller. The storage is all managed by ZFS on Solaris, and exported directly to your real network over standard protocols like NFS and iSCSI.

And if you think Thumper was disruptive, well... stay tuned.

Thursday Jan 11, 2007

Out of the mouths of babes...

After sizing up the computers we have at home, my son Andrew made the following declaration: "I want Solaris security, Mac interface, and Windows compatibility."  Age 10.  Naturally, sensing a teachable moment, I explained to him what virtualization is all about -- bootcamp, Parallels, Xen, etc.  And the thing is, he really gets it.  I can't wait to see what his generation is capable of.

Saturday Nov 04, 2006

ZFS Block Allocation

Block allocation is central to any filesystem.  It affects not only performance, but also the administrative model (e.g. stripe configuration) and even some core capabilities like transactional semantics, compression, and block sharing between snapshots.  So it's important to get it right.

There are three components to the block allocation policy in ZFS:

  1. Device selection (dynamic striping)
  2. Metaslab selection
  3. Block selection

By design, these three policies are independent and pluggable.  They can be changed at will without altering the on-disk format, which gives us lots of flexibility in the years ahead.

So... let's go allocate a block!

1.  Device selection (aka dynamic striping).  Our first task is device selection.  The goal is to spread the load across all devices in the pool so that we get maximum bandwidth without needing any notion of stripe groups.  You add more disks, you get more bandwidth.  We call this dynamic striping -- the point being that it's done on the fly by the filesystem, rather than at configuration time by the administrator.

There are many ways to select a device.  Any policy would work, including just picking one at random.  But there are several practical considerations:

  1. If a device was recently added to the pool, it'll be relatively empty.  To address such imbalances, we bias the allocation slightly in favor of underutilized devices.  This keeps space usage uniform across all devices.

  2. All else being equal, round-robin is a fine policy, but it's critical to get the granularity right.  If the granularity is too coarse (e.g. 1GB), we'll only get one device worth of bandwidth when doing sequential reads and writes.  If the granularity is too fine (e.g. one block), we'll waste any read buffering the device can do for us.  In practice, we've found that switching from one device to the next every 512K works well for the current generation of disk drives.

  3. That said, for intent log blocks, it's better to round-robin between devices each time we write a log block.  That's because they're very short-lived, so we don't expect to ever need to read them; therefore it's better to optimize for maximum IOPs when writing log blocks.  Neil Perrin integrated support for this earlier today.

  4. More generally, we'll probably want different striping policies for different types of data: large/sequential, small/random, transient (like the intent log), and dnodes (clumping for spatial density might be good).  This is fertile ground for investigation.

  5. If one of the devices is slow or degraded for some reason, it should be avoided.  This is work in progress.

2. Metaslab selection.  We divide each device into a few hundred regions, called metaslabs, because the overall scheme was inspired by the slab allocator. Having selected a device, which metaslab should we use?  Intuitively it seems that we'd always want the one with the most free space, but there are other factors to consider:

  1. Modern disks have uniform bit density and constant angular velocity.  Therefore, the outer recording zones are faster (higher bandwidth) than the inner zones by the ratio of outer to inner track diameter, which is typically around 2:1.  We account for this by assigning a higher weight to the free space in lower-LBA metaslabs.  In effect, this means that we'll select the metaslab with the most free bandwidth rather than simply the one with the most free space.

  2. When a pool is relatively empty, we want to keep allocations in the outer (faster) regions of the disk; this improves both bandwidth and latency (by minimizing seek distances).  Therefore, we assign a higher weight to metaslabs that have been used before.

All of these considerations can be seen in the function metaslab_weight().  Having defined a weighting scheme, the selection algorithm is simple:  always select the metaslab with the highest weight.

3. Block selection.  Having selected a metaslab, we must choose a block within that metaslab.  The current allocation policy is a simple variation on first-fit; it  seems likely that we can do better.  In the future I expect that we'll  have not only a better algorithm, but a whole collection of algorithms, each optimized for a specific workload.  Anticipating this, the block allocation code is fully vectorized; see space_map_ops_t for details.

The mechanism (as opposed to policy) for keeping track of free space in a metaslab is a new data structure called a space map, which I'll describe in the next post.

Saturday Sep 16, 2006

ZFS Internals at Storage Developer Conference



On Monday, September 18, Bill Moore and I will host a four-hour deep dive into ZFS internals at the 2006 Storage Developer Conference in San Jose.  The four-hour format is a lot of fun -- both for us and the audience -- because it allows enough time to explore the architectural choices and trade-offs, explain how things really work, and weave in some amusing stories along the way.  Come join the festivities!

Friday May 26, 2006

ZFS on FUSE/Linux



You've probably heard by now that Apple is planning to port ZFS to MacOS.

And that Matt Dillon is porting ZFS to DragonFly BSD.

Today I'm pleased to report that ZFS is being ported to FUSE/Linux.

Very cool.  This is another step toward the ultimate goal: ubiquity.  Consumer devices need data integrity just as badly as enterprises do.  With ZFS, even the humblest device -- an iPod, a digital camera -- can provide fast, reliable storage using nothing but commodity hardware and free, open-source software.  And it's about time, don't you think?


Thursday May 04, 2006

You say zeta, I say zetta

On a lighter note, what is the Z in ZFS?

I've seen the (non-)word 'zetabyte' many times in the press. It's also been noted that zeta is the last letter of the Greek alphabet, which will no doubt surprise many Greeks.

So, here's the real story. When I began the (nameless) project, it needed a name. I had no idea what to call it. I knew that the Big Idea was to bring virtual memory concepts to disks, so things like VSS (Virtual Storage System) came to mind. The problem is, everything I came up with sounded like vapor. I don't know about you, but when I hear "Tiered Web Deployment Virtualization Engine" I assume it's either a vast steaming pile of complexity or a big fluffy cloud of nothing -- which is correct north of 98% of the time. So I didn't want a name that screamed "BS!" out of the gate. That actually ruled out a lot of stuff.

At first I was hesitant to pick a name that ended in FS, because that would pigeon-hole the project in a way that's not quite accurate. It's much more than a filesystem, but then again, it is partly a filesystem, and at least people know what that is. It's a real thing. It's not a Tiered Web Deployment Virtualization Engine. I figured it would be better to correct any initial misimpression than to make no impression at all.

So the hunt was on for something-FS. The first thing I did was to Google all 26 three-letter options, from AFS to ZFS. As it turns out, they've all been used by current or previous products, often many times over. But ZFS hadn't been used much, and certainly not by anything mainstream like (for example) AFS, DFS, NFS or UFS.

I should mention that I ruled out BFS or BonwickFS, because it implies either sole authorship or a failure to share credit. Neither would be good.

So in the end, I picked ZFS for the simplest of reasons: it sounds cool. It doesn't come with any baggage, like YFS (Why FS? Ha ha!). And you can even say it in hex (2f5).

The next task was to retrofit the acronym -- that is, I had to come up with something for the Z to stand for. I actually got out my trusty old 1980 Random House Unabridged and looked through the entire Z section, and found nothing that spoke to me. Zero had a certain appeal (as in zero administrative hassle), but then again -- eh.

OK, what does the web have to say? I don't recall exactly how I stumbled on the zetta prefix, but it seemed ideal: ZFS was to be a 128-bit filesystem, and the next unit after exabyte (the 64-bit limit is 16EB) is zettabyte. Perfect: the zettabyte filesystem.

A bit of trivia (in case the rest of this post isn't quite trivial enough for you): the prefix 'zetta' is not of Greek or Latin origin -- at least, not directly. The original SI prefixes were as follows:

	milli	kilo	(1000^1)
	micro	mega	(1000^2)
	nano	giga	(1000^3)
	pico	tera	(1000^4)
	femto	peta	(1000^5)
	atto	exa	(1000^6)

The demand for more extreme units actually began at the small end, so the SI people needed names for 1000^-7 and 1000^-8. For seven, they took the Latin "sept" and made "zepto" (septo wouldn't work because s for second is ubiquitous); for eight, they took the Greek "okto" and made "yocto" (octo wouldn't work because o looks like 0). For symmetry with zepto and yocto on the small side, they defined zetta and yotta on the big side. (I also suspect they were drinking heavily, and thought yotta was a real hoot.)

But zettabyte wasn't perfect, actually. We (we were a team by now) found that when you call it the zettabyte filesystem, you have to explain what a zettabyte is, and by then the elevator has reached the top floor and all people know is that you're doing large capacity. Which is true, but it's not the main point.

So we finally decided to unpimp the name back to ZFS, which doesn't stand for anything. It's just a pseudo-acronym that vaguely suggests a way to store files that gets you a lot of points in Scrabble.

And it is, of course, "the last word in filesystems".

Tuesday May 02, 2006

Smokin' Mirrors

Resilvering -- also known as resyncing, rebuilding, or reconstructing -- is the process of repairing a damaged device using the contents of healthy devices. This is what every volume manager or RAID array must do when one of its disks dies, gets replaced, or suffers a transient outage.

For a mirror, resilvering can be as simple as a whole-disk copy. For RAID-5 it's only slightly more complicated: instead of copying one disk to another, all of the other disks in the RAID-5 stripe must be XORed together. But the basic idea is the same.

In a traditional storage system, resilvering happens either in the volume manager or in RAID hardware. Either way, it happens well below the filesystem.

But this is ZFS, so of course we just had to be different.

In a previous post I mentioned that RAID-Z resilvering requires a different approach, because it needs the filesystem metadata to determine the RAID-Z geometry. In effect, ZFS does a 'cp -r' of the storage pool's block tree from one disk to another. It sounds less efficient than a straight whole-disk copy, and traversing a live pool safely is definitely tricky (more on that in a future post). But it turns out that there are so many advantages to metadata-driven resilvering that we've chosen to use it even for simple mirrors.

The most compelling reason is data integrity. With a simple disk copy, there's no way to know whether the source disk is returning good data. End-to-end data integrity requires that each data block be verified against an independent checksum -- it's not enough to know that each block is merely consistent with itself, because that doesn't catch common hardware and firmware bugs like misdirected reads and phantom writes.

By traversing the metadata, ZFS can use its end-to-end checksums to detect and correct silent data corruption, just like it does during normal reads. If a disk returns bad data transiently, ZFS will detect it and retry the read. If it's a 3-way mirror and one of the two presumed-good disks is damaged, ZFS will use the checksum to determine which one is correct, copy the data to the new disk, and repair the damaged disk.

A simple whole-disk copy would bypass all of this data protection. For this reason alone, metadata-driven resilvering would be desirable even it it came at a significant cost in performance.

Fortunately, in most cases, it doesn't. In fact, there are several advantages to metadata-driven resilvering:

Live blocks only. ZFS doesn't waste time and I/O bandwidth copying free disk blocks because they're not part of the storage pool's block tree. If your pool is only 10-20% full, that's a big win.

Transactional pruning. If a disk suffers a transient outage, it's not necessary to resilver the entire disk -- only the parts that have changed. I'll describe this in more detail in a future post, but in short: ZFS uses the birth time of each block to determine whether there's anything lower in the tree that needs resilvering. This allows it to skip over huge branches of the tree and quickly discover the data that has actually changed since the outage began.

What this means in practice is that if a disk has a five-second outage, it will only take about five seconds to resilver it. And you don't pay extra for it -- in either dollars or performance -- like you do with Veritas change objects. Transactional pruning is an intrinsic architectural capability of ZFS.

Top-down resilvering. A storage pool is a tree of blocks. The higher up the tree you go, the more disastrous it is to lose a block there, because you lose access to everything beneath it.

Going through the metadata allows ZFS to do top-down resilvering. That is, the very first thing ZFS resilvers is the uberblock and the disk labels. Then it resilvers the pool-wide metadata; then each filesystem's metadata; and so on down the tree. Throughout the process ZFS obeys this rule: no block is resilvered until all of its ancestors have been resilvered.

It's hard to overstate how important this is. With a whole-disk copy, even when it's 99% done there's a good chance that one of the top 100 blocks in the tree hasn't been copied yet. This means that from an MTTR perspective, you haven't actually made any progress: a second disk failure at this point would still be catastrophic.

With top-down resilvering, every single block copied increases the amount of discoverable data. If you had a second disk failure, everything that had been resilvered up to that point would be available.

Priority-based resilvering. ZFS doesn't do this one yet, but it's in the pipeline. ZFS resilvering follows the logical structure of the data, so it would be pretty easy to tag individual filesystems or files with a specific resilver priority. For example, on a file server you might want to resilver calendars first (they're important yet very small), then /var/mail, then home directories, and so on.


What I hope to convey with each of these posts is not just the mechanics of how a particular feature is implemented, but to illustrate how all the parts of ZFS form an integrated whole. It's not immediately obvious, for example, that transactional semantics would have anything to do with resilvering -- yet transactional pruning makes recovery from transient outages literally orders of magnitude faster. More on how that works in the next post.

Wednesday Dec 14, 2005

ZFS BoF at FAST

The ZFS team is hosting a BoF at FAST in San Francisco at 9PM tonight.

We'll begin with an in-depth look at what ZFS is and how it works. This will be followed by the customary Airing of Grievances. (I'll get you a free ZFS T-shirt if you can get a Festivus pole past security and into our BoF. (Seriously.))

Quite a few of us will be there, so come on up and meet the folks who put the dot in vmcore.0.

Monday Dec 12, 2005

SEEK_HOLE and SEEK_DATA for sparse files

A sparse file is a file that contains much less data than its size would suggest. If you create a new file and then do a 1-byte write to the billionth byte, for example, you've just created a 1GB sparse file. By convention, reads from unwritten parts of a file return zeroes.

File-based backup and archiving programs like tar, cpio, and rsync can detect runs of zeroes and ignore them, so that the archives they produce aren't filled with zeroes. Still, this is terribly inefficient. If you're a backup program, what you really want is a list of the non-zero segments in the file. But historically there's been no way to get this information without looking at filesystem-specific metadata.

Desperately seeking segments

As part of the ZFS project, we introduced two general extensions to lseek(2): SEEK_HOLE and SEEK_DATA. These allow you to quickly discover the non-zero regions of sparse files. Quoting from the new man page:

       o  If whence is SEEK_HOLE, the offset of the start of  the
          next  hole greater than or equal to the supplied offset
          is returned. The definition of a hole is provided  near
          the end of the DESCRIPTION.

       o  If whence is SEEK_DATA, the file pointer is set to  the
          start  of the next non-hole file region greater than or
          equal to the supplied offset.

        [...]

     A "hole" is defined as a contiguous  range  of  bytes  in  a
     file,  all  having the value of zero, but not all zeros in a
     file are guaranteed to be represented as holes returned with
     SEEK_HOLE. Filesystems are allowed to expose ranges of zeros
     with SEEK_HOLE, but not required to.  Applications  can  use
     SEEK_HOLE  to  optimise  their behavior for ranges of zeros,
     but must not depend on it to find all such ranges in a file.
     The  existence  of  a  hole  at the end of every data region
     allows for easy programming and implies that a virtual  hole
     exists  at  the  end  of  the  file. Applications should use
     fpathconf(_PC_MIN_HOLE_SIZE) or  pathconf(_PC_MIN_HOLE_SIZE)
     to  determine if a filesystem supports SEEK_HOLE. See fpath-
     conf(2).

     For filesystems that do not supply information about  holes,
     the file will be represented as one entire data region.

Any filesystem can support SEEK_HOLE / SEEK_DATA. Even a filesystem like UFS, which has no special support for sparseness, can walk its block pointers much faster than it can copy out a bunch of zeroes. Even if the search algorithm is linear-time, at least the constant is thousands of times smaller.

Sparse file navigation in ZFS

A file in ZFS is a tree of blocks. To maximize the performance of SEEK_HOLE and SEEK_DATA, ZFS maintains in each block pointer a fill count indicating the number of data blocks beneath it in the tree (see below). Fill counts reveal where holes and data reside, so that ZFS can find both holes and data in guaranteed logarithmic time.

  L4                           6
              -----------------------------------
  L3          5                0                1
         -----------      -----------      -----------
  L2     3    0    2                       0    0    1
        ---  ---  ---                     ---  ---  ---
  L1    111       101                               001


An indirect block tree for a sparse file, showing the fill count in each block pointer. In this example there are three block pointers per indirect block. The lowest-level (L1) block pointers describe either one block of data or one block-sized hole, indicated by a fill count of 1 or 0, respectively. At L2 and above, each block pointer's fill count is the sum of the fill counts of its children. At any level, a non-zero fill count indicates that there's data below. A fill count less than the maximum (3L-1 in this example) indicates that there are holes below. From any offset in the file, ZFS can find the next hole or next data in logarithmic time by following the fill counts in the indirect block tree.

Portability

At this writing, SEEK_HOLE and SEEK_DATA are Solaris-specific. I encourage (implore? beg?) other operating systems to adopt these lseek(2) extensions verbatim (100% tax-free) so that sparse file navigation becomes a ubiquitous feature that every backup and archiving program can rely on. It's long overdue.


Technorati Tags:


Archives
Links
Referrers