Jeff Bonwick's Blog |
Tuesday Jun 10, 2008
ZFS in MacOS X Snow Leopard
Cheers to Noel, Don, Bertrand, and all the great folks at Apple. Now when can I get this in Time Capsule? :-) Posted at 02:54AM Jun 10, 2008 by bonwick in ZFS | Comments[11]
Monday May 26, 2008
Casablanca
Chocolate on my peanut butter?
No, peanut butter on my chocolate!
All I can say for the moment is... stay tuned.
Posted at 03:49PM May 26, 2008 by bonwick in ZFS | Comments[69]
Monday Dec 31, 2007
How to Lose a Customer
For over a year I have been the proud and happy owner of a Garmin GPS unit -- the Nuvi 360. I have practically been a walking billboard for the company. Go ahead, ask me about my Nuvi! But wait, it gets better. Moreover, the 2008 map update isn't a one-time purchase. There's an update every year, so it's really a $138/year subscription. That's $11.50/month. For maps. For a mapping device. That I already paid for. I can get better information from Google maps, continuously updated, with integrated real-time traffic data, for free, forever -- and my iPhone will happily use that data to plot time-optimal routes. (In fact, all the iPhone needs is the right antenna and a SIRF-3 chipset to make dedicated GPS devices instantly obsolete. This is so obvious it can't be more than a year out. I can live with the stale maps until then, and have a $138 down payment on the GPS iPhone earning interest while I wait.) And so, starting today, that's exactly what I'll do. I don't mind paying a reasonable fee for services rendered. I do mind getting locked into a closed-source platform and being forced to pay monopoly rents for a proprietary, stale and limited version of data that's already available to the general public. That business model is so over. Everything about this stinks, Garmin. You tell me, unexpectedly, that I have to pay for routine map updates. You make the price outrageous. You don't actually disclose what's in the update. (Several Amazon reviewers say the new maps are actually worse.) You make the update hard to do. You needlessly add to our landfills by creating single-use DVDs. You have an unreasonable licensing policy. And you hide that policy until after the purchase. Way to go, Garmin. You have pissed off a formerly delighted customer, and that is generally a one-way ticket. You have lost both my business and my respect. I won't be coming back. Ever. Posted at 04:29AM Dec 31, 2007 by bonwick in General | Comments[11]
Friday Sep 14, 2007
Space Maps
BitmapsThe most common way to represent free space is by using a bitmap. A bitmap is simply an array of bits, with the Nth bit indicating whether the Nth block is allocated or free. The overhead for a bitmap is quite low: 1 bit per block. For a 4K blocksize, that's 1/(4096*8) = 0.003%. (The 8 comes from 8 bits per byte.)For a 1GB filesystem, the bitmap is 32KB -- something that easily fits in memory, and can be scanned quickly to find free space. For a 1TB filesystem, the bitmap is 32MB -- still stuffable in memory, but no longer trivial in either size or scan time. For a 1PB filesystem, the bitmap is 32GB, and that simply won't fit in memory on most machines. This means that scanning the bitmap requires reading it from disk, which is slower still. Clearly, this doesn't scale. One seemingly obvious remedy is to break the bitmap into small chunks, and keep track of the number of bits set in each chunk. For example, for a 1PB filesystem using 4K blocks, the free space can be divided into a million bitmaps, each 32KB in size. The summary information (the million integers indicating how much space is in each bitmap) fits in memory, so it's easy to find a bitmap with free space, and it's quick to scan that bitmap. But there's still a fundamental problem: the bitmap(s) must be updated not only when a new block is allocated, but also when an old block is freed. The filesystem controls the locality of allocations (it decides which blocks to put new data into), but it has no control over the locality of frees. Something as simple as 'rm -rf' can cause blocks all over the platter to be freed. With our 1PB filesystem example, in the worst case, removing 4GB of data (a million 4K blocks) could require each of the million bitmaps to be read, modified, and written out again. That's two million disk I/Os to free a measly 4GB -- and that's just not reasonable, even as worst-case behavior. More than any other single factor, this is why bitmaps don't scale: because frees are often random, and bitmaps that don't fit in memory perform pathologically when they are accessed randomly. B-treesAnother common way to represent free space is with a B-tree of extents. An extent is a contiguous region of free space described by two integers: offset and length. The B-tree sorts the extents by offset so that contiguous space allocation is efficient. Unfortunately, B-trees of extents suffer the same pathology as bitmaps when confronted with random frees.What to do? Deferred freesOne way to mitigate the pathology of random frees is to defer the update of the bitmaps or B-trees, and instead keep a list of recently freed blocks. When this deferred free list reaches a certain size, it can be sorted, in memory, and then freed to the underlying bitmaps or B-trees with somewhat better locality. Not ideal, but it helps. But what if we went further? Space maps: log-structured free listsRecall that log-structured filesystems long ago posed this question: what if, instead of periodically folding a transaction log back into the filesystem, we made the transaction log be the filesystem? That is precisely what ZFS does. ZFS divides the space on each virtual device into a few hundred regions called metaslabs. Each metaslab has an associated space map, which describes that metaslab's free space. The space map is simply a log of allocations and frees, in time order. Space maps make random frees just as efficient as sequential frees, because regardless of which extent is being freed, it's represented on disk by appending the extent (a couple of integers) to the space map object -- and appends have perfect locality. Allocations, similarly, are represented on disk as extents appended to the space map object (with, of course, a bit set indicating that it's an allocation, not a free).
Finally, note that when a space map is completely full, it is represented by a single extent. Space maps therefore have the appealing property that as your storage pool approaches 100% full, the space maps start to evaporate, thus making every last drop of disk space available to hold useful information. Posted at 05:03AM Sep 14, 2007 by bonwick in ZFS | Comments[12]
Sunday Jun 24, 2007
ZFS License Announcement
My oldest son, Andrew, recently gave me something cool for the back of my car:
That's Andrew(11), David(8), and Galen(6). If you look closely, you can see the CEO potential... What, you were looking for yet another CDDL vs. GPL thread? Posted at 05:59PM Jun 24, 2007 by bonwick in ZFS | Comments[7]
Friday May 04, 2007
Rampant Layering Violation?
Andrew Morton has famously called ZFS a "rampant layering violation" because it combines the functionality of a filesystem, volume manager, and RAID controller. I suppose it depends what the meaning of the word violate is. While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler. Suppose you had to compute the sum, from n=1 to infinity, of 1/n(n+1). Expanding that out term by term, we have: 1/(1*2) + 1/(2*3) + 1/(3*4) + 1/(4*5) + ... That is, 1/2 + 1/6 + 1/12 + 1/20 + ... What does that infinite series add up to? It may seem like a hard problem, but that's only because we're not looking at it right. If you're clever, you might notice that there's a different way to express each term: 1/n(n+1) = 1/n - 1/(n+1) For example, 1/(1*2) = 1/1 - 1/2 Thus, our sum can be expressed as: (1/1 - 1/2) + (1/2 - 1/3) + (1/3 - 1/4) + (1/4 - 1/5) + ... Now, notice the pattern: each term that we subtract, we add back. Only in Congress does that count as work. So if we just rearrange the parentheses -- that is, if we rampantly violate the layering of the original problem by using associativity to refactor the arithmetic across adjacent terms of the series -- we get this: 1/1 + (-1/2 + 1/2) + (-1/3 + 1/3) + (-1/4 + 1/4) + ... or 1/1 + 0 + 0 + 0 + ... In others words, 1. Isn't that cool? Mathematicians have a term for this. When you rearrange the terms of a series so that they cancel out, it's called telescoping -- by analogy with a collapsable hand-held telescope. In a nutshell, that's what ZFS does: it telescopes the storage stack. That's what allows us to have a filesystem, volume manager, single- and double-parity RAID, compression, snapshots, clones, and a ton of other useful stuff in just 80,000 lines of code. A storage system is more complex than this simple analogy, but at a high level the same idea really does apply. You can think of any storage stack as a series of translations from one naming scheme to another -- ultimately translating a filename to a disk LBA (logical block address). Typically it looks like this: filesystem(upper): filename to object (inode) This is the stack we're about to refactor. ZPL: filename to object The DMU provides both file and block access to a common pool of physical storage. File access goes through the ZPL, while block access is just a direct mapping to a single DMU object. We're also developing new data access methods that use the DMU's transactional capabilities in more interesting ways -- more about that another day. Posted at 03:10AM May 04, 2007 by bonwick in ZFS | Comments[33]
Thursday Apr 12, 2007
A Near-Death Experience
Evidently, my previous post was just a tad too cheerful for some folks' taste. But I speak with the optimism of a man who has cheated death. And ironically, Pete's reference to George Cameron had a lot to do with it. Several years ago, George and a few other Sun folks went off to form 3par, a new storage company. They all had Solaris expertise, and understood its advantages, so they wanted to use it inside their box. But we weren't open-source at the time, and our licensing terms really sucked. Both of us -- George at 3par, and me at Sun -- tried for months to arrange something reasonable. We failed. So finally -- because Sun literally gave them no choice -- 3par went with Linux. I couldn't believe it. A cool new company wanted to use our product, and instead of giving them a hand, we gave them the finger. For many of us, that was the tipping point. If we had any reservations about open-sourcing Solaris, that ended them. It was a gamble, to be sure, but the alternative was certain death. Even if the 3par situation had ended differently, it was clear that we needed to change our business practices. To do that, we'd first have to change our culture. But cultures don't change easily -- it usually takes some traumatic event. In Sun's case, watching our stock shed 95% of its value did the trick. It was that total collapse of confidence -- that near-death experience -- that opened us up to things that had previously seemed too dangerous. We had to face a number of hard questions, including the most fundamental ones: Can we make a viable business out of this wreckage? Why are we doing SPARC? Why not AMD and Intel? Why Solaris? Why not Linux and Windows? Where are we going with Java? And not rah-rah why, but really, why? In each case, asking the question with a truly open mind changed the answer. We killed our more-of-the-same SPARC roadmap and went multi-core, multi-thread, and low-power instead. We started building AMD and Intel systems. We launched a wave of innovation in Solaris (DTrace, ZFS, zones, FMA, SMF, FireEngine, CrossBow) and open-sourced all of it. We started supporting Linux and Windows. And most recently, we open-sourced Java. In short, we changed just about everything. Including, over time, the culture. Still, there was no guarantee that open-sourcing Solaris would change anything. It's that same nagging fear you have the first time you throw a party: what if nobody comes? But in fact, it changed everything: the level of interest, the rate of adoption, the pace of communication. Most significantly, it changed the way we do development. It's not just the code that's open, but the entire development process. And that, in turn, is attracting developers and ISVs whom we couldn't even have spoken to a few years ago. The openness permits us to have the conversation; the technology makes the conversation interesting. After coming so close to augering into the ground, it's immensely gratifying to see the Solaris revival now underway. So if I sometimes sound a bit like the proud papa going on and on about his son, well, I hope you can forgive me. Oh, and Pete, if you're reading this -- George Cameron is back at Sun now, three doors down the hall from me. Small valley! Posted at 02:15AM Apr 12, 2007 by bonwick in General |
Tuesday Apr 10, 2007
Solaris Inside
When you choose an OS for your laptop, many things affect your decision: application support, availability of drivers, ease of use, and so on. Posted at 04:23AM Apr 10, 2007 by bonwick in General | Comments[6] The General-Purpose Storage Revolution
It happened so slowly, most people didn't notice until it was over. Posted at 04:12AM Apr 10, 2007 by bonwick in General | Comments[4]
Thursday Jan 11, 2007
Out of the mouths of babes...
After sizing up the computers we have at home, my son Andrew made the following declaration: "I want Solaris security, Mac interface, and Windows compatibility." Age 10. Naturally, sensing a teachable moment, I explained to him what virtualization is all about -- bootcamp, Parallels, Xen, etc. And the thing is, he really gets it. I can't wait to see what his generation is capable of. Posted at 10:54PM Jan 11, 2007 by bonwick in General | Comments[1]
Saturday Nov 04, 2006
ZFS Block Allocation
Block allocation is central to any filesystem. It affects not only performance, but also the administrative model (e.g. stripe configuration) and even some core capabilities like transactional semantics, compression, and block sharing between snapshots. So it's important to get it right.
By design, these three policies are independent and pluggable. They can be changed at will without altering the on-disk format, which gives us lots of flexibility in the years ahead. So... let's go allocate a block! 1. Device selection (aka dynamic striping). Our first task is device selection. The goal is to spread the load across all devices in the pool so that we get maximum bandwidth without needing any notion of stripe groups. You add more disks, you get more bandwidth. We call this dynamic striping -- the point being that it's done on the fly by the filesystem, rather than at configuration time by the administrator. There are many ways to select a device. Any policy would work, including just picking one at random. But there are several practical considerations:
2. Metaslab selection. We divide each device into a few hundred regions, called metaslabs, because the overall scheme was inspired by the slab allocator. Having selected a device, which metaslab should we use? Intuitively it seems that we'd always want the one with the most free space, but there are other factors to consider:
All of these considerations can be seen in the function metaslab_weight(). Having defined a weighting scheme, the selection algorithm is simple: always select the metaslab with the highest weight. 3. Block selection. Having selected a metaslab, we must choose a block within that metaslab. The current allocation policy is a simple variation on first-fit; it seems likely that we can do better. In the future I expect that we'll have not only a better algorithm, but a whole collection of algorithms, each optimized for a specific workload. Anticipating this, the block allocation code is fully vectorized; see space_map_ops_t for details. The mechanism (as opposed to policy) for keeping track of free space in a metaslab is a new data structure called a space map, which I'll describe in the next post. Posted at 06:51PM Nov 04, 2006 by bonwick in ZFS | Comments[1]
Saturday Sep 16, 2006
ZFS Internals at Storage Developer Conference
Posted at 07:32PM Sep 16, 2006 by bonwick in ZFS | Comments[3]
Friday May 26, 2006
ZFS on FUSE/Linux
Posted at 12:03AM May 26, 2006 by bonwick in ZFS | Comments[7]
Thursday May 04, 2006
You say zeta, I say zetta
On a lighter note, what is the Z in ZFS? I've seen the (non-)word 'zetabyte' many times in the press. It's also been noted that zeta is the last letter of the Greek alphabet, which will no doubt surprise many Greeks. So, here's the real story. When I began the (nameless) project, it needed a name. I had no idea what to call it. I knew that the Big Idea was to bring virtual memory concepts to disks, so things like VSS (Virtual Storage System) came to mind. The problem is, everything I came up with sounded like vapor. I don't know about you, but when I hear "Tiered Web Deployment Virtualization Engine" I assume it's either a vast steaming pile of complexity or a big fluffy cloud of nothing -- which is correct north of 98% of the time. So I didn't want a name that screamed "BS!" out of the gate. That actually ruled out a lot of stuff. At first I was hesitant to pick a name that ended in FS, because that would pigeon-hole the project in a way that's not quite accurate. It's much more than a filesystem, but then again, it is partly a filesystem, and at least people know what that is. It's a real thing. It's not a Tiered Web Deployment Virtualization Engine. I figured it would be better to correct any initial misimpression than to make no impression at all. So the hunt was on for something-FS. The first thing I did was to Google all 26 three-letter options, from AFS to ZFS. As it turns out, they've all been used by current or previous products, often many times over. But ZFS hadn't been used much, and certainly not by anything mainstream like (for example) AFS, DFS, NFS or UFS. I should mention that I ruled out BFS or BonwickFS, because it implies either sole authorship or a failure to share credit. Neither would be good. So in the end, I picked ZFS for the simplest of reasons: it sounds cool. It doesn't come with any baggage, like YFS (Why FS? Ha ha!). And you can even say it in hex (2f5). The next task was to retrofit the acronym -- that is, I had to come up with something for the Z to stand for. I actually got out my trusty old 1980 Random House Unabridged and looked through the entire Z section, and found nothing that spoke to me. Zero had a certain appeal (as in zero administrative hassle), but then again -- eh. OK, what does the web have to say? I don't recall exactly how I stumbled on the zetta prefix, but it seemed ideal: ZFS was to be a 128-bit filesystem, and the next unit after exabyte (the 64-bit limit is 16EB) is zettabyte. Perfect: the zettabyte filesystem.
A bit of trivia (in case the rest of this post isn't quite trivial enough for you): the prefix 'zetta' is not of Greek or Latin origin -- at least, not directly. The original SI prefixes were as follows: But zettabyte wasn't perfect, actually. We (we were a team by now) found that when you call it the zettabyte filesystem, you have to explain what a zettabyte is, and by then the elevator has reached the top floor and all people know is that you're doing large capacity. Which is true, but it's not the main point. So we finally decided to unpimp the name back to ZFS, which doesn't stand for anything. It's just a pseudo-acronym that vaguely suggests a way to store files that gets you a lot of points in Scrabble. And it is, of course, "the last word in filesystems". Posted at 04:33AM May 04, 2006 by bonwick in ZFS | Comments[3]
Tuesday May 02, 2006
Smokin' Mirrors
Resilvering -- also known as resyncing, rebuilding, or reconstructing -- is the process of repairing a damaged device using the contents of healthy devices. This is what every volume manager or RAID array must do when one of its disks dies, gets replaced, or suffers a transient outage. For a mirror, resilvering can be as simple as a whole-disk copy. For RAID-5 it's only slightly more complicated: instead of copying one disk to another, all of the other disks in the RAID-5 stripe must be XORed together. But the basic idea is the same. In a traditional storage system, resilvering happens either in the volume manager or in RAID hardware. Either way, it happens well below the filesystem. But this is ZFS, so of course we just had to be different. In a previous post I mentioned that RAID-Z resilvering requires a different approach, because it needs the filesystem metadata to determine the RAID-Z geometry. In effect, ZFS does a 'cp -r' of the storage pool's block tree from one disk to another. It sounds less efficient than a straight whole-disk copy, and traversing a live pool safely is definitely tricky (more on that in a future post). But it turns out that there are so many advantages to metadata-driven resilvering that we've chosen to use it even for simple mirrors. The most compelling reason is data integrity. With a simple disk copy, there's no way to know whether the source disk is returning good data. End-to-end data integrity requires that each data block be verified against an independent checksum -- it's not enough to know that each block is merely consistent with itself, because that doesn't catch common hardware and firmware bugs like misdirected reads and phantom writes. By traversing the metadata, ZFS can use its end-to-end checksums to detect and correct silent data corruption, just like it does during normal reads. If a disk returns bad data transiently, ZFS will detect it and retry the read. If it's a 3-way mirror and one of the two presumed-good disks is damaged, ZFS will use the checksum to determine which one is correct, copy the data to the new disk, and repair the damaged disk. A simple whole-disk copy would bypass all of this data protection. For this reason alone, metadata-driven resilvering would be desirable even it it came at a significant cost in performance. Fortunately, in most cases, it doesn't. In fact, there are several advantages to metadata-driven resilvering: Live blocks only. ZFS doesn't waste time and I/O bandwidth copying free disk blocks because they're not part of the storage pool's block tree. If your pool is only 10-20% full, that's a big win. Transactional pruning. If a disk suffers a transient outage, it's not necessary to resilver the entire disk -- only the parts that have changed. I'll describe this in more detail in a future post, but in short: ZFS uses the birth time of each block to determine whether there's anything lower in the tree that needs resilvering. This allows it to skip over huge branches of the tree and quickly discover the data that has actually changed since the outage began. What this means in practice is that if a disk has a five-second outage, it will only take about five seconds to resilver it. And you don't pay extra for it -- in either dollars or performance -- like you do with Veritas change objects. Transactional pruning is an intrinsic architectural capability of ZFS. Top-down resilvering. A storage pool is a tree of blocks. The higher up the tree you go, the more disastrous it is to lose a block there, because you lose access to everything beneath it. Going through the metadata allows ZFS to do top-down resilvering. That is, the very first thing ZFS resilvers is the uberblock and the disk labels. Then it resilvers the pool-wide metadata; then each filesystem's metadata; and so on down the tree. Throughout the process ZFS obeys this rule: no block is resilvered until all of its ancestors have been resilvered. It's hard to overstate how important this is. With a whole-disk copy, even when it's 99% done there's a good chance that one of the top 100 blocks in the tree hasn't been copied yet. This means that from an MTTR perspective, you haven't actually made any progress: a second disk failure at this point would still be catastrophic. With top-down resilvering, every single block copied increases the amount of discoverable data. If you had a second disk failure, everything that had been resilvered up to that point would be available. Priority-based resilvering. ZFS doesn't do this one yet, but it's in the pipeline. ZFS resilvering follows the logical structure of the data, so it would be pretty easy to tag individual filesystems or files with a specific resilver priority. For example, on a file server you might want to resilver calendars first (they're important yet very small), then /var/mail, then home directories, and so on. What I hope to convey with each of these posts is not just the mechanics of how a particular feature is implemented, but to illustrate how all the parts of ZFS form an integrated whole. It's not immediately obvious, for example, that transactional semantics would have anything to do with resilvering -- yet transactional pruning makes recovery from transient outages literally orders of magnitude faster. More on how that works in the next post. Posted at 04:13AM May 02, 2006 by bonwick in ZFS | Comments[1] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||