Multithreaded Musings
Stand back - I'm a scientist!
Archives
« February 2010
MonTueWedThuFriSatSun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
       
       
Today
Click me to subscribe
Search

About Me
Known throughout Sun as a man of infinite wit, of jovial attitude, and of making things up about himself at the slightest whim.
Contact!
Links
 

Today's Page Hits: 212

Main | Next page »
Tuesday Jan 05, 2010
Seven Years of Good Luck: Splitting Mirrors

The Problem

Imagine you had a nice zfs pool holding all the data for your business application. Regular back-ups are a must, right? But imagine further that you want to back it up without impacting the application, without going directly to the pool and reading the data, without having to incur the overhead of all those additional I/O operations. What's the solution?

Traditionally, it is been the practice to mirror the data locally, break the mirror, and then move the "broken-off" disks to a new machine for backup. This is possible to do with ZFS but, until now, has been very awkward:

Consider the case of a pool, "tank", composed of a mirror of two disks, "c0t0d0" and "c0t1d0". First, remove the disk from the pool:

# zpool offline tank c0t0d0
# zpool detach tank c0t0d0
Then, physically move disk c0t0d0 to a new machine, and use zfs import -f to find it. The -f is necessary because the pool still thinks it's imported on the original machine:
# zpool import -f tank
But even after that, the "tank" pool that was on the new machine still remembered being attached to the other disk, so some clean-up is in order:
# zpool detach tank c0t1d0
A bit awkward, but still possible, right? Now extend that to a pool with many n-way mirrors, and you can see how this is risky and prone to error. And there are certain configs where this won't actually work, which I'll go into detail on later.

In any case, using the technique above, the pool cannot be imported on the same machine because of something called a "Pool GUID". By using the offline/detach sequence means the kernel will find two copies of the GUID: one in the original pool and one in the offlined/detached disk, and this stops the GUID from honouring the "U" bit in its acronym.

The Solution

With the integration of PSARC 2009/511, we've introduced a new command: "zpool split". In the simplest case, zpool split takes two arguments: the existing pool to split disks from, and a name for the new pool. Consider again the "tank" example above. The two disks, c0t0d0 and c0t1d0 are mirrors and each is identical to the other. Running the command:

# zpool split tank vat
will result in two pools: the original pool "tank" with the c0t0d0 disk, and the new pool "vat" with the c0t1d0 disk.

That's it. The c0t1d0 disk can immediately be removed and plugged into a new machine. A "zpool import vat" will find it on the new machine and import it.

Behind the scenes, several things went on: first, zfs evaluated the configuration -- only certain configurations will work -- and chose a disk to split off. Next, the in-memory data was flushed out to the mirror. Incidentally, this is one of many reasons it's REALLY important to have disks that honour the Flush Write Cache command instead of ignore it. After flushing the data out, the disk can be detached from the pool and given a new label with a new Pool GUID. By generating a new pool GUID, zfs allows the pool to be imported on the same machine that it was split from. (See below for more detail).

In fact, there's an option, -R, that tells the split command to go ahead and import it after the split is complete:

# zpool split -R /vatroot tank vat
This command imports the "vat" pool under the altroot directory /vatroot. The only reason for having to specify an altroot is if there are any non-default mountpoints on any of the datasets in "tank". Because "vat" is an exact copy of tank, all the dataset properties will be exactly the same. If all the mountpoints are the default mountpoints (e.g. tank/foo is mounted at /tank/foo), then there is no need for an altroot. However, if dataset tank/foo is mounted at /etc/foo instead, then the "vat" pool's vat/foo dataset will also have a mountpoint of /etc/foo, and they will conflict. So the simplest thing to do for split is to require an altroot if the split-off pool is to be mounted.

By specifying an altroot /vatroot, the dataset vat/foo will instead be mounted under /vatroot/etc/foo, and there will be no conflict. Moreover, when the disk is moved to a new machine, it can be mounted without the need for an altroot, and all the mountpoints will be correct.

The split code has another bit of flexibility in it: you can specify which disks to split off. Normally the split code simply choose the "last" disk in each mirror to use for the new pool. But if there are specific disks that you're planning on moving to the new machine, you simply put that on the command line. For example, let's say you had created a pool with the command: "zpool create tank mirror c0t0d0 c1t0d0 c2t0d0 mirror c0t1d0 c1t1d0 c2t1d0". If you just run "zpool split tank vat", then the "vat" pool will be composed of c2t0d0 and c2t1d0, leaving the remaining four disks as part of the "tank" pool.

Let's say you wanted to use controller 1's disks to move. The command would instead be:

# zpool split tank vat c1t0d0 c1t1d0
and the split code will use those disks instead. The "vat" pool will get c1t0d0 and c1t1d0, leaving the other four as part of "tank".

To verify this before doing any actual splitting, you can use the -n option: this option goes through all the configuration validation that is normally done for zpool split, but does not do the actual split. Instead, this displays what the new pool would look like if the split command were to succeed.

The Gory Details

So is it that simple, then? Just offline and detach the mirrors, and give them a new pool GUID? Not quite, alas. Not if you want to do it right. We need to contend with real-world problems, and therefore we cannot assume that all operations succeed, or even that the machine doesn't die part way through the process. The typical way to handle these situations is to use something known as a "three-phase commit", which boils down to: (1) state your intentions, (2) perform the operation, and (3) remove the statement of intentions. That way, after any one of the phases, your system is in a known state, and you can either roll back to the previous state, or forge ahead and complete the task.

For split, these steps are: (1) create a list of disks being split off, and the offline them, (2) create a new pool using those offlined disks, and (3) go back to the original pool and detach the disks. If we die after step 1, then on resume we know we can just change the disks back online, and we've successfully rolled back. If we die after step 2, the new pool is already created, so all we have to do when we start back up again is complete step 3.

The obvious question to ask is: what happens if we die part-way through one of the steps? That's where ZFS's transaction model saves us: we only commit operations at the end of each step, so if we die part-way through a step, it's as if that step never got started. However, there are two things that are not covered by the transaction model and, unfortunately, the split code is required to touch one of them explicitly: the vdev label. The other non-transactional block of data is the /etc/zfs/zpool.cache file, which we don't have to worry about for splitting, because the innards of the kernel handles that for us.

What the vdev label holds is a number of pieces of information, and these include the pool to which it belongs, and the other members of the top-level vdev to which it belongs. The vdev label is not part of the pool data. It resides outside the pool, and is written to separately from the pool. In order to keep the label from being corrupt, not only does it get checksummed, but it gets written in four different locations on the disk. If the zfs kernel has to change the vdev configuration, a new nvlist is generated with all the in-memory configuration information, and then that is written out on the next sync operation to all four label locations. The bottom line is that it is possible for the vdev label's idea of the configuration to be out of sync with the configuration stored within the pool, known as the "spa config", depending on the timing of the writes and when the disk loses power.

So, for step (1), as stated above we offline the disks and generate a list of disks to be split off. This list is written to the spa config. If we die at this point then, on reboot, the vdev labels have not yet been updated, so the split is incomplete. The remedy is to throw away the list and put the disks back online.

For step (2) we update the vdev labels on the offlined disks, and generate a new spa config for them. If we die at this point, then, on reboot, we see the vdev labels are updated, and we remove the disks from the original pool in order to compelte the split.

For step (3) we just clean things up, removing the list of disks from the spa config, and removing the disks themselves from the in-core data structures.

So how does this actually work? You can see the heart of the code in spa.c, with the new function spa_vdev_split_mirror.

One tricky area is how we deal with log devices. Log (and "cache" and "spare") disks are not part of the split and, normally this would not be a problem. However, due to the way block pointers work, it is possible to generate a configuration that cannot be easily split. Consider the following sequence of commands:

# zpool create tank mirror c0t0d0 c1t0d0
# zpool add tank log c0t0d1
# zpool add tank mirror c0t1d0 c1t1d0

For the first line, zfs creates a new pool. The mirror is the top-level vdev, and it has a vdev ID of 0. The second line adds a log device. This device is also a top-level vdev, and gets an ID of 1. Finally, the third line adds a mirror as a new stripe to the pool. It gets a vdev ID of 2. A zpool status command confirms this numbering:

# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
        logs
          c0t1d0    ONLINE       0     0     0

errors: No known data errors
#

What this means is that any block pointer pointing to data on the second stripe is going to have a vdev ID of 2. If the second and third commands were swapped -- in other words, if the stripe was added before the log device was added -- then the stripe would be "mirror-1" instead of "mirror-2", and the vdev ID would be 1. But we need to be able to handle both cases.

Why is this important? After a split, there will only be two stripes and no logs. Somehow, we need to ensure that these are vdev ID 0 and vdev ID 2, or all the block pointers that point to ID 2 will result in panicking zfs. How do we tell zfs to skip over ID 1?

With George Wilson's putback of CR 6574286, he introduced the concept of "holes". A hole is a top-level vdev that cannot be allocated from, and provides no data, but it takes up a slot in the vdev list. This made it possible for log devices to be removed, with the hole device taking over. The split code leverages this feature to do its form of log device removal, inserting holes in the new config wherever log devices are. And of course, it's smart about it: if the log device is the last device in the configuration, there's no need to put in a hole. This is done in libzfs, in the new function zpool_vdev_split. Look how the "lastlog" variable is used.

And that's it. That's splitting in a nutshell. Or at least a few nutshells. I'm guessing the information density in this post is pretty high, but splitting up zpools is rather complex. There are still things it doesn't do that would be nice to do in the future, such as splitting off mirrors from mirrors, or even rejoining a split config to its parent. I hope I get to work on those.

Posted at 04:30PM Jan 05, 2010 by Mark Musante in ZFS  |  Comments[3]

Thursday Dec 10, 2009
A ZFS Taxonomy
Nevada
Build
Version
Introduced
Features
Added
36 1 ZFS introduced*
38 2 Ditto blocks for meta-data
42 3 Hot spares
Double-parity raidz
62 4 ZFS command history
5 Added gzip compression
6 ZFS boot
68 7 Added slogs
69 8 Delegation
77 9 CIFS support
Filesystem-specific quotas
78 10 L2-arc
94 11 Improved scrub/resilver performance
96 12 Snapshot user properties
98 13 New "snapused" readonly property
103 14 Added passthrough-x to aclinherit property
114 15 User and group quotas
116 16 STMF property for COMSTAR
120 17 Triple-parity RAIDZ
121 18 Snapshot Holds
125 19 Slog Device Removal
12820 Zero-length encoding compression
21 Deduplication
22 Received Properties
* Note: while ZFS was nominally introduced in build 27, the on-disk format was still undergoing changes until build 36, despite the fact that they were all labelled "version 1".
Posted at 07:13PM Dec 10, 2009 by Mark Musante in ZFS  | 

Tuesday Dec 08, 2009
Backing up a zvol

Over at Spiceworks, Michael2024 asks, "Anybody know how to get rsync to backup a ZFS zvol?"

My response is: "That's the wrong question." In fact, someone replied to Michael2024 already saying that rsync was not the right tool, but no one suggested the best tool for backing up zvols: snapshots

"But Mark," you say (because we're on first-name terms, and that is in fact my first name). "The snapshot is right there on the device that I'm trying to back up! How can that possibly help me?"

I'm glad you asked.

If you try to "back up" a zvol using a tool like dd, you're going to have to copy the whole volume, even the blocks that contain no data. But zvols are ZFS constructs which means they follow the copy-on-write paradigm which, in turn, means that ZFS needs to know what's data and what's not.

So that means that any snapshot will only contain the data that is actually on the disk. That's right: a snapshot of a 100TB volume that has 10MB of data will only contain those 10MB of data. And therefore, any "zfs send" stream will only contain real data and not a bunch of unwritten garbage.

To demonstrate, let's create a 100MB volume and snapshot it:

-bash-4.0# zfs create -V 100m tank/vol
-bash-4.0# zfs snapshot tank/vol@snap
How big is the send stream? Easy enough to check:
-bash-4.0# zfs send tank/vol@snap | wc -c
4256
Just a smidge over 4k. Let's write some data:
-bash-4.0# dd if=/dev/random of=/dev/zvol/rdsk/tank/vol bs=1k count=10
10+0 records in
10+0 records out
-bash-4.0# zfs snapshot tank/vol@snap2
-bash-4.0# zfs send tank/vol@snap2 | wc -c
21264
OK, we wrote 10k of data, and the send stream is 20k. With such a little amount of data, the overhead is about half the stream. But, what if we write to the same blocks again?
-bash-4.0# dd if=/dev/random of=/dev/zvol/rdsk/tank/vol bs=1k count=10
10+0 records in
10+0 records out
-bash-4.0# zfs snapshot tank/vol@snap3
-bash-4.0# zfs send tank/vol@snap3 | wc -c
21264
The exact same amount! So ZFS knows exactly how much data there is on the zvol. Let's write 1MB instead:
-bash-4.0# dd if=/dev/random of=/dev/zvol/rdsk/tank/vol bs=1k count=1024
1024+0 records in
1024+0 records out
-bash-4.0# zfs snapshot tank/vol@snap4
-bash-4.0# zfs send tank/vol@snap4 | wc -c
1092768
-bash-4.0#
And now the overhead is quite a bit smaller than the data, around 3-4%.

The question then is: which is more efficient? Doing a full block-by-block copy using something "higher up in the stack" (quoting from Michael2024 there), or creating another pool and doing a "zfs send | zfs recv"? On top of that, add the under-appreciated feature of incremental send streams, and you have a full backup solution that does not require any external tools.

I would respond on the Spiceworks website, but alas they are both members-only and require you download a Windows client just to register. Lame!

Posted at 06:52PM Dec 08, 2009 by Mark Musante in ZFS  |  Comments[4]

Thursday Mar 26, 2009
Cloning Humans: Bad; Cloning Filesystems: Good

Imagine you could snapshot yourself and generate a clone from that snapshot. On the surface, that would seem like a cool idea, but scratch the surface and you quickly run into problems. These problems could be ethical (what rights does your clone have to your own possessions?), or they could be practical (how can the planet sustain the population explosion?). Fortunately, with zfs datasets, snapshotting and cloning not only are permitted, they are actively encouraged.

Over at the Xerces Tech site, a recent article outlines how to use zfs clones in order to safely do an apt-get on their Nexenta box. Take a look at the article called Unbreakable upgrades, ZFS and apt-get to see how it's done.

Posted at 05:00PM Mar 26, 2009 by Mark Musante in ZFS  | 

Sunday Feb 15, 2009
ZFS Mandates
While making ZFS required for court and police NAS/SAN devices it would be nice if people would use ZFS because it is so obviously better, not because its use is mandated.
Posted at 06:18PM Feb 15, 2009 by Mark Musante in ZFS  | 

Wednesday Jan 28, 2009
Building a ZFS server

Have you ever wanted to try building your own ZFS-based file server? Jermaine Maree has done just that, and blogged about it. Start with part 1.

Posted at 03:02PM Jan 28, 2009 by Mark Musante in ZFS  |  Comments[1]

Thursday Jan 01, 2009
Don't shout at your JBODs
They don't like it!
Posted at 07:14PM Jan 01, 2009 by Mark Musante in Storage  | 

Tuesday Dec 23, 2008
Environmental Disaster
A TVA ash pond has flooded the Tennessee River valley. This is a huge disaster, but I didn't hear about it through the normal news channels. Why not?
Posted at 08:38PM Dec 23, 2008 by Mark Musante in Personal  |  Comments[1]

Wednesday Nov 19, 2008
ZFS can boost performance

Even a suboptimal configuration can result in a performance boost. The most interesting thing, I think, is the ease with which the zpool was created.

I wonder what kind of performance numbers this user would see with the 7110 compared with the Dell Powervault. The 7110 can hold 14 146gb sas drives, whereas the Dell uses 14 146gb scsi drives, so comparing power utilization would be interesting as well.

Posted at 03:24AM Nov 19, 2008 by Mark Musante in ZFS  | 

Thursday Oct 23, 2008
ZFS User Directories on OS X
Check out this blog post on setting up OS X with zfs-based user directories.
Posted at 02:58AM Oct 23, 2008 by Mark Musante in ZFS  | 

Tuesday Oct 07, 2008
Why it's going to take me forever...

One of the things I'm trying to do, besides have a full time job at Sun and raise three kids (another full-time job, even with my wife's help), is to learn Japanese. I'm not in any great rush, so I make time where I can and I try to spend at least a few minutes a day on it. I've started by trying to learn the kana writing system before I move onto the more complex kanji. There are two primary kinds of kana, hiragana and katakana. The former is used when writing Japanese, and the latter is used when writing "foreign" words, including words that have become part of the Japanese language but were borrowed from, for example, English.

Sun, as you're probably aware, encourages blogging from its employees around the globe, and more than a few of these are from Japan. Here's an example of one: キャンパス アンバサダ (I hope that came through).

Here's how the entry looks in my browser - I'll include it here in case your browser doesn't show the characters correctly

What I've been doing, in order to practice my kana, is to pick out the katakana symbols from the entries and see if I can work out what word it is in English. Here's the title of the entry I linked to, broken down kana by kana:

KanaEnglish pronunciation
first word
キャkya
n or m
pa
su
second word
a
n or m
ba
sa
da

Spelling it phonetically, we get kyanpasu or kyampasu for the first word, and anbasada or ambasada for the second. And I was puzzled. Contrast that with another katakana word that appears in the blog entry: "オリエンテーション", which is o-ri-e-n-teh-sho-n ... orientehshon ... orientation. Obvious, right? What's kyanpasu? Words that end with ス tend to have the final 'oo' sound dropped, so it becomes kyanpas when saying it aloud. I give up, so Google Translate tells me: Campus. This is where I hit my head on the desk. Why does Campus start with キャ? The a in kya sounds like the a in father, not the a in campus, so maybe that's why kya is used to differentiate it from the カ character ('ka') which also sounds like the a in father?

Of course, once I saw the first word was 'campus', it's easy to figure out the second word is Ambassador. Which I would have spelled, apparently incorrectly, アンバサドル

Posted at 03:01PM Oct 07, 2008 by Mark Musante in Hobbies  | 

Tuesday Aug 19, 2008
Cache on hand

When we think of a cache, we think of a way of storing information "closer" to the place it's needed. Most general-purpose CPUs, for example, have an on-board cache which is used to avoid accessing main memory - after all, the memory that's on the same die as the CPU is going to be quicker to access than the RAM chips. Filesystems use caches of RAM to make disk accesses appear to be quicker, as RAM chips are much faster than moving a mechanical arm across a spinning disks. If we're lucky, and if we've got a good caching algorithm, we can get an impressive speed boost by keeping the right bits in RAM. Likewise, CPUs get a speed boost by keeping the right instructions and data on chip.

Caching is not limited to CPUs and filesystems, of course. Most browsers maintain a cache of pages, of images, of css files, of javascript, and of any other bit of information that is useful for displaying web pages. By using a local on-disk cache (some of which is going to be in RAM anyway, thanks to the filesystem), browsing appears to be much quicker than it would if the browser had to re-load every single image from a distant web site. The browser does check to see if any files need to be retrieved again (see http's 304 Not Modified response code), so there is some over-the-wire activity, but that's about it.

All of this is a long-winded way of saying I was amused by the second bullet item here (from Apple Insider):

Either the cache is poorly implemented, or the users reporting this information are confused.

Posted at 12:56PM Aug 19, 2008 by Mark Musante in Algorithms  | 

Wednesday Jul 16, 2008
Good n' plenty

This is a great example of doing well AND doing good. Proof positive they're not mutually exclusive.

Posted at 03:46PM Jul 16, 2008 by Mark Musante in Sun  | 

Monday Jun 16, 2008
ZFS In The Wild, Part 5

It's been over a year since I last posted sightings of ZFS around the web, so it's high time I offered another list.

Posted at 07:59PM Jun 16, 2008 by Mark Musante in ZFS  | 

Friday Jun 13, 2008
Time Flies

It was nearly a year ago that I first made this screenshot:

Since then, I have done quite a number of different things, all related in some way to getting zfs to install and boot. Some of these things also involved teaching Live Upgrade to understand zfs datsets.

But now I'm starting to see that screenshot elsewhere, virtually unchanged from that fateful day long ago when I used the original to help design the changes needed for the text based installer.

Here are some: The Sect of Rama | Number 9 | Otmanix' Blog | Osamu Sayama's Weblog

It's really exciting to see it get out there and for it to be used outside of the development and test teams. Of course, there are some CRs being filed, and there are some things we'll need to address, but it's great nevertheless.

Posted at 04:20AM Jun 13, 2008 by Mark Musante in Bloggers  |  Comments[1]