Tuesday Dec 23, 2008

A Different Approach

A week or so ago, I wrote about a way to get around the current limitation of mixing flash and ZFS root in Solaris 10 10/08. Well, here's a much better approach.

I was visiting with a customer last week and they were very excited to move forward quickly with ZFS boot in their Solaris 10 environment, even to the point of using this as a reason to encourage people to upgrade. However, when they realized that it was impossible to use Flash with Jumpstart and ZFS boot, they were disappointed. Their entire deployment infrastructure is built around using not just Flash, but Secure WANboot. This means that they have no alternative to Flash; the images deployed via Secure WANBoot are always flash archives. So, what to do?

It occurred to me that in general, the upgrade procedure from a pre-10/08 update of Solaris 10 to Solaris 10 10/08 with a ZFS root disk is a two-step process. First, you have to upgrade to Solaris 10 10/08 on UFS and then use lucreate to copy that environment to a new ZFS ABE. Why not use this approach in Jumpstart?

Turns out that it works quite nicely. This is a framework for how to do that. You likely will want to expand on it, since one thing this does not do is give you any indication of progress once it starts the conversion. Here's the general approach:

  • Create your flash archive for Solaris 10 10/08 as you usually would. Make sure you include all the appropriate LiveUpgrade patches in the flash archive.
  • Use Jumpstart to deploy this flash archive to one disk in the target system.
  • Use a finish script to add a conversion program to run when the system reboots for the first time. It is necessary to make this script run once the system has rebooted so that the LU commands run within the context of the fully built new system.

Details of this approach

Our goal when complete is to have the flash archive installed as it always has been, but to have it running from a ZFS root pool, preferably a mirrored ZFS pool. The conversion script requires two phases to complete this conversion. The first phase creates the ZFS boot environment and the second phase mirrors the root pool. The following in this example, our flash archive is called s10u6s.flar. We will install the initial flash archive onto the disk c0t1d0 and built our initial root pool on c0t0d0.

Here is the Jumpstart profile used in this example:


install_type    flash_install
archive_location nfs nfsserver:/export/solaris/Solaris10/flash/s10u6s.flar
partitioning    explicit
filesys         c0t1d0s1        1024    swap
filesys         c0t1d0s0        free    /

We specify a simple finish script for this system to copy our conversion script into place:

cp ${SI_CONFIG_DIR}/S99xlu-phase1 /a/etc/rc2.d/S99xlu-phase1

You see what we have done: We put a new script into place to run at the end of rc2 during the first boot. We name the script so that it is the last thing to run. The x in the name makes sure that this will run after other S99 scripts that might be in place. As it turns out, the luactivate that we will do puts its own S99 script in place, and we want to come after that. Naming ours S99x makes it happen later in the boot sequence.

So, what does this magic conversion script do? Let me outline it for you:

  • Create a new ZFS pool that will become our root pool
  • Create a new boot environment in that pool using lucreate
  • Activate the new boot environment
  • Add the script to be run during the second phase of the conversion
  • Clean up a bit and reboot

That's Phase 1. Phase 2 has its own script to be run at the same time that finishes the mirroring of the root pool. If you are satisfied with a non-mirrored pool, you can stop here and leave phase 2 out. Or you might prefer to make this step a manual process once the system is built. But, here's what happens in Phase 2:

  • Delete the old boot environment
  • Add a boot block to the disk we just freed. This example is SPARC, so use installboot. For x86, you would do something similar with installgrub.
  • Attach the disk we freed from the old boot environment as a mirror of the device used to build the new root zpool.
  • Clean up and reboot.

I have been thinking it might be worthwhile to add a third phase to start a zpool scrub, which will force the newly attached drive to be resilvered when it reboots. The first time something goes to use this drive, it will notice that it has not been synced to the master drive and will resilver it, so this is sort of optional.

The reason we add bootability explicitly to this drive is because currently, when a mirror is attached to a root zpool, a boot block is not automatically installed. If the master drive were to fail and you were left with only the mirror, this would leave the system unbootable. By adding a boot block to it, you can boot from either drive.

So, here's my simple little script that got installed as /etc/rc2.d/S99xlu-phase1. Just to make the code a little easier for me to follow, I first create the script for phase 2, then do the work of phase 1.


cat > /etc/rc2.d/S99xlu-phase2 << EOF
ludelete -n s10u6-ufs
installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk /dev/rdsk/c0t1d0s0
zpool attach -f rpool c0t0d0s0 c0t1d0s0
rm /etc/rc2.d/S99xlu-phase2
init 6
EOF
dumpadm -d swap
zpool create -f rpool c0t0d0s0
lucreate -c s10u6-ufs -n s10u6 -p rpool
luactivate -n s10u6
rm /etc/rc2.d/S99xlu-phase1
init 6

I think that this is a much better approach than the one I offered before, using ZFS send. This approach uses standard tools to create the new environment and it allows you to continue to use Flash as a way to deploy archives. The dependency is that you must have two drives on the target system. I think that's not going to be a hardship, since most folks will use two drives anyway. You will have to keep then as separate drives rather than using hardware mirroring. The underlying assumption is that you previously used SVM or VxVM to mirror those drives.

So, what do you think? Better? Is this helpful? Hopefully, this is a little Christmas present for someone! Merry Christmas and Happy New Year!

Friday Dec 05, 2008

Ancient History

Gather round kiddies and let Grandpa tell you a tale of how we used to to clone systems before we had Jumpstart and Flash, when we had to carry water in leaky buckets 3 miles through snow up to our knees, uphill both ways.

Long ago, a customer of mine needed to deploy 600(!) SPARCstation 5 desktops all running SunOS 4.1.4. Even then, this was an old operating system, since Solaris 2.6 had recently been released. But it was what their application required. And we only had a few days to build and deploy these systems.

Remember that Jumpstart did not exist for SunOS 4.1.4, Flash did not exist for Solaris 2.6. So, our approach was to build a system, a golden image, the way we wanted to be deployed and then use ufsdump to save the contents of the filesystems. Then, we were able to use Jumpstart from a Solaris 2.6 server to boot each of these workstations. Instead of having a Jumpstart profile, we only used a finish script that partitioned the disks and restored the ufsdump images. So Jumpstart just provided us clean way to boot these systems and apply the scripts we wanted to them.

Solaris 10 10/08, ZFS, Jumpstart and Flash

Now, we have a bit of a similar situation. Solaris 10 10/08 introduces ZFS boot to Solaris, something that many of my customers have been anxiously awaiting for some time. A system can be deployed using Jumpstart and the ZFS boot environment created as a part of the Jumpstart process.

But. There's always a but, isn't there.

But, at present, Flash archives are not supported (and in fact do not work) as a way to install into a ZFS boot environment, either via Jumpstart or via Live Upgrade. Turns out, they use the same mechanism under the covers for this. This is CR 6690473.

So, how can I continue to use Jumpstart to deploy systems, and continue to use something akin to Flash archives to speed and simplify the process?

Turns out the lessons we learned years ago can be used, more or less. Combine the idea of the ufsdump with some of the ideas that Bob Netherton recently blogged about (Solaris and OpenSolaris coexistence in the same root zpool), and you can get to a workaround that might be useful enough to get you through until Flash really is supported with ZFS root.

Build a "Golden Image" System

The first step, as with Flash, is to construct a system that you want to replicate. The caveat here is that you use ZFS for the root of this system. For this example, I have left /var as part of the root filesystem rather than a separate dataset, though this process could certainly be tweaked to accommodate a separate /var.

Once the system to be cloned has been built, you save an image of the system. Rather than using flarcreate, you will create a ZFS send stream and capture this in a file. Then move that file to the jumpstart server, just as you would with a flash archive.

In this example, the ZFS bootfs has the default name - rpool/ROOT/s10s_u6wos_07.


golden# zfs snapshot rpool/ROOT/s10s_u6wos_07@flar
golden# zfs send -v rpool/ROOT/s10s_u6wos_07@flar > s10s_u6wos_07_flar.zfs
golden# scp s10s_u6wos_07_flar.zfs js-server:/flashdirectory

How do I get this on my new server?

Now, we have to figure out how to have this ZFS send stream restored on the new clone systems. We would like to take advantage of the fact that Jumpstart will create the root pool for us, along with the dump and swap volumes, and will set up all of the needed bits for the booting from ZFS. So, let's install the minimum Solaris set of packages just to get these side effects.

Then, we will use Jumpstart finish scripts to create a fresh ZFS dataset and restore our saved image into it. Since this new dataset will contain the old identity of the original system, we have to reset our system identity. But once we do that, we are good to go.

So, set up the cloned system as you would for a hands-free jumpstart. Be sure to specify the sysid_config and install_config bits in the /etc/bootparams. The manual Solaris 10 10/08 Installation Guide: Custom JumpStart and Advanced Installations covers how to do this. We add to the rules file a finish script (I called mine loadzfs in this case) that will do the heavy lifting. Once Jumpstart installs Solaris according to the profile provided, it then runs the finish script to finish up the installation.

Here is the Jumpstart profile I used. This is a basic profile that installs the base, required Solaris packages into a ZFS pool mirrored across two drives.


install_type    initial_install
cluster         SUNWCreq
system_type     standalone
pool            rpool auto auto auto mirror c0t0d0s0 c0t1d0s0
bootenv         installbe bename s10u6_req

The finish script is a little more interesting since it has to create the new ZFS dataset, set the right properties, fill it up, reset the identity, etc. Below is the finish script that I used.


#!/bin/sh -x

# TBOOTFS is a temporary dataset used to receive the stream
TBOOTFS=rpool/ROOT/s10u6_rcv

# NBOOTFS is the final name for the new ZFS dataset
NBOOTFS=rpool/ROOT/s10u6f

MNT=/tmp/mntz
FLAR=s10s_u6wos_07_flar.zfs
NFS=serverIP:/export/solaris/Solaris10/flash

# Mount directory where archive (send stream) exists
mkdir ${MNT}
mount -o ro -F nfs ${NFS} ${MNT}

# Create file system to receive ZFS send stream &
# receive it.  This creates a new ZFS snapshot that
# needs to be promoted into a new filesystem
zfs create ${TBOOTFS}
zfs set canmount=noauto ${TBOOTFS}
zfs set compression=on ${TBOOTFS}
zfs receive -vF ${TBOOTFS} < ${MNT}/${FLAR}

# Create a writeable filesystem from the received snapshot
zfs clone ${TBOOTFS}@flar ${NBOOTFS}

# Make the new filesystem the top of the stack so it is not dependent
# on other filesystems or snapshots
zfs promote ${NBOOTFS}

# Don't automatically mount this new dataset, but allow it to be mounted
# so we can finalize our changes.
zfs set canmount=noauto ${NBOOTFS}
zfs set mountpoint=${MNT} ${NBOOTFS}

# Mount newly created replica filesystem and set up for
# sysidtool.  Remove old identity and provide new identity
umount ${MNT}
zfs mount ${NBOOTFS}

# This section essentially forces sysidtool to reset system identity at
# the next boot.
touch /a/${MNT}/reconfigure
touch /a/${MNT}/etc/.UNCONFIGURED
rm /a/${MNT}/etc/nodename
rm /a/${MNT}/etc/.sysIDtool.state
cp ${SI_CONFIG_DIR}/sysidcfg /a/${MNT}/etc/sysidcfg

# Now that we have finished tweaking things, unmount the new filesystem
# and make it ready to become the new root.
zfs umount ${NBOOTFS}
zfs set mountpoint=/ ${NBOOTFS}
zpool set bootfs=${NBOOTFS} rpool

# Get rid of the leftovers
zfs destroy ${TBOOTFS}
zfs destroy ${NBOOTFS}@flar

When we jumpstart the system, Solaris is installed, but it really isn't used. Then, we load from the send stream a whole new OS dataset, make it bootable, set our identity in it, and use it. When the system is booted, Jumpstart still takes care of updating the boot archives in the new bootfs.

On the whole, this is a lot more work than Flash, and is really not as flexible or as complete. But hopefully, until Flash is supported with a ZFS root and Jumpstart, this might at least give you an idea of how you can replicate systems and do installations that do not have to revert back to package-based installation.

Many people use Flash as a form of disaster recover. I think that this same approach might be used there as well. Still not as clean or complete as Flash, but it might work in a pinch.

So, what do you think? I would love to hear comments on this as a stop-gap approach.

Friday Dec 01, 2006

Continuing with some of the ideas around zvols, I wondered about UFS on a zvol.  On the surface, this appears to be sort of redundant and not really very sensible.  But thinking about it, there are some real advantages.

  • I can take advantage of the data integrity and self-healing features of ZFS since this is below the filesystem layer.
  • I can easily create new volumes for filesystems and grow existing ones
  • I can make snapshots of the volume, sharing the ZFS snapshot flexibility with UFS - very cool
  • In the future, I should be able to do things like have an encrypted UFS (sort-of) and secure deletion

Creating UFS filesystems on zvols

Creating a UFS filesystem on a zvol is pretty trivial.  In this example, we'll create a mirrored pool and then build a UFS filesystem in a zvol.

bash-3.00# zpool create p mirror c2t10d0 c2t11d0 mirror c2t12d0 c2t13d0
bash-3.00# zfs create -V 2g p/v1
bash-3.00# zfs list
NAME     USED  AVAIL  REFER  MOUNTPOINT
p       4.00G  29.0G  24.5K  /p
p/v1    22.5K  31.0G  22.5K  -
bash-3.00# newfs /dev/zvol/rdsk/p/v1
newfs: construct a new file system /dev/zvol/rdsk/p/v1: (y/n)? y
Warning: 2082 sector(s) in last cylinder unallocated
/dev/zvol/rdsk/p/v1:    4194270 sectors in 683 cylinders of 48 tracks, 128 sectors
        2048.0MB in 43 cyl groups (16 c/g, 48.00MB/g, 11648 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
3248288, 3346720, 3445152, 3543584, 3642016, 3740448, 3838880, 3937312,
4035744, 4134176
bash-3.00# mkdir /fs1
bash-3.00# mount /dev/zvol/dsk/p/v1 /fs1
bash-3.00# df -h /fs1
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v1     1.9G   2.0M   1.9G     1%    /fs1

Nothing much to it. 

Growing UFS filesystems on zvols

But, what if I run out of space?  Well, just as you can add disks to a volume and grow the size of the volume, you can grow the size of a zvol.  Now, since the UFS filesystem is a data structure inside zvol container, you have to grow it as well.  Were I using just zfs, the size of the file system would grow and shrink dynamically with the size of the data in the file system.  But  a UFS has a fixed size, so it has to be expanded manually to accomodate the enlarged volume.  Now, this seems to have quite working between b45 and b53, so I just filed a bug on this one.

bash-3.00# uname -a
SunOS atl-sewr-158-154 5.11 snv_45 sun4u sparc SUNW,Sun-Fire-480R
bash-3.00# zfs create -V 1g bsd/v1
bash-3.00# newfs /dev/zvol/rdsk/bsd/v1
...
bash-3.00# zfs set volsize=2g bsd/v1
bash-3.00# growfs /dev/zvol/rdsk/bsd/v1
Warning: 2048 sector(s) in last cylinder unallocated
/dev/zvol/rdsk/bsd/v1:  4194304 sectors in 683 cylinders of 48 tracks, 128 sectors
        2048.0MB in 49 cyl groups (14 c/g, 42.00MB/g, 20160 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 86176, 172320, 258464, 344608, 430752, 516896, 603040, 689184, 775328,
3359648, 3445792, 3531936, 3618080, 3704224, 3790368, 3876512, 3962656,
4048800, 4134944

What about compression? 

Along the same lines as growing the file system, I suppose you could turn compression on for the zvol.  But since the UFS is of fixed size, it won't help especially, as far as fitting more data in the file system.  You can't put more into the filesystem than the filesystem thinks that it can hold.  Even if it isn't using that much on the disk.  Here's a little demonstration of that.

First, we will loop through, creating 200MB files in a 1GB file system with no compression.  We will use blocks of zeros, since these will compress quite a bit the second time round. 

bash-3.00# zfs create -V 1g p/v1
bash-3.00# zfs get used,volsize,compressratio p/v1
NAME  PROPERTY       VALUE    SOURCE
p/v1  used           22.5K    -
p/v1  volsize        1G       -
p/v1  compressratio  1.00x    -
bash-3.00# newfs /dev/zvol/rdsk/p/v1
...
bash-3.00# mount /dev/zvol/dsk/p/v1 /fs1
bash-3.00#
bash-3.00# for f in f1 f2 f3 f4 f5 f6 f7 ; do
> dd if=/dev/zero bs=1024k count=200 of=/fs1/$f
> df -h /fs1
> zfs get used,volsize,compressratio p/v1
> done

200+0 records in
200+0 records out
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v1     962M   201M   703M    23%    /fs1
NAME  PROPERTY       VALUE    SOURCE
p/v1  used           62.5M    -
p/v1  volsize        1G       -
p/v1  compressratio  1.00x    -
200+0 records in
200+0 records out
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v1     962M   401M   503M    45%    /fs1
NAME  PROPERTY       VALUE    SOURCE
p/v1  used           149M     -
p/v1  volsize        1G       -
p/v1  compressratio  1.00x    -
200+0 records in
200+0 records out
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v1     962M   601M   303M    67%    /fs1
NAME  PROPERTY       VALUE    SOURCE
p/v1  used           377M     -
p/v1  volsize        1G       -
p/v1  compressratio  1.00x    -
200+0 records in
200+0 records out
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v1     962M   801M   103M    89%    /fs1
NAME  PROPERTY       VALUE    SOURCE
p/v1  used           497M     -
p/v1  volsize        1G       -
p/v1  compressratio  1.00x    -
dd: unexpected short write, wrote 507904 bytes, expected 1048576
161+0 records in
161+0 records out
Dec  1 14:53:04 atl-sewr-158-122 ufs: NOTICE: alloc: /fs1: file system full

bash-3.00# zfs get used,volsize,compressratio p/v1
NAME  PROPERTY       VALUE    SOURCE
p/v1  used           1.00G    -
p/v1  volsize        1G       -
p/v1  compressratio  1.00x    -
bash-3.00#

So, you see that it fails as it writes the 5th 200MB chunk, which is what you would expect.  Now, let's do the same thing with compression turned on for the volume.

bash-3.00# zfs create -V 1g p/v2
bash-3.00# zfs set compression=on p/v2
bash-3.00# newfs /dev/zvol/rdsk/p/v2
...
bash-3.00#
bash-3.00# mount /dev/zvol/dsk/p/v2 /fs2
bash-3.00# for f in f1 f2 f3 f4 f5 f6 f7 ; do
> dd if=/dev/zero bs=1024k count=200 of=/fs2/$f
> df -h /fs2
> zfs get used,volsize,compressratio p/v2
> done
200+0 records in
200+0 records out
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v2     962M   201M   703M    23%    /fs2
NAME  PROPERTY       VALUE    SOURCE
p/v2  used           8.58M    -
p/v2  volsize        1G       -
p/v2  compressratio  7.65x    -
200+0 records in
200+0 records out
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v2     962M   401M   503M    45%    /fs2
NAME  PROPERTY       VALUE    SOURCE
p/v2  used           8.58M    -
p/v2  volsize        1G       -
p/v2  compressratio  7.65x    -
200+0 records in
200+0 records out
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v2     962M   601M   303M    67%    /fs2
NAME  PROPERTY       VALUE    SOURCE
p/v2  used           8.83M    -
p/v2  volsize        1G       -
p/v2  compressratio  7.50x    -
200+0 records in
200+0 records out
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v2     962M   801M   103M    89%    /fs2
NAME  PROPERTY       VALUE    SOURCE
p/v2  used           8.83M    -
p/v2  volsize        1G       -
p/v2  compressratio  7.50x    -
dd: unexpected short write, wrote 507904 bytes, expected 1048576
161+0 records in
161+0 records out
Dec  1 15:16:42 atl-sewr-158-122 ufs: NOTICE: alloc: /fs2: file system full

bash-3.00# zfs get used,volsize,compressratio p/v2
NAME  PROPERTY       VALUE    SOURCE
p/v2  used           9.54M    -
p/v2  volsize        1G       -
p/v2  compressratio  7.07x    -
bash-3.00# df -h /fs2
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v2     962M   962M     0K   100%    /fs2
bash-3.00#

This time, even though the volume was not using much space at all, the file system was full.  So compression in this case is especially valuable from a space management standpoint.  Depending on the contents of the filesystem, compression may still help the performance by converting multiple I/Os into single or fewer I/Os, though.

The Cool Stuff - Snapshots and Clones with UFS on Zvols

One of the things that is not available in UFS is the ability to create multiple snapshots quickly and easily.  The fssnap(1M) command allows me to create a single, read-only snapshot of a UFS file system.  In addition, it requires an additional location to maintain backing store for files changed or deleted in the master image during the lifetime of  the snapshot.

ZFS offers the ability to create many snapshots of a ZFS filesystem quickly and easily.  This ability extends to zvols, as it turns out.

For this example, we will create a volume, fill it up with some data and then play around with taking some snapshots of it.  We will just tar over the Java JDK so there are some files in the file system. 

bash-3.00# zfs create -V 2g p/v1
bash-3.00# newfs /dev/zvol/rdsk/p/v1
...
bash-3.00# mount /dev/zvol/dsk/p/v1 /fs1
bash-3.00# tar cf -  ./jdk/ | (cd /fs1 ; tar xf - )
bash-3.00# df -h /fs1
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v1     1.9G   431M   1.5G    23%    /fs1
bash-3.00# zfs list
NAME     USED  AVAIL  REFER  MOUNTPOINT
p       4.00G  29.0G  24.5K  /p
p/swap  22.5K  31.0G  22.5K  -
p/v1     531M  30.5G   531M  -

Now, we will create a snapshot of the volume, just like for any other ZFS file system.  As it turns out, this creates new device nodes in /dev/zvol for the block and character devices.  We can mount them as UFS file systems same as always.

bash-3.00# zfs snapshot p/v1@s1  # Make the snapshot
bash-3.00# zfs list # See that it's really there
NAME      USED  AVAIL  REFER  MOUNTPOINT
p        4.00G  29.0G  24.5K  /p
p/swap   22.5K  31.0G  22.5K  -
p/v1      531M  30.5G   531M  -
p/v1@s1      0      -   531M  -
bash-3.00# mkdir /fs1-s1
bash-3.00# mount  /dev/zvol/dsk/p/v1@s1 /fs1-s1 # Mount it
mount: /dev/zvol/dsk/p/v1@s1 write-protected # Snapshots are read-only, so this fails
bash-3.00# mount -o ro  /dev/zvol/dsk/p/v1@s1 /fs1-s1 # Mount again read-only
bash-3.00# df -h /fs1-s1 /fs1
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v1@s1
                       1.9G   431M   1.5G    23%    /fs1-s1
/dev/zvol/dsk/p/v1     1.9G   431M   1.5G    23%    /fs1
bash-3.00#

At this point /fs1-s1 is a read-only snapshot of /fs1.  If I delete files, create files, or change files in /fs1, that change will not be reflected in /fs1-s1.

bash-3.00# ls /fs1/jdk
instances    jdk1.5.0_08  jdk1.6.0     latest       packages
bash-3.00# rm -rf /fs1/jdk/instances
bash-3.00# df -h /fs1 /fs1-s1
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v1     1.9G    61M   1.8G     4%    /fs1
/dev/zvol/dsk/p/v1@s1
                       1.9G   431M   1.5G    23%    /fs1-s1
bash-3.00#

Just as you can create multiple snapshots.  And as with any other ZFS file system, you can rollback a snapshot and make it the master again.  You have to unmount the filesystem in order to do this, since the rollback is at the volume level.  Changing the volume underneath the UFS filesystem would leave UFS confused about the state of things.  But, ZFS catches this, too.

 

bash-3.00# ls /fs1/jdk/
jdk1.5.0_08  jdk1.6.0     latest       packages
bash-3.00# rm /fs1/jdk/jdk1.6.0
bash-3.00# ls /fs1/jdk/
jdk1.5.0_08  latest       packages
bash-3.00# zfs list
NAME      USED  AVAIL  REFER  MOUNTPOINT
p        4.00G  29.0G  24.5K  /p
p/swap   22.5K  31.0G  22.5K  -
p/v1      535M  30.5G   531M  -
p/v1@s1  4.33M      -   531M  -
bash-3.00# zfs rollback p/v1@s2 # /fs1 is still mounted.
cannot remove device links for 'p/v1': dataset is busy
bash-3.00# umount /fs1
bash-3.00# zfs rollback p/v1@s2
bash-3.00# mount /dev/zvol/dsk/p/v1 /fs1
bash-3.00# ls /fs1/jdk
jdk1.5.0_08  jdk1.6.0     latest       packages
bash-3.00#

I can create additional read-write instances of a volume by cloning the snapshot.  The clone and the master file system will share the same objects on-disk for data that remains unchanged, while new on-disk objects will be created for any files that are changed either in the master or in the clone.

 

bash-3.00# ls /fs1/jdk
jdk1.5.0_08  jdk1.6.0     latest       packages
bash-3.00# zfs snapshot p/v1@s1
bash-3.00# zfs clone p/v1@s1 p/c1
bash-3.00# zfs list
NAME      USED  AVAIL  REFER  MOUNTPOINT
p        4.00G  29.0G  24.5K  /p
p/c1         0  29.0G   531M  -
p/swap   22.5K  31.0G  22.5K  -
p/v1      531M  30.5G   531M  -
p/v1@s1      0      -   531M  -
bash-3.00# mkdir /c1
bash-3.00# mount /dev/zvol/dsk/p/c1 /c1
bash-3.00# ls /c1/jdk
jdk1.5.0_08  jdk1.6.0     latest       packages
bash-3.00# df -h /fs1 /c1
Filesystem             size   used  avail capacity  Mounted on
/dev/zvol/dsk/p/v1     1.9G    61M   1.8G     4%    /fs1
/dev/zvol/dsk/p/c1     1.9G    61M   1.8G     4%    /c1
bash-3.00#

I think am pretty sure that this isn't exactly what the ZFS guys had in mind when they set out to build all of this, but this is pretty cool.  Now, I can create UFS snapshots without having to specify a backing store.  I can create clones, promote the clones to the master, and the other things that I can do in ZFS.  I still have to manage the mounts myself, but I'm better off than before.

I have not tried any sort of performance testing on these.  Dominic Kay has just written a nice blog about using filebench to compare ZFS and VxFS.  Maybe I can use some of that work to see how things go with UFS on top of ZFS.

As always, comments, etc. are welcome!

I mentioned recently that I just spent a week in a ZFS internals TOI. Got a few ideas to play with there that I will share. Hopefully folks might have suggestions as to how to improve / test / validate some of these things.

ZVOLs as Swap

The first thing that I thought about was using ZFS as a swap device. Of course, this is right there in the zfs(1) man page as an example, but it still deserves a mention here.  There has been some discussion of this on the zfs-discuss list at opensolaris.org (I just retyped that dot four times thinking it was a comma. Turns out there was crud on my laptop screen).  The dump device cannot be on a zvol (at least if you want to catch a crash dump) but this still gives a lot of flexibility.  With root on ZFS (coming before too long) ZFS swap makes a lot of sense and is the natural choice. We were talking in class that maybe it would be nice if there were a way to turn off ZFS' caching for the swap surface to improve performance, but that remains to be seen.

At any rate, setting up mirrored swap with ZFS is way simple! Much simpler even than with SVM, which in turn is simpler than VxVM. Here's all it takes:


bash-3.00# zpool create -f p mirror c2t10d0 c2t11d0
bash-3.00# zfs create -V 2g p/swap
bash-3.00# swap -a /dev/zvol/dsk/p/swap

Pretty darn simple, if you ask me. You can make it permanent by changing the lines for swap in your /etc/vfstab (below).  Notice that you use the path to the zvol in the /dev tree rather than the ZFS dataset name.


bash-3.00# cat /etc/vfstab
#device device mount FS fsck mount mount
#to mount to fsck point type pass at boot options
#
#/dev/dsk/c1t0d0s1 - - swap - no -
/dev/zvol/dsk/p/swap - - swap - no -

I would like to do some performance testing to see what kind of performance you can get with swap on a zvol.  I am curious about how this will affect kernel memory usage.  I am curious about the effect of things like compression on the swap volume.  Thinking about that one, it doesn't make a lot of sense.  I am also curious about the ability to dynamically change the size of the swap space.  At first glance, changing the size of the volume does not automatically change the amount of available swap space.  That makes sense.  That makes sense for expanding swap space.  But if you reduce the size of the volume and the kernel doesn't notice, that sounds like a it could be a problem.  Maybe I should file a bug.

Suggestions for things to try and ways to measure overhead and performance for this are welcomed.

Thursday Nov 30, 2006

I just spent the last four days in a ZFS Intenals TOI, given by George Wilson from RPE.  This just reinforces my belief that the folks who build OpenSolaris (and most any complex software product, actually) have a special gift.  How one can conceive of all of the various parts and pieces to bring together something as cool as OpenSolaris or ZFS or DTrace, etc., is beyond me.

By way of full disclosure, I ought to admit that the main thing I learned in graduate school and while working as a developer in a CO-OP job at IBM was that I hate development.  I am not cut out for it and have no patience for it.

Anyway, though, spending a week in the ZFS source actually helps you figure out how to best use the tool at a user level.  You how things fit together and this helps to figure out how to build solutions.  I got a ton of good ideas about some things that you might do with ZFS even without moving all of your data to ZFS.  Don't know whether they will pan out or not, but some ideas to play around with.  More about that later.

Same kind of thing applies for internals of the kernel.  Whether or not you are a kernel programmer, you can be a better developer and a better system administrator if you have a notion of how the pieces of the kernel fit together.  Sun Education is now offering a class called Solaris 10 Operating System, previously only offered internally at Sun.  Since Solaris has been open-sourced, the internal Internals is now and external Internals!  If you have a chance, take this class!  I take it every couple of Solaris releases and never regret it.

But, mostly I want to say a special thanks to George Wilson and the RPE team for putting together a fantastic training event and for allowing me, from the SE / non-developer side of the house to sit in and bask in the glow of those who actually make things for a living.