Lori's Blog Lori Alt's Weblog

Friday May 25, 2007

ZFS introduced a new way to get file systems mounted: the mount point of a file system is a property of the file system, and the file system is mounted automatically at its defined mount point. Furthermore, mount points can be generated automatically from the dataset's name and its position in the dataset hierarchy. As the zfs(1m) man page says:

The mountpoint property can be inherited, so if pool/home has a mount point of /export/stuff, then pool/home/user automatically inherits a mount point of /export/stuff/user.

Thus, ZFS file systems do not need entries in /etc/vfstab.

So the question of the day is: should we use this mount mechanism for the root file system when the root file system type is ZFS?

There are some problems in getting non-legacy mounts of root file systems to work right now (detailed below for the truly interested). But I think they are solvable. My big concern is: should we make it work? Or should we continue to use legacy mounts? Here are some of the pros and cons:

Pros:

  1. No need to update /etc/vfstab when the root dataset name changes. Every time a new Boot Environment (BE, in LiveUpgrade terms) is lucreated from a ZFS-based BE, the new BE will have a new root dataset name (by definition, since the new root dataset can't have the same dataset name as the old one). If the root dataset name has to be explicitly named in the /etc/vfstab file, the /etc/vfstab file in the cloned dataset will start out wrong and have to be updated. This is not necessarily a terrible thing (we fix it already for clones of ufs-based BEs), but it's not ideal.

  2. ZFS-rooted BEs are more likely to be composed of multiple datasets than UFS-based BEs. (See my April 23 blog entry.) If we use legacy mounts for the mounts in a BE, every one of those datasets will need an entry in /etc/vfstab and every one will have to be updated when a new BE is generated from an old one. Even if LiveUpgrade or some other BE-management tool does this automatically, it still means clutter in /etc/vfstab.

  3. Non-legacy mounts automatically give you the right mount point if you've named the dataaset correctly. So if you create a BE named roottank/BE1 with a mount point of “/”, datasets named “roottank/BE1/usr”, “roottank/BE1/opt”, and “roottank/BE1/var” will automatically get mountpoints of /usr, /opt, and /var, which is what you want. Furthermore, if you want to temporarily move the entire BE to a new point in the name space (say, because you're upgrading it while booted off another BE), all you have to do is change the mount point of the base BE dataset to, say, “/a”, and all of the subordinate datasets will automatically move to /a/usr, /a/opt, /a/var and so on.

  4. Legacy mounts are not the “ZFS way”. Over time, the more we adhere to the ZFS-standard ways of doing things, the more likely it is that ZFS as a root file system will “just work”.

Cons:

  1. No entry in /etc/vfstab for the root file system. This is a big one, because it means that code that parses /etc/vfstab to figure out where root is mounted, and whether /usr is a separate file system, and answer other questions, won't work. Note that many of these questions can be answered by parsing /etc/mnttab (or more appropriately, using the getmntent(3C) functions), because /etc/mnttab will contain all mounts, legacy or otherwise. But there is certain to be code out there that parses /etc/vfstab, or uses the getvfsent(3C) functions.

I don't want to minimize the potential cost of not having an entry in /etc/vfstab for root. I suspect that it will break a fair number of things (not least in Solaris itself, such as the initialization scripts in /lib/svc/methods), but this may be a “pay me now, or pay me later” situation, where the cost of adapting now will save us a lot of grief in the long run. I'd like to get a feeling from the community about this issue and whether there are problems with this that I haven't thought of yet.

So here are some of the implementation issues of moving to a non-legacy mount of root (all of which I think are fixable):

  1. Right now, when you assign a mount point to a dataset, ZFS tries to mount the dataset at that mount point immediately. Obviously, if you've just assigned a dataset a mountpoint of “/”, that mount is going to fail because there is already something mounted at “/”. We need some way to override the immediate mount when we're setting up a BE for later mounting.

  2. The logic in the SMF startup scripts related to file system mounting (/lib/svc/methods/{fs-root,fs-usr} will have to change to deal with the fact that those file systems don't have entries in /etc/vfstab.

  3. The mountroot code in the kernel should mount zfs roots as read-write, not read-only. There is no need for the temporary read-only mount because we don't fsck zfs file systems.



Tuesday Apr 24, 2007

The question came up in comments to yesterday's post about how swap will work when the system is booted from zfs.

Current plan is that swap will be a zvol within the pool. Eventually, that same zvol will be used for dump as well. (Right now, taking a crash dump into a zvol isn't supported, but it's considered a critical feature and is being developed at this time.) The current install code has an algorithm for setting the swap size (based on the amount of memory on the system), and we will need a way for users to override that. One nice thing about using a zvol for swap is that it will be easy to change its size.

I'm thinking right now that we shouldn't (and probably couldn't even if we wanted to) prevent users from setting up configurations in which they have a swap slice. Perhaps there are workloads for which a dedicated swap slice still makes sense. But in that case, we should still find a way to turn on write caching if the disk is composed of nothing but parts of zfs pools and a swap slice. It's on my list to get such a thing implemented. We don't just need it for swap. The bigger reason to support it is that some users may want to divide a disk into two pools: one which is a root pool and one which is a data pool (since we don't support RAID-Z on root pools at this time). In that case, even though the entire disk is not devoted to a single pool, the disk is still "all-ZFS", in which case write caching should be allowed.

Monday Apr 23, 2007

One of the cringe-inducing parts of this whole open-source approach to things is that we now have the opportunity, and thus perhaps the obligation, to expose just how confused we are sometimes about technical issues. I have this fear that the reaction to some of what I'm going to blog will be "You haven't figured that out YET?" Well, I'm going to take that risk. ZFS is such a different file system paradigm that we're all still figuring out the implications. And in particular, using ZFS as a root file system and booting from datasets in pools, not LUNs, really changes the way we have to think about managing bootable environments.

So here's an issue: What's the best way to split up the Solaris name space into datasets? Should it be split at all? I'm going to answer the second question right away with my strong opinion that YES, we should split up the name space into separate datasets. We don't need to do it any more for space reasons (back in the days of 400 MB disks, we couldn't fit all of what shipped with the Solaris operating system onto a single disk), but now that having multiple file systems is EASY (no need to preallocate a separate slice for each one), maybe we should do it for other reasons. Here are a few:

  • It might make sense to have different qualities of service for different parts of the name space. Maybe /opt could be compressed, for example.
  • When you clone a bootable environment (i.e., a root file system and its subordinate file systems), you might want to include some parts of the old boot environment by reference, not by cloning. For example, since /var/adm/log reflects the history of the overall system, not just a single boot environment, maybe you want to have just one copy which is shared among the various boot environments. If that directory were its own file system, it's easier to share it among different bootable environments.
  • Eventually, we'd like to support booting from other kinds of pool configurations besides simple mirrors. But to do that, we need to have the files that are crucial for booting on each disk in the pool. That happens automatically with mirrors, but not for RAID-Z. So how to make sure that some files are available on each disk? Well, I had considered a file attribute of some kind. Some kind of "treat this file special" interface to zfs. But that would be a pain because every time we wrote out, say, the boot archive, we should have to give it the "special" treatment, whatever that is. ZFS's "quality of service" boundary is the dataset, not the file. So what if we created a new dataset property, which is the "make this dataset available to the booter" property? It's not clear how to implement that and I don't want to get into it (I'll leave it to the ZFS internals gurus to figure that one out), but if we had such a property, then we could assign it to the root file system and automatically get bootability. But however that property is implemented, it would probably involve replicating the entire dataset on each disk. And that's a good reason for keeping the root file system small. Which means splitting off /usr, /opt, and perhaps other parts of the name space into separate datasets.
  • Splitting the name space into separate file systems might have some advantages for zones.

My gut feeling is that the most controversial part of the name space as far as where to make the divisions will be /var. Should /var as a whole be a separate dataset? Is there value in splitting off some of the subdirectories of /var as separate datasets?

I'll be working on a plan for these and other issues. Watch this space for more questions and updates. I welcome your comments on this, which can be posted either here or at the zfs-discuss@opensolaris.org alias.

Friday Apr 20, 2007

I should have done this ages ago. If I had known how easy it is to set up a blog, I would have.

I'm Lori Alt, staff engineer at Sun Microsystems. I'm currently the project lead for the zfs boot project. The goal of the project is to enable zfs to be used as a root file system. I intend this blog to be a way to keep the OpenSolaris community informed about new developments in that area. I also hope for lots of comments about what direction the project should take.

Some background: I went to Washington University in St. Louis, graduating with bachelor's degrees in History and Computer Science, and a master's degree in CS. I've been at Sun since 1991. During my first six or so years at Sun, I was a member of the installation software group, mainly working on upgrade. I wish we had something like zfs back in those days. It would have made installation and upgrade so much easier! In more recent years, I've been a member of the file systems group, mainly working on ufs. I was the principal developer and project lead for the multi-terabyte UFS project. After that project shipped, I did some general UFS bug-fix work and then joined the zfs team with the specific assignment of making zfs bootable. Part of the reason I was chosen to lead that project was my background in install, since a big part of implementing a new root file system is the installation software to set it up and maintain it (through patching, upgrades, etc.)

I also wrote the "acr" part of bfu. Since I was the originator of the use of "class-action scripts" in Solaris packages for upgrading editable files, it was always very annoying to me to have to resolve conflicts manually after a bfu. So I wrote acr to do it automatically. (I must thank Bill Sommerfeld however for adding some nice enhancements to the acr code and actually getting it made part of the Solaris source. Before that, it was just an internal tool.)

So here's the latest on zfs boot:

The ON (Operating Environment and Networking) support for booting from zfs root file systems on x86 platforms was integrated into the Nevada source on March 29, thanks to the great work of my colleague Lin Ling. At that point (from build 62 on), it became possible to set up systems with a zfs root file system, either by a manual setup procedure or by using a kit to convert a standard install image into one that would support the profile-based installation of a system with zfs root.

Support is planned for sparc. More about that later.

So now that I've started this blog, I expect to be posting status regularly and presenting some of the issues that come up. I will welcome comments and input. You can also monitor the project at the zfs-discuss@opensolaris.org mailing list.