Lori's Blog Lori Alt's Weblog

Friday May 25, 2007

ZFS introduced a new way to get file systems mounted: the mount point of a file system is a property of the file system, and the file system is mounted automatically at its defined mount point. Furthermore, mount points can be generated automatically from the dataset's name and its position in the dataset hierarchy. As the zfs(1m) man page says:

The mountpoint property can be inherited, so if pool/home has a mount point of /export/stuff, then pool/home/user automatically inherits a mount point of /export/stuff/user.

Thus, ZFS file systems do not need entries in /etc/vfstab.

So the question of the day is: should we use this mount mechanism for the root file system when the root file system type is ZFS?

There are some problems in getting non-legacy mounts of root file systems to work right now (detailed below for the truly interested). But I think they are solvable. My big concern is: should we make it work? Or should we continue to use legacy mounts? Here are some of the pros and cons:

Pros:

  1. No need to update /etc/vfstab when the root dataset name changes. Every time a new Boot Environment (BE, in LiveUpgrade terms) is lucreated from a ZFS-based BE, the new BE will have a new root dataset name (by definition, since the new root dataset can't have the same dataset name as the old one). If the root dataset name has to be explicitly named in the /etc/vfstab file, the /etc/vfstab file in the cloned dataset will start out wrong and have to be updated. This is not necessarily a terrible thing (we fix it already for clones of ufs-based BEs), but it's not ideal.

  2. ZFS-rooted BEs are more likely to be composed of multiple datasets than UFS-based BEs. (See my April 23 blog entry.) If we use legacy mounts for the mounts in a BE, every one of those datasets will need an entry in /etc/vfstab and every one will have to be updated when a new BE is generated from an old one. Even if LiveUpgrade or some other BE-management tool does this automatically, it still means clutter in /etc/vfstab.

  3. Non-legacy mounts automatically give you the right mount point if you've named the dataaset correctly. So if you create a BE named roottank/BE1 with a mount point of “/”, datasets named “roottank/BE1/usr”, “roottank/BE1/opt”, and “roottank/BE1/var” will automatically get mountpoints of /usr, /opt, and /var, which is what you want. Furthermore, if you want to temporarily move the entire BE to a new point in the name space (say, because you're upgrading it while booted off another BE), all you have to do is change the mount point of the base BE dataset to, say, “/a”, and all of the subordinate datasets will automatically move to /a/usr, /a/opt, /a/var and so on.

  4. Legacy mounts are not the “ZFS way”. Over time, the more we adhere to the ZFS-standard ways of doing things, the more likely it is that ZFS as a root file system will “just work”.

Cons:

  1. No entry in /etc/vfstab for the root file system. This is a big one, because it means that code that parses /etc/vfstab to figure out where root is mounted, and whether /usr is a separate file system, and answer other questions, won't work. Note that many of these questions can be answered by parsing /etc/mnttab (or more appropriately, using the getmntent(3C) functions), because /etc/mnttab will contain all mounts, legacy or otherwise. But there is certain to be code out there that parses /etc/vfstab, or uses the getvfsent(3C) functions.

I don't want to minimize the potential cost of not having an entry in /etc/vfstab for root. I suspect that it will break a fair number of things (not least in Solaris itself, such as the initialization scripts in /lib/svc/methods), but this may be a “pay me now, or pay me later” situation, where the cost of adapting now will save us a lot of grief in the long run. I'd like to get a feeling from the community about this issue and whether there are problems with this that I haven't thought of yet.

So here are some of the implementation issues of moving to a non-legacy mount of root (all of which I think are fixable):

  1. Right now, when you assign a mount point to a dataset, ZFS tries to mount the dataset at that mount point immediately. Obviously, if you've just assigned a dataset a mountpoint of “/”, that mount is going to fail because there is already something mounted at “/”. We need some way to override the immediate mount when we're setting up a BE for later mounting.

  2. The logic in the SMF startup scripts related to file system mounting (/lib/svc/methods/{fs-root,fs-usr} will have to change to deal with the fact that those file systems don't have entries in /etc/vfstab.

  3. The mountroot code in the kernel should mount zfs roots as read-write, not read-only. There is no need for the temporary read-only mount because we don't fsck zfs file systems.