Lori's Blog Lori Alt's Weblog

Monday Apr 23, 2007

One of the cringe-inducing parts of this whole open-source approach to things is that we now have the opportunity, and thus perhaps the obligation, to expose just how confused we are sometimes about technical issues. I have this fear that the reaction to some of what I'm going to blog will be "You haven't figured that out YET?" Well, I'm going to take that risk. ZFS is such a different file system paradigm that we're all still figuring out the implications. And in particular, using ZFS as a root file system and booting from datasets in pools, not LUNs, really changes the way we have to think about managing bootable environments.

So here's an issue: What's the best way to split up the Solaris name space into datasets? Should it be split at all? I'm going to answer the second question right away with my strong opinion that YES, we should split up the name space into separate datasets. We don't need to do it any more for space reasons (back in the days of 400 MB disks, we couldn't fit all of what shipped with the Solaris operating system onto a single disk), but now that having multiple file systems is EASY (no need to preallocate a separate slice for each one), maybe we should do it for other reasons. Here are a few:

  • It might make sense to have different qualities of service for different parts of the name space. Maybe /opt could be compressed, for example.
  • When you clone a bootable environment (i.e., a root file system and its subordinate file systems), you might want to include some parts of the old boot environment by reference, not by cloning. For example, since /var/adm/log reflects the history of the overall system, not just a single boot environment, maybe you want to have just one copy which is shared among the various boot environments. If that directory were its own file system, it's easier to share it among different bootable environments.
  • Eventually, we'd like to support booting from other kinds of pool configurations besides simple mirrors. But to do that, we need to have the files that are crucial for booting on each disk in the pool. That happens automatically with mirrors, but not for RAID-Z. So how to make sure that some files are available on each disk? Well, I had considered a file attribute of some kind. Some kind of "treat this file special" interface to zfs. But that would be a pain because every time we wrote out, say, the boot archive, we should have to give it the "special" treatment, whatever that is. ZFS's "quality of service" boundary is the dataset, not the file. So what if we created a new dataset property, which is the "make this dataset available to the booter" property? It's not clear how to implement that and I don't want to get into it (I'll leave it to the ZFS internals gurus to figure that one out), but if we had such a property, then we could assign it to the root file system and automatically get bootability. But however that property is implemented, it would probably involve replicating the entire dataset on each disk. And that's a good reason for keeping the root file system small. Which means splitting off /usr, /opt, and perhaps other parts of the name space into separate datasets.
  • Splitting the name space into separate file systems might have some advantages for zones.

My gut feeling is that the most controversial part of the name space as far as where to make the divisions will be /var. Should /var as a whole be a separate dataset? Is there value in splitting off some of the subdirectories of /var as separate datasets?

I'll be working on a plan for these and other issues. Watch this space for more questions and updates. I welcome your comments on this, which can be posted either here or at the zfs-discuss@opensolaris.org alias.