Lori's Blog Lori Alt's Weblog

Monday Apr 23, 2007

One of the cringe-inducing parts of this whole open-source approach to things is that we now have the opportunity, and thus perhaps the obligation, to expose just how confused we are sometimes about technical issues. I have this fear that the reaction to some of what I'm going to blog will be "You haven't figured that out YET?" Well, I'm going to take that risk. ZFS is such a different file system paradigm that we're all still figuring out the implications. And in particular, using ZFS as a root file system and booting from datasets in pools, not LUNs, really changes the way we have to think about managing bootable environments.

So here's an issue: What's the best way to split up the Solaris name space into datasets? Should it be split at all? I'm going to answer the second question right away with my strong opinion that YES, we should split up the name space into separate datasets. We don't need to do it any more for space reasons (back in the days of 400 MB disks, we couldn't fit all of what shipped with the Solaris operating system onto a single disk), but now that having multiple file systems is EASY (no need to preallocate a separate slice for each one), maybe we should do it for other reasons. Here are a few:

  • It might make sense to have different qualities of service for different parts of the name space. Maybe /opt could be compressed, for example.
  • When you clone a bootable environment (i.e., a root file system and its subordinate file systems), you might want to include some parts of the old boot environment by reference, not by cloning. For example, since /var/adm/log reflects the history of the overall system, not just a single boot environment, maybe you want to have just one copy which is shared among the various boot environments. If that directory were its own file system, it's easier to share it among different bootable environments.
  • Eventually, we'd like to support booting from other kinds of pool configurations besides simple mirrors. But to do that, we need to have the files that are crucial for booting on each disk in the pool. That happens automatically with mirrors, but not for RAID-Z. So how to make sure that some files are available on each disk? Well, I had considered a file attribute of some kind. Some kind of "treat this file special" interface to zfs. But that would be a pain because every time we wrote out, say, the boot archive, we should have to give it the "special" treatment, whatever that is. ZFS's "quality of service" boundary is the dataset, not the file. So what if we created a new dataset property, which is the "make this dataset available to the booter" property? It's not clear how to implement that and I don't want to get into it (I'll leave it to the ZFS internals gurus to figure that one out), but if we had such a property, then we could assign it to the root file system and automatically get bootability. But however that property is implemented, it would probably involve replicating the entire dataset on each disk. And that's a good reason for keeping the root file system small. Which means splitting off /usr, /opt, and perhaps other parts of the name space into separate datasets.
  • Splitting the name space into separate file systems might have some advantages for zones.

My gut feeling is that the most controversial part of the name space as far as where to make the divisions will be /var. Should /var as a whole be a separate dataset? Is there value in splitting off some of the subdirectories of /var as separate datasets?

I'll be working on a plan for these and other issues. Watch this space for more questions and updates. I welcome your comments on this, which can be posted either here or at the zfs-discuss@opensolaris.org alias.

Comments:

Since you simply cannot satisfy everyone with whatever way you ultimately divide the root name space, why separate it at all?

All I suggest is that the default install is NOT sub-divided, but the install process prompts the end-user for their preferred layout, should they wish to complicate things further.

Problem solved. =)

By the way, will swap be available as a separate ZFS dataset [by default] or will there be a raw implementation on the same ZFS boot disk? If the raw option, we should still have the full ZFS disk cache enabled...right?

Posted by Wes Williams on April 23, 2007 at 01:05 PM MDT #

At the moment, it's still a pain to discover that /var should really have had its own filesystem (on ZFS or otherwise).
It'd be worth looking to windwards for the 'zfs split' functionality that's on the way.

I'm looking for an easy way to boot zfs clones, ideally with live upgrade support, that can transparently survive either disk in the mirror dying.
And a pony :)

As ever, really appreciate your work on this.

Posted by Dick Davies on April 23, 2007 at 02:49 PM MDT #

One of our security rules is that filesystems containing user-writable directories should not permit setuid execution or devices. These would be home directories and temporary directories such as /tmp, /var/tmp, and a few others that appear in /var. We need to set separate mount options for these filesystems.

Posted by Gary Mills on April 23, 2007 at 03:29 PM MDT #

Simpler is better. Think, if I have my install split into multiple datasets then luupgrade tools will have to be aware of them to snapshot/clone. Yes, zfs snapshot -r makes that easier, but there's no clone -r. For a complete Solaris install I'm not sure that there's any part where I want ZFS to compress some bits but not others, but maybe that's just me.

Posted by Nico on April 23, 2007 at 04:05 PM MDT #

Gary: yes, but those directories tend not to contain any part of a Solaris install (i.e., the only entries in the packaging system for "/var/tmp" and "tmp" are the ones for the directory itself -- there is nothing inside them that installed by any package).

Posted by Nico on April 23, 2007 at 04:08 PM MDT #

Nico: that's true, but as users are generally allowed to write into /tmp and /var/tmp, if those are exec-enabled, then users can simply copy their executables there and run them.

Thus far, it's sounding like we want /, /tmp, /var, /var/tmp, /opt, and /usr. Any thoughts on /usr/local? /home? /export? /sbin? I know /sbin is part of / right now, but if we're talking about sharing filesystems among zones, then perhaps that could be split off?

Posted by Mark J Musante on April 24, 2007 at 07:02 AM MDT #

Some responses: I agree that having multiple datasets in a boot environment will make maintenance more complex because of having to separately clone all of the datasets, but I'm afraid that that complexity is inevitable, which is why I'd like to see it hidden within LiveUpgrade or whatever replaces it. (We could also perhaps arrange for a clone -r option.) As for the specific set of division points, I'm thinking that we probably need a default set, but should provide an option for administrators to override it and provide their own (definitely within Jumpstart and probably within the interactive install tools as well).

Posted by lalt on April 24, 2007 at 08:32 AM MDT #

The US DoD Security guidelines for Unix currently require: GEN003620: CAT III) The SA will configure separate filesystem partitions export/home, and /var unless justified and documented with the IAO. You can find this document at: http://iase.disa.mil/stigs/stig/unix-stig-v5r1.pdf They buy many millions of dollars of Solaris based equipment each years.

Posted by Jim Laurent on April 24, 2007 at 09:22 AM MDT #

Whichever datasets are decided on please provide the SA with the option of changing the default layout at install time. I am already creating /home and /usr/local as separate datasets on a 2nd mirror disk pair so they survive new build installations.

Posted by Ron Halstead on April 30, 2007 at 09:25 AM MDT #

One of the things that I've had to fight with Solaris for a long time was the concept of /export/home as the default home area.
I *always* end up disabling autofs for a few reasons, but primarily I don't like the cludge of /export/home. I like the neat and clean simplicity of /home without it having to overmount it on login.

Next, I think the ability to migrate an existing directory into a new zfs set would be in order.

Let's say you do your install, and elect not to create /var/crash as a separate area, and decide later to make it separate, or /var/spool or something else that is in use quite a bit.

Now, instead of having to shutdown to single user mode, rename the old directory, create the new zfs mountpoint, copy data, and then bring the box back up to multiuser mode, give the ability to migrate a subdirectory to a separate subfs online (possibly throught the use of snapshots or whatever mechanism is needed).

Posted by Larry B on July 28, 2008 at 03:24 PM MDT #

In regards to the DoD GEN003620: CAT III...

Sounds to me like someone at the DoD hasn't been awake for the last 10 years.

Solaris doesn't go down if root fills up, doesn't need to have separate filesystems/mount points for /home or /var.

They've done experiments where they kept an NFS server online for months while scripts kept the entire root filesystem (which included /var /opt /export/home) 100% full the entire time. They got bored with it and shutdown the experiment after proving there was no issue with Solaris running on a full root filesystem.

Granted, logging goes to hell with a full root, but that happens regardless of which filesystem the logs are written to.

Posted by Larry B on July 28, 2008 at 03:28 PM MDT #

Hello

I was wondering if there is any method to create different dataset within rpool during initial installation. e.g for opt. I see for var there is a option, but I cant figure it out for opt. All I want is root, var & opt on different dataset.

can you throw some inputs in same?

Thank you.
Birut Patel.
NYC

Posted by Birut Patel on March 05, 2009 at 11:32 AM MST #

Post a Comment:
  • HTML Syntax: NOT allowed