The View from the Moon

20050614 Tuesday June 14, 2005

7 second boot?
Today, the OpenSolaris "Hello, World" published on Slashdot had this to say:
...Some other highlights include the GRUB bootloader, SMF (Service Management Facility) which replaces init.d scripts, it starts up processes in parallel for faster boots (7 second boot on a dual opteron workstation I think that was the setup) as well as providing features for automatically restarting...
This link generated a massive 4000 hits to my 'blog today, putting me #3 in the blogs.sun.com ranking. I even got an attaboy from MaryMary! While I'm certainly flattered, I felt that these hits were perhaps undeserved-- the claim was so provocative that people just had to click it. We've never made such a claim, though everything else in the sentence is basically correct (see these experts to learn more about new boot and SMF).

So where did that figure of seven seconds come from? Well, we can boot a Solaris Zone in seven seconds-- and by "boot" I mean we go from nothing to having dtlogin up in 7 seconds on a uniprocessor Opteron system (a stock ShuttlePC under my desk). Here is the proof. That said, we do boot pretty quickly on an Opteron system, although you do wind up spending a lot of time waiting around for the BIOS to initialize before the OS ever gets started.

I hope that clears things up. Maybe someday we really will have a seven second boot!

Technorati Tag:
Technorati Tag:


(2005-06-14 23:39:01.0) Permalink Comments [3]
Trackback: http://blogs.sun.com/dp/entry/7_second_boot
 

Inside the Zones Console: a Tour of Comments, and Bugs

OpenSolaris has arrived. I'm really happy to be able to show off the OS. If you couldn't tell from our blogs over the past year, we've been itching-- aching-- to let people join in our fun. In a way, a lot of us have seen our 'blog entries as a walk on the shores of a very deep pool of knowledge, most of which is in the source code itself. But code alone can be terribly obscure. Cruise on over to tbl(1) if you doubt. To summarize, knowledge in code can be useless from a maintenance, diagnosability and reusability perspective. But well documented code can be enlightening and useful. I'll try to show you what I mean, by giving you a tour of some of the comments I've written in the source base, and a taste of the kinds of bugs which have cropped up thus far.

I thought I would get started by talking about a subsystem I developed for Solaris 10. By now you've heard of Solaris Zones-- and if not, Solaris Zones: Operating System Support for Consolidating Commercial Workloads is, I think, a good introduction (but I'm a coauthor, so I admit bias). One aspect of Zones I care a lot about is the Zones console. I'm particularly proud of this subsystem because I designed and implemented entirely from scratch myself, and the design blends a range of techniques I've picked up over the years: dynamic device instances, a modular design, reuse of existing facilities and a familiar interaction model for users.

Since we believe in "big theory" comments explaining the overall design of code, usr/src/cmd/zoneadmd/zcons.c speaks for itself:

/*
 * Console support for zones requires a significant infrastructure.  The
 * core pieces are contained in this file, but other portions of note
 * are in the zlogin(1M) command, the zcons(7D) driver, and in the
 * devfsadm(1M) misc_link generator.
 *
 * Care is taken to make the console behave in an "intuitive" fashion for
 * administrators.  Essentially, we try as much as possible to mimic the
 * experience of using a system via a tip line and system controller.
 *
 * The zone console architecture looks like this:
 *
 *                                      Global Zone | Non-Global Zone
 *                        .--------------.          |
 *        .-----------.   | zoneadmd -z  |          | .--------. .---------.
 *        | zlogin -C |   |     myzone   |          | | ttymon | | syslogd |
 *        `-----------'   `--------------'          | `--------' `---------'
 *                  |       |       | |             |      |       |
 *  User            |       |       | |             |      V       V
 * - - - - - - - - -|- - - -|- - - -|-|- - - - - - -|- - /dev/zconsole - - -
 *  Kernel          V       V       | |                        |
 *               [AF_UNIX Socket]   | `--------. .-------------'
 *                                  |          | |
 *                                  |          V V
 *                                  |     +-----------+
 *                                  |     |  ldterm,  |
 *                                  |     |   etc.    |
 *                                  |     +-----------+
 *                                  |     +-[Anchor]--+
 *                                  |     |   ptem    |
 *                                  V     +-----------+
 *                           +---master---+---slave---+
 *                           |                        |
 *                           |      zcons driver      |
 *                           |    zonename="myzone"   |
 *                           +------------------------+
 *
 * There are basically three major tasks which the console subsystem in
 * zoneadmd accomplishes:
 *
 * - Setup and teardown of zcons driver instances.  One zcons instance
 *   is maintained per zone; we take advantage of the libdevice APIs
 *   to online new instances of zcons as needed.  Care is taken to
 *   prune and manage these appropriately; see init_console_dev() and
 *   destroy_console_dev().  The end result is the creation of the
 *   zcons(7D) instance and an open file descriptor to the master side.
 *   zcons instances are associated with zones via their zonename device
 *   property.  This the console instance to persist across reboots,
 *   and while the zone is halted.
 *
 * - Initialization of the slave side of the console.  This is
 *   accomplished by pushing various STREAMS modules onto the console.
 *   The ptem(7M) module gets special treatment, and is anchored into
 *   place using the I_ANCHOR facility.  This is so that the zcons driver
 *   always has terminal semantics, as would a real hardware terminal.
 *   This means that ttymon(1M) works unmodified;  at boot time, ttymon
 *   will do its own plumbing of the console stream, and will even
 *   I_POP modules off.  Hence the anchor, which assures that ptem will
 *   never be I_POP'd.
 *
 * - Acting as a server for 'zlogin -C' instances.  When zlogin -C is
 *   run, zlogin connects to zoneadmd via unix domain socket.  zoneadmd
 *   functions as a two-way proxy for console I/O, relaying user input
 *   to the master side of the console, and relaying output from the
 *   zone to the user.
 */

One of the (in my opinion) elegant attributes of this design is that it defers as much as possible to userland. The zcons(7d) driver, in usr/src/uts/common/io/zcons.c, is only 707 lines of code (of which 155 or 22% is comments). And happily, this code has mostly "just worked" since I integrated it, and has needed little maintenance. One bug in the zcons driver (4983336) was the result of changes I made following codereview feedback (always a danger), causing messages in the stream to occasionally arrive out of order-- messages could occasionally "pass" each other.

The only other kernel changes I made for the console were to pseudonex itself-- I needed to bring it into compliance with the devctl interfaces (part of the interface family we use for hotplug). That was so that we could dynamically instantiate new console nodes when we need them. When we online new zcons nodes, we make them children of a new zconsnex node, like this:

$ prtconf -P
...
    pseudo, instance #0
        zconsnex, instance #1
            zcons, instance #0
            zcons, instance #1
            zcons, instance #2
            zcons, instance #3

$ prtconf -v /devices/pseudo/zconsnex@1/zcons@0
zcons, instance #0
    Hardware properties:
        name='ddi-no-autodetach' type=int items=1
            value=00000001
        name='auto-assign-instance' type=int items=1
            value=00000001
        name='zonename' type=string items=1
            value='xanadu'
    Device Minor Nodes:
        dev=(227,1)
            dev_path=/pseudo/zconsnex@1/zcons@0:zoneconsole
                spectype=chr type=minor
                dev_link=/dev/zcons/xanadu/zoneconsole
        dev=(227,0)
            dev_path=/pseudo/zconsnex@1/zcons@0:masterconsole
                spectype=chr type=minor
                dev_link=/dev/zcons/xanadu/masterconsole
In the above, you can see that you can easily related a zone console to the zone using it, via the 'zonename' property on the device node. To my annoyance, when I started to work on pseudonex.c, there was no header comment at all about how it worked! This was a real nuisance, as the pseudonex has some subtle behavior in its device instance number assignment. I left behind an improved header comment, but it could probably still use more work:
/*
 * Pseudo devices are devices implemented entirely in software; pseudonex
 * (pseudo) is the traditional nexus for pseudodevices.  Instances are
 * typically specified via driver.conf files; e.g. a leaf device which
 * should be attached below pseudonex will have an entry like:
 *
 *	name="foo" parent="/pseudo" instance=0;
 *
 * pseudonex also supports the devctl (see ) interface via
 * its :devctl minor node.  This allows priveleged userland applications to
 * online/offline children of pseudo as needed.
 *
 * In general, we discourage widespread use of this tactic, as it may lead to a
 * proliferation of nodes in /pseudo.  It is preferred that implementors update
 * pseudo.conf, adding another 'pseudo' nexus child of /pseudo, and then use
 * that for their collection of device nodes.  To do so, add a driver alias
 * for the name of the nexus child and a line in pseudo.conf such as:
 *
 * 	name="foo" parent="/pseudo" instance= valid-children="bar","baz";
 *
 * Setting 'valid-children' is important because we have an annoying problem;
 * we need to prevent pseudo devices with 'parent="pseudo"' set from binding
 * to our new pseudonex child node.  A better way might be to teach the
 * spec-node code to understand that parent="pseudo" really means
 * parent="/pseudo".
 *
 * At some point in the future, it would be desirable to extend the instance
 * database to include nexus children of pseudo.  Then we could use devctl
 * or devfs to online nexus children of pseudo, auto-selecting an instance #,
 * and the instance number selected would be preserved across reboot in
 * path_to_inst.
 */

This much should have given you sufficient context to understand the code at the top half of usr/src/cmd/zoneadmd/zcons.c, which takes care of managing the zcons pseudo children. I was particularly happy with this code. There's something cool about specifying a new zones console device, and zapping it into existence all on the fly. This leads to the more subtle bug I faced: 4981626, which was reported just before one of the S10 beta releases, and so rapidly put me in the hot seat to root cause and fix it. The symptom was vexing: infrequently, multiprocessor systems would see one of their several zones fail to startup at boot time. Worse yet, the problem could only be seen on non-DEBUG systems, making the problem potentially even harder to track down. We had only some messages on the console to work from:

Jan 21 14:33:01 xanadu devfsadmd[320]: driver failed to attach: zcons
failed to create devlinks: No such device or address
console setup: device initialization failed
zoneadm: zone 'xanadu-z2': could not start zoneadmd
zoneadm: zone 'xanadu-z2': call to zoneadmd failed
After some head scratching and basic investigation with DTrace (which sadly, I've lost), I arrived at a hypothesis: we had a race condition in which the zcons device node (such as /devices/pseudo/zconsnex@1/zcons@7) was getting automatically torn down before the system had a chance to make the device sufficiently "busy" that the system would leave it alone. So I had an initial hypothesis about the race which looked promising:
zoneadmd:     devctl() to create zone console node
rc3:          modunload -i 0, (which tears down the zcons node)
zoneadmd:     ask devfsadmd to make links for the zcons driver
devfsadmd:    no such device as 'zcons' attached, fail.
zoneadmd:     call to devfsadmd fail!  fail to start up the zone.
You can see that this interaction is pretty complex. A little more digging revealed that this hypothesis was wrong, and that things were much worse. The first step was to isolate the problem, and get boot out of the way. I wrote a simple program called zcons_test. In one window, I ran zcons_test in a loop. This simple C program which basically does nothing more than
        if ((hdl = di_devlink_init("zcons", DI_MAKE_LINK)) == NULL)
                perror("di_devlink_init");
(I'll leave as an exercise to the reader to track down what di_devlink_init actually does). In another window, I ran 'modunload -i 146' in a loop. (n.b. using whatever module number corresponded to zcons on that machine). This was run in a rigged-up environment in which nothing was holding the zcons driver busy. What I saw on occasion (every minute or so) while running this:
# while :; do ./zcons_test; done
di_devlink_init: No such device or address
di_devlink_init: No such device or address
At this point, it was easy to use DTrace to narrow down where the ENXIO was coming from. I found the following snippet in di_ioctl, which is where we actually do the device online:
	modunload_disable();
	(void) i_ddi_load_drvconf(i);
	(void) ndi_devi_config_driver(ddi_root_node(), ndi_flags, i);
	kmem_free(drv_name, MAXNAMELEN);
	ddi_rele_driver(i);
	rv = i_ddi_devs_attached(i);
	modunload_enable();

	return ((rv == DDI_SUCCESS)? 0 : ENXIO);
Progress: It is this ENXIO which is causing the "No such device or address" message. But what the hell? module unloading is disabled during this sequence of events, right? So how could the modunload loop be affecting this? I used DTrace to track down the call chain which was triggering the unload. It looked like this:
modctl(2)->
  modctl_modunload()->
    mod_uninstall_all()
      mod_uninstall()
        ...
The problem here would seem to be that the call to modunload_disable() doesn't in fact disable the modunload! It does manage to block some automatic, period module unloads (the mod_uninstall_daemon). Even more insidious is that the code for modctl_modunload() makes it appear that the "at-bootup" incarnation of this bug can *only* appear on non-DEBUG systems! In the end, I worked around the problem by adding the ddi-no-autodetach property to the zcons device nodes. I also filed 4988141 modunload(1M) can race with di_ioctl(DINFOLODRV), which will hopefully soon be fixed.

So that is, as they say, the nickel tour!

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:


(2005-06-14 08:05:32.0) Permalink Comments [2]
Trackback: http://blogs.sun.com/dp/entry/inside_the_zones_console_a
 

Dan Price's Weblog
[about me]