rantings of a code minimalist :: meem simplex ::

Tuesday Sep 02, 2008

Creating Shell-Friendly Parsable Output

Being able to easily write scripts from the command-line has long been regarded as one of UNIX's core strengths. However, over the years, surprisingly little attention has been paid to writing CLIs whose output lend themselves to scripting. Indeed, even modern CLIs often fail to consider parsable output as a distinct concept from human output, leading to overwrought and fragile scripts which inevitably break as the CLI is enhanced over time. Some recent CLIs have "solved" the parsable format problem by using popular formats such as XML and JSON. These are fine formats for sophisticated scripting languages, but a poor match for traditional UNIX line-oriented tools (e.g. grep, cut, head) that form the foundation of shell-scripting.

Even those CLIs that consider the shell when designing a parsable output format often fall short of the mark. For dladm(1M), it took us (Sun) three tries to create a format that can be easily parsed from the shell. So, while the final format we settled on may seem simple and obvious, as is often the case, making things simple can prove to be surprisingly hard. Further, there are a number of alternative output formats that seem compelling at first blush but ultimately prove to be unworkable.

So that others working on similar problems may benefit, below I've summarized our set of guidelines -- some obvious, some not -- that we arrived at while working on dladm. As each CLI has its own constraints, not all of them may prove applicable, but I'd urge anyone designing a CLI with parsable output to consider each one carefully.

To provide some specifics to hang our guidelines on, first, here's an example of the dladm human output format:

  # dladm show-link -o link,class,over
  LINK        CLASS    OVER
  eth0        phys     --
  eth1        phys     --
  eth2        phys     --
  default0    aggr     eth0 eth1 eth2
  cyan0       vlan     default0
... and here's the equivalent parsable output format:
  # dladm show-link -p -o link,class,over
  eth0:phys:
  eth1:phys:
  eth2:phys:
  default0:aggr:eth0 eth1 eth2
  cyan0:vlan:default0
Now, the guidelines:
  1. Design CLIs that output in a regular format -- even in human output mode.

    Once your human output mode ceases to be regular (ifconfig(1M) output is a prime example of an irregular format), later adding a parsable output mode becomes difficult if not impossible. (As an aside, I've often found that irregular output suggests deeper design flaws, either in the CLI itself or the objects it operates on.)

  2. Prefer tabular formats in parsable output mode.

    Because traditional shell scripting works best with lines of information, tabular formats where each line both identifies and describes a unique object are ideal. For example, above, the link field uniquely identifies the object, and the class and over fields describe that object. In some cases, multiple fields may be required to uniquely identify the object (e.g., with dladm show-linkprop, both the link and the property field are needed). As an aside: in the multiple-field case, the human output mode may choose to use visual nesting (e.g., by grouping all of a link's properties together on successive lines and omitting the link value entirely), but it's important this not be done in parsable output mode so that the shell script can remain simple.

  3. Require output fields to be specified.

    Unlike humans, scripts always invoke a CLI with a specific purpose in mind. Also unlike humans, scripts are not facile at adapting to change (e.g., the addition of new fields). Thus, it's imperative that scripts be forced to explicitly specify the fields they need (with dladm, attempting to use -p without -o yields an error message). With this approach, new fields can be added to a CLI without any risk of breaking existing consumers. Further, if a field used by a script is removed, the failure mode becomes hard (the CLI will produce an error), rather than soft (the consumer misparses the CLI's output and does something unpredictable). Note that for similar reasons, if your CLI provides a way to print field subsets that may change over time (e.g., -o all), those must also fail in parsable output mode.

  4. Leverage field specifiers to infer field names.

    Because field names must be specified in an order, it's natural to use that same order as the output order, and thus avoid having to explicitly identify the field names in the parsable output format. That is, as shown above, dladm can omit indicating which field name corresponds with which value because the order is inferred from the invocation of -olink,class,over. This may seem a minor point, but in practice it saves a lot of grotty work in the shell to otherwise correlate each field name with its value.

  5. Omit headers.

    Similarly, because the field order is known (and no human will be staring at the output) there is no utility in providing a header in parsable output mode, and indeed its presence would only complicate parsing. As shown above, dladm omits the header in parsable output mode.

  6. Do not adorn your field values.

    In human output mode, it can be useful to give visual indications for specific field values. For instance, as shown above, dladm shows empty values as "--" in human output mode so that the table does not look malformed. In parsable output mode, such embellishments only complicate and confuse consumers of the data (and may in fact make it ambiguous), and thus should be avoided. As above, in parsable output format, empty values are shown as actually being empty.

  7. Do not use whitespace as a field separator.

    Whitespace may seem like a natural field separator, but in practice it's problematic. Specifically, many shells treat whitespace separators specially by merging consecutive instances into a single instance. For example, consider representing three consecutive empty values. With a non-whitespace field separator such as ":", this would be output as "::" (empty value 1, : separator, empty value 2, :, empty value 3). With the shell's IFS variable set to ":", the shell will parse this as three separate empty values, as intended. With space as the field separator, this would be output as "   ", and with IFS set to " " the shell would misparse this as a single empty value.

  8. Do not restrict your allowed field values.

    While some fields may be controlled directly by the CLI (e.g., the class field above), others are either outside of your direct control (e.g., the link field above), or outside of even your system's control (e.g., the essid field output by dladm show-wifi). As such, aside from ensuring the field value is printable ASCII (where newline is considered as unprintable), no values should be filtered out or forbidden[1].

    Thus, any values that have special meaning should generally be escaped. For instance, with ":" as a field delimiter, IPv6 address "fe80::1" would become "fe80\:\:1" when displayed in parsable output mode. Thankfully, escaping does not complicate shell parsing because all popular scripting shells have read builtins that will automatically strip escapes. Thus, the common idiom of piping the output to a read/while loop works as expected without any special-purpose logic. For instance, even though the BSSID field will contain embedded colons, this will loop through each BSSID on each link, trying to connect to one until it succeeds:

          dladm scan-wifi -p -o link,bssid |
          while IFS=: read link bssid; do
                  dladm connect-wifi -i $bssid $link && break
          done
        
    That said, if only a single field has been requested, the field separator is not needed. Since no ambiguity exists in that case, there's no need to escape it -- and not doing so can make things more convenient for other shell idioms -- e.g., to collect all in-range SSIDs:
          ssids=`dladm scan-wifi -p -o bssid`
        
I'd welcome hearing back from others who have tackled this problem.

[1] If unprintable ASCII values can legitimately occur in a given field's output, you need to use another encoding format.

Wednesday Aug 20, 2008

GNOME Home

Yes, it's been a whole year since I last posted a blog entry. Between moving from Boston to San Francisco (metro, anyway), countless urgent matters (both professional and personal), and wrapping up Clearview IPMP development (more on that real soon), blogging hasn't exactly been top priority. That said, I have amassed a really nice list of topics for future blog entries over the coming weeks (OK, maybe months ;-).

Before we get to all that though, I have an urgent tip for those who are using GNOME's Nautilus on OpenSolaris build 94 or later. It seems that the GNOME development team (not inside Sun) decided to change the Open Terminal menu item (available by right clicking on the desktop) to Open in Terminal, and correspondingly changed things so that the GNOME terminal will open in your ~/Desktop directory, rather than ~. The unmitigated idiocy and arrogance of this change is beyond comprehension, and the pain associated with it only intensifies with each opened terminal. Nonetheless, thankfully, there is a simple way to restore the previous (correct) behavior:

  gconftool-2 -s /apps/nautilus-open-terminal/desktop_opens_home_dir
              -t bool true 
        
Hope this saves some other poor soul from spending half a day digging through the GNOME sources for a solution.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Thursday Aug 16, 2007

Linked

Solaris draws clear boundaries between IP interfaces, data-links, devices, and physical hardware. However, these boundaries are a frequent source of confusion, especially for migrants from other operating systems that do not have such clear delineations. Further, with data-link abstractions becoming ever-richer (via link aggregations, VLANs, IP tunnels -- and soon VNICs, vswitches, and vbridges), people have become increasingly confused about how the abstractions within and across each layer relate. As such, the Clearview team has been working closely with Sun's documentation writers to provide a background chapter (including illustrations) that illuminate the core abstractions.

Needless to say, I was thrilled to see my original skrawls turned into wonderful images like this one (apologies for the blurriness -- I had to shrink it to fit):

Above, one can see the flexible and powerful networking topologies that can be created simply from two common Sun networking cards (in this case, ce and qfe). Above the hardware layer, we see five devices -- one for the ce card, and four for the qfe card (the "q" stands for "quad"; qfe has four network ports on one card, which appear to the operating system as four independent devices).

Above the device layer, we see four physical links (shown in blue) that have been instantiated using those devices (the qfe1 device is unused). These links (as with all links) have been named by the administrator using Clearview's upcoming vanity naming feature. As illustrated, VLANs can be created over the links -- as can aggregations. Further, any of the links can also be instantiated at the IP layer (with their link name) using the ifconfig plumb subcommand. We also see that some links can exist independently of any specific underlying hardware -- such as vpn1, which uses the IP routing table to determine the actual link to direct a given packet to.

Finally, at the IP layer, we see that while most IP interfaces have a one-to-one relationship with an underlying datalink, some (such as lo0) have no underlying datalink, and others (such as eml3) group IP interfaces on the same IP broadcast domain together using IPMP (at least, they will once Clearview IPMP is complete).

Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday May 16, 2007

Disruptor

If I may indulge my personal side for a second, congratulations to my father for being named one of Fortune Magazine's top 24 "disruptive innovators"! Since childhood, I've looked up to the passion, precision, integrity and imagination he brought to solving problems of all sizes -- and I continue to be amazed at both the magnitude and relevance of the problems he chooses to tackle today (at age 63, no less!) Needless to say, his approach to engineering (among many other things) has influenced me enormously, and I'm both proud and inspired.

(I could go on at length, but I'll save him the embarrassment ;-)

Wednesday Apr 25, 2007

IPMP Development Follow-up

Several folks have again (understandably) asked for updates on the Next-Generation IPMP work. Significant progress has been made since my last update. Notably:

  • Probe-based failure detection is operational (in addition to the earlier support for link-based failure detection).
  • DR support of interfaces using IPMP through RCM works. Thanks to the new architecture, the code is almost 1000 lines more compact than Solaris's current implementation -- and more robust.
  • Boot support is now complete. That is any number (including all) interfaces can be missing at boot and then transparently repaired during operation.
  • At long last, ipmpstat. As discussed in the high-level design document, this is a new utility that allows the IPMP subsystem to be compactly examined.

Since ipmpstat allows other aspects of the architecture to be succinctly examined, let's take a quick look at a simple two-interface group on my test system:

  # ipmpstat -g
  GROUP       GROUPNAME   STATE     FDT       INTERFACES
  net57       a           ok        10000ms   ce1 ce0

As we can see, the "-g" (group) output mode tells us all the basics about the group: the group interface name and group name (these will usually be the same, but differ above for illustrative purposes), its current state ("ok", indicating that all of the interfaces are operational), the maximum time needed to detect a failure (10 seconds), and the interfaces that comprise the group.

We can get a more detailed look at the IPMP health and configuration of the interfaces under IPMP using the "-i" (interface) output mode:

  # ipmpstat -i
  INTERFACE   ACTIVE  GROUP       FLAGS   LINK      PROBE     STATE
  ce1         yes     net57       ------  up        ok        ok
  ce0         yes     net57       ------  up        disabled  ok

Here, we can see that ce0 has probe-based failure detection disabled. We can also see issues that prevent an interface from being used (aka being "active") -- e.g., if suppose we enable standby on ce0:

  # ifconfig ce0 standby

  # ipmpstat -i
  INTERFACE   ACTIVE  GROUP       FLAGS   LINK      PROBE     STATE
  ce1         yes     net57       ------  up        ok        ok
  ce0         no      net57       si----  up        disabled  ok

We can see that ce0 is now no longer active, because it's an inactive standby (indicated by the "i" and "s" flags). This means that all of the addresses in the group must be restricted to ce1 (unless ce1 becomes unusable), which we can see via the "-a" (address) output mode ("-n" turns off address-to-hostname resolution):

  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND
  10.8.57.210         net57       up      ce1         ce1
  10.8.57.34          net57       up      ce1         ce1

For fun, we can offline ce1 and observe the failover to ce0:

  # if_mpadm -d ce1

  # ipmpstat -i
  INTERFACE   ACTIVE  GROUP       FLAGS   LINK      PROBE     STATE
  ce1         no      net57       ----d-  disabled  disabled  offline
  ce0         yes     net57       s-----  up        disabled  ok
[ In addition to the "offline" state, the "d" flag also indicates that all of the addresses on ce0 are down, preventing it from receiving any traffic. ]
  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND
  10.8.57.210         net57       up      ce0         ce0
  10.8.57.34          net57       up      ce0         ce0
We can also convert ce0 back to a "normal" interface, online ce1 and observe the load spreading configurations:
  # ifconfig ce0 -standby
  # if_mpadm -r ce1

  # ipmpstat -i
  INTERFACE   ACTIVE  GROUP       FLAGS   LINK      PROBE     STATE
  ce1         yes     net57       ------  up        ok        ok
  ce0         yes     net57       ------  up        disabled  ok

  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND
  10.8.57.210         net57       up      ce0         ce1 ce0
  10.8.57.34          net57       up      ce1         ce1 ce0
In particular, this indicates that incoming traffic to 10.8.57.210 will go to ce0 and inbound traffic to 10.8.57.34 will go to ce1 (as per the ARP mappings). However, outbound traffic will potentially flow over either interface (though to sidestep packet ordering issues, a given connection will remain latched unless the interface becomes unusable).

This also highlights another aspect of the new IPMP design: the kernel is responsible for spreading the IP addresses across the interfaces (rather than the administrator). The current algorithm simply attempts to keep the number of IP addresses "evenly" distributed over the set of interfaces, but more sophisticated policies (e.g., based on load measurements) could be added in the future.

To round out the ipmpstat feature set, one can also monitor the targets and probes used during probe-based failure detection:

  # ipmpstat -tn
  INTERFACE   MODE      TESTADDR            TARGETS
  ce1         mcast     10.8.57.12          10.8.57.237 10.8.57.235 10.8.57.254 10.8.57.253 10.8.57.207
  ce0         disabled  --                  --
Above, we can see that ce1 is using "mcast" (multicast) mode to discover its probe targets, and we can see the targets it has decided to probe, in firing order. We can also look at the probes themselves, in real-time:
  # ipmpstat -pn
  TIME      INTERFACE   PROBE     TARGET              RTT       RTTAVG    RTTDEV
  1.15s     ce1         112       10.8.57.237         1.09ms    1.14ms    0.11ms
  2.33s     ce1         113       10.8.57.235         1.11ms    1.18ms    0.13ms
  3.94s     ce1         114       10.8.57.254         1.07ms    2.10ms    2.00ms
  5.38s     ce1         115       10.8.57.253         1.08ms    1.14ms    0.10ms
  6.19s     ce1         116       10.8.57.207         1.43ms    1.20ms    0.19ms
  7.73s     ce1         117       10.8.57.237         1.04ms    1.13ms    0.11ms
  9.47s     ce1         118       10.8.57.235         1.04ms    1.16ms    0.13ms
  10.67s    ce1         119       10.8.57.254         1.06ms    1.97ms    1.76ms
  ^C
Above, the inflated RTT average and standard deviation for 10.8.57.254 indicate that something went wrong with 10.8.57.254 in the not-too-distant past. (As an aside: "-p" also revealed a subtle longstanding bug in in.mpathd that was causing inflated jitter times for probe targets; see 6549950.)

Anyway, hopefully all this gives you not only a feel for ipmpstat, but a feel for how development is progressing. It should be noted that several key features are still missing, such as:

  • Broadcast and multicast support on IPMP interfaces.
  • IPv6 traffic on IPMP interfaces.
  • IP Filter support on IPMP interfaces.
  • MIB and kstat support on IPMP interfaces.
  • DHCP on IPMP interfaces.
  • Sun Cluster support.
All of these are currently being worked on. In the meantime, we will be making early-access BFU archives based on what we have so far to those who are interested in kicking the tires. (And a big thanks to those customers who have already volunteered!)

Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday Feb 06, 2007

IPMP Development Update

A number of people have sent me emails asking for updates on the Next-Generation IPMP work. In short, there's a lot to do, but development is progressing smoothly and early-access bits are on the horizon[1]. At this point, one can:

  • Create, destroy, and reconfigure IPMP groups with arbitrary numbers of interfaces and IP addresses, using either the legacy or new administrative model.
  • Load-spread inbound and outbound traffic across the interfaces and addresses. As per the new model, all IP addresses are hosted on "IPMP" interfaces and the kernel handles the binding of IP addresses to interfaces in the group internally. There is no longer a visible concept of failover or failback.
  • Use in.mpathd to track the failure and repair of interfaces. It notifies the kernel of these changes so that the kernel can update its interface-to-address bindings.
  • Use if_mpadm to offline and undo-offline interfaces. Again, this causes the kernel to update its interface-to-address bindings.

To illustrate where I'm at, let me use last night's build to show the lay of the land. (What's been implemented is almost identical to what was proposed in the high-level design document -- so please consult that document for additional background.) For starters, one can use the old IPMP administrative commands as before -- e.g., to create a two-interface group with two IP data addresses:

# ifconfig ce0 plumb group ipmp0 10.8.57.34/24 up
# ifconfig ce1 plumb group ipmp0 10.8.57.210/24 up
But what you end up with looks a bit different:
# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
ce0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 0.0.0.0 netmask ff000000
        groupname ipmp0
        ether 0:3:ba:94:3b:74
ce1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        inet 0.0.0.0 netmask ff000000
        groupname ipmp0
        ether 0:3:ba:94:3b:75
ipmp0: flags=8001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,IPMP> mtu 1500 index 3
        inet 10.8.57.34 netmask ffffff00 broadcast 10.8.57.255
        groupname ipmp0
ipmp0:1: flags=8001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,IPMP> mtu 1500 index 3
        inet 10.8.57.210 netmask ffffff00 broadcast 10.8.57.255
Above, we can see that ifconfig has created an ipmp0 interface for IPMP group a and placed the two data addresses that we configured onto it. The ce0 and ce1 interfaces have no actual addresses configured on them (though they would if we'd configured test addresses), but are marked UP so that they can be used to send and receive traffic. Note that ipmp0 is marked with a special IPMP flag to indicate that it is an IP interface that represents an IPMP group.

Though the legacy configuration works, we will recommend configuring IPMP through the new model, since it better expresses the intent. The same configuration as above would be achieved instead by doing:

# ifconfig ipmp0 ipmp 10.8.57.34/24 up addif 10.8.57.210/24 up
# ifconfig ce0 plumb group ipmp0 up
# ifconfig ce1 plumb group ipmp0 up
Note the presence of the ipmp keyword, which tells ifconfig that the interface represents an IPMP group. Because of this keyword, an IPMP interface can actually be given any valid unused IP interface name -- e.g., ifconfig xyzzy0 ipmp will create an IPMP interface named xyzzy0. This follows the Project Clearview tenet that IP interface names must not be tied to the interface type -- which in turn allows one to roll out new networking technologies without disturbing the system's higher-level network configuration.

In general, an IPMP interface can be used like any other IP interface -- e.g., to create a default route through ipmp0, we can do:

# route add default 10.8.57.248 -ifp ipmp0 
We can also examine the ARP table to see the current distribution of ipmp0's IP addresses to IP interfaces in the group (once development is complete, this will be able to be done more easily with ipmpstat):
# arp -an | grep ipmp0
ipmp0  10.8.57.34           255.255.255.255 SPLA     00:03:ba:94:3b:74
ipmp0  10.8.57.210          255.255.255.255 SPLA     00:03:ba:94:3b:75
Here, we see that 10.8.57.34 is using ce0's hardware address, and 10.8.57.210 is using ce1's hardware address. If we offline ce0, we can see the kernel will change the binding:
# if_mpadm -d ce0
# arp -an | grep ipmp0
ipmp0  10.8.57.34           255.255.255.255 SPLA     00:03:ba:94:3b:75
ipmp0  10.8.57.210          255.255.255.255 SPLA     00:03:ba:94:3b:75
One interesting consequence of the new design is that it's possible to remove all of the interfaces in a group and still preserve the IPMP group configuration. For instance:
# ifconfig ce0 unplumb
# ifconfig ce1 unplumb
# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
ipmp0: flags=8001000803<UP,BROADCAST,MULTICAST,IPv4,IPMP> mtu 1500 index 3
        inet 10.8.57.34 netmask ffffff00 broadcast 10.8.57.255
        groupname ipmp0
ipmp0:1: flags=8001000803<UP,BROADCAST,MULTICAST,IPv4,IPMP> mtu 1500 index 3
        inet 10.8.57.210 netmask ffffff00 broadcast 10.8.57.255
Since all of the network configuration (e.g., the routing table) is tied to ipmp0 rather than to the underlying interfaces, it's unaffected. However, of course, no network traffic can flow through ipmp0 until another interface is placed back into the group -- as evidenced by the fact that the RUNNING flag has been cleared on ipmp0.

Those familiar with the existing IPMP implementation may be asking yourself what's left to do. The answer is "quite a bit". Notable current omissions include:

  • Broadcast and multicast support on IPMP interfaces.
  • IPv6 traffic on IPMP interfaces.
  • Probe-based failure detection.
  • DR support of interfaces using IPMP through RCM.
  • MIB and kstat support on IPMP interfaces.
  • DHCP over IPMP interfaces.
  • ipmpstat

Of the above, the first four are supported by the existing IPMP implementation and are (with minor exceptions) requirements for any early-access candidate. That said, as I mentioned earlier, development is proceeding at a good clip -- especially now that the hairy IP configuration multithreading model as been tamed, and several lethal bugs in IP have been nailed[2]. So stay tuned.

Footnotes

[1] If you're a Sun customer interested in kicking the tires in a pre-production environment, please send an email to meem AT eng DOT sun DOT com.
[2] For instance, see http://mail.opensolaris.org/pipermail/clearview-discuss/2007-February/000685.html or http://mail.opensolaris.org/pipermail/clearview-discuss/2007-February/000661.html.

Technorati Tag:
Technorati Tag:

Monday Dec 04, 2006

WiFi/GLDv3

Now that the new Solaris WiFi architecture has integrated into Build 54 of Nevada, this seemed an appropriate time to interrupt the static with a fresh blog entry. In short, WiFi's integration represents both a new beginning for WiFi on Solaris and a milestone for the new architecture of Solaris networking. I realize such hyperbole may set off more than a few BS-meters, so let me back these statements up with some specifics.

First, with regard to WiFi:

  1. The kernel now has first-class support for the WiFi link-layer protocols. Previously, WiFi drivers in Solaris masqueraded as Ethernet drivers, and internally performed header translations to send and receive actual WiFi link-layer frames.

    Having first-class support for WiFi simplifies our codebase, reduces per-packet overhead, introduces WiFi-specific kstats, and opens the door for network sniffers like snoop and Ethereal to directly interpret WiFi frames on Solaris via the new DLIOCNATIVE ioctl.

  2. Similarly, the kernel's GLDv3 networking driver framework now natively supports WiFi, allowing WiFi to be seamlessly handled by the protocol stack. Accordingly, the bundled ath driver has been ported to GLDv3, and all unbundled drivers have either been ported to GLDv3 or in the process of being ported. Previously, WiFi drivers were relegated to the GLDv2 framework, which is no longer under active development.

  3. The kernel now has a dedicated net80211 kernel module which facilitates code sharing across WiFi drivers. This kernel module is based closely on the mature and robust FreeBSD 7 WLAN module, allowing us to easily incorporate enhancements as they become available. Previously, different versions of the WLAN framework had been directly linked into each driver, which was a significant maintenance and support hazard.

  4. As a result of (2), WiFi drivers can now be managed using our GLDv3 administration command, dladm. For instance, running dladm show-link on a laptop now shows any available WiFi links alongside the Ethernet links. In addition, new dladm subcommands have been added to allow WiFi administration -- e.g., to connect to the most optimal unencrypted WiFi link in-range, just run dladm connect-wifi. Or you can create keys with dladm create-secobj and then pass those keys to connect-wifi. Check out the EXAMPLES section of the latest dladm manpage for specifics.

Collectively -- and in combination with other smaller improvements -- I hope you agree that this adds up to a solid foundation for WiFi on Solaris. Moreover, this foundation greatly benefits the development of our two follow-on WiFi projects -- specifically, WPA/WPA2 support, and bundled support for many more WiFi chipsets.

Now, with regard to Solaris Networking, WiFi both builds on recent work by other projects and paves the way for ongoing and future projects:

  1. The new WiFi support in GLDv3 makes use of the MAC Type plugin architecture integrated by Project Clearview into Build 44 of Nevada. In fact, two plugin operations -- mtops_header_cook() and mtops_header_uncook() -- were designed around WiFi's requirements. The entire mac_wifi plugin source is just 415 lines of straightforward code, including comments. Without the MAC Type plugin architecture, native WiFi support would have been significantly more complex and less elegant.

  2. The userland WiFi architecture was designed to be easily used by future administrative tools -- especially Network Auto-Magic (NWAM). Specifically, the heavy lifting for the new WiFi dladm subcommands is actually done by the new libwladm library, rendering dladm mostly a trivial wrapper around the libwladm routines.

    This separation of WiFi mechanism from UI policy allows other tools such as NWAM to make use of the WiFi framework without having to resort to ugly calls to dladm or (gag) code duplication. Note that libwladm is currently Consolidation Private, and thus is only safe to call from code in the ON consolidation.

  3. Enhancing dladm to support WiFi required new facilities for administering WEP keys and WiFi properties (e.g., radio and powermode settings). However, to keep the dladm administrative model simple and extensible, we introduced two new generic dladm facilities: link properties and secure objects.

    While these facilities are only used by WiFi at present, future projects will extend them in new directions. For instance, both Project Clearview and IP Instances already have new link properties planned (autopush and zone respectively), and an upcoming project to take administration out of the ndd(1M) stone-age will likely build extensively on link properties. Similarly, future secure objects will exist for WPA/WPA2 keys and perhaps other secure data such as certificates.

Looking ahead to 2007, Project Clearview, NWAM, Crossbow, and IP Instances will all make use of features introduced by WiFi - and with features introduced by one another -- to collectively realize a "new world order" for Solaris networking. Stay tuned, it's going to rock.


Peter Memishian
Last modified: Mon Dec 4 16:21:48 EST 2006 Technorati Tag:
Technorati Tag:

Monday Jul 31, 2006

On Locking

Recently, I've been hip-deep in the 105,000 lines of heavily-multithreaded code that comprise Solaris's IPv4/IPv6 implementation, finishing the bring-up of our new IP Network Multipathing (IPMP) implementation for Clearview. Along the way, I've been reminded of some collected wisdom regarding locking that could use wider dissemination:

  1. An object cannot synchronize its own visibility.

    Most long-lived objects are put into a structure such as a hash table that allows them to be looked up at some point in the future. In order to track the number of threads that currently have an object "checked out", objects are usually reference counted, with the reference count itself being manipulated under the object's lock.

    Things get tricky when the object must go away. In particular, many make the mistake of trying to synchronize the object's removal from visibility under the object's lock when the reference count reaches one. This is not possible: any thread looking up the object -- by definition -- does not yet have the object and thus cannot hold the object's lock during the lookup operation (unless the object lock is not tied to the object itself -- see (2)). Thus, another thread can race and acquire another reference at the same time the object is being destroyed, leading to incorrect behavior. Thus, whatever higher-level synchronization is used to coordinate the threads looking up the object must also be used as part of removing the object from visibility.

    Once the object has been removed from visibility, an object can indeed synchronize its own destruction [1]. The simplest approach is to have the object's destruction done by whatever thread causes the object's reference count to reach zero [2] -- that is, "if you're the last one out, turn off the lights". Note that this logic can be part of the standard code that decrements the object's reference count, but will be guaranteed to be unreachable until the object is removed from visibility. This is because the object's reference count will be incremented when it is made visible to another structure (e.g., the hash table providing its visibility), and that reference will remain until the object is removed from visibility.

  2. An object should not synchronize itself without due cause.

    This is a generalization of (1). When building objects that are intended to be used in a multithreaded environment, it is tempting to build the locks into the objects themselves. For instance, a stack object might contain an internal lock to ensure that multiple threads issuing a push or pop operation simultaneously will operate without corrupting the underlying stack object or returning erroneous results. Languages like Java have promoted this sort of locking to a first-class concept through the "synchronized" keyword.

    While this "works" in the small, it misses the big picture. Specifically, each of those aforementioned threads working on the stack object was performing its pushes and pops as part of accomplishing some larger task. Those larger tasks are indeed what need to be synchronized with one another, so that they appear atomic to each other as a whole. However, only the callers (the threads using the stack objects) have insight into the granularity and semantics of those tasks and the objects that comprise them -- so only they can implement that locking. However, once that locking has been implemented, any internal object locks performing similar functions become superfluous, and only end up complicating the object's implementation (e.g., to avoid recursive mutex locking).

    Thus, many objects are better off leaving their synchronization to their callers, since those users will have to synchronize between each other anyway. Of course, objects will in turn use other objects -- so it's still quite likely that an object will have embedded locks. However, those locks will be used to synchronize access to the objects it's using, rather than attempting to synchronize its own use across multiple threads. In short, locks are best kept external to the data structures they manage, unless the structure itself must support high-performance concurrent access.

  3. A condition can be signaled without holding the condition's lock.

    There is a longstanding and deeply embedded superstition that signaling a condition without holding that condition's lock will compromise correctness. In fact, Solaris's own cond_signal(3C) manpage contains:

          Both functions should be called under the protection of the same
          mutex that is used with the condition variable being signaled.
          Otherwise, the condition variable may be signaled between the test
          of the associated condition and blocking in cond_wait().  This can
          cause an infinite wait.
    
    This is completely false: the thread heading into cond_wait() must have already tested the condition under the lock and concluded it was false in order to decide to cond_wait(). Since any thread changing state that would affect the condition must also be holding the lock, there is no way for the state (and thus the outcome of the test) to change beween the test and the cond_wait(), and thus any cond_signal() sent during that window would end up being spurious anyway. Accordingly, 6437070 has been filed.

    All that said, it may be true that signaling the condition while holding the lock may improve overall determinism since it eliminates a possible avenue for priority inversion. It also may waste cycles since the thread being signalled may not be able to grab the lock (if the signaling thread has not yet dropped it). This recent thread on mdb-discuss contains more on this issue.

Footnotes

[1] As David Powell mentioned to me while discussing this blog entry, "This problem is frequently misrepresented as an inability to synchronize against one's own destruction To be precise, the problem is that an object can't synchronize its own visibility."
[2] The other common option is to have the thread removing the object to wait for the reference counts to drop, but that forces the thread to block for a potentially unbounded period of time, and should only be used if required for correctness.

Technorati Tag:
Technorati Tag:

Friday May 12, 2006

Clearview Updates

Clearview development has been proceeding at a rapid pace -- here's a quick update on the milestones reached over the past month:

  • Thanks especially to Sebastien's hard work, the Nemo Binary Compatibility and Nemo Generalization components have been reviewed by the OpenSolaris community, approved by our internal architectural review board, and are nearing integration into OpenSolaris. With these, non-Ethernet Nemo drivers (such as the Clearview IP Tunneling driver) will be able to be written -- not to mention Nemo WiFi drivers (which are almost ready for intergation into OpenSolaris as well). This work also brings us a step closer to making the Nemo interfaces available for third-party use, and paves the way for TCP LSO support.
  • Cathy and Dan Groves have published a proposal for improving the observability of VLAN's, which will be submitted to the our architecture review board shortly. These changes are necessary for Clearview's Nemo Unification component, but are also quite useful in their own regard since they make it significantly easier to track down networking problems that occur on VLANs. Code that implements the proposal is already running internally, and is also destined for a nearby build of OpenSolaris.
  • Phil Kirk has published our proposed architecture for IP Observability. With this work comes the vital ability to debug intra-zone and inter-zone networking problems using traditional utilities such as ethereal and snoop -- along with opening the door for interesting possibilities such as inter-zone IDS's. Again, the code that implements this proposal is already running internally.
  • Sagun Shakya has published our proposal for a public library that can be used to communicate with link-layer devices via DLPI. This work is necessary for [Vanity Naming], but also allows us to centralize all application-level DLPI handling -- and to torch thousands of lines of tedious and obscure code. Expect a revised proposal -- based on our experiences of porting the Solaris DHCP client to use it -- to be posted to OpenSolaris shortly.
I'm also happy to report that we will shortly be making builds of the Clearview gate available to the OpenSolaris community. These early-access bits will include everything mentioned above.

And finally, as promised, here are my photos from Sichuan. Thanks again to Cathy Zhou and her family for taking me on this amazing trip.

Technorati Tag:
Technorati Tag:

Friday Mar 24, 2006

An Update (From China)

Greetings from Beijing! Actually, I feel silly saying that since I've been here for the past 2 months on rotation, and I am only a week away from departure. This is my second trip to Beijing, and I've enjoyed it even more than the first -- though the most exciting part of my trip is yet to come: a week-long trip to Sichuan province -- a place legendary for its beautiful land, beautiful ladies, and divinely spicy cuisine. I plan on assembling a photo album of my trip, which I will post here shortly.

As I must pack for the trip, I have limited time to write, but I would like to make folks aware of a few new items we've been busy on with regard to Solaris networking:

  • Clearview is now an "official" OpenSolaris Project. For those trying to wrap their heads around it, we also now have a set of slides which provide a high-level overview. I'm also happy to report that development is proceeding at a rapid pace, and that some components of Clearview will be available for experimentation shortly (via the OpenSolaris Project page).
  • In WiFi news, our long-term proposal for Solaris WiFi administration is under active discussion on the OpenSolaris Networking Forum. Now's the time to speak up if you have strong opinions on such matters.

Technorati Tag:
Technorati Tag:

Tuesday Jan 03, 2006

Private vs. Secret

Some personal responses I got to why lsof does not build in OpenSolaris build 27 made it clear that the distinction we make between private and secret has not been well-communicated.

Specifically, several were confused by my comment that <inet/udp_impl.h> "is not shipped". By this, I meant that it is not part of any OpenSolaris package, and thus that it will not be installed as /usr/include/inet/udp_impl.h on a machine running OpenSolaris. Thus, <inet/udp_impl.h> is what we consider to be a private header file: its contents represent private interfaces that we do not want external software to develop dependencies on[1]. This is the essence of good software engineering: making sure that each software layer depends only on well-defined interfaces from other layers. Moreover, well-defined interfaces allow stability levels to be specified (see attributes(5)) which allow the volatility of each interface to be known by its consumers.

Unfortunately, for a variety of (mostly historical) reasons, many header files containing only private interfaces have been shipped over the years. Moreover, there is not widespread consensus that shipping private header files is a bad idea -- many feel that there is a de facto understanding that undocumented header files are private, and thus that all headers, private or not, should be shipped with the system[2].

Regardless, while <inet/udp_impl.h> is private, it is not secret. That is, although <inet/udp_impl.h> is not shipped with OpenSolaris, it is part of the OpenSolaris source distribution, and available for anyone to examine. In contrast, a secret header file is one that we are legally prohibited from making available as part of OpenSolaris. For instance, the LLC2 driver was developed by a third party, and the terms of the agreement grant only Sun employees (and those under NDA) access to its source. Thus, <sys/llc2.h> is a secret header file, as we are legally prohibited from including <sys/llc2.h> with OpenSolaris. By contrast with open, we term all of our secret source files closed. These files reside in a parallel directory hierarchy of the Solaris ON gate rooted at usr/closed. The OpenSolaris source tree contains everything in the ON gate except for those files under usr/closed.

So, to summarize: private header files are not installed under /usr/include to keep the product more robust and flexible. However, they are part of the OpenSolaris source tree. In contrast, secret (or closed) header files are not installed under /usr/include because we are legally prohibited from doing so. Further, they are not part of the OpenSolaris source tree because we are legally prohibited from doing so.

Footnotes

[1] These dependencies usually happen accidentally, but sometimes they are on purpose. For instance, the aforementioned lsof utility uses many data structures that are contained in private header files, including <inet/ipclassifier.h>. The right long-term answer is to provide public, well-defined interfaces that these utilities can depend on.
[2] This is something I vehemently disagree with. As the lsof example makes clear, once an interface is shipped, external software is tempted to make use of it. These dependencies often remain unknown until an innocent developer needs to change the private interfaces and breaks a popular software package, affecting myriad end-users. Moreover, the "relief" provided by these private interfaces lowers the internal priority of developing proper well-defined interfaces.

Technorati Tag:
Technorati Tag:

Saturday Nov 26, 2005

Voodoo vs. Engineering

Even at my stately pace for blog entries, it's been a while. Truthfully, the last few months have been a blur, as I've been completely consumed by two projects: Clearview, and the future of WiFi on Solaris (more on that in a future blog entry). On the Clearview front, the IPMP design review on OpenSolaris has wrapped up, the IP Tunnel Rearchitecture and IP Observability Device design reviews are nearing completion, and Cathy's formidable "Nemo Unification and Vanity Naming" design is about to get underway. Thanks again to everyone who has contributed -- we really appreciate the feedback -- and keep it coming!

Each time I go through the design phase[1] of a project, I'm reminded of just how hard a process it is do with rigor and integrity. Performed as intended, the design phase forces us to confront and address the critical flaws at the start of a project, rather than right before integration -- or in the case of the technologies Clearview's rearchitecting, long after shipment. With Clearview, we've spent many months carefully working through the design process, constructing crisp, precise, and thorough design materials that collectively top 200 pages.

Along the way, I occasionally find it necessary to seek more real-world examples of why we are willing to endure such pain. The most entertaining method is to don the hazmat suit, wade into the toxic anarchy of subsystems from years gone by, and bear witness to the nightmare that results from inadequate design.

Though under-designed (or just plain never-designed) subsystems make themselves known in many capacities, one of their hallmarks is the presence of voodoo coding. Specifically, because the forces shaping the design space were never really understood, the developer must iterate until something finally "works". Code from previous iterations or stillborn ideas of how to solve unexpected problems are left behind as the logic is debugged into existence. Eventually, persistence prevails over common sense, and the code is checked in, complete with ritual sacrifices. As the years go by, subsequent developers become convinced that there is something tricky going on that they simply aren't capable of understanding, and the subsystem in question becomes steeped in mysticism.

As I've mentioned before, one subsystem that has more than its fair share of arcana is STREAMS -- which is undoubtedly why I find it so fascinating. About once a release, I like to go in and haul out a few thousand lines[2] of nonsense. While I started on the periphery, at this point I've plucked the low-hanging fruit and must instead target the wizened roots that comprise the core paths. Once such routine is str_mate(), which "twists" two streams together so that they can act as pipe or a FIFO. Since its introduction more than a decade ago, it has looked like this (line numbers added for subsequent discussion):

     1    /*
     2     *  No activity allowed on streams
     3     *  Input: 2 write driver end write queues
     4     *  Return 0 if error
     5     *  If wrq1 == wrq2 then we are doing a loop back,
     6     *  otherwise just mate two queues.
     7     *  If these queues are not the stream head they must have
     8     *  a service procedure
     9     *  It is up to caller to ensure that neither queue goes away.
    10     *  XXX str_mate() and str_unmate() should be moved to new file kstream.c
    11     *  XXX these routines need to be a little more general
    12     */
    13    int
    14    str_mate(queue_t *wrq1, queue_t *wrq2)
    15    {
    16            /*
    17             * Loop back?
    18             */
    19            if (wrq2 == 0 || wrq1 == wrq2) {
    20                    /*
    21                     * driver end of stream?
    22                     */
    23                    if (! (wrq1->q_flag & QEND))
    24                            return (EINVAL);
    25    
    26                    ASSERT(wrq1->q_next == 0);      /* sanity */
    27    
    28                    /*
    29                     * connect write queue to read queue
    30                     */
    31                    wrq1->q_next = _RD(wrq1);
    32    
    33                    /*
    34                     * If write queue does not have a service routine,
    35                     * assign the forward service procedure from the
    36                     * read queue
    37                     */
    38    
    39                    if (! wrq1->q_qinfo->qi_srvp)
    40                            wrq1->q_nfsrv = _RD(wrq1)->q_nfsrv;
    41                   /*
    42                     * set back service procedure..
    43                     * XXX - note back service procedure is not implemented
    44                     * this may cause a race condition, breaking it
    45                     * a bit more.
    46                     */
    47                    _RD(wrq1)->q_nbsrv = wrq1;
    48            } else {
    49                    /*
    50                     * driver end of stream?
    51                     */
    52                    if (! (wrq1->q_flag & QEND))
    53                            return (EINVAL);
    54                    if (! (wrq2->q_flag & QEND))
    55                            return (EINVAL);
    56    
    57                    ASSERT(wrq1->q_next == NULL);   /* sanity */
    58                    ASSERT(wrq2->q_next == NULL);   /* sanity */
    59    
    60    
    61                    /*
    62                     * if first queue is a stream head, second must
    63                     * must also be one
    64                     */
    65                    if (! (_RD(wrq1)->q_flag & QEND)) {
    66                            if (_RD(wrq2)->q_flag & QEND)
    67                                    return (EINVAL);
    68                    } else if (! (_RD(wrq2)->q_flag & QEND))
    69                            return (EINVAL);
    70                    /*
    71                     * Twist the stream head queues so that the write queue
    72                     * points to the other stream's read queue.
    73                     */
    74                    wrq1->q_next = _RD(wrq2);
    75                    wrq2->q_next = _RD(wrq1);
    76    
    77                    if (! wrq1->q_qinfo->qi_srvp)
    78                            wrq1->q_nfsrv = _RD(wrq2)->q_nfsrv;
    79                    if (! wrq2->q_qinfo->qi_srvp)
    80                            wrq2->q_nfsrv = _RD(wrq1)->q_nfsrv;
    81    
    82                    /*
    83                     * Nothing really uses the back service routines,
    84                     * but fill them in for completeness
    85                     */
    86    
    87                    _RD(wrq1)->q_nbsrv = wrq2;
    88                    _RD(wrq2)->q_nbsrv = wrq1;
    89    
    90                    SETMATED(STREAM(wrq1), STREAM(wrq2));
    91            }
    92            return (0);
    93    }

As a developer, this is the kind of ghetto you hope you never have to enhance or debug: gnarly logic, unsettling comments, and suspect codepaths abound. Further, as a code minimalist, it's a personal affront to everything I hold dear.

The function's biggest problem is that it is pointlessly general-purpose (despite the contrary claim on line 11), undoubtedly because no one ever bolted down its requirements. Specifically, the function signature and comments suggest that it can mate any two STREAMS driver queues together at any time. However, because of the limitations expressed on lines 2, 7, 8, and 9, this was never possible in practice -- instead, the function was only used to mate stream head queues before they were put into use. In fact, all of its callers go through extra work to accommodate this needless generality:

                if (dotwist && firstopen) {
                        queue_t *wq = strvp2wq(*vpp);
                        (void) str_mate(wq, wq);
                }

Adding insult to injury, str_mate() can never fail when passed stream head queues -- so, as shown above, all callers simply discard the return value. Indeed, by simply changing str_mate() to accept two vnode pointers and letting it derive the stream heads to mate, we can (a) simplify its callers, (b) simplify its interface, and (c) simplify its implementation. With this change, callers will just do:

                if (dotwist && firstopen)
                        str_mate(*vpp, *vpp);

Regarding (c):

  • Because the queues are guaranteed to be stream heads, lines 61-69 can be removed
  • Because stream heads always have QEND set, lines 20-23 and 49-55 can be removed
  • Because stream heads always have a service procedure, lines 33-40 and 77-81 can be removed. However, since one day this may not be true, an ASSERT() should be added to avoid debugging headaches.
After our first pass, things are starting to look more comprehensible -- and at no loss of functionality:

     1    /*
     2     * Mate the stream heads of two vnodes together. If the two vnodes are the
     3     * same, we just make the write-side point at the read-side -- otherwise,
     4     * we do a full mate.  Only works on vnodes associated with streams that
     5     * are still being built and thus have only a stream head.
     6     */
     7    void
     8    str_mate(vnode_t *vp1, vnode_t *vp2)
     9    {
    10            queue_t *wrq1 = strvp2wq(vp1);
    11            queue_t *wrq2 = strvp2wq(vp2);
    12  
    13            /* 
    14             * We rely on the stream head always having a service procedure
    15             * to avoid tweaking q_nfsrv.
    16             */
    17            ASSERT(wrq1->q_qinfo->qi_srvp != NULL);
    18            ASSERT(wrq2->q_qinfo->qi_srvp != NULL);
    19
    20            /*
    21             * Loop back?
    22             */
    23            if (wrq2 == 0 || wrq1 == wrq2) {
    24                    ASSERT(wrq1->q_next == 0);      /* sanity */
    25    
    26                    /*
    27                     * connect write queue to read queue
    28                     */
    29                    wrq1->q_next = _RD(wrq1);
    30  
    31                    /*
    32                     * set back service procedure..
    33                     * XXX - note back service procedure is not implemented
    34                     * this may cause a race condition, breaking it
    35                     * a bit more.
    36                     */
    37                    _RD(wrq1)->q_nbsrv = wrq1;
    38            } else {
    39                    ASSERT(wrq1->q_next == NULL);   /* sanity */
    40                    ASSERT(wrq2->q_next == NULL);   /* sanity */
    41    
    42                    /*
    43                     * Twist the stream head queues so that the write queue
    44                     * points to the other stream's read queue.
    45                     */
    46                    wrq1->q_next = _RD(wrq2);
    47                    wrq2->q_next = _RD(wrq1);
    48    
    49                    /*
    50                     * Nothing really uses the back service routines,
    51                     * but fill them in for completeness
    52                     */
    53    
    54                    _RD(wrq1)->q_nbsrv = wrq2;
    55                    _RD(wrq2)->q_nbsrv = wrq1;
    56    
    57                    SETMATED(STREAM(wrq1), STREAM(wrq2));
    58            }
    59    }

That hauled out a third of it, but there's plenty more to go. First, since wrq2 is no longer passed in, there is no risk that it can be NULL, so the check on 23 can be disposed of. Next, as boldly proclaimed in the comments on lines 32-35 and 50-51, q_nbsrv is indeed unused (a weed which I hauled out as part of 6267823) -- so it can be torched as well. Finally, the "sanity" checks on lines 24, 39, and 40 can be hoisted to neighbor the qi_nfsrv ASSERT()'s at minimal cost (one extra compare on DEBUG systems). Our result:

     1    /*
     2     * Mate the stream heads of two vnodes together. If the two vnodes are the
     3     * same, we just make the write-side point at the read-side -- otherwise,
     4     * we do a full mate.  Only works on vnodes associated with streams that
     5     * are still being built and thus have only a stream head.
     6     */
     7     void
     8     str_mate(vnode_t *vp1, vnode_t *vp2)
     9     {
    10             queue_t *wrq1 = strvp2wq(vp1);
    11             queue_t *wrq2 = strvp2wq(vp2);
    12          
    13             /*
    14              * Verify that there are no modules on the stream yet.  We also
    15              * rely on the stream head always having a service procedure to
    16              * avoid tweaking q_nfsrv.
    17              */
    18             ASSERT(wrq1->q_next == NULL && wrq2->q_next == NULL);
    19             ASSERT(wrq1->q_qinfo->qi_srvp != NULL);
    20             ASSERT(wrq2->q_qinfo->qi_srvp != NULL);
    21     
    22             /*
    23              * If the queues are the same, just twist; else do a full mate.
    24              */
    25             if (wrq1 == wrq2) {
    26                     wrq1->q_next = _RD(wrq1);
    27             } else {
    28                     wrq1->q_next = _RD(wrq2);
    29                     wrq2->q_next = _RD(wrq1);
    30                     SETMATED(STREAM(wrq1), STREAM(wrq2));
    31             }
    32     }

Behold, under all that cargo-cult programming: a scraggly, unintimidating function -- but functionally identical to the original 92-line thug. Now, we can trivially follow the logic:

  1. If the two vnodes are the same (pipe), then the single stream head has its write-side pointed to its read-side.
  2. If the two vnodes are different (FIFO), then each stream head's write-side is pointed to the other's read-side.
In the second case, the SETMATED() macro is also called in order to link the two stream heads (via sd_mate) and to set the STRMATE sd_flag on each stream head[3]. Since str_mate() is the only routine that establishes mates, it is also the only user of the SETMATED() macro. Since the macro neither adds semantic value nor reduces code duplication, our final version inlines the macro (whose definition can then be removed), preferring clarity to brevity. In addition, str_mate() has been renamed to strmate() for consistency with other STREAMS functions:
     1    /*
     2     * Mate the stream heads of two vnodes together. If the two vnodes are the
     3     * same, we just make the write-side point at the read-side -- otherwise,
     4     * we do a full mate.  Only works on vnodes associated with streams that
     5     * are still being built and thus have only a stream head.
     6     */
     7     void
     8     strmate(vnode_t *vp1, vnode_t *vp2)
     9     {
    10             queue_t *wrq1 = strvp2wq(vp1);
    11             queue_t *wrq2 = strvp2wq(vp2);
    12          
    13             /*
    14              * Verify that there are no modules on the stream yet.  We also
    15              * rely on the stream head always having a service procedure to
    16              * avoid tweaking q_nfsrv.
    17              */
    18             ASSERT(wrq1->q_next == NULL && wrq2->q_next == NULL);
    19             ASSERT(wrq1->q_qinfo->qi_srvp != NULL);
    20             ASSERT(wrq2->q_qinfo->qi_srvp != NULL);
    21     
    22             /*
    23              * If the queues are the same, just twist; else do a full mate.
    24              */
    25             if (wrq1 == wrq2) {
    26                     wrq1->q_next = _RD(wrq1);
    27             } else {
    28                     wrq1->q_next = _RD(wrq2);
    29                     wrq2->q_next = _RD(wrq1);
    30                     STREAM(wrq1)->sd_mate = STREAM(wrq2);
    31                     STREAM(wrq1)->sd_flag |= STRMATE;
    32                     STREAM(wrq2)->sd_mate = STREAM(wrq1);
    33                     STREAM(wrq2)->sd_flag |= STRMATE;
    34             }
    35     }

The above is identical to the version in OpenSolaris.

Footnotes

[1] Of course, no one who has actually done design believes in a pure waterfall model -- instead, designs must be revised over the lifetime of the project. Regardless, the more rigorous the initial design is, the fewer changes that will have to be made later in the project.
[2] While most developers pride themselves on the "lines of code" they've *produced*, I'm quite the opposite: I feel a profound sense of failure if the codebase is larger after I've integrated a bugfix or feature.
[3] STRMATE isn't strictly needed, but since sd_flag is often checked to accept or reject certain STREAMS operations, it's more convenient than constantly checking whether sd_mate != NULL.

Technorati Tag:
Technorati Tag:

Sunday Sep 18, 2005

Openly Excited The past few weeks have been some of the most exciting (and hectic!) of my professional career, as I acclimate to the power and responsibility of an increasingly open software development model.

To rewind for a second: Clearview is a project that Sebastien and I are spearheading to rationalize, unify, and enhance the way network interfaces are handled in Solaris. "What?,", you say? Well, by defining and implementing a set of core unifying attributes (or unalienable rights)[1] for all interfaces, a wide variety of administrative, programmatic, and diagnostic problems are addressed. For example, after this work, one will be able to:

  • Observe all IP layer network traffic, including loopback, IPMP group and IP tunnel traffic.
  • Observe all IP layer network traffic flowing to and from a zone.
  • Administrate all network interfaces using dladm(1M).
  • Use VLANs and form link aggregations on all Ethernet devices.
  • Use IPMP with technologies such as DHCP and routing protocols.

... though that list is far from comprehensive (nor does the above list constitute a commitment -- this is work-in-progress, mind you!)

Before I start hearing cries of "vaporware!", let's get back to the part that has me stoked: Clearview is one of the first Solaris projects to directly involve the OpenSolaris community in its design. To wit, our team is in the process of releasing design materials for each component of the Clearview project, and is actively soliciting input from the community at large.

In fact, the first document -- covering our massive overhaul of IPMP -- was sent out a week ago, and has already received some great feedback (keep it coming!) Further, just this morning, Sebastien unveiled our proposed Tunnelling Device Driver -- if you've spent much time with tunnels in Solaris, this will be a sight for sore eyes. Of course, that's not all: in the coming weeks, Phil Kirk will provide our design for IP-Level Observability Devices, and Cathy Zhou will detail our sinister plan to bring the power of dladm to all Solaris network interfaces (and more!)

When we say directly involve, we mean it. That is, this review is instead of, rather than in addition to our traditional Sun-internal design review. As such, other project teams inside Sun are seeing these materials for the first time as well -- and we have encouraged all of them to provide their feedback to the community forum. We hope that this underscores not only our commitment to open up our development process, but our desire to actively involve the community in each step, from design through integration, and beyond.

Speaking of design: we hope these hefty documents make it clear that we do not believe in leaving the design or architecture of Solaris to the whims of late-night hackery. Moreover, although technical documents are rarely known for their humanity, we hope that our materials provide a glimpse into the deep pride we each have in both Solaris and our craft.

Needless to say, your participation is most welcome -- and rest assured that we will continue to seek your involvement as the components proceed through various stages of our software development process.

Footnotes

[1] The rights themselves and their implications are too important to serve as a sidebar to this blog entry, and will surely get one of their own in the future.

Technorati Tag:
Technorati Tag:

Saturday Aug 13, 2005

On Locklessness

Unlike many other Unix kernels, the Solaris kernel is heavily multithreaded, including our networking stack. The conventional way to ensure consistency between multiple threads is to use synchronization primitives such as mutexes and readers-writer locks -- and indeed, these are widely used throughout the Solaris kernel. However, there are times when conventional locks are inappropriate, and more creative approaches to ensuring consistency must be devised. In this blog entry, I discuss a recent issue I encountered while working on the new Nemo[1] network driver framework, and describe an elegant but rarely-used approach.

The Problem

Among its other responsibilities, the Nemo framework provides the glue between the network protocol layer (e.g., IP) and the underlying network device driver. In particular, when a packet needs to be sent to an underlying device, the protocol stack will call the dls_tx() function. The dls_tx() function takes two arguments: the data-link to send the packet on, and the the packet to send. If any part of the packet could not be sent, the remainder is returned. In the original Nemo implementation, dls_tx() was implemented as:

	mblk_t *
	dls_tx(dls_channel_t dc, mblk_t *mp);
	{
	        dls_impl_t      *dip = (dls_impl_t *)dc;
	
	        return (dip->di_tx(dip->di_tx_arg, mp));
	}
Note that dc and dip refer to the same object (the data-link), but the first is opaque so that the caller cannot grovel around in the implementation details[2].

As is apparent, dls_tx() turns around and calls a previously-specified per-datalink transmit function pointer to send packets on that data-link. The transmit function is passed a previously-specified callback argument, and the packet to transmit. The transmit function and callback argument usually specify the underlying network driver's transmit function and a driver-instance data structure associated with that link.

However, when promiscuous-mode[3] is enabled on the underlying device, the Nemo framework interposes its own transmit function and callback argument between dls_tx() and the underlying driver's transmit routine. This interposed function loops back each transmitted packet so that a system will see its own transmitted packets through tools such as snoop(1M)[4].

Unfortunately, this presents a problem: as can be seen in dls_tx(), no locks are held as part of dereferencing di_tx and di_tx_arg. Thus, if one thread is calling dls_tx() while another is enabling or disabling promiscuous mode, it is possible for us to end up invoking the callback function with a mismatched callback argument.

As an example, suppose that the normal di_tx and di_tx_arg are tx and txarg, and the promiscuous di_tx and di_tx_arg are looptx and looptxarg. Then the following could happen:

          thread 1 (in dls_tx())      |  thread 2 (enabling promiscuous)
        ------------------------------|----------------------------------
  time    load di_tx = tx             |
   |                                  |  store di_tx = looptx
   |                                  |  store di_tx_arg = looptxarg
   V      load di_tx_arg = looptxarg  |
                                      |

Clearly, passing the wrong argument to di_tx could trigger panic or other misbehavior and is thus unacceptable. As such, we need to ensure that from the perspective of dls_tx(), di_tx and di_tx_arg are always either tx/txarg, or looptx/looptxarg.

So ... why not lock?

The simplest and most obvious way to address this problem would be to introduce locks, changing dls_tx() to be:

	mblk_t *
	dls_tx(dls_channel_t dc, mblk_t *mp)
        {
	        dls_impl_t      *dip = (dls_impl_t *)dc;
		mblk_t		*mp;

		mutex_enter(&dip->di_tx_lock);
	   	mp = dip->di_tx(dip->di_tx_arg, mp);
		mutex_exit(&dip->di_tx_lock);
		return (mp);
	}

Of course, the same lock would be acquired while assigning di_tx and di_tx_arg, making the assignment of both di_tx and di_tx_arg appear atomic from the perspective of dls_tx(). While this approach will work (and, moreover, is almost always the right way to solve this problem), in this particular case it has some drawbacks. To wit:

  1. All calls to the underlying di_tx function on the given channel are now serialized. That is, if multiple threads call dls_tx() for the same dc, only one thread will be able to call di_tx at a time.
  2. The dls_tx() function is called for every transmitted packet. Thus acquiring and releasing a lock -- even supposing that it is rarely contended -- introduces needless overhead [5].
  3. The dls_tx() function is no longer eligible for compiler tail-call optimization, impacting performance since the stack frame of dls_tx() can no longer be directly reused by di_tx()[6].

Lockless Atomicity

While the first problem could be fixed by switching to a readers-writer lock, this does nothing for the other two problems. Indeed, there appears to be no way to overpower this problem with complexity, so instead we return our requirement:

We need to ensure that from the perspective of dls_tx(), di_tx and di_tx_arg are always either tx/txarg, or looptx/looptxarg.
Now, if all we needed to do was guarantee that only di_tx or di_tx_arg assignments appeared atomic, we'd need no locks at all: all processors that Solaris supports guarantee that assignments to pointer-sized quantities will be atomic [7].

However, this of course means that assignment of structure pointers is also atomic. Thus, we can simply bundle these two fields into a structure:

	typedef struct mac_txinfo_s {
	        mac_tx_t                mt_fn;
	        void                    *mt_arg;
	} mac_txinfo_t;

... and fill in two such structures for each driver: one that contains the tx/txarg pair, and one that contains the looptx/looptxarg pair. Here's the actual code that does this in i_mac_create():

        /*
         * Set up the two possible transmit routines.
         */
        mip->mi_txinfo.mt_fn = mp->m_tx;
        mip->mi_txinfo.mt_arg = mp->m_driver;
        mip->mi_txloopinfo.mt_fn = mac_txloop;
        mip->mi_txloopinfo.mt_arg = mip;

We then replace di_tx and di_tx_arg with a single di_txinfo field:

        const mac_txinfo_t              *di_txinfo;

Again, since di_txinfo is a single pointer, it can be atomically set to point to either mi_txinfo or mi_txloopinfo. Coming back around to dls_tx(), we can now remove the need for locks, and also restore the compiler's ability to tail-call optimize dls_tx():

	mblk_t *
	dls_tx(dls_channel_t dc, mblk_t *mp)
	{
	        const mac_txinfo_t *mtp = ((dls_impl_t *)dc)->di_txinfo;
	
	        return (mtp->mt_fn(mtp->mt_arg, mp));
	}

Yes, that's it -- it's really that elegant. However, it's also a bit subtle: the code is careful to dereference di_txinfo only once. A more mechanical transformation of the original dls_tx() function would reintroduce a variant of the original bug, since di_txinfo might change between the two dereferences:

	mblk_t *
	dls_tx(dls_channel_t dc, mblk_t *mp)
	{
	        dls_impl_t      *dip = (dls_impl_t *)dc;

		/* WRONG */	
	        return (dip->di_txinfo->mt_fn(dip->di_txinfo->mt_arg, mp));
	}

It's also worth noting that even the correct lockless dls_tx() has subtlely different semantics from the version which used mutexes: since there is no explicit synchronization, it's possible a thread in dls_tx() may not see recent changes to the di_txinfo pointer. In this particular case, this is acceptable since:

  • The lifetime of either value is identical, and it is not possible for either value to become invalid while in dls_tx(), due to other invariants in the Nemo framework.
  • The effect of using the wrong function for a few packets is unobservable: either the kernel will attempt to loop back a few packets unnecessarily, or the first few packets after promiscuous-mode is enabled will not be looped back. The latter is unobservable because the application which enabled promiscuous-mode cannot observe the exact instant that it became possible to loop-back packets it transmits.

In many cases, these restrictions may limit the applicability of this approach -- and moreover, may suggest that a more conventional locking approach is needed. Nonetheless, as this discussion has hopefully made clear, there are times when locks are a poor match to the problem at hand, and "going lockless" both improves performance and code elegance.

Footnotes

[1] Nemo is also known as GLDv3, and is our brand new network driver framework that bge and xge currently use. All future Sun-provided network drivers on Solaris will use GLDv3. Once the interfaces stabilize, they will be published for third-party network driver developers.
[2] Throughout the kernel, we prefer this to void * since this enables the compiler to still type-check, thus maintaining type-safety. The actual definition of dls_channel_t is typedef struct dls_t *dls_channel_t, where dls_t is never defined.
[3] Enabling promiscuous-mode allows a network device to receive and send up packets on the network which are not destined for its hardware unicast, multicast, or broadcast addresses. In the old days, this would allow all traffic on the subnet to be observed -- but in these days of switched networks, enabling promiscuous-mode often yields only a few additional packets.
[4] Almost all network devices are deaf while transmitting, so transmitted packets which a host must also receive must be looped back in software.
[5] For instance, on SPARC, a mutex_exit() will perform a membar #LoadStore|#StoreStore, which forces the processor to ensure that all outstanding loads and stores complete before any future store complete.
[6] Those not familiar with the basics of tail-call optimization should read this wiki entry. On SPARC, this optimization can be important because it can avoid needless register spill/fill traps (among other things).
[7] To my knowledge, we have never officially stated this as a guarantee, but tons of kernel code relies on this invariant.

Technorati Tag:
Technorati Tag:

Wednesday Jul 06, 2005

On Bugs and Reproducibility

I love bugs -- specifically, I love putting an end to bugs[1]. I find myself a bit tongue-twisted when asked to articulate just what makes root-causing a bug so engrossing. In brief: it's the thrill of the hunt, the pull of a mystery, the search for a deeper understanding of how the system works, and -- of course -- the gripping fear that some customer somewhere will hit it again if we don't nail the bloody thing[2].

One way to subclass bugs is by reproducibility. That is, can the bug be reproduced on demand? However, as this blog entry will make clear, almost[3] any bug can be reproduced on demand once it is sufficiently understood. Thus, when someone states "this bug is not reproducible", they're likely really saying "this bug is not sufficiently understood". Sadly, one established way a bug can be "resolved" at Sun is for it to be closed out as "not reproducible" -- not just a shameful admission of utter defeat, but a near-guarantee to customers that they will be inconvenienced by the problem yet again. Thus, many of us simply refuse to close out bugs as "not reproducible"[4].

Nonetheless, sometimes bugs are indeed closed out as "not reproducible". In fact, if a bug takes up residence in a subsystem that is poorly understood but widely used, it can thump around for years -- or even decades -- before it is finally smoked out by a particularly determined engineer. Further, often the manifestations of the bug can be so varied that even after diagnosing, it can prove impossible to track down all of the duplicates that are either still open or have been closed out over the years as "not reproducible". Here I discuss my dealings with one such bug -- 4927647.

The Sewer of STREAMS

A classic example of a poorly-understood but widely-used area is the pseudo-terminal subsystem[5]. While the basics of how pseudo-terminals work is well-documented in several texts[6], the STREAMS code that implements it is a arcane and delicate piece of machinery from AT&T that has undergone little change since its initial impact with SVR3.2 some 18 years ago.

Again, that's not to say the code is unimportant. On the contrary, it is a core building block upon which many critical utilities, such as xterms, remote logins (ssh/telnet), and tools like expect(1) are built. Thus, when something goes wrong, someone has to be called in to diagnose it -- and since it's STREAMS code, these days that often means me[7].

Enter 4927647.

Specifically, I got called in because a high-profile customer filed the following bug:

    In expect 5.38.0 on Solaris 9, run the following script: 

        for { set i 1 } { 1 } { incr i } {
            spawn echo foobarbaz
            expect {
                "foobarbaz" {
                   puts "ok"
                   expect eof wait
                }
                eof {
                    puts "$i passes"
                    wait
                    exit 1
                }
            }
        }

    It should loop forever; instead, it exits after a few iterations on
    multiprocessor systems.  It appears to loop "forever" on uniprocessors.

Indeed, running this test I was able to confirm that it did fail on multiprocessor systems -- though usually after thousands of iterations (each iteration taking close to a second), rather than the "few" indicated in the bug report. Without a reproducible test case, this was surely going to be a painful battle.

The Hunt for Reproducibility

Using truss(1), it had already been confirmed that the problem was not with expect, but with the pseudo-terminal implementation itself. Specifically, something was causing the poll(2) to occasionally wake up with POLLERR, and the subsequent read(2) to return EINVAL rather than the number of bytes in the expected message:

    25258:  open("/dev/ptmx", O_RDWR)                       = 4
    [ ... lots happens ... ]
    25258:  poll(0xFFBFEBC0, 1, 10000)                      = 1
    25258:          fd=4  ev=POLLRDNORM|POLLRDBAND rev=POLLERR
    25258:  read(4, 0x00044FE8, 3999)                       Err#22 EINVAL

Of course, with this came the sobering realization that this bug was undoubtedly responsible for countless other sporadic "not reproducible" pseudo-terminal-related bugs, and an accompanying growing list of suspected duplicates. So, the key question was: could we intuit enough about this bug to make it reproducible and finally smoke this thing out?

At the time, DTrace had not yet integrated into Solaris, but was available for internal use and had an ever-growing buzz, so I was eager to see if it could shed some light on this bug. In particular, based on the truss output, my suspicion was that a toxic message was arriving on the /dev/ptmx stream that was causing STRDERR to be set, triggering POLLERR to be returned from strpoll():

        if (sd_flags & (STPLEX | STRDERR | STWRERR)) {
                if (sd_flags & STPLEX) {
                        *reventsp = POLLNVAL;
                        return (EINVAL);
                }
                if (((events & (POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI)) &&
                    (sd_flags & STRDERR)) ||
                    ((events & (POLLOUT | POLLWRNORM | POLLWRBAND)) &&
                    (sd_flags & STWRERR))) {
                        if (!(events & POLLNOERR)) {
                                *reventsp = POLLERR;
                                return (0);
                        }
                }
        }

Further, my suspicion was that most of the time, the application called poll() before this toxic message arrived, thus sidestepping the problem. However, occasionally, the application would get delayed (e.g., by other system activity), causing the toxic message to set STRDERR before the application was able to call poll().

Thus, I used the following simple D script to "force" the application to become stalled:

      #!/usr/sbin/dtrace -ws
     
      syscall::poll:entry
      {
             chill(30000000);
      }

And just like that, the problem became reproducible on the first try! That is, by simply widening the failure window, the problem became immediately reproducible, simultaneously confirming my hypothesis and significantly easing further analysis. (It should be noted that the proper value to chill(), which is specified in nanoseconds, is highly dependent on the underlying hardware -- for my hardware, delaying by 30ms was sufficient. Of course, to prevent widespread system misbehavior, the amount of time one can chill() is bounded.)

The Analysis

With the problem reproducible, further analysis with DTrace and other tools proceeded quickly. In less than an hour, the specifics behind the bug were well-understood, though still extraordinarily twisted.

For those who are interested in such sordid details, here's a crude diagram which shows the data flow (indicated by arrows) and the relevant streams modules (shown in brackets; more about the the modules can be gleaned from their manpages -- e.g., pts(7D)). Note that pckt(7D), while critical to the problem, is not actually on the stream -- hence the dotted lines going through it:

                  expect                   echo
            U
            -------------------------------------------
            K   
                [ stream ]              [ stream ]
                [  head  ]              [  head  ]
                   |  ^                    |  ^
                   |  |                    v  |
                   |  |                 [ ldterm ]
                   |  |                    |  ^
                   :  :                    v  |
                [  pckt  ]              [  ptem  ]
                   :  :                    |  ^
                   v  |                    v  |
                [  ptm   ]   <------>   [  pts   ]

                master side             slave side
Using the above diagram, here are the steps that lead to the problem [8]:
  1. The message (e.g., "foobarbaz" or "something") is sent down the pseudo-terminal slave stream by the echo command.
  2. The pseudo-terminal slave is closed by echo (e.g., through exit(2)) and the dismantling of the stream begins; the message sent in (1) is still in-flight, perhaps even still on the slave side.
  3. Eventually, ldterm's close routine (ldtermclose()) is called, which causes a TCSBRK M_IOCTL to be sent downstream (see Jim Carlson's blog for lots more on that misery).
  4. The ptem module receives the M_IOCTL, acks it, and then passes a copy of of the M_IOCTL downstream for pckt(7M), as documented in ptem(7M).
  5. Shortly after ldtermrput() receives the M_IOCACK, ldtermclose() is allowed to complete, and the following processing elements (ptem and pts) quickly follow suit, being careful to drain (if necessary) the message and the M_IOCTL copy to the master side stream.
  6. Both the M_IOCTL and original data message reach the master side stream. Eventually, the data message makes it to the stream head and a pollwakeup() is issued to notify the application (expect) that data is available. However, as discussed before, for whatever reason the application is delayed.
  7. Since the pckt(7M) module is actually not on the master side, the M_IOCTL reaches the master side's stream head. Since it's not expecting to receive this message, strrput_nondata() sends an M_IOCNAK back downstream.
  8. The M_IOCNAK reaches ptm. However, since the slave-side has been closed, ptmwput() frees the message and sends an M_ERROR upstream to indicate the problem.
  9. The stream head receives the M_ERROR and strrput_nondata() sets STRDERR on the stream.
  10. The application thread finally receives the notification from step (6) and is awoken, but since STRDERR is set, the block of code shown earlier causes strpoll() to return POLLERR rather than POLLIN. Thus, the observed behavior.

All That ... For This?

Despite the tortuous nature of the bug, the fix itself proved to be trivial[9]. In particular, while there are several possible simple fixes, the least risky is to change the ptmwput() logic in step (8) to use the obscure 2-byte form of the M_ERROR message to only mark the write-side as having an error. That is, to change:

         if ((ptmp->pt_state & PTLOCK) || (ptmp->pts_rdq == NULL)) {
                 DBG(("got msg. but no slave\n"));
                 (void) putnextctl1(RD(qp), M_ERROR, EINVAL);
                 PT_EXIT_READ(ptmp);
                 freemsg(mp);
                 return;
         }

... to[10]:

         if ((ptmp->pt_state & PTLOCK) || (ptmp->pts_rdq == NULL)) {
                 DBG(("got msg. but no slave\n"));
                 mp = mexchange(NULL, mp, 2, M_ERROR, -1);
                 if (mp != NULL) {
                         mp->b_rptr[0] = NOERROR;
                         mp->b_rptr[1] = EINVAL;
                         qreply(qp, mp);
                 }
                 PT_EXIT_READ(ptmp);
                 return;
         }

         [ See the putnextctl(9F), qreply(9F), and mexchange(9F)
           manpages to get a better handle on this code. ]

This seems eminently reasonable since ptm is unable to forward a message through the write-side, but the read-side is still usable -- so why wasn't it done this way from day one? The reason is simple: the 2-byte form of the M_ERROR message post-dates the original code.

Again, thanks to the reproducible test case, it was quick and easy to verify that this fixed the problem.

Due Diligence

As I previously mentioned, most of the pseudo-terminal code dates back to AT&T and has simply been hitched to the nearest wrecker and gingerly towed to the latest release. Indeed, inspecting an internal gates, it's clear this bug goes all the way back to the introduction of the STREAMS-based pseudo-terminal subsystem in 1987! While this is certainly an extreme case, it is not rare for apparently-intermittent bugs to terrorize subsystems for years before they are finally root-caused.

In this case, scanning the bug database suggests the conditions needed to reproduce this bug did not become widespread until Solaris 9 -- at which point, numerous bizarre application failures began to be reported, such as:

    4530041 Unable to restore/save backup during DSR upgrade to S9/S10 due to internal error
    4654418 ripv2 test routed.perf.3 failed intermittently
    4695559 spurious failures of tcp.serverclose
    4813669 ripv2 test routed.perf.3 failed intermittently
    4897163 ppp tests often return with "FAIL: Unable to ping 10.0.0.2."
    4892148 unexpected POLLERR causing early EOF from spawned processes in "expect"
    4926624 ssh exits with -1 if stdin is not a terminal
    4985311 rdisc.3 testcase randomly fails
    5012511 ripv2: some test cases in rtquery subdir failed randomly and intermittently

Of course, there were surely many more anomalies due to this bug which were either never filed, "fixed" by other means (workaround), or closed out as "not reproducible" -- we will never know the full scope of the chaos caused by this persistent little bug.

Summary

Hopefully, this treatise has clarified three key points regarding hard-to-reproduce bugs:

  1. Almost any bug can be reproduced once it is properly understood. Resist the temptation to close a bug out as non-reproducible -- it will come back to bite someone!
  2. DTrace is a powerful tool for investigating hard-to-reproduce bugs, and the chill() action is particularly ideal for investigating bugs that you suspect are timing-related.
  3. Once the problem is understood, scanning the bug database for similar symptoms can often reveal years -- or even decades -- of other bits of inexplicable behavior. Like nabbing a serial killer, the search through the cold cases can prove tedious but ultimately provides a sense of closure to others who have been victimized. Moreover, the search provides a basis for understanding the true impact of the bug and thus how many releases the fix needs to be patched back to.

Happy hunting!

Footnotes

[1] I literally put an end to bugs during a recent trip to China -- though sadly I was not able to dine on anything quite this large. (Upon seeing that photo, Jim Carlson quipped: "I guess if you can put a stick through it, it's food." -- a disturbingly accurate observation.)
[2] Of course, as a developer, much of my time is spent on the other side of the fence: introducing new code, which will invariably have a new crop of challenging bugs. Thus, there is another element to bugfixing: paying my dues to society.
[3] There exists a very small subset of bugs that are tied to factors that are beyond software's ability to control, and thus cannot be reproduced on demand, even when they are sufficiently understood.
[4] Unfortunately, our internal "list" of reasons for closing out a bug is currently too limited, forcing "not reproducible" to be pressed into service for "fixed as a side effect of X", or "insufficient information available to evaluate", among others.
[5] Actually, we have two pseudo-terminal subsystems on Solaris: an System V one, and a BSD one. However, the BSD flavor is provided only for compatibility and for purposes of this discussion can be ignored.
[6] An excellent description of pseudo-terminals can be found in the recently-revised Advanced Programming in the Unix Environment.
[7] For the record, I've never formally worked on STREAMS -- but Sasha Kolbasov and myself have become the unofficial caretakers of the much-maligned STREAMS core.
[8] To keep the description from growing too bloated, I have left out the code snippets, but inspecting the linked functions will allow more detailed study of the problem. Of course, the bug has been fixed, so the broken code in ptmwsrv() no longer exists.
[9] It is surprising how often hard bugs have trivial fixes. Surely there is a reason for this, but it's not obvious to me.
[10] Note that this fix makes use of mexchange(9F), which has long been available internally, but which I added to the Solaris DDI (along with a collection of other STREAMS utility routines) for Solaris 10. Hopefully these routines will make writing new STREAMS code a little less painful. Perhaps worthy of a future blog entry.

Technorati Tag:
Technorati Tag:
Technorati Tag: