|
The two fundamental operations most people associate with clusters are takeover (assuming control of resources from a failed peer) and failback (relinquishing control of those resources to that peer after it has resumed operation). It's very important to understand what these operations mean in the context of the Sun Storage 7310 and 7410 NAS appliances. A number of important changes have been made in the recently-released 2009.Q3 software release that affect how these operations work - all of them, we obviously believe, for the better - but the administrative interfaces are unchanged. The product documentation (PDF not yet updated for Q3 as of this writing) has also been greatly enhanced to better describe the clustering model and administrative operations that apply to these products, and I would strongly recommend that you avail yourself of that resource; you'll want to be familiar with it to understand some of the terminology I'll use here. But many of our customers and partners have asked us questions around takeover performance, and in order to address those questions I will need to go into greater detail about the implementation of these operations. So let's take a look under the hood. It's important to understand that nearly everything discussed here is an implementation detail that may change without notice in future software releases, and applies very specifically to the new 2009.Q3 software.
A First-Order Look
Takeover and failback consist of operations performed on each resource eligible for the operation. The selection of eligible resources is described in the product documentation and depends on the resource type and, if applicable, which cluster peer has been assigned ownership of it. For the sake of simplicity, we will define a fairly typical cluster configuration in which there is a single pool consisting of the disks and log devices in 4 J4400 JBODs, a pair of network interfaces by which clients will access that storage via NFS, and a network interface private to each head for administration only. We will refer to the heads as simply A and B. The pool and service interfaces are assigned to head A.
In this configuration, the following resources of interest will exist on both heads:
- Eight resources of the form
ak:/diskset/(uuid)
ak:/zfs/pool-0
ak:/nas/pool-0
ak:/net/nge0
ak:/net/nge1
ak:/shadow/pool-0
ak:/smb/nge0
ak:/smb/nge1
In addition, head A will have ak:/net/nge2 and head B will have ak:/net/nge3, representing the private administrative network interfaces. There will also be a very large number of other resources representing components with upgradeable firmware, service configuration, users and roles, and other configuration state replicated between heads. Because these resources are replicas and do not normally have any activity associated with them at takeover or failback time, we will not consider them further. The resources of interest must be acted upon in dependency order; for example, we must open the ZFS pool and mount and share all of its shares before we can safely bring any network interfaces up that clients expect to use to access those shares. Otherwise clients could attempt to access the shares before they are available, receiving a response that would result in stale filehandle errors. The above listing of resources reflects the dependency order for import; as you would expect, the opposite order must be observed when exporting.
It is worth noting that most of these resources do not appear in the management UIs. This is deliberate: the nas, shadow, and smb resource classes are symbiotes; that is, they always have the same resource identifier and ownership as their respective masters but have distinct locations in the dependency chain and distinct actions required for import or export. This allows us finer-grained dependency control and makes the implementation simpler and more modular by separating subsystems from one another. This becomes very important when discussing takeover and failback performance: the symbiote resources must be exported and/or imported, and these operations take time.
Takeover
When head A fails (let us assume it has been powered off accidentally by our data center personnel), head B will initiate takeover once it detects that no heartbeats have arrived from head A. This timeout period is currently 500ms, and takeover is initiated as soon as the timeout elapses. At the beginning of takeover, an arbitration process is performed that protects user data in the case in which all cluster I/O has failed but both heads are still functioning. This arbitration process consists of attempting to take the zone lock (defined by SAS-2) on each SAS expander present in the storage fabric. These locks are acquired and dropped in a defined order by any head attempting to enter the OWNER state. The locks are held with a fixed timeout period, so that if the holder of the locks does not continually reacquire them they will eventually be dropped. A thread on the OWNER head does this at an interval significantly less than the timeout period. Therefore, if a head attempting to enter the OWNER state fails to acquire the locks, it will wait until the timeout interval expires and try again. If it again fails, it will reboot, allowing its peer (which must still be functioning) to take control of shared resources. This process prevents both heads from attempting to perform simultaneous write access to shared storage, which would destroy or corrupt data. The timeout interval is set to 5 seconds in current products, meaning that this step can take up to 5 seconds to complete plus the time to contact all expanders (typically around 1-2s even in the largest configurations). Since the zone locks are not held when in the CLUSTERED state, this will normally take less than 2 seconds overall. Only in the cases where arbitration is actually necessary - such as when all three cluster I/O links are disconnected between two functioning heads - or when taking over directly from the STRIPPED state can the additional 5 second penalty apply.
After acquiring the zone locks, the surviving head will evaluate each resource in the resource map in dependency order. If the resource is not already imported, its class's import function will be invoked. Since head B does not own any of the singleton resources listed above, none of them nor any of their symbiotes will be imported. We will therefore invoke the diskset class's import function for each of the disksets, then the zfs class's import function for pool-0, then the nas class's import function for pool-0, then the net class's import function for nge0, and so on until we have attempted to import all of these resources. Note that if a failure occurs we will simply mark the resource faulted and proceed: our peer is down so we must make a best effort. When head A resumes functioning, it will rejoin the cluster, meaning that the current list and state of all resources will be transferred from head B over the intracluster I/O subsystem. Head A will not, however, take control of any of the singleton resources or their symbiotes; it will import only its own private resource ak:/net/nge2 as it transitions into the STRIPPED state following rejoin. This behaviour prevents ping-ponging and allows the administrator to verify that the restored head has had any hardware issues addressed before returning it to service.
Failback
Now that head A has rejoined, a failback can be initiated from head B. During failback, head B will walk the list of resources in reverse dependency order, invoking the resource's class's export function for each resource that is not owned by head B. If any of these functions fails, head B will generate an alert and reboot itself. This is done to ensure that the cluster is in a consistent, well-defined state: it would not be safe for head A to import a resource that is still under the control of head B, nor would it be possible for head A to enter a defined cluster state without importing all of the resources assigned to it. Likewise, if head B attempted to re-import the resource that could not be exported, that operation or some other re-import required by it could (and likely would) fail as well, making matters worse. Therefore B's reboot will trigger a takeover by A and consistency is maintained. Assuming a successful export, head B will now perform an intracluster RPC to head A instructing it to begin importing its resources. In response, head A will walk the list of resources in dependency order, invoking each resource's class's import function for each resource assigned to it (but not any assigned to head B). If any of these functions fails, head A will generate an alert and reboot itself, again triggering takeover from head B and maintaining consistency.
A Closer Look
Since failure detection and zone lock acquisition together take at most a few seconds, it is clear that we will need to understand the performance characteristics of each import and export function in order to understand overall takeover and failback performance. What exactly does each of them do?
Disksets
A diskset is exactly what its name implies: a collection of disks managed together. A few simple rules govern disksets: disks in a diskset are always part of the same storage pool, or no storage pool; disks in a diskset are always located in the same physical enclosure such as a J4400; and disks in a diskset are always imported or exported together. The mapping of disksets onto the slots or bays in a storage enclosure is defined by metadata delivered as part of the appliance software and may vary by product or enclosure type. Administrators do not manage disksets directly; they are handled automatically by the software when storage pools are created and destroyed.
In the abstract, disksets would be merely an engineering convenience, containers used to track the allocation of disks. Unfortunately, the need to support ATA disks necessitates a far more complex implementation. Because the ATA protocol does not support communication between multiple initiators and a single target, the SAS standard defines the notion of an affiliation, a mapping between a single initiator and an ATA target. Only the initiator that owns the affiliation can communicate with the target; any attempt by another initiator to do so will fail. Ownership of the affiliation is tracked by the SAS expander in which the STP bridge port associated with the ATA target is located, and an affiliation is claimed automatically the first time an initiator performs an I/O operation on a given target. Note that I/O operations in this context are not limited to those that affect the media: it is not possible to obtain even basic information about the device without claiming the affiliation. The process of obtaining that information and using it to create a device node used by ZFS and other software to interface with the disk is known as enumeration, a process that is normally performed by default by Solaris and other operating systems on every disk visible to the system's HBAs. However, as we can see, attaching two initiators to the same expander and performing automatic enumeration will result in chaos if there are ATA disks behind that expander: each system will claim some subset of the affiliations during enumeration but hang for an extended time attempting to enumerate those disks whose affiliations were claimed by its peer. The net result would be extremely long boot times and some random subset of disks visible to each system. Clearly this is untenable.
Disksets are a solution to this problem. By disabling automatic enumeration of ATA disks, we can control when the enumeration process is performed, limiting it to circumstances in which it is known to be safe: storage configuration and those times when we know our peer is not attempting to access the disks; e.g., during takeover and failback. Therefore the diskset import function must, for each disk, cause the operating system to enumerate that disk via each possible path. The export routine, likewise, must cause the operating system to "forget about" the disk and relinquish its affiliations for each initiator.
In previous software releases, diskset import time usually dominated the takeover and failback processes. While the 12 disks in each diskset were enumerated in parallel, fundamental problems in the kernel and an inability to process disks from multiple disksets in parallel limited the parallelism that could be exploited. Each diskset typically took 15 to 30 seconds to import, and, worse, could take much longer in certain error paths, especially if, during takeover, the expander had not yet torn down the defunct initiator's affiliations. The current software release improves the situation considerably: all disksets can be imported in a single invocation of the diskset import function (known as "vector import"), and up to 96 disks can be enumerated in parallel, up from 12. In addition, improvements in error handling and timeouts have greatly reduced the worst-case import time when disk or affiliation errors occur. Overall, configurations such as our example above will typically see reductions in diskset import time on the order of 4-6x, with an accompanying large decrease in variance. That is, we might reasonably expect all 8 disksets in our example configuration to be imported in 30s. Because most of the overall benefits come from increased parallelism, smaller configurations will see somewhat smaller improvements. Diskset export is not, and has never been, a significant contributor to failback times; undoing the enumeration process is typically measured in milliseconds for each disk. This means that the relationship between takeover and failback times depends mainly on which contributing factors dominate each activity; i.e., the configuration and uses of the system.
ZFS
Importing a ZFS pool resource simply means opening the pool, reading the labels from each disk, and creating the attachment points for any zvols (used to provide block storage) contained in the pool. Reading of labels takes constant time as it is performed in parallel, but the second portion of this activity requires walking all datasets in the pool, which is done sequentially. The time taken here is therefore proportional to the sum of the number of projects, filesystem shares, and LU shares the pool contains. It is, however, usually much less than the mounting and sharing activities, which we will investigate next.
NAS
The NAS symbiote of the ZFS pool is responsible for mounting and sharing all of the ZFS datasets, including zvols used as backing stores for block devices. This activity therefore takes time proportional to the number of shared filesystems (NFS, CIFS, FTP, HTTP, SFTP, FTPS) and block devices (iSCSI). Tests have shown that NFS shares contribute between 5ms and 15ms each to this process but, because the meaning of "sharing" depends on the protocol, it is difficult to provide an overall estimate of the constants associated with this activity. Likewise, export requires unsharing and then unmounting the filesystems and LUs, which is also linear and requires variable time that is protocol-dependent. Tests have shown that each NFS share contributes a similar time increment to the export process as it does to the import process.
Net
The net class's resources represent the state of a network interface, which will already have been plumbed and configured on both heads. When this resource is imported, the subsystem informs the kernel that the addresses on the interface should be brought up. This activity is performed sequentially for each address and the time taken is therefore linear in the number of addresses configured for the interface. However, the time taken for each is miniscule and it is unusual to assign more than a few addresses to an interface. The entire operation normally completes in a second or two. Exporting is directly analogous and takes a comparable length of time.
Shadow
This resource class, new in the 2009.Q3 software, manages shadow migration destinations associated with the pool. Because we cannot necessarily mount the shadow migration sources until the network interfaces are imported, this symbiote of the pool resource is imported after all net resources. It is responsible for activating each shadow mount, which will cause the source filesystems to be mounted. This occurs sequentially, and is therefore linear in the number of shadow sources. Of course, shadow sources that are local will take very little time to mount while NFS client mounts can take a significant amount of time. Exporting is the complement, and is normally very fast in all cases.
SMB
Each net resource has an smb symbiote, responsible for notifying the CIFS subsystem that an additional network interface should be used to provide service to CIFS clients. This operation is effectively irrelevant as it usually takes less than a second.
Putting It Together
As we've seen, there are many moving pieces involved in takeover and failback. Each resource class has its own set of operations for import and export; some take effectively constant time while others depend on the number of shares and projects or the number of disks. Even where a clear dependency in a particular variable can be characterised, the actual time taken to perform each individual suboperation may not be known or even constant; for example, sharing a filesystem can take a different amount of time depending on the protocol used and even the parameters associated with that particular share. For all these reasons, I strongly encourage anyone who is especially sensitive to takeover or failback time to perform some tests based on their own real-world configurations. This will become even more important as overall performance improves: for example, the recent improvements to diskset import time make the number and type of shares much more relevant to total takeover time. Many configurations may achieve 4x or better overall takeover time improvement as a result of that work, but a configuration with, for example, 3000 shares on a pool consisting of a single diskset, may see little or no change. As with any benchmarking activity, there is no substitute for testing your own configuration, but I hope the above description of the process and rough guidelines will be helpful in establishing expectations going into that testing process so that anomalous behaviour can be identified and tracked down.
In a future post I'll talk about a few of the remaining opportunities for improvement. Until then, ALL HAIL CLUSTRON!
Greetings, puny humans! I am Sun part number 371-3024, a Sun Fishworks Cluster Controller 100, but the world knows me as CLUSTRON. Today you'll be giving me all your gold in tribute as I tell you about the clustering strategy implemented in Fishworks appliances and my integral place in the Sun Storage 7410C.
All clustering software comes with a devastating intrinsic drawback: its own existence.
As anyone who has worked in the industry can tell you, the only bug-free software is the software that isn't written. So when we talk about using two servers - or appliances - to provide higher availability through redundancy, one ought to be immediately suspicious. Managing multiple system images and coordinating their actions is a notoriously difficult problem. And when the state shared between them consists of the business-critical data you're using the appliances to store, you ought to be downright skeptical. After all, while simple logic dictates that two systems ought to offer better availability than one, there's the small matter of the software required to take that from a simplistic statement of the obvious to a working implementation fulfilling at least some of that promise. It's not just software in the usual sense, either; hardware - like me - is also in play, and most modern hardware contains software of its own, usually called firmware. Firmware is really just software for which the system designer has no source code, no observability tools, and no hope. Generally speaking, more software - wherever it runs - means more bugs, more time and energy devoted to management, and more opportunity for operator error; all of these factors act to reduce availability, eating away at the gains offered by the second head. Anyone who tells you otherwise is lying. Liars make CLUSTRON angry.
The typical clustered unified storage server consists of a pair of underpowered servers, each populated with some HBAs, some NICs, a small, expensive DRAM buffer with a giant battery, and an Infiniband (IB) HCA. Oh, and some software. Lots of software, as it turns out, because the way these implementations provide synchronous write semantics to clients is by mirroring the contents of their battery-backed DRAM buffers to one another in real time across those IB links. When a server fails, its partner has access both to the disk storage (usually via FC) and the in-flight transactions stored in its own copy of NVRAM, so it can pick up where its dead partner left off. The onus is often on the administrator, however, to keep configuration state in sync; while it changes infrequently, it usually needs to be identical in order for clients to observe correct behaviour when one of the two servers has failed. And all this comes at a hefty price in cost - NVRAM and IB HCAs take up precious I/O slots (reducing total capacity and performance) and are not particularly cheap. But there is also a complexity cost: a quick glance at the Solaris IB stack turns up about 65,000 lines of source code, and of course that doesn't include an NVRAM driver or the code needed to coordinate mirroring NVRAM over IB. None of the software in such an implementation is reused elsewhere in the storage stack, so it has to be developed and tested independently, and the IB HCA is likely to contain a fat chunk of that nasty undebuggable firmware of which you'd like to as little as possible in your core systems. Worst of all, because that interconnect link is in the data path and doubles as the cluster "heartbeat" channel, under extreme load it may be possible to lose heartbeats and incorrectly conclude that your partner is dead. That can lead to a takeover at the worst possible time: under extreme load (most general-purpose clustering software suffers from this deficiency as well). Overall, it's almost as if the engineers who designed these systems kept adding complexity, cost, and opportunity for error until they finally ran out of ideas.
The Fishworks approach to clustering is somewhat different. At the bottom of the stack lies the most important difference: me, your CLUSTRON overlord. Instead of IB in the data path, I offer three redundant inter-head communication links for use only by management software. We'll come back to this in a bit. The data that would otherwise be written to NVRAM and mirrored over IB is instead written once to each intent log device as if it were an ordinary storage device. These devices combine flash for persistence with supercapacitor-backed DRAM for performance. Since they live next to the disks in your JBODs, they can - just like NVRAM contents - be accessed by an appliance when it takes over for a failed partner. But this entire path is much simpler; notice that we are reusing the basic I/O path that is already used - and tested - for writing to ordinary disks. And since there's nothing to mirror, we don't need any software on the appliances to drive IB devices or coordinate NVRAM mirroring. Each appliance simply writes its intent log records to the device(s) associated with a given storage pool and replays them when later taking control of that pool, either on boot or during a cluster takeover or failback activity.
But what is my role in this? I provide basic connectivity for two purposes:
- Configuration sync - if you make a change to a service property (say, you add a DNS server) on one appliance, this change is transparently propagated to its partner. If that partner is down, it will pick up the change when it next boots and rejoins the cluster.
- Heartbeats - this is how a clustered Fishworks appliance decides to take control of cluster resources. No heartbeats? It must be dead. It wouldn't become a soulless machine to mourn its passing so I'd better just poke the userland management software to initiate a takeover.
On the face of it, that seems unremarkable. One could presumably multiplex these functions onto a traditional IB-based implementation. But recall that a key goal in any clustering implementation must be reducing the complexity of the software and thereby limiting the number of bugs that can affect core functionality. I designed myself to do exactly that. Instead of a complex, featureful, high-performance I/O path, I provide some seriously old-school technology, namely 2 plain old serial links - the kind to which you might once have attached a modem to dial into the WOPR. My third link offers somewhat better performance but again uses only existing software drivers; it is an Intel gigabit Ethernet device. All three links provide redundant heartbeat paths (at all times) and all three can be used to carry management traffic, though management traffic is preferentially routed over the fastest available link to provide a better interactive management experience.
The advantages of this design are several:
- Serial devices typically take interrupts at high priority. By noting the receipt of heartbeat messages in high-level interrupt context, I can ensure that I remain aware of my partner's health no matter how much load my appliance is under.
- Likewise, I can employ a high-level cyclic on the transmit side to ensure that outgoing heartbeat messages keep flowing to my partner no matter how heavily loaded my appliance.
- Serial communication is dead-simple, time-tested, and battle-proven. Fewer than 3400 lines of code are required to provide all my serial functionality, including controlling my LEDs. That's around 5% of what we might expect an IB-based solution to require. And while the Ethernet driver is considerably larger, it once again does double-duty: it's the same driver used with the NICs that attach your appliance to the data centre networks.
As you can see, the Fishworks team kept hammering away at a few key design objectives throughout; perhaps the most important of these was a desire to minimise the amount and complexity of new software to be written. This is not to say there is not complexity in the clustering subsystem; there certainly is, and I'll discuss some of those areas in a later edict. But the foundation of the clustering design is as simple as it can be. Clustering is not right for every application or every shop: even with these design principles firmly in place, clusters are much more complex to manage and monitor than standalone appliances, entail significantly higher hardware costs (though as always in the Fishworks universe, there is no added software licensing fee), and however little code may be specific to clustering it certainly is not zero. That means there will be failures that occur in clusters which would not have occurred in a standalone configuration - in other words, that clustering can always reduce availability as well as enhance it. The Fishworks clustering design makes a commendable effort to make this unhappy outcome less likely than in traditional shared-storage clusters. In my next edict I'll discuss the exact circumstances in which I can help provide greater availability than a standalone appliance, and some of the cases not yet covered that the engineers are looking to include.
Until then, ALL HAIL CLUSTRON!
A few people have asked how I'm voting in this year's elections. Here, then, are my endorsements:
OGB:
- John Sonnenschein
- Brian Gupta
- Joerg Schilling
- Ben Rockwood
- Stephen Lau
- Al Hopper
Questions:
- No
- No
Rich Lowe asked an excellent question about OpenSolaris government in response to one of Casper Dik's answers to the DTrace Community. Here's the question, and my answer. As always, the other candidates' responses are available in the above-referenced thread.
And what would be your (the general "you", not just Casper) plans to help make the ARC and especially the C-team more practically part of OpenSolaris process, rather than a part of Sun process we're exposed to from one side, but not, so far, fully involved with?
Of the two, the ARC is much more difficult to rationalise; I'll
explain why below. As for the C-teams and the more general problem of
consolidation management, I'll let this text from my position paper[1]
do the answering:
One of the OGB's most important tasks will be to rationalise the
Community Group structure into one which will allow meaningful
self-government. The centerpiece of my plan for doing this is
construction of Consolidation-Sponsoring Community Groups
(CSCGs). Each of these groups will be given control over an
existing consolidation. This structure is not unlike that which
exists today in the misnamed Nevada Community, representing
ON. But that Community does not govern openly, and other
consolidations are entirely missing structure under which they can
be governed legitimately. Since the Constitution provides for the
Community Group as the unit of independent government, each
consolidation requires one to oversee its progress. The CSCGs will
be responsible both for controlling the content of their codebases
and for providing guidance and leadership to project teams
desirous of integration. They will be required to adopt a set of
rules (harmonised but not necessarily identical across all CSCGs)
for integration and apply fairly these rules.
The challenge associated with the ARC (or ARCs) is that it maps poorly
onto the Community Group structure. It makes little sense to me that
an Architecture Community Group would sit alongside, say, an
Observability Community Group. Observability incorporates a number of
subsystems in the OS which in turn need to be properly integrated into
each project. So would Reliability, or Virtualization. Architecture
is not another such feature set but rather the way in which all those
features, along with the new ones offered by the project, fit together
and expose themselves to other consumers. That is, Architecture is
both a superset of and yet entirely disjoint from all other CGs' areas
of interest. The practical effect is substantial overlap: we would
expect each CG to offer project teams advice concerning how best to
integrate their work with existing features (and, for projects
directly related to the CG's area of expertise, what features it
should offer to others). In some ways, however, this directly
conflicts with the mandate of an Architecture CG, which is to provide
architecture guidance to all project teams. In the current system, an
observability expert cannot override the ARC's decisions with respect
to a proposed observability project. Yet under the Constitution, the
Observability CG is supposed to be self-governing. The defining
question is what exactly the latter CG is expected to govern, and by
what mechanism - the very question the Constitution so conspicuously
fails to answer.
It's easy enough in my CSCG model to simply require that all CSCGs
adopt rules requiring architectural review by a particular CG just as
they should require other CGs with expertise in relevant areas to
review and perhaps approve each project prior to integration. Indeed,
this is not unlike the system that exists today. The CSCGs do indeed
have complete control over their areas of responsibility, namely, the
existing consolidations. But this leaves all other CGs less equal,
their endorsements subject to veto and without any code of their own
to govern. A logical conclusion one could reach on this line of
thinking is that CSCGs and perhaps the ARC should be the *only* CGs.
The reality on the ground thus maps poorly to the Constitution we've
been given, suggesting that the Framers either did not consider this
matter in sufficient detail or intended much more radical changes in
either the structure of consolidations, global review processes, or
both. Mr. Fielding in fact hinted at just such an intent[2]:
We don't need to enshrine one committee's view of how C-Teams
operate in an organization-wide constitution because C-Teams
simply aren't relevant to *every* activity at OpenSolaris, and the
vast majority of comments we have received so far clearly indicate
that the existing consolidation boundaries are arbitrary AND
dysfunctional. Personally, I am hoping that the communities feel
empowered to change the things that are obviously causing them
harm right now, and let the consensus process ensure that the
traditions are adequately promoted and maintained over time.
Presumably Mr. Fielding and perhaps others have some grand detailed
view of how all these things should be made to fit together in the
rather obvious presence of existing bodies of code with no associated
governing units and vice versa. Unfortunately, they've not seen fit
to share that view nor to stand for election themselves. If consensus
does not emerge within a few months as to an appropriate way to map
the (possibly modified) existing practical devices of government onto
the new constitutional structures, I'll probably favour amending the
constitution rather than spinning our wheels forever trying to
shoehorn OpenSolaris into a framework that may well be inappropriate
to our broader goals.
At some point I'd like to hear Mr. Plocher and others more intimately
involved with the operation of the ARC Community express their views
on how that Community could be made to fit into the new Constitutional
world of governing CGs. Their testimony will be needed before the OGB
as it considers how best to restructure the Communities into
meaningfully self-governing units.
- http://blogs.sun.com/wesolows/entry/ogb_election
- http://www.opensolaris.org/jive/message.jspa?messageID=99494#99494
Leaders of the DTrace Community had a number of questions for the OGB candidates. Here's a copy of the questions and my answers. You can also see the other candidates' responses in the DTrace mailing list archives.
- DTrace is one of only a small handful of OpenSolaris technologies that
has actually been incorporated into other operating systems. Thus,
your position on dual-licensing is very important to us; what is your
position on dual-licensing in general?
As I noted in my position paper:
(a) the OGB does not control licensing, and
(b) to the extent that the OGB would be consulted on the
matter, I'm opposed to dual-licensing.
The well-known opportunity it offers for license-based forks is a
significant drawback that would have to be balanced and more by
compelling benefits. No one has yet articulated such benefits, and I
have found no evidence myself that they exist. The advantages
presented by proponents of such a licensing scheme appear to be
predicated on the idea that the second license would be GPLv3 (which
is not complete), and that its use would dramatically increase the
size of our community by drawing in the FSF as a partner in our
technical work. Those are two large 'ifs' for a 'maybe' we're poorly
positioned to handle.
- Do you agree with the conclusions and decrees of CAB/OGB Position
Paper # 20070207?
Generally, yes. See above.
- The OGB is responsible for the representation of OpenSolaris to
third parties. If a third party were to inquire about incorporating
DTrace into a GPL'd Program, what would be your response or position?
I would note that my lay reading of the GPL would preclude that party
from distributing the resulting product without violating the terms of
that license. I would also advise that party to seek legal counsel,
as with any licensing concern. That's as far as I'd go, however; the
OGB does not hold the copyright to DTrace and is not in a position to
warn or litigate against infringers.
- DTrace is currently a Community Group, but some could argue that it would
make more sense for DTrace to be a Project in (say) the Observability
Community Group. In your mind, what is (or should be) the difference
between a Community Group and a Project -- and where should DTrace fall?
These two questions are not necessarily well-linked. The difference
between a Project and a Community Group is straightforward. A Project
owns one or more gates and does direct technical work with the intent
to add or improve a specific aspect of the software they contain. A
Community Group is the unit of independent government as defined by
the Constitution; it is responsible for directing and guiding Project
teams and others doing work that affects a broadly-defined set of
interests.
Others have suggested that a Project is defined to have a defined life
span (presumably terminating upon integration into a consolidation).
I disagree with this definition - a project (like DTrace) which
provides a large and useful set of functionality will never be fully
complete unless and until it is replaced wholesale. So long as the
Project's work remains in use, it is important that some collaborative
unit exist to provide a home for those using and improving it.
DTrace is unquestionably a Project. Whether it deserves a Community
Group of its own[0] depends on the granularity at which we wish to
distinguish among Community Groups and the amount of overlap among
them. That is, if Observability is held to be a Community Group
distinguished from others at the correct granularity, DTrace should
not be a separate CG, as its function would be a strict subset of
another valid CG's. Instead, the DTrace leadership would be expected
to participate in the Observability Group's activities, offering
guidance and advocacy for consumers of its work. As part of that
transition, mutually acceptable agreements regarding contributorship
grants and leadership structure would need to be in place regarding
the merged community (much like any corporate merger). Alternately,
however, I could envision a finer-grained set of Community Groups with
some overlap; DTrace might fit alongside, for example, a Debuggers
Community Group in such a scheme. My personal preference is for a
smaller number of larger Community Groups, some of them controlling
the long-term maintenance of consolidations and others providing
guidance to project teams (and the consolidation owners) based on
their particular areas of technical expertise. I believe this would
promote a vision of our software as an integrated whole. Just as
importantly, even under such a system, any large and ambitious Project
would fall inside the scope of several Community Groups' areas of
interest. Expecting project teams to interact with dozens of Groups'
leaders would seem to introduce excessive and unneccessary complexity.
[0] The existence of DTrace as a Project ought not preclude the
existence of other Projects which seek to enhance it.
- The Draft Constitution says next-to-nothing about where the authority
lies to make or accept changes to OpenSolaris -- only that Projects
operate at the behest of Community Groups, and that Community Groups
can be "terminated" by the OGB. In your opinion, where does or should
this authority lie? And do you believe that the Constitution should
or should not make this explicit? Finally, under what grounds do you
believe that a Community Group should be "terminated"?
As I noted in my position paper, I believe the authority for code
acceptance should reside with Community Groups responsible for the
targetted consolidation. Those CGs would be expected to delegate some
or all of that authority in turn to specific individuals forming the
C-Team for a particular release. While some minor changes will be
needed to this strategy to accomodate open development, the basic
process has worked well for a long time, and I see little reason to
alter it radically.
As I've noted in several messages, I would prefer that the
Constitution had made at least some of this more explicit. The
absence of this specification leaves the OGB with a set of
illegitimate Sun entities excercising effective control over matters
the Charter clearly leaves to the OGB, and offers no transition plan,
timetable, or framework in which to take over these functions. This
will present an additional challenge to the first elected OGB.
Community Groups formed under a coherent and comprehensive strategy
such as the one I hope the OGB will provide should generally be
terminated only for inactivity or other clearly self-induced act of
dissolution (such as a voluntary merger with another Community Group,
approved by the OGB). Unfortunately, we also have a large number of
existing Communities which do not fit well within any strategy one
could retroactively imagine, and the OGB will be obligated to
rationalise this situation. The process of doing so will likely
involve terminating a number of these Communities and/or merging them
with other Communities to form strategically valuable Community
Groups. In the process, it is not unrealistic to suppose that some
Communities may be terminated without the consent of their leaders.
The OGB should seek to offer reasonable accomodation to the leaders of
such Communities and work with them to find acceptable solutions that
fit the strategic plan. My hope and expectation is that events of
this type would occur very rarely after the initial realignment.
- The Draft Constitution says that Community Groups (and in particular,
the Community Groups' Facilitators) are responsible for "communicating
the Community Group's status to the OGB"; what does this mean to you?
My understanding was that the Working Group introduced the position of
Facilitator for the purpose of maintaining a single first-line point
of contact for each Community Group. The OGB should expect each
Community Group to provide its membership list as required by the
Constitution on a regular basis, and for proposing desired changes in
structure or termination (if any). Beyond that, I believe this
requirement has little meaning to the OpenSolaris community; it seems
to make more sense in the context of an Apache-like organisation in
which many completely disjoint software engineering efforts are
undertaken simultaneously by likewise disjoint groups overseen by the
Foundation. Since the OGB is not responsible for technical decisions,
it makes little sense to expect Community Groups to provide detailed
information about the work they oversee in the absence of a specific
conflict or other matter requiring the OGB's attention. In short, it
makes no sense to sample data which you cannot usefully consume.
- According to the Draft Constitution, "nominations to the office of
Facilitator shall be made by the Core Contributors of the Community
Group, but the OGB shall not be limited in their appointment to those
nominated." Under what conditions do you believe that the selection of
a Facilitator would or could fall outside of the nominations made by
a Community Group's Core Contributors?
The only example I can imagine is one in which the designated
Facilitator has a proven history of unreliability or deception. It
seems unlikely that such an individual would be nominated by a
responsible Community Group, so in practice I doubt this clause will
ever be exercised.
- According to the Draft Constitution, "non-public discussion related to
the Community Group, such as in-person meetings or private communication,
shall not be considered part of the Community Group activities unless or
until a record of such discussion is made available via the normal meeting
mechanism." In your opinion, in the context of a Community Group like
DTrace -- where a majority of the Core Contributors spend eight to ten
hours together every work day -- what does this mean? Specifically, what
does it mean to be (or not to be) "considered part of the Community
Group activities"? And in your opinion, what role does the OGB have in
auditing a Community Group's activities?
I choose to interpret this as a Blue Sky provision, requiring that
important decisions be undertaken in public with the opportunity to
participate for all those whose input might be considered useful.
Since the Constitution provides no definition of "Community Group
activities" other than voting, by implication this works in the same
way as similar provisions in Municipal charters.
In the context of the DTrace Community Group, I take it to mean that
matters which require a Community Group to vote must be presented on a
public list with reasonable opportunity for comment before such a vote
is taken.
Outside of bootstrapping activities around organising and
rationalising Community Groups, I see little proactive role for the
OGB in auditing CG activities. The OGB should generally handle only
conflicts which cannot be resolved within one or more CGs, and then
only when requested by a party to the conflict. The Constitution does
preclude the OGB from interfering with a CG's internal governance.
-
Historically, binary compatibility has been very important to Solaris,
having been viewed as a constraint on the evolution of technology.
However, some believe that OpenSolaris should not have such constraints,
and should be free to disregard binary compatibility. What is your
opinion?
Those people are wrong. Binary compatibility is a great strength, one
which can in nearly all cases be preserved without retarding progress.
To the extent that binary compatibility requires deeper thought on the
part of engineers, it also directly enhances the quality of new work.
Solaris customers praise and appreciate this engineering philosophy
and the results it offers them; we should offer the same benefits to
customers of other distributions as well by maintaining compatibility
and architectural consistency within all recognised consolidations.
Naturally, consumers of OpenSolaris are free to incorporate the
technology into their own products in whatever manner they choose,
including the introduction of changes that violate these constraints.
Such activities are outside the scope of the OGB to regulate.
- If a third-party were to use and modify DTrace in a non-CDDL'd system,
whose responsibility is it to assure that those modifications are
made public? To put it bluntly: is enforcing the CDDL an OGB issue?
The answer to the first question is "No one." Neither use nor
modification triggers the requirement that those modifications be made
distributed in source form (and additions, in particular, need not be
distributed at all). Only distribution triggers this requirement, and
it is extended only to those to whom binaries are provided. If such a
party did distribute the binaries containing DTrace, it is that
party's responsibility to ensure its own compliance with the license
terms.
Enforcement of the CDDL is not an OGB consideration. The OGB does not
hold any copyrights and has not issued any licenses. If the OGB is
notified of a license violation, it should (as a group of good
citizens) pass the information along to the copyright holder, if
his/her/its identity is known. For much of the code in OpenSolaris
including DTrace, that copyright holder is Sun Microsystems, Inc.
Further action is at the discretion of the copyright holder.
It may well be within the scope of the OGB's activities to help
educate contributors about the terms of the CDDL, but such a campaign
would require the OGB to obtain legal counsel.
- Do you have an opinion on the patentability of software? In particular,
what is the role of the OGB -- if any -- if Sun were to initiate legal
proceedings to protect a part of its software patent portfolio that
is represented in OpenSolaris?
The OGB does not own software patents (or any other property), and I
have no position on the patentability of software in general. Sun has
the right to enforce its property rights under the laws of the
countries in which it operates, and the OGB has no authority to
interfere with that enforcement. Since community members who adhere
to the terms of the licenses offered for OpenSolaris have limited (but
adequate for all uses permitted under the CDDL) licenses to patents
represented within that body of code, there is no reason for the OGB
to worry about this. If such an event were to occur, the OGB might
profitably offer a simple statement to this effect, clarifying the
facts of the situation and denying incorrect rumours. Whether such an
action would be necessary or appropriate would depend on the specific
circumstances.
- When you give public presentations, do you run OpenSolaris on your laptop?
Have you ever given a public demonstration of OpenSolaris technology?
Yes, I use OpenSolaris exclusively with the exception of
interoperability testing. Yes, I have demonstrated new technology in
Solaris 10 (now in OpenSolaris as well) at OSCON in 2004 and 2005, and
the early OpenSolaris build system technology at SVOSUG in 2005.
- And an extra credit question: Have you ever used DTrace? When did you
most recently use it, and why? The answers "just now" and "to answer
this question" will be awarded no points. ;)
Yes, I've used DTrace. I most recently used it earlier this week
while diagnosing the behaviour of two machines in an HA cluster. I've
also written a (never-integrated) System V IPC provider for
OpenSolaris and introduced USDT probes to enhance the observability of
several aspects of daemon behaviour.
OGB Election
OpenSolaris Governing Board elections begin next week. In addition, a single question will be presented to the voters: Shall the proposed Constitution be ratified? Please take the time to read this important document and learn about the issues being debated by the candidates. As a candidate for an OGB seat, I can help you right here and now with the latter task. I'd appreciate an allowance of five minutes of your time to learn where I stand on some of these issues. I welcome questions; you can send mail to all candidates to ask your questions. I'll be posting here my answers to any questions I receive in this fashion.
- The Constitution
VOTE YES
I've pointed out a number of issues with the Constitution (see the 'constitutional limitations' thread) and continue to believe that the proposal as written positions us poorly to achieve independence from Sun, accomplish useful technical work, and provide leadership. Nevertheless, the alternative (last paragraph) is unlikely to be any better, thanks to some unfortunate decisions made by Sun. Therefore I support ratification and urge you to vote in favour.
- Community Structure
We need a FULL-SCALE OVERHAUL
One of the biggest gaps in the Constitution is how the existing codebases are to be managed, controlled, and led. Indeed, the document does not even acknowledge their existence, despite the fact that they are the primary purpose for and value in OpenSolaris's existence. One of the OGB's most important tasks will be to rationalise the Community Group structure into one which will allow meaningful self-government. The centerpiece of my plan for doing this is construction of Consolidation-Sponsoring Community Groups (CSCGs). Each of these groups will be given control over an existing consolidation. This structure is not unlike that which exists today in the misnamed Nevada Community, representing ON. But that Community does not govern openly, and other consolidations are entirely missing structure under which they can be governed legitimately. Since the Constitution provides for the Community Group as the unit of independent government, each consolidation requires one to oversee its progress. The CSCGs will be responsible both for controlling the content of their codebases and for providing guidance and leadership to project teams desirous of integration. They will be required to adopt a set of rules (harmonised but not necessarily identical across all CSCGs) for integration and apply fairly these rules.
- Projects
MINOR CHANGES are needed here.
The bar for project creation is very low today: if two Members believe a Project ought to exist, it does. This benefits everyone by allowing virtually unrestricted exploration of new spaces and approaches, but it also encourages duplication of effort and expenditure of effort on projects which are not positioned to be successful. I would like to see this approach altered: instead of directing project creation requests to a giant unmoderated mailing list (see more on this below), I would prefer to see them directed to one or more Community Groups, including (when relevant) the CSCG to which the project is targeted for integration. During a one-week initial review period, members of those Community Groups would be expected to provide feedback on the proposed project, informing its backers of related or conflicting ongoing work, the need for inclusion of additional or alternate Community Groups in the review, and risks and opportunities the project would offer. Just as importantly, this is an opportunity for Community Groups to inform the project's backers of the actions and choices the project team would need to make in order to secure those Groups' endorsements. It is expected that, by the time a project seeks integration into a consolidation, it will have secured the endorsements of all relevant Community Groups; this process will give the project team a leg up on understanding what will be required to do so, and help them make contacts and forge working relationships within those Groups. At the end of the initial review period, the project team will be required to indicate to the OGB's project-creation delegate whether, in light of the feedback received, it wishes to proceed. This decision cannot be vetoed, but a project which fails to secure the endorsement of relevant Groups will have much more work to do later if integration is desired. It is worth noting that integration need not be a project team's goal: some projects may be worthwhile on their own, may eventually lead to the formation of new consolidations, or may be intended solely as exploratory efforts that may yield innovative work later used elsewhere. We must not discourage these teams nor should we send them elsewhere to do their work. At the same time, we should provide a framework in which project teams desiring integration can learn early what will be required and work continuously throughout the life of the project with the technical leaders of relevant Community Groups.
- Dual- or re-licensing
I am OPPOSED to either of these steps at this time.
It's important to note that the OGB does not control the offered licenses to OpenSolaris source because it does not hold the copyrights. Only Sun can offer additional or alternate licenses. Therefore, this position is relevant only to the extent that Sun seeks the OGB's guidance on the matter. The arguments for and against changes to the licensing regime have been discussed at length; I will not repeat them here. I have two main observations: First, licensing changes appear to be a solution in search of a problem. No proponent of such changes has articulated clearly the problem(s) which such a change would solve. Given the risks and costs, I would expect a clear and convincing case to be made that license changes are necessary; that threshold has not been met. Second, the main benefits posited by advocates of licensing change center around an increase in the size and stature of our community. Unfortunately, we are ill-positioned for growth; our institutions and infrastructure are in dismal shape. Any large influx in contributors would lead to more complaints and flames but little additional useful work. If we desire to grow, we must first position ourselves to leverage fully our existing contributor base. Until then, a focus on growth makes no sense. Similarly, I have little concern for our 'stature' in the broader Free Software community. If the FSF or a similar organisation would like us to change our licensing to better suit their interests, or to form a partnership to deliver interesting and useful products, we should remain open to such offers if they would benefit all parties. Since no such offer has been made, and made openly, there is little reason to consider hypothetical partnerships as a key benefit of a licensing change.
- Infrastructure
The OGB and the Tools Community must exert leadership; BLIND RELIANCE ON SUN IS NOT THE ANSWER
The OGB must formulate a plan with dates and milestones for opening defect tracking to community participation, establishing review, approval, and archival mechanisms for change submissions, and increasing the transparency and utility of the ARC process. The OGB must also establish rules that Community Groups will be expected to follow regarding acceptance and integration of opaquely-managed projects (namely, that non-grandfathered
projects of this type must not be permitted to integrate until, at minimum, a sufficient period for public review). Since Sun currently has a variety of tools for managing these processes, it would of course be nice if they would make those tools available to us. However, Sun's resources for doing so are limited, and in some cases the tools are poorly-designed to be used outside a LAN. The most important such example is the Bugtraq2/Bugster defect tracking system. Lack of access to this system is a major roadblock to open development, and Sun has not offered a plan to address this problem. The OGB must seek a firm commitment from Sun to open access to this system in an acceptable way, and must hold Sun to agreed-upon milestones in that plan. If Sun declines to offer an acceptable plan for doing so, or fails to uphold its agreement, the OGB must assist the Community Groups, notably but not exclusively Tools, in designing and constructing suitable replacements. I would like the OGB to issue a Call for Proposals for solving the defect tracking problem with a deadline of May 20. Sun is especially invited to submit a proposal. The OGB would then evaluate the proposals, giving special weight to one which would allow access to the existing body of information in Bugtraq2, and establish and monitor progress toward a chosen proposal. Other infrastructure problems (code review and archival, ACLs and Wiki-like features, RTI handling, etc.) should be handled in a similar fashion. This general framework is proving itself effective in the SCM project and we should not hesitate to use it in the future rather than expecting Sun to "do something" "someday."
- Communication
The OGB MUST DO MORE to improve the signal-to-noise ratio, and to communicate its own activities more clearly
Several have complained about communication of information about the election, and with good reason. The OGB has at times communicated poorly with the other members. I would like to see the OGB use opensolaris-announce (a read-only list containing all members) more heavily to communicate information of universal interest. Correspondingly, opensolaris-discuss should never be used to convey any official information, nor to seek input or feedback from all members. Instead, the OGB should provide a set of mailing lists open to all in which topics related to governance can be discussed. When input or feedback are desired on a particular issue, the OGB should announce a Call For Discussion via opensolaris-announce, pointing interested members to the appropriate topical list. Naturally, the traffic on -announce should be kept low, but neither should we be afraid to use it when appropriate: it is a highly effective way to reach all members without requiring them to subscribe to a largely useless list with minimal signal and excessive noise. I will recommend that the OGB adopt a policy that its members not subscribe to -discuss, so as to force the board to communicate with all members on an equal footing. In short, subscription to such a high-volume, low-S/N list wastes time and resources that could be better spent working on real problems in more focused venues. The OGB should strongly encourage the use of appropriate topical, project-, or Community Group-sponsored lists for technical questions, proposals, and announcements. The general discussion list may well be reserved for flames, offtopic "water-cooler" conversation, and sophomoric hand-wringing over OpenSolaris's future. No one who does useful work should have to filter such tripe in order to keep up with important news.
- Culture and Leadership
A QUALITY-CENTRIC ENGINEERING CULTURE is one of our greatest assets; the OGB must encourage and strengthen that culture.
The OGB is not intended to make technical decisions; these are to be made by Community Groups. Nevertheless, the OGB must position these Groups to enforce sound engineering philosophy, and provide them with the tools and support needed to do so. There is far too often a perception that the "movers and shakers," those who want to "cut red tape" and "just solve problems" are the community's true leaders. At times, this is indeed true. But engineering also requires a sober, cautious approach to new problems, especially those which are poorly understood. The existence of process and review is neither an accident nor red tape. Instead, these tools help us make the right decisions - decisions that will remain with us for many years. The OGB should urge, and where appropriate, force, its Community Groups to keep this in mind as they evaluate proposals and requests. Expressions of enthusiasm and a can-do spirit are welcome, but should not be confused with commitment or full agreement. It can take weeks or months of work to validate or discredit a particular approach to a problem. Community Groups will be most successful which do not commit to a particular approach until that time has passed.
- The role of Sun
Sun's engineers are IMPORTANT CONTRIBUTORS but Sun Microsystems, Inc. is JUST ANOTHER DISTRIBUTOR of our technology and enjoys NO SPECIAL STANDING.
One of the largest challenges the OGB will face is encouraging the formation of decision-making bodies that operate openly and are independent of Sun, while still ensuring that the interests of Sun and other distributors are well-served. Far too much of our activity today takes place entirely within Sun in a largely opaque fashion. For example, the Solaris PAC, an entity mentioned nowhere in the Charter or Constitution, still believes it has the authority to set integration rules for each build. And, in part because no alternate framework exists for making these decisions, SPAC in fact does - improperly - exercise this authority. The OGB is responsible for taking over these functions with respect to OpenSolaris and providing a framework in which these actions can be taken openly. None of this should be taken to imply that the OGB exercises control over Solaris (Sun's distribution); like any other distributor, Sun remains free to ship whatever products it wishes without regard for the OGB or any other action of the OpenSolaris community. But to the extent that it wishes to undertake actions which conflict with openly-established policies, it must branch or fork in order to do so. If we make our decisions properly, with input from all stakeholders and with adequate transparency, then Sun's or another distributor's choice to do so will be both healthy and desirable. It may not always be possible to meet the needs of every possible member of our community, and sometimes Sun's corporate interests may be the ones we cannot serve. For now, however, our focus must be on building credible and authoritative institutions which are independent but not ignorant of Sun.
I should note that I work for Sun, although not for the business unit responsible for Solaris. However, I am running as an independent individual, not a representative of Sun or any other entity. I have in the past expressed skepticism and disagreement with Sun's (and Sun executives') positions on various issues of interest to our community, and I will continue to do so in the future when appropriate. The OGB is not beholden to Sun or anyone else, and its members are expected to act accordingly. Neither corporations nor corporate representatives are permitted to serve - by design. I expect your confidence in my ability and determination to act independently for the common good.
The Solaris reference implementation of the fault manager recently got a
boost in its ability to report faults with the introduction of a
two-part SNMP agent. This agent makes it easy to integrate the Solaris
fault manager into existing SNMP-based monitoring infrastructure.
Background
The fault manager has always been able to report faults to the system
log and console(s), and to provide a wealth of status information via
fmadm(1M)
and fmdump(1M).
But these reporting mechanisms leave much to be desired; syslog messages
must be parsed, and a busy central log host can easily lose important
messages in the noise. Worse still, a privileged user must log into the
affected system and run administrative commands to get information they
need that isn't contained in the message.
SNMP is a natural choice for extending the reach of the fault manager's
voice; it's widely used to facilitate centralised monitoring of events
throughout and even across administrative domains. The basic model is
simple and extensible; information can be pushed from any device to one
or more network management stations (NMSs), or pulled by an
administrator or automated utility from a particular device of interest.
Managed devices - in this case, a Solaris system - signify events using
traps (also called notifications in SNMPv2), which provide a limited
amount of information to designated NMSs. They also provide access to a
management information base (MIB) on demand. Generally, the MIB
provides access to a much greater breadth and depth of information than
is transmitted with a trap or notification. An NMS can be configured to
retrieve additional data from the MIB upon receipt of a trap if desired.
Availability
The technology described here is available in Solaris Nevada builds 33
and later. OpenSolaris
offers access to the sources. A prerequisite for building or using
these applications is the installation of the SMA packages provided by
the SFW consolidation; BFUing newer ON bits is not sufficient. If you
have SWAN access, you can run
/ws/onnv-gate/public/bin/update_sma to get the necessary
packages; otherwise see the OpenSolaris
download center for the packages.
A Note on NMS Configuration
If you use the Net-SNMP-based NMS software delivered in Solaris, as I do
below, you will want to tell the client utilities to use the fault
management MIB to encode and decode OIDs. The easiest way to do this is
to add MIBS=+ALL to your environment. You can also make
this permanent by creating (or adding to)
/etc/sma/snmp/snmp.conf the line:
mibs +ALL
See snmp.conf(4)
for more information on MIB searching and importing. If you use a
different NMS, consult your vendor's documentation to learn how to
import a new MIB.
snmp-trapgen: an SNMP plugin for fmd(1M)
The trap or notification generator component is snmp-trapgen. This is a
very simple fault manager plugin similar to that which logs fault
information to the system log and console. Instead of writing formatted
text to a log device, however, this plugin generates SNMPv1 traps and/or
SNMPv2 notifications, one for each destination configured in the
systemwide snmpd.conf(4).
No additional configuration is required; if you have already configured
a system to send traps to one or more NMSs, you don't need to do
anything else to be notified upon fault diagnosis. If not, you'll want
to add v1 or v2 trap destinations to
/etc/sma/snmp/snmpd.conf. The hostnames or addresses you
use will need to be configured to receive and act upon SNMP traps or
notifications. If you don't have an NMS on your network, you can use
the snmptrapd(1M)
server included with Solaris.
A fault diagnosis trap (sunFmProblemTrap) includes a limited subset of
the information contained in the syslog message associated with the
fault. Specifically, the diagnosis's UUID, diagnostic code, and
reference URL are included. The object identifiers (OIDs) for these
data are defined by the fault management MIB, SUN-FM-MIB, installed in
/etc/sma/snmp/mibs/. The same information is delivered to
both SNMPv1 and SNMPv2 trap sinks. At present, this is the only trap
defined by the fault management MIB, but others may be generated in the
future. Here's an example of an SNMPv2 notification as decoded by
snmptrapd(1M):
2006-02-07 16:36:34 stomper [192.xx.xx.xx]:
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (2266748911) 262 days, 8:31:29.11
SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
SUN-FM-MIB::sunFmProblemUUID."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: "a58aa105-4fab-6e16-8557-ab7687113de7"
SUN-FM-MIB::sunFmProblemCode."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: SUN4U-8000-KA
SUN-FM-MIB::sunFmProblemURL."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: http://sun.com/msg/SUN4U-8000-KA
The diagnostic code and URL can be used to find knowledge base articles
describing the fault and suggested corrective action. The diagnosis
UUID can be used to get further detail from fmdump(1M),
or from the MIB, as seen in the next section.
libfmd_snmp: a MIB plugin for the System Management Agent (SMA)
Knowing that a fault has been diagnosed is important, but the amount of
information delivered with the trap or notification may not be enough to
provide an administrator with a complete understanding of the problem.
The fault management MIB defines a wealth of detail, and this detail is
made available via SMA by libfmd_snmp. In addition to fault diagnosis
detail, this MIB also offers information about faulty components and the
configuration of the fault manager itself, similar to that offered by
fmadm(1M).
Enabling the plugin requires configuring the master SNMP agent on each
server you wish to query. Adding the architecture-dependent line
dlmod sunFM /usr/lib/fm/sparcv9/libfmd_snmp.so.1
to /etc/sma/snmp/snmpd.conf will cause the MIB plugin to be
automatically loaded and initialised the next time the master agent is
started, such as via /etc/init.d/init.sma. In the future, SMA will be
managed via SMF; see 6349499[0].
No further configuration is necessary, although the usual snmpd.conf(4)
directives will allow you to restrict access to the MIB, which may be
important to you since some of the information it provides is ordinarily
restricted to privileged users.
The fault management MIB provides 4 tables and a single scalar, in
addition to the trap/notification described above. sunFmProblemTable
and sunFmFaultEventTable are logically two pieces of the same table;
they are separated only because MIBs do not support nested tables. The
problem table contains the scalar information about each diagnosis,
while the fault event table contains lists of the events associated with
each diagnosis. Both tables are indexed by diagnosis UUID; the fault
event table utilises a second scalar index to distinguish between
multiple events associated with a diagnosis. In response to the trap
above, you might want to know which Automated System Recovery Unit(s)
(ASRU(s)) the fault manager believes may have caused the fault. This is
just a fancy way of saying we want to know what broke to trigger the
diagnosis. Because each ASRU is associated with a fault event, we'll
first need to know how many fault events were associated with this
diagnosis so that we can then look up each one's ASRU in the fault event
table. To do this, we'll use snmpget(1M),
delivered by Solaris in /usr/sfw/bin. Of course, you can
use any NMS software.
nms$ snmpget -c public -v 2c stomper \
sunFmProblemSuspectCount.\"a58aa105-4fab-6e16-8557-ab7687113de7\"
SUN-FM-MIB::sunFmProblemSuspectCount."a58aa105-4fab-6e16-8557-ab7687113de7" = Gauge32: 1
This diagnosis has only one fault event associated with it. To look up
the ASRU, we'll look in the fault event table entry indexed by the UUID
and the fault index. Since fault events are indexed starting from 1,
we'll need to do:
nms$ snmpget -c public -v 2c stomper \
sunFmFaultEventASRU.\"a58aa105-4fab-6e16-8557-ab7687113de7\".1
SUN-FM-MIB::sunFmFaultEventASRU."a58aa105-4fab-6e16-8557-ab7687113de7".1
= STRING: cpu:///cpuid=4/serial=23EBEC1505
Most NMSs offer scripting facilities that allow you to perform actions
similar to these in response to a trap. Alternately, you could poll the
data on a regular basis. Many impementations do both, using polling to
offset the risk of losing traps, which like all SNMP datagrams do not
offer reliable transmission. SNMPv3 informs, also known as acknowledged
notifications, offer only a partial remedy to this problem, and are not
supported by snmp-trapgen at this time.
A polling NMS may wish to poll the systemwide faulty component count,
provided by the MIB as sunFmFaultCount. An increase in this gauge
without a corresponding problem trap is a good indication that the trap
has been lost. More details about devices the fault manager believes to
be in degraded or faulted states is available via the
sunFmResourceTable; walking this table provides a ready - and remote -
answer to the common question "What's broken on that machine?" For
this, we use the snmpwalk(1M)
utility:
nms$ snmpwalk -c public -v 2c stomper sunFmResourceTable
SUN-FM-MIB::sunFmResourceFMRI.1 = STRING: cpu:///cpuid=4/serial=23EBEC1505
SUN-FM-MIB::sunFmResourceStatus.1 = INTEGER: degraded(3)
SUN-FM-MIB::sunFmResourceDiagnosisUUID.1 = STRING:
"a58aa105-4fab-6e16-8557-ab7687113de7"
Finally, the sunFmConfigTable offers remote access to the same
information provided by fmadm(1M)'s
config subcommand; like the other tables, it can be
accessed using snmpget(1M),
snmpwalk(1M),
or any other SNMP-compatible NMS implementation. You can find the
complete fault management MIB at the Fault Management
community site, and in build 33 and later at
/etc/sma/snmp/mibs/SUN-FM-MIB.mib.
[0] The bug should be visible,
but it isn't. This is itself a bug, which the SFW team is working to
fix.
For those who attended the SVOSUG meeting last night and are looking for boilerplate code similar to that Max presented, you can find it in the Device Driver Tutorial. This gentle introduction also includes a trivial but functional pseudo device implementation.
Long ago, I promised to write more about gcc inline assembly, in
particular a few cases that are tricky to get right. Here, somewhat
belatedly, are those cases. These examples are taken from libc, but the
concepts apply to any inline assembly fragments you write for gcc. As I
mentioned previously,
these concerns apply only to gcc-style inlines; the Studio-style inline
format doesn't require that you use this same level of caution. gcc
expects you to write assembly fragments (even in a "separate" inline
function) as if they are logically a part of the caller. That is, the
compiler will allocate registers or other appropriate storage locations
to each of the input and output C variables. This requires that you
instruct the compiler very carefully as to your use of each variable,
and the variables' relationships to one another. The advantage is much
better register allocation; the compiler is free to allocate whatever
registers it wishes to your input and output variables in a manner that
is transparent to you. Instead, Studio requires that you code the
fragment as if it were a leaf function, so the compiler does not do any
register allocation for you. You are permitted to use the caller-saved
registers any way you wish, and even to use the caller's stack as if you
are in a leaf function. Arguments and return values are stored in their
ABI-defined locations. Depending on the optimization level you use,
this can be wasteful of registers (though the peephole optimizer can
often clean up some of this waste) and can also make writing the
fragment much more difficult. In exchange, however, you don't have to
be nearly as careful to express the fragment's operation to the
compiler.
Inputs, Outputs, and Clobbers (oh my!)
Each assembly fragment may have any or all of outputs, inputs, and
clobbers. Each input and output maps a C variable or literal to a
string suitable for use as an assembly operand. These operands can then
be referenced as %0, %1, %2, etc.
These are ordered beginning from 0 with the first output, followed by
the inputs. Alternately, newer versions of gcc allow the use of
symbolic names for each input and output. Clobbers are somewhat
different; they express the set of registers and/or memory whose values
are changed by the fragment but are not expressed in the outputs.
Inputs which are also changed must be listed as outputs, not clobbers.
Normally, the clobbers include explicit registers used by certain
instructions, but may also include "cc" to indicate that
the condition code registers are modified and/or "memory"
to indicate that arbitrary memory addresses have had their contents
altered.
Constraints
Outputs and inputs are expressed as constraints, in a language
specifying the type of operand that will contain the value of a
variable. Common constraints include "r", indicating that
a general register should be allocated, and "m" indicating
that some type of memory location should be used. The complete list of
constraints is found in the
gcc documentation. These constraints may contain modifiers, which
give gcc more information about how the operand will be used. The most
common modifiers are "=", "+", and
"&". The "=" modifier is used to indicate
that the operand is output-only; it may appear only in the constraint
for an output variable. Even if the constraint is applied to a variable
containing an existing value in your program, there is no guarantee that
it will contain that value when your assembly fragment is executed. If
you need that, you must use the "+" modifier instead of
"="; this tells the compiler that this operand is both an
input and an output. Nevertheless, the variable with this constraint is
provided only in the outputs section of the fragment's specification.
An alternate way to express the same thing is provided in the
documentation. Note that providing the same variable as both an input
and an output does not guarantee you that the same location (register,
address, etc.) will be used for both of them. Thus the following is
generally incorrect:
static inline int
add(int var1, int var2)
{
__asm__(
"add %2, %0"
: "=r" (var1)
: "r" (var1), "r" (var2));
return (var1);
}
The "&" modifier is used on an output operand whose value
is overwritten before all the input operands are consumed. This
exists to prevent gcc from using the same register for both the input
and output operands. For example, for swap32()
(see also the
Studio inline function), we might think to write:
extern __inline__ uint32_t
swap32(volatile uint32_t *__memory, uint32_t __value)
{
...
uint32_t __tmp1, __tmp2;
__asm__ __volatile__(
"ld [%3], %1\n\t"
"1:\n\t"
"mov %0, %2\n\t"
"cas [%3], %1, %2\n\t"
"cmp %1, %2\n\t"
"bne,a,pn %%icc, 1b\n\t"
" mov %2, %1"
: "+r" (__value), "=r" (__tmp1), "=r" (__tmp2)
: "r" (__memory)
: "cc");
return (__tmp2);
}
But suppose gcc decided to allocate o0 to both
__tmp1 and __memory. This is allowable,
because the "=r" constraint implies that the corresponding
register is set only after all input-only operands are no longer needed
(input/output operands obviously don't have this problem). In the case
above, the first load would clobber o0 and the
cas would operate on an arbitrary location. Instead, we
must write "=&r" for both __tmp1 and
__tmp2; neither variable may safely be allocated the same
register as the input operand.
Bugs caused by omitting the earlyclobber are painful to track down
because they often appear and disappear from one compilation to the
next as entirely unrelated code changes cause increases or decreases
in register pressure.
This is not an academic concern. Consider this example program:
#include
static __inline__ void
incr32(volatile uint32_t *__memory)
{
uint32_t __tmp1, __tmp2;
__asm__ __volatile__(
"ld [%2], %0\n\t"
"1:\n\t"
"add %0, 1, %1\n\t"
"cas [%2], %0, %1\n\t"
"cmp %0, %1\n\t"
"bne,a,pn %%icc, 1b\n\t"
" mov %1, %0"
: "=r" (__tmp1), "=r" (__tmp2)
: "r" (__memory)
: "cc");
}
uint32_t
func(uint32_t x)
{
uint32_t y = 4;
uint32_t z = x + y;
incr32(&y);
z = x + y;
return (z);
}
gcc compiles this (use -O2 -mcpu=v9 -mv8plus) into:
func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 90 02 20 04 add %o0, 0x4, %o0
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: c2 00 40 00 ld [%g1], %g1 <===
func+0x18: 9a 00 60 01 add %g1, 0x1, %o5
func+0x1c: db e0 50 01 cas [%g1] , %g1, %o5 <= SEGV
func+0x20: 80 a0 40 0d cmp %g1, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 82 10 00 0d mov %o5, %g1
func+0x2c: 81 c3 e0 08 retl
func+0x30: 9c 23 bf 88 sub %sp, -0x78, %sp
In this case, gcc has allocated g1 to both
__tmp1 and __memory, and o5 to
__tmp2. Note the highlighted instructions: the initial
load destroys the value of g1, and the subsequent
cas will attempt to operate on whatever address was stored
at *__memory when the fragment began. In this example,
that value will be 4 (g1 is assigned sp+0x64,
which is simply the address of y). This program is
compiled incorrectly due to improper constraints, and will cause a
segmentation fault if the code in question is executed.
If instead we use "=&r" for both __tmp1
and __tmp2, gcc generates the following code:
func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 90 02 20 04 add %o0, 0x4, %o0
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: d8 00 40 00 ld [%g1], %o4 <===
func+0x18: 9a 03 20 01 add %o4, 0x1, %o5
func+0x1c: db e0 50 0c cas [%g1] , %o4, %o5 <= OK
func+0x20: 80 a3 00 0d cmp %o4, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 98 10 00 0d mov %o5, %o4
func+0x2c: 81 c3 e0 08 retl
func+0x30: 9c 23 bf 88 sub %sp, -0x78, %sp
This code now assigns o4 to __tmp1, which
eliminates the problem described above. This function, however, still
does not do the right thing. Why not?
Reloading
Compilers keep track of where each live variable in the program can
be found; many variables can be found both at some memory location and
in a register. Sometimes, the compiler chooses to use a register for a
different variable, and stores the value back to its memory location (if
it has changed) before doing so. Later, if this value is needed, the
value must be loaded back into a register before being used. This is
known as reloading. Other reasons reloading may be required include a
variable's declaration as volatile and the case that
concerns us here, a variable's modification via side effects.
In the example above, incr32() is actually operating on
a memory address, not a register. So why did we assign
__memory the "r" constraint instead of more
correctly expressing the constraint as "+m" (*__memory)?
It turns out that the "m" constraint allows a variety of
possible addressing modes. On SPARC, this includes the register/offset
mode (such as [%sp+0x64]). This is fine for instructions
like ld and st, but the cas
instruction is special: it allows no offset. No constraint exists to
describe this condition; the "V" constraint is clearly
similar but is not correct; a bare register ([%g1]) is an
offsettable address, so "V" would actually exclude the case
we want. Conversely, "o", the inverse constraint of
"V", includes the register/offset addressing mode we
specifically wish to exclude. So, the only way to express this
constraint is "r". But this does nothing to capture the
fact that although the pointer itself is not modified, the value at
*__memory is altered by the assembly fragment. Is this a
problem? Let's look at the assembly generated for func() a
little more closely:
func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 90 02 20 04 add %o0, 0x4, %o0 <===
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: d8 00 40 00 ld [%g1], %o4
func+0x18: 9a 03 20 01 add %o4, 0x1, %o5
func+0x1c: db e0 50 0c cas [%g1] , %o4, %o5
func+0x20: 80 a3 00 0d cmp %o4, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 98 10 00 0d mov %o5, %o4
func+0x2c: 81 c3 e0 08 retl <===
func+0x30: 9c 23 bf 88 sub %sp, -0x78, %sp
We see that gcc has assigned z the o0
register, which is not surprising given that it's the return value. But
after o0 is set to x + 4 at the beginning of
the function, it's never set again. The line z = x + y has
been discarded by the compiler! This is because it does not know that
our inline assembly modified the value of y, so it did not
reload the value and recalculate z.
There are two ways we can correct this problem: (a) add a
"+m" output operand for *__memory, or (b) add
"memory" to the list of clobbers. This is a special
clobber that tells gcc not to trust the values in any registers it would
otherwise believe to hold the current values of variables stored in
memory. In short, this clobber tells gcc that all registers must be
reloaded if the correct value of a variable is required. This is
somewhat inefficient when we know which piece of memory has been
touched, so (a) is preferable for better performance.
Whichever solution we choose, gcc now compiles our code to:
func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 98 10 00 08 mov %o0, %o4
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: d6 00 40 00 ld [%g1], %o3
func+0x18: 9a 02 e0 01 add %o3, 0x1, %o5
func+0x1c: db e0 50 0b cas [%g1] , %o3, %o5
func+0x20: 80 a2 c0 0d cmp %o3, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 96 10 00 0d mov %o5, %o3
func+0x2c: d0 03 a0 64 ld [%sp + 0x64], %o0 <===
func+0x30: 90 03 00 08 add %o4, %o0, %o0 <===
func+0x34: 81 c3 e0 08 retl
func+0x38: 9c 23 bf 88 sub %sp, -0x78, %sp
Note the reload, which will now return the correct result. There
are actually two other ways to correct this, although the use of
"+m" is the most correct. First, we could declare
z to be volatile in func(). This
would force gcc to reload its value from memory any time that value is
required. Use of the volatile keyword is mainly useful
when some external thread (or hardware) may change the value at any
time; using it as a substitute for correct constraints will cause
unnecessary reloading, degrading performance. Second, and perhaps best
of all, the compiler could be modified to accept a SPARC-specific
constraint for use with the cas instruction, one which
requires the address of the operand to be stored in a general register.
You can find more inline assembly examples in libc (math
functions), MD5
acceleration, and the
kernel illustrating these concepts. Be sure to read and understand
the documentation completely before writing your own
inline assembly for gcc, and always test your understanding by
constructing and compiling simple test programs like these.
Not so long ago I was looking through Solaris's shells for memory allocators - functions that perform tasks similar to malloc(3c). These functions often store the size of the allocated block at the beginning of each block; if that size is stored as a 4-byte value, the return value from the allocator may not be aligned on an 8-byte boundary. This is a major problem on SPARC, because it's not uncommon to allocate structs or unions containing types that require 8-byte alignment, especially long long. As it turns out, gcc correctly assumes that long long variables are aligned on 8-byte boundaries and uses the ldd and std instructions to access them. Our Studio compiler doesn't; it always issues two ld or st instructions. The result is that programs using this kind of allocator can crash when built with gcc but not with Studio, not a pleasant condition.
As part of my search, I found that, indeed, the Bourne and Korn shells have some alignment problems. Though these are bugs, we've decided that there's no reliable way to find all possible bugs of this type, so we worked around them in the compiler as well as fixed the ones we've found. This is, if nothing else, a good argument against compilers that "help" programmers by covering up this kind of error. But the best prize of all wasn't the kind of problem I was looking for, but rather this gem from the C shell:
showall(av);
printf("i=%d: Out of memory\n", i);
chdir("/usr/bill/cshcore");
abort();
This is the systems programming equivalent of finding a live wooly mammoth contentedly smoking a cigar in your recliner. Unfortunately, there's no way to trigger this behaviour, as it's protected by the "debug" preprocessor symbol, which we never set in a normal build. Nevertheless, thanks to OpenSolaris, you can see it for yourself.
We harp incessantly on the need to be able to debug production code, with no recompilation needed; there are a number of better ways to debug this particular condition. For example, you could use the DTrace pid provider to stop a csh process when nomem() is called, and even provide a backtrace. If that weren't enough, you could then use mdb(1) to debug the problem in greater detail, or gcore(1) to produce a core dump. But the best part, the real joy, if you'll pardon the pun, is the chdir call. Clearly the purpose was to drop core in a predictable location for later analysis by the author. I think you'll find that coreadm(1m), along with other corefile improvements, offers a far more flexible and powerful way to accomplish this - and it complements nicely the other debugging strategies I mentioned above.
Those of you in or near Portland, Oregon are encouraged to come and see us at OSCON this week. Most of the conference is at the Convention Center this year (use the helpfully-named Convention Center train stop). Sun will have a booth in the exhibit hall starting Wednesday, and we're giving a few talks as well. In particular, join Bryan and me for a free tutorial on building, installing, and developing with OpenSolaris using DTrace, mdb, and more. That will be held Tuesday at 1:30pm in room D140. Then on Wednesday, I'll be giving a short talk on the status of OpenSolaris at 2:35pm in Portland/255, and we'll have a BOF at 8:30pm. Thursday, don't miss Bryan's short talk on DTrace at 4:30pm.
Even if you can't make the conference, you're welcome to join me for a beer. Send me mail at wesolows at eng dot sun dot com if you're interested, or leave a message for me at the 5th Avenue Suites.
The First OpenSolaris Project: GCC Support
OpenSolaris is (finally)
available. I've been working on this every day I've been with Sun,
though others have spent years on the effort, and it's an amazing
milestone. Unlike most launches, though, this is the beginning of a new
effort rather than the end of one. As much as we've done already,
there's far more
left to be done before OpenSolaris can fulfil all our promises and
achieve all our goals.
One promise we have fulfilled today is our commitment to make
OpenSolaris accessible to people without the money or desire to buy
compilers. Since most of Solaris is normally built with the
Sun Studio compilers, this meant we'd need either to provide the
compilers on the same terms as Solaris (also
required to build OpenSolaris sources), or modify the sources to build
and work with the GNU C compiler,
available with source and free of charge under the terms of the GNU GPL. For reasons more illustrative of bureaucracy and
human nature than of technological difficulties, we were unsure almost
until the moment of launch whether we would be able to provide the
Studio compilers under acceptable terms; therefore, another engineer and
I have spent the last two and a half months porting OpenSolaris to gcc.
At this point I had a nice writeup on inline assembly differences
between the Studio and GNU compilers. But it relies on source code that
isn't available yet - namely, the gcc-specific inline assembly files.
So instead I'll talk about why it happened that way and why it's
actually a good thing. I'll also talk about some straight-up bugs we
found in the process of porting.
We received word that a final Studio license had been agreed upon on
June 3 - just 11 days ago! The license is free-as-in-beer
and although somewhat vague seems reasonable enough. Of course, I
prefer using only Free Software and promoting it whenever possible (as
we're going with OpenSolaris), so I'd really rather use gcc. Our plan
of record was to make a merged workspace available as "official"
OpenSolaris. There were three sets of changes that needed to be merged
together in the last three days leading up to launch: the gcc changes,
which edit about 2500 files (mostly to fix compiler warnings), a large
wad of renames to support the separation of code we're releasing now
from that which we're hoping to release later (thousands of renames),
and the coup d'grace, the addition of the CDDL license block to over
24,000 files. In the end, this gigantic 3-way merge proved impractical:
there were over 1700 conflicts to resolve. Most are trivial and can
easily be automerged by TeamWare, our revision control system, but the
sheer volume and shortened schedule would have made adequate testing
impossible.
Instead of the three-way merge, then, we elected to take the minimum
amount of change we could: the addition of the CDDL blocks and the
separation of released from unreleasable source. That meant gcc
support would not ship in the "official" sources - but it could still be
made available to the developer community. This is important for
several reasons - first, it illustrates an important principle: FCS
quality all the time. That is, if it's not good enough for a customer,
it's not good enough to be putback. Since there was no doubt in
anyone's mind that the gcc work was not ready for either, that meant it
also wasn't good enough to call OpenSolaris. Second, it offers us an
opportunity to provide a glimpse into the way projects work. One of the
most common questions we get is "so, if the gate always has to be
golden, how does any major work ever get done?" Like most people, we do
major work in "branches" off the trunk. TeamWare supports children of
children and merging of independent workspaces with common ancestry, so
that no complicated branching apparatus is needed as for CVS. What will
be available on the gcc project
page will be that project gate. You're invited to participate -
there are over 300 mostly very small bugs to fix.
One of the most significant kinds of bug we found were programs writing
into string constants, confirming Osborne's Law.
These programs ordinarily work properly because the Studio compilers
place the string constants in the .data section or some
other writable data section. The flag -xstrconst changes
this behaviour, placing the strings in .rodata or a similar
read-only segment and thus also allowing them to be shared. This
reduces runtime memory usage but comes at a cost: buggy programs that
attempt to write to the constant strings will trigger a segmentation
violation and normally die. gcc acts as if this flag were always on,
and applies it to other const data types as well. The end
result is greater enforcement of correctness at the cost crashes.
Fortunately fixing these is very easy. For example, I fixed bug number
6281909 (you're supposed to be able to see bugs, too, but it doesn't
seem to include the bugs of interest) by fixing the selector
function not to assume it can write '=' and
'\0' into its arguments. Note that the correct use of
'const' can help prevent this kind of problem.
The original article on inline assembly will appear when the source it
references appears - and you can help make that happen sooner: check out
the gcc project page.
Technorati Tag:
OpenSolaris
Technorati Tag: Solaris
Earlier this week, Mr. Vaughan-Nichols at eWeek wrote a largely inaccurate and needlessly hostile article about the CDDL, and our own Andy Tucker called him on a few points. Without bothering to correct that article or respond, he went back at it again on Wednesday, this time giving air time to SCO and their blessing of the OpenSolaris program. Why Mr. McBride of SCO felt the need to give this "blessing" is unclear; Sun obviously believes it has the rights needed to make the sources to nearly all of Solaris available under whatever license(s) we choose. Without those rights, no blessing would be sufficient; with them, none is necessary. I'll chalk this up to SCO taking whatever opportunity it can to appear relevant, especially as they continue to struggle in both the marketplace and the courtroom.
Enough of that. Instead, I'd like to focus on the most obvious and significant error in this article: the assertion that
"To date, though, the only released components of OpenSolaris are programs, such as DTrace, which aren't parts of the operating system."
We don't need to be too picky about what constitutes an operating system; even the most pedantic would surely agree that a component which spans the system from user applications to the heart of the kernel is part of the operating system. Under even an extremely narrow definition, DTrace is very much a part of the Solaris operating system - and therefore also of OpenSolaris technology. Our release of DTrace includes the sources for not just the standalone program dtrace(1M), but also all of the following:
- The userland library libdtrace(3LIB) which provides most of dtrace(1M)'s functionality
- Three other userland programs: lockstat(1M), plockstat(1M), and intrstat(1M), which are implemented using DTrace
- Several kernel modules: dtrace(7D), fasttrap(7D), fbt(7D), lockstat(7D), profile(7D), sdt(7D), and systrace(7D); these implement the kernel portions of DTrace
- Code added to the kernel itself to support dtrace, such as
usr/src/uts/common/os/dtrace_subr.c
- Two additional private user libraries which provide access to Compact C Type Format (CTF) data and the proc(4) filesystem
- Small programs demonstrating the D language and DTrace functionality
- A variety of headers and glue
It should be apparent that this is far more complex a subsystem than just one standalone user program. In fact, the source to dtrace(1M) is a single file out of 345 we released, and constitutes only 1431 of 102,163 lines of code (about 1.4%) in this initial release. It dtrace(1M) were simply an ordinary user program, it would not require over 100,000 lines of additional code - including over 32,000 in the kernel - to make it work.
As a final example, observe this comment block from usr/src/uts/os/common/dtrace_subr.c:
/*
* Making available adjustable high-resolution time in DTrace is regrettably
* more complicated than one might think it should be. The problem is that
* the variables related to adjusted high-resolution time (hrestime,
* hrestime_adj and friends) are adjusted under hres_lock -- and this lock may
* be held when we enter probe context. One might think that we could address
* this by having a single snapshot copy that is stored under a different lock
* from hres_tick(), using the snapshot iff hres_lock is locked in probe
* context. Unfortunately, this too won't work: because hres_lock is grabbed
* in more than just hres_tick() context, we could enter probe context
* concurrently on two different CPUs with both locks (hres_lock and the
* snapshot lock) held. As this implies, the fundamental problem is that we
* need to have access to a snapshot of these variables that we _know_ will
* not be locked in probe context. To effect this, we have two snapshots
* protected by two different locks, and we mandate that these snapshots are
* recorded in succession by a single thread calling dtrace_hres_tick(). (We
* assure this by calling it out of the same CY_HIGH_LEVEL cyclic that calls
* hres_tick().) A single thread can't be in two places at once: one of the
* snapshot locks is guaranteed to be unheld at all times. The
* dtrace_gethrestime() algorithm is thus to check first one snapshot and then
* the other to find the unlocked snapshot.
*/
This comment, while arcane, is clear by itself, so I will not attempt to add to it. I will only point out that if DTrace were not a part of the operating system, it would not need to concern itself with the locking rules for updates to the high-resolution system timers. Further examples of DTrace's intimate association with core features of the Solaris kernel and userland libraries can easily be found by examining the sources.
Sun's DTrace experts have written extensively about their creation [more here and here to note just two] and provided a highly detailed reference manual. While much of this material may not be in a format which is accessible to the layman, even a cursory overview of the source we are offering and the breadth and depth of publications on the topic should be sufficient to satisfy one that DTrace is very much a part of the operating system. Perhaps Mr. Vaughan-Nichols was simply unfamiliar with the offering; in that case I would invite him to download the sources and inspect them himself, and to seek the opinions of expert engineers before making further claims of this sort. DTrace is very much a part of Solaris, and while we have much more to do, releasing it as open source was no trivial step.
Most people have probably read the recent Linus interview, in which he has a number of things to say about Linux, Solaris, and software development. Like any interview, it contains some interesting assertions, some obvious filler, and some real head-scratchers. Many in the Solaris community have expressed dismay or anger over some of his remarks, but rather than add to that, I'd like to examine some internal contradictions in Linus's statements and try better to understand why he's made them. As we ready OpenSolaris for public consumption and contribution, it's important to observe how similar development systems work and take steps to avoid difficulties encountered by other projects. Linus's comments indicate that, indeed, the structures and processes in place to serve Linux development are imperfect. We will be well-served to learn from this.
One of the head-scratchers is his assertion that he's not interested in Solaris because he feels it offers nothing of value that isn't already in Linux. This conclusion might be less baffling, though no less disappointing, if he'd actually examined the code, the feature set, and then made up his mind. But he admitted openly that he probably won't even look at the code, and instead will rely on others to tell him if it contains ideas worth considering. I really have to wonder about this approach, especially given his later comments concerning the reason for adding a feature to a system. We certainly agree with him that system design is about solving problems, not just doing something new and different for its own sake. Features don't get added to Solaris if they don't serve some useful purpose, fill some hole for developers, users, or both. It's difficult to believe that Solaris developers and users have problems to solve that differ greatly from those of Linux developers and users. In fact, as a long-time Linux developer myself, I can say with some confidence that the challenges are the same. So why does Solaris offer tools like kmdb, dtrace, and crash dumps, while Linus either refuses to integrate similar functionality or claims he hasn't heard of the problems these tools help to solve?
One possible reason is that distributions sometimes provide parts of these feature sets, so that users never even realize their absence in Linux proper. Linus talked about the distributors serving a valuable function, buffering developers from customers. But perhaps in that process, valuable information is not making its way back to Linus. The Linux development community would be well-served by talking to ordinary systems administrators now and then. Another possibility is that users and administrators can't, won't, or don't effectively communicate the problems they are trying to solve. But why don't Solaris users seem to have this problem? Do Linux distributors simply not listen? Or perhaps these decisions are really based on ideology, as so many Linux detractors claim. Regardless, a sober assessment of users' real-world needs might well reveal that Linus and others still have much work to do (as do Solaris developers), and that some of the changes they ought well to consider have already been made in other systems. The solutions Linus might choose may well be quite different from those chosen by Sun, but disregarding or remaining ignorant of the challenges is an opportunity lost to innovate and improve. What kind of engineer willingly passes up that opportunity?
If NIH is in fact "a disease" - a point which ought to solicit universal agreement, I'm left to wonder why Linus would pass up an opportunity to examine the works of other engineers. If he does in fact rely on others to tell him about valuable features in similar systems, something in that process is broken. If he wants to make sure Linux can solve all the problems Solaris can, I'd suggest he look closely at what's been done here. The code isn't even needed for this - a quick glance at public white papers would be sufficient to understand many of the problems Solaris engineers have been working to solve. If he doesn't believe these problems exist, a reality check is in order.
There are lessons here, of course. One of them is that systems developers must not lose touch with the problems they're supposed to solve. It pays to listen. Another lesson is that a process which prevents useful features from being implemented is broken, and someone has to be willing to recognize and correct such a process. If distributions take on the work of making a usable system and interacting with customers, engineers risk losing sight of appropriate goals. This is avoidable, but that it appears to be occurring implies that the relationship among Linux (the codebase), its distributors, and its developers (many of whom work for distributors) is defective in some way.
I'm cheered by the prospects for OpenSolaris to avoid these pitfalls, especially if we recognize them and take proper action. I hope we as a community will remain cognizant that they have hindered other large projects before ours, even those with leaders of Linus's stature.
|
|