Welcome to the Real World

Main | Next page »

20080312 Wednesday March 12, 2008

OGB endorsements A few people have asked how I'm voting in this year's elections. Here, then, are my endorsements:

OGB:

Questions:

  1. No
  2. No
(2008-03-12 09:48:07.0) Permalink Comments [4]

20070311 Sunday March 11, 2007

C-Teams and the ARC as Community Groups

Rich Lowe asked an excellent question about OpenSolaris government in response to one of Casper Dik's answers to the DTrace Community. Here's the question, and my answer. As always, the other candidates' responses are available in the above-referenced thread.

And what would be your (the general "you", not just Casper) plans to help make the ARC and especially the C-team more practically part of OpenSolaris process, rather than a part of Sun process we're exposed to from one side, but not, so far, fully involved with?

Of the two, the ARC is much more difficult to rationalise; I'll explain why below. As for the C-teams and the more general problem of consolidation management, I'll let this text from my position paper[1] do the answering:

One of the OGB's most important tasks will be to rationalise the Community Group structure into one which will allow meaningful self-government. The centerpiece of my plan for doing this is construction of Consolidation-Sponsoring Community Groups (CSCGs). Each of these groups will be given control over an existing consolidation. This structure is not unlike that which exists today in the misnamed Nevada Community, representing ON. But that Community does not govern openly, and other consolidations are entirely missing structure under which they can be governed legitimately. Since the Constitution provides for the Community Group as the unit of independent government, each consolidation requires one to oversee its progress. The CSCGs will be responsible both for controlling the content of their codebases and for providing guidance and leadership to project teams desirous of integration. They will be required to adopt a set of rules (harmonised but not necessarily identical across all CSCGs) for integration and apply fairly these rules.

The challenge associated with the ARC (or ARCs) is that it maps poorly onto the Community Group structure. It makes little sense to me that an Architecture Community Group would sit alongside, say, an Observability Community Group. Observability incorporates a number of subsystems in the OS which in turn need to be properly integrated into each project. So would Reliability, or Virtualization. Architecture is not another such feature set but rather the way in which all those features, along with the new ones offered by the project, fit together and expose themselves to other consumers. That is, Architecture is both a superset of and yet entirely disjoint from all other CGs' areas of interest. The practical effect is substantial overlap: we would expect each CG to offer project teams advice concerning how best to integrate their work with existing features (and, for projects directly related to the CG's area of expertise, what features it should offer to others). In some ways, however, this directly conflicts with the mandate of an Architecture CG, which is to provide architecture guidance to all project teams. In the current system, an observability expert cannot override the ARC's decisions with respect to a proposed observability project. Yet under the Constitution, the Observability CG is supposed to be self-governing. The defining question is what exactly the latter CG is expected to govern, and by what mechanism - the very question the Constitution so conspicuously fails to answer.

It's easy enough in my CSCG model to simply require that all CSCGs adopt rules requiring architectural review by a particular CG just as they should require other CGs with expertise in relevant areas to review and perhaps approve each project prior to integration. Indeed, this is not unlike the system that exists today. The CSCGs do indeed have complete control over their areas of responsibility, namely, the existing consolidations. But this leaves all other CGs less equal, their endorsements subject to veto and without any code of their own to govern. A logical conclusion one could reach on this line of thinking is that CSCGs and perhaps the ARC should be the *only* CGs. The reality on the ground thus maps poorly to the Constitution we've been given, suggesting that the Framers either did not consider this matter in sufficient detail or intended much more radical changes in either the structure of consolidations, global review processes, or both. Mr. Fielding in fact hinted at just such an intent[2]:

We don't need to enshrine one committee's view of how C-Teams operate in an organization-wide constitution because C-Teams simply aren't relevant to *every* activity at OpenSolaris, and the vast majority of comments we have received so far clearly indicate that the existing consolidation boundaries are arbitrary AND dysfunctional. Personally, I am hoping that the communities feel empowered to change the things that are obviously causing them harm right now, and let the consensus process ensure that the traditions are adequately promoted and maintained over time.

Presumably Mr. Fielding and perhaps others have some grand detailed view of how all these things should be made to fit together in the rather obvious presence of existing bodies of code with no associated governing units and vice versa. Unfortunately, they've not seen fit to share that view nor to stand for election themselves. If consensus does not emerge within a few months as to an appropriate way to map the (possibly modified) existing practical devices of government onto the new constitutional structures, I'll probably favour amending the constitution rather than spinning our wheels forever trying to shoehorn OpenSolaris into a framework that may well be inappropriate to our broader goals.

At some point I'd like to hear Mr. Plocher and others more intimately involved with the operation of the ARC Community express their views on how that Community could be made to fit into the new Constitutional world of governing CGs. Their testimony will be needed before the OGB as it considers how best to restructure the Communities into meaningfully self-governing units.

  1. http://blogs.sun.com/wesolows/entry/ogb_election
  2. http://www.opensolaris.org/jive/message.jspa?messageID=99494#99494
(2007-03-11 10:13:30.0) Permalink

20070309 Friday March 09, 2007

DTrace Community OGB Questionnaire Leaders of the DTrace Community had a number of questions for the OGB candidates. Here's a copy of the questions and my answers. You can also see the other candidates' responses in the DTrace mailing list archives.

(2007-03-09 08:39:32.0) Permalink

20070308 Thursday March 08, 2007

OGB Election OGB Election OpenSolaris Governing Board elections begin next week. In addition, a single question will be presented to the voters: Shall the proposed Constitution be ratified? Please take the time to read this important document and learn about the issues being debated by the candidates. As a candidate for an OGB seat, I can help you right here and now with the latter task. I'd appreciate an allowance of five minutes of your time to learn where I stand on some of these issues. I welcome questions; you can send mail to all candidates to ask your questions. I'll be posting here my answers to any questions I receive in this fashion.

(2007-03-08 10:04:01.0) Permalink

20060208 Wednesday February 08, 2006

A louder voice for the fault manager

The Solaris reference implementation of the fault manager recently got a boost in its ability to report faults with the introduction of a two-part SNMP agent. This agent makes it easy to integrate the Solaris fault manager into existing SNMP-based monitoring infrastructure.

Background

The fault manager has always been able to report faults to the system log and console(s), and to provide a wealth of status information via fmadm(1M) and fmdump(1M). But these reporting mechanisms leave much to be desired; syslog messages must be parsed, and a busy central log host can easily lose important messages in the noise. Worse still, a privileged user must log into the affected system and run administrative commands to get information they need that isn't contained in the message.

SNMP is a natural choice for extending the reach of the fault manager's voice; it's widely used to facilitate centralised monitoring of events throughout and even across administrative domains. The basic model is simple and extensible; information can be pushed from any device to one or more network management stations (NMSs), or pulled by an administrator or automated utility from a particular device of interest. Managed devices - in this case, a Solaris system - signify events using traps (also called notifications in SNMPv2), which provide a limited amount of information to designated NMSs. They also provide access to a management information base (MIB) on demand. Generally, the MIB provides access to a much greater breadth and depth of information than is transmitted with a trap or notification. An NMS can be configured to retrieve additional data from the MIB upon receipt of a trap if desired.

Availability

The technology described here is available in Solaris Nevada builds 33 and later. OpenSolaris offers access to the sources. A prerequisite for building or using these applications is the installation of the SMA packages provided by the SFW consolidation; BFUing newer ON bits is not sufficient. If you have SWAN access, you can run /ws/onnv-gate/public/bin/update_sma to get the necessary packages; otherwise see the OpenSolaris download center for the packages.

A Note on NMS Configuration

If you use the Net-SNMP-based NMS software delivered in Solaris, as I do below, you will want to tell the client utilities to use the fault management MIB to encode and decode OIDs. The easiest way to do this is to add MIBS=+ALL to your environment. You can also make this permanent by creating (or adding to) /etc/sma/snmp/snmp.conf the line:

    mibs +ALL
See snmp.conf(4) for more information on MIB searching and importing. If you use a different NMS, consult your vendor's documentation to learn how to import a new MIB.

snmp-trapgen: an SNMP plugin for fmd(1M)

The trap or notification generator component is snmp-trapgen. This is a very simple fault manager plugin similar to that which logs fault information to the system log and console. Instead of writing formatted text to a log device, however, this plugin generates SNMPv1 traps and/or SNMPv2 notifications, one for each destination configured in the systemwide snmpd.conf(4). No additional configuration is required; if you have already configured a system to send traps to one or more NMSs, you don't need to do anything else to be notified upon fault diagnosis. If not, you'll want to add v1 or v2 trap destinations to /etc/sma/snmp/snmpd.conf. The hostnames or addresses you use will need to be configured to receive and act upon SNMP traps or notifications. If you don't have an NMS on your network, you can use the snmptrapd(1M) server included with Solaris.

A fault diagnosis trap (sunFmProblemTrap) includes a limited subset of the information contained in the syslog message associated with the fault. Specifically, the diagnosis's UUID, diagnostic code, and reference URL are included. The object identifiers (OIDs) for these data are defined by the fault management MIB, SUN-FM-MIB, installed in /etc/sma/snmp/mibs/. The same information is delivered to both SNMPv1 and SNMPv2 trap sinks. At present, this is the only trap defined by the fault management MIB, but others may be generated in the future. Here's an example of an SNMPv2 notification as decoded by snmptrapd(1M):

2006-02-07 16:36:34 stomper [192.xx.xx.xx]:

        DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (2266748911) 262 days, 8:31:29.11
        SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
        SUN-FM-MIB::sunFmProblemUUID."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: "a58aa105-4fab-6e16-8557-ab7687113de7"
        SUN-FM-MIB::sunFmProblemCode."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: SUN4U-8000-KA
        SUN-FM-MIB::sunFmProblemURL."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: http://sun.com/msg/SUN4U-8000-KA
The diagnostic code and URL can be used to find knowledge base articles describing the fault and suggested corrective action. The diagnosis UUID can be used to get further detail from fmdump(1M), or from the MIB, as seen in the next section.

libfmd_snmp: a MIB plugin for the System Management Agent (SMA)

Knowing that a fault has been diagnosed is important, but the amount of information delivered with the trap or notification may not be enough to provide an administrator with a complete understanding of the problem. The fault management MIB defines a wealth of detail, and this detail is made available via SMA by libfmd_snmp. In addition to fault diagnosis detail, this MIB also offers information about faulty components and the configuration of the fault manager itself, similar to that offered by fmadm(1M).

Enabling the plugin requires configuring the master SNMP agent on each server you wish to query. Adding the architecture-dependent line

    dlmod sunFM /usr/lib/fm/sparcv9/libfmd_snmp.so.1
to /etc/sma/snmp/snmpd.conf will cause the MIB plugin to be automatically loaded and initialised the next time the master agent is started, such as via /etc/init.d/init.sma. In the future, SMA will be managed via SMF; see 6349499[0].

No further configuration is necessary, although the usual snmpd.conf(4) directives will allow you to restrict access to the MIB, which may be important to you since some of the information it provides is ordinarily restricted to privileged users.

The fault management MIB provides 4 tables and a single scalar, in addition to the trap/notification described above. sunFmProblemTable and sunFmFaultEventTable are logically two pieces of the same table; they are separated only because MIBs do not support nested tables. The problem table contains the scalar information about each diagnosis, while the fault event table contains lists of the events associated with each diagnosis. Both tables are indexed by diagnosis UUID; the fault event table utilises a second scalar index to distinguish between multiple events associated with a diagnosis. In response to the trap above, you might want to know which Automated System Recovery Unit(s) (ASRU(s)) the fault manager believes may have caused the fault. This is just a fancy way of saying we want to know what broke to trigger the diagnosis. Because each ASRU is associated with a fault event, we'll first need to know how many fault events were associated with this diagnosis so that we can then look up each one's ASRU in the fault event table. To do this, we'll use snmpget(1M), delivered by Solaris in /usr/sfw/bin. Of course, you can use any NMS software.

    nms$ snmpget -c public -v 2c stomper \
        sunFmProblemSuspectCount.\"a58aa105-4fab-6e16-8557-ab7687113de7\"
    SUN-FM-MIB::sunFmProblemSuspectCount."a58aa105-4fab-6e16-8557-ab7687113de7" = Gauge32: 1
This diagnosis has only one fault event associated with it. To look up the ASRU, we'll look in the fault event table entry indexed by the UUID and the fault index. Since fault events are indexed starting from 1, we'll need to do:
    nms$ snmpget -c public -v 2c stomper \
        sunFmFaultEventASRU.\"a58aa105-4fab-6e16-8557-ab7687113de7\".1
    SUN-FM-MIB::sunFmFaultEventASRU."a58aa105-4fab-6e16-8557-ab7687113de7".1
    = STRING: cpu:///cpuid=4/serial=23EBEC1505
Most NMSs offer scripting facilities that allow you to perform actions similar to these in response to a trap. Alternately, you could poll the data on a regular basis. Many impementations do both, using polling to offset the risk of losing traps, which like all SNMP datagrams do not offer reliable transmission. SNMPv3 informs, also known as acknowledged notifications, offer only a partial remedy to this problem, and are not supported by snmp-trapgen at this time.

A polling NMS may wish to poll the systemwide faulty component count, provided by the MIB as sunFmFaultCount. An increase in this gauge without a corresponding problem trap is a good indication that the trap has been lost. More details about devices the fault manager believes to be in degraded or faulted states is available via the sunFmResourceTable; walking this table provides a ready - and remote - answer to the common question "What's broken on that machine?" For this, we use the snmpwalk(1M) utility:

    nms$ snmpwalk -c public -v 2c stomper sunFmResourceTable
    SUN-FM-MIB::sunFmResourceFMRI.1 = STRING: cpu:///cpuid=4/serial=23EBEC1505
    SUN-FM-MIB::sunFmResourceStatus.1 = INTEGER: degraded(3)
    SUN-FM-MIB::sunFmResourceDiagnosisUUID.1 = STRING:
        "a58aa105-4fab-6e16-8557-ab7687113de7"
Finally, the sunFmConfigTable offers remote access to the same information provided by fmadm(1M)'s config subcommand; like the other tables, it can be accessed using snmpget(1M), snmpwalk(1M), or any other SNMP-compatible NMS implementation. You can find the complete fault management MIB at the Fault Management community site, and in build 33 and later at /etc/sma/snmp/mibs/SUN-FM-MIB.mib.

[0] The bug should be visible, but it isn't. This is itself a bug, which the SFW team is working to fix.

(2006-02-08 21:04:12.0) Permalink

20060127 Friday January 27, 2006

More on Drivers For those who attended the SVOSUG meeting last night and are looking for boilerplate code similar to that Max presented, you can find it in the Device Driver Tutorial. This gentle introduction also includes a trivial but functional pseudo device implementation. (2006-01-27 16:08:14.0) Permalink

20051205 Monday December 05, 2005

GCC inline assembly, part 2

Long ago, I promised to write more about gcc inline assembly, in particular a few cases that are tricky to get right. Here, somewhat belatedly, are those cases. These examples are taken from libc, but the concepts apply to any inline assembly fragments you write for gcc. As I mentioned previously, these concerns apply only to gcc-style inlines; the Studio-style inline format doesn't require that you use this same level of caution. gcc expects you to write assembly fragments (even in a "separate" inline function) as if they are logically a part of the caller. That is, the compiler will allocate registers or other appropriate storage locations to each of the input and output C variables. This requires that you instruct the compiler very carefully as to your use of each variable, and the variables' relationships to one another. The advantage is much better register allocation; the compiler is free to allocate whatever registers it wishes to your input and output variables in a manner that is transparent to you. Instead, Studio requires that you code the fragment as if it were a leaf function, so the compiler does not do any register allocation for you. You are permitted to use the caller-saved registers any way you wish, and even to use the caller's stack as if you are in a leaf function. Arguments and return values are stored in their ABI-defined locations. Depending on the optimization level you use, this can be wasteful of registers (though the peephole optimizer can often clean up some of this waste) and can also make writing the fragment much more difficult. In exchange, however, you don't have to be nearly as careful to express the fragment's operation to the compiler.

Inputs, Outputs, and Clobbers (oh my!)

Each assembly fragment may have any or all of outputs, inputs, and clobbers. Each input and output maps a C variable or literal to a string suitable for use as an assembly operand. These operands can then be referenced as %0, %1, %2, etc. These are ordered beginning from 0 with the first output, followed by the inputs. Alternately, newer versions of gcc allow the use of symbolic names for each input and output. Clobbers are somewhat different; they express the set of registers and/or memory whose values are changed by the fragment but are not expressed in the outputs. Inputs which are also changed must be listed as outputs, not clobbers. Normally, the clobbers include explicit registers used by certain instructions, but may also include "cc" to indicate that the condition code registers are modified and/or "memory" to indicate that arbitrary memory addresses have had their contents altered.

Constraints

Outputs and inputs are expressed as constraints, in a language specifying the type of operand that will contain the value of a variable. Common constraints include "r", indicating that a general register should be allocated, and "m" indicating that some type of memory location should be used. The complete list of constraints is found in the gcc documentation. These constraints may contain modifiers, which give gcc more information about how the operand will be used. The most common modifiers are "=", "+", and "&". The "=" modifier is used to indicate that the operand is output-only; it may appear only in the constraint for an output variable. Even if the constraint is applied to a variable containing an existing value in your program, there is no guarantee that it will contain that value when your assembly fragment is executed. If you need that, you must use the "+" modifier instead of "="; this tells the compiler that this operand is both an input and an output. Nevertheless, the variable with this constraint is provided only in the outputs section of the fragment's specification. An alternate way to express the same thing is provided in the documentation. Note that providing the same variable as both an input and an output does not guarantee you that the same location (register, address, etc.) will be used for both of them. Thus the following is generally incorrect:

static inline int
add(int var1, int var2)
{
	__asm__(
		"add	%2, %0"
	: "=r" (var1)
	: "r" (var1), "r" (var2));

	return (var1);
}
The "&" modifier is used on an output operand whose value is overwritten before all the input operands are consumed. This exists to prevent gcc from using the same register for both the input and output operands. For example, for swap32() (see also the Studio inline function), we might think to write:
extern __inline__ uint32_t
swap32(volatile uint32_t *__memory, uint32_t __value)
{
	...
	uint32_t __tmp1, __tmp2;
	__asm__ __volatile__(
		"ld [%3], %1\n\t"
		"1:\n\t"
		"mov %0, %2\n\t"
		"cas [%3], %1, %2\n\t"
		"cmp %1, %2\n\t"
		"bne,a,pn %%icc, 1b\n\t"
		"  mov %2, %1"
		: "+r" (__value), "=r" (__tmp1), "=r" (__tmp2)
		: "r" (__memory)
		: "cc");
	return (__tmp2);
}

But suppose gcc decided to allocate o0 to both __tmp1 and __memory. This is allowable, because the "=r" constraint implies that the corresponding register is set only after all input-only operands are no longer needed (input/output operands obviously don't have this problem). In the case above, the first load would clobber o0 and the cas would operate on an arbitrary location. Instead, we must write "=&r" for both __tmp1 and __tmp2; neither variable may safely be allocated the same register as the input operand.

Bugs caused by omitting the earlyclobber are painful to track down because they often appear and disappear from one compilation to the next as entirely unrelated code changes cause increases or decreases in register pressure.

This is not an academic concern. Consider this example program:

#include 

static __inline__ void
incr32(volatile uint32_t *__memory)
{
        uint32_t __tmp1, __tmp2;
        __asm__ __volatile__(
        "ld [%2], %0\n\t"
        "1:\n\t"
        "add %0, 1, %1\n\t"
        "cas [%2], %0, %1\n\t"
        "cmp %0, %1\n\t"
        "bne,a,pn %%icc, 1b\n\t"
        "  mov %1, %0"
        : "=r" (__tmp1), "=r" (__tmp2)
        : "r" (__memory)
        : "cc");
}

uint32_t
func(uint32_t x)
{
        uint32_t y = 4;
        uint32_t z = x + y;

        incr32(&y);

        z = x + y;

        return (z);
}
gcc compiles this (use -O2 -mcpu=v9 -mv8plus) into:
func()
    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               90 02 20 04  add          %o0, 0x4, %o0
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              c2 00 40 00  ld           [%g1], %g1	<===
    func+0x18:              9a 00 60 01  add          %g1, 0x1, %o5
    func+0x1c:              db e0 50 01  cas          [%g1] , %g1, %o5	<= SEGV
    func+0x20:              80 a0 40 0d  cmp          %g1, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              82 10 00 0d  mov          %o5, %g1
    func+0x2c:              81 c3 e0 08  retl         
    func+0x30:              9c 23 bf 88  sub          %sp, -0x78, %sp

In this case, gcc has allocated g1 to both __tmp1 and __memory, and o5 to __tmp2. Note the highlighted instructions: the initial load destroys the value of g1, and the subsequent cas will attempt to operate on whatever address was stored at *__memory when the fragment began. In this example, that value will be 4 (g1 is assigned sp+0x64, which is simply the address of y). This program is compiled incorrectly due to improper constraints, and will cause a segmentation fault if the code in question is executed.

If instead we use "=&r" for both __tmp1 and __tmp2, gcc generates the following code:

func()
    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               90 02 20 04  add          %o0, 0x4, %o0
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              d8 00 40 00  ld           [%g1], %o4	<===
    func+0x18:              9a 03 20 01  add          %o4, 0x1, %o5
    func+0x1c:              db e0 50 0c  cas          [%g1] , %o4, %o5	<= OK
    func+0x20:              80 a3 00 0d  cmp          %o4, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              98 10 00 0d  mov          %o5, %o4
    func+0x2c:              81 c3 e0 08  retl         
    func+0x30:              9c 23 bf 88  sub          %sp, -0x78, %sp

This code now assigns o4 to __tmp1, which eliminates the problem described above. This function, however, still does not do the right thing. Why not?

Reloading

Compilers keep track of where each live variable in the program can be found; many variables can be found both at some memory location and in a register. Sometimes, the compiler chooses to use a register for a different variable, and stores the value back to its memory location (if it has changed) before doing so. Later, if this value is needed, the value must be loaded back into a register before being used. This is known as reloading. Other reasons reloading may be required include a variable's declaration as volatile and the case that concerns us here, a variable's modification via side effects.

In the example above, incr32() is actually operating on a memory address, not a register. So why did we assign __memory the "r" constraint instead of more correctly expressing the constraint as "+m" (*__memory)? It turns out that the "m" constraint allows a variety of possible addressing modes. On SPARC, this includes the register/offset mode (such as [%sp+0x64]). This is fine for instructions like ld and st, but the cas instruction is special: it allows no offset. No constraint exists to describe this condition; the "V" constraint is clearly similar but is not correct; a bare register ([%g1]) is an offsettable address, so "V" would actually exclude the case we want. Conversely, "o", the inverse constraint of "V", includes the register/offset addressing mode we specifically wish to exclude. So, the only way to express this constraint is "r". But this does nothing to capture the fact that although the pointer itself is not modified, the value at *__memory is altered by the assembly fragment. Is this a problem? Let's look at the assembly generated for func() a little more closely:

func()
    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               90 02 20 04  add          %o0, 0x4, %o0	<===
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              d8 00 40 00  ld           [%g1], %o4
    func+0x18:              9a 03 20 01  add          %o4, 0x1, %o5
    func+0x1c:              db e0 50 0c  cas          [%g1] , %o4, %o5
    func+0x20:              80 a3 00 0d  cmp          %o4, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              98 10 00 0d  mov          %o5, %o4
    func+0x2c:              81 c3 e0 08  retl         			<===
    func+0x30:              9c 23 bf 88  sub          %sp, -0x78, %sp

We see that gcc has assigned z the o0 register, which is not surprising given that it's the return value. But after o0 is set to x + 4 at the beginning of the function, it's never set again. The line z = x + y has been discarded by the compiler! This is because it does not know that our inline assembly modified the value of y, so it did not reload the value and recalculate z.

There are two ways we can correct this problem: (a) add a "+m" output operand for *__memory, or (b) add "memory" to the list of clobbers. This is a special clobber that tells gcc not to trust the values in any registers it would otherwise believe to hold the current values of variables stored in memory. In short, this clobber tells gcc that all registers must be reloaded if the correct value of a variable is required. This is somewhat inefficient when we know which piece of memory has been touched, so (a) is preferable for better performance. Whichever solution we choose, gcc now compiles our code to:

func()
    func:                   9c 03 bf 88  add          %sp, -0x78, %sp
    func+0x4:               9a 10 20 04  mov          0x4, %o5
    func+0x8:               98 10 00 08  mov          %o0, %o4
    func+0xc:               da 23 a0 64  st           %o5, [%sp + 0x64]
    func+0x10:              82 03 a0 64  add          %sp, 0x64, %g1
    func+0x14:              d6 00 40 00  ld           [%g1], %o3
    func+0x18:              9a 02 e0 01  add          %o3, 0x1, %o5
    func+0x1c:              db e0 50 0b  cas          [%g1] , %o3, %o5
    func+0x20:              80 a2 c0 0d  cmp          %o3, %o5
    func+0x24:              32 47 ff fd  bne,a,pn     %icc, func+0x18
    func+0x28:              96 10 00 0d  mov          %o5, %o3
    func+0x2c:              d0 03 a0 64  ld           [%sp + 0x64], %o0	<===
    func+0x30:              90 03 00 08  add          %o4, %o0, %o0	<===
    func+0x34:              81 c3 e0 08  retl         
    func+0x38:              9c 23 bf 88  sub          %sp, -0x78, %sp

Note the reload, which will now return the correct result. There are actually two other ways to correct this, although the use of "+m" is the most correct. First, we could declare z to be volatile in func(). This would force gcc to reload its value from memory any time that value is required. Use of the volatile keyword is mainly useful when some external thread (or hardware) may change the value at any time; using it as a substitute for correct constraints will cause unnecessary reloading, degrading performance. Second, and perhaps best of all, the compiler could be modified to accept a SPARC-specific constraint for use with the cas instruction, one which requires the address of the operand to be stored in a general register.

You can find more inline assembly examples in libc (math functions), MD5 acceleration, and the kernel illustrating these concepts. Be sure to read and understand the documentation completely before writing your own inline assembly for gcc, and always test your understanding by constructing and compiling simple test programs like these. (2005-12-05 17:32:50.0) Permalink Comments [3]

20050816 Tuesday August 16, 2005

Broken allocators and paleolithic debugging strategies

Not so long ago I was looking through Solaris's shells for memory allocators - functions that perform tasks similar to malloc(3c). These functions often store the size of the allocated block at the beginning of each block; if that size is stored as a 4-byte value, the return value from the allocator may not be aligned on an 8-byte boundary. This is a major problem on SPARC, because it's not uncommon to allocate structs or unions containing types that require 8-byte alignment, especially long long. As it turns out, gcc correctly assumes that long long variables are aligned on 8-byte boundaries and uses the ldd and std instructions to access them. Our Studio compiler doesn't; it always issues two ld or st instructions. The result is that programs using this kind of allocator can crash when built with gcc but not with Studio, not a pleasant condition.

As part of my search, I found that, indeed, the Bourne and Korn shells have some alignment problems. Though these are bugs, we've decided that there's no reliable way to find all possible bugs of this type, so we worked around them in the compiler as well as fixed the ones we've found. This is, if nothing else, a good argument against compilers that "help" programmers by covering up this kind of error. But the best prize of all wasn't the kind of problem I was looking for, but rather this gem from the C shell:

        showall(av);
        printf("i=%d: Out of memory\n", i);
 	chdir("/usr/bill/cshcore");
 	abort();

This is the systems programming equivalent of finding a live wooly mammoth contentedly smoking a cigar in your recliner. Unfortunately, there's no way to trigger this behaviour, as it's protected by the "debug" preprocessor symbol, which we never set in a normal build. Nevertheless, thanks to OpenSolaris, you can see it for yourself.

We harp incessantly on the need to be able to debug production code, with no recompilation needed; there are a number of better ways to debug this particular condition. For example, you could use the DTrace pid provider to stop a csh process when nomem() is called, and even provide a backtrace. If that weren't enough, you could then use mdb(1) to debug the problem in greater detail, or gcore(1) to produce a core dump. But the best part, the real joy, if you'll pardon the pun, is the chdir call. Clearly the purpose was to drop core in a predictable location for later analysis by the author. I think you'll find that coreadm(1m), along with other corefile improvements, offers a far more flexible and powerful way to accomplish this - and it complements nicely the other debugging strategies I mentioned above.

(2005-08-16 14:16:31.0) Permalink Comments [2]

20050802 Tuesday August 02, 2005

Premium Drinks Tuesday and Wednesday nights (after the extravaganza on Tuesday and the OpenSolaris BOF on Wednesday) we'll be convening for potent beverages, good food, and unique and amusing company. I'll be at the Lloyd Center DoubleTree in downtown Portland, OR, room 1560. Expect other OpenSolaris personalities to be present. Laura tells me that souvenir shot glasses are among the after-party swag collection, so don't miss out. (2005-08-02 10:45:01.0) Permalink Comments [0]

20050801 Monday August 01, 2005

OpenSolaris at OSCON Those of you in or near Portland, Oregon are encouraged to come and see us at OSCON this week. Most of the conference is at the Convention Center this year (use the helpfully-named Convention Center train stop). Sun will have a booth in the exhibit hall starting Wednesday, and we're giving a few talks as well. In particular, join Bryan and me for a free tutorial on building, installing, and developing with OpenSolaris using DTrace, mdb, and more. That will be held Tuesday at 1:30pm in room D140. Then on Wednesday, I'll be giving a short talk on the status of OpenSolaris at 2:35pm in Portland/255, and we'll have a BOF at 8:30pm. Thursday, don't miss Bryan's short talk on DTrace at 4:30pm.

Even if you can't make the conference, you're welcome to join me for a beer. Send me mail at wesolows at eng dot sun dot com if you're interested, or leave a message for me at the 5th Avenue Suites. (2005-08-01 12:26:19.0) Permalink Comments [0]

20050614 Tuesday June 14, 2005

The First OpenSolaris Project: GCC Support The First OpenSolaris Project: GCC Support

OpenSolaris is (finally) available. I've been working on this every day I've been with Sun, though others have spent years on the effort, and it's an amazing milestone. Unlike most launches, though, this is the beginning of a new effort rather than the end of one. As much as we've done already, there's far more left to be done before OpenSolaris can fulfil all our promises and achieve all our goals.

One promise we have fulfilled today is our commitment to make OpenSolaris accessible to people without the money or desire to buy compilers. Since most of Solaris is normally built with the Sun Studio compilers, this meant we'd need either to provide the compilers on the same terms as Solaris (also required to build OpenSolaris sources), or modify the sources to build and work with the GNU C compiler, available with source and free of charge under the terms of the GNU GPL. For reasons more illustrative of bureaucracy and human nature than of technological difficulties, we were unsure almost until the moment of launch whether we would be able to provide the Studio compilers under acceptable terms; therefore, another engineer and I have spent the last two and a half months porting OpenSolaris to gcc.

At this point I had a nice writeup on inline assembly differences between the Studio and GNU compilers. But it relies on source code that isn't available yet - namely, the gcc-specific inline assembly files. So instead I'll talk about why it happened that way and why it's actually a good thing. I'll also talk about some straight-up bugs we found in the process of porting.

We received word that a final Studio license had been agreed upon on June 3 - just 11 days ago! The license is free-as-in-beer and although somewhat vague seems reasonable enough. Of course, I prefer using only Free Software and promoting it whenever possible (as we're going with OpenSolaris), so I'd really rather use gcc. Our plan of record was to make a merged workspace available as "official" OpenSolaris. There were three sets of changes that needed to be merged together in the last three days leading up to launch: the gcc changes, which edit about 2500 files (mostly to fix compiler warnings), a large wad of renames to support the separation of code we're releasing now from that which we're hoping to release later (thousands of renames), and the coup d'grace, the addition of the CDDL license block to over 24,000 files. In the end, this gigantic 3-way merge proved impractical: there were over 1700 conflicts to resolve. Most are trivial and can easily be automerged by TeamWare, our revision control system, but the sheer volume and shortened schedule would have made adequate testing impossible.

Instead of the three-way merge, then, we elected to take the minimum amount of change we could: the addition of the CDDL blocks and the separation of released from unreleasable source. That meant gcc support would not ship in the "official" sources - but it could still be made available to the developer community. This is important for several reasons - first, it illustrates an important principle: FCS quality all the time. That is, if it's not good enough for a customer, it's not good enough to be putback. Since there was no doubt in anyone's mind that the gcc work was not ready for either, that meant it also wasn't good enough to call OpenSolaris. Second, it offers us an opportunity to provide a glimpse into the way projects work. One of the most common questions we get is "so, if the gate always has to be golden, how does any major work ever get done?" Like most people, we do major work in "branches" off the trunk. TeamWare supports children of children and merging of independent workspaces with common ancestry, so that no complicated branching apparatus is needed as for CVS. What will be available on the gcc project page will be that project gate. You're invited to participate - there are over 300 mostly very small bugs to fix.

One of the most significant kinds of bug we found were programs writing into string constants, confirming Osborne's Law. These programs ordinarily work properly because the Studio compilers place the string constants in the .data section or some other writable data section. The flag -xstrconst changes this behaviour, placing the strings in .rodata or a similar read-only segment and thus also allowing them to be shared. This reduces runtime memory usage but comes at a cost: buggy programs that attempt to write to the constant strings will trigger a segmentation violation and normally die. gcc acts as if this flag were always on, and applies it to other const data types as well. The end result is greater enforcement of correctness at the cost crashes.

Fortunately fixing these is very easy. For example, I fixed bug number 6281909 (you're supposed to be able to see bugs, too, but it doesn't seem to include the bugs of interest) by fixing the selector function not to assume it can write '=' and '\0' into its arguments. Note that the correct use of 'const' can help prevent this kind of problem.

The original article on inline assembly will appear when the source it references appears - and you can help make that happen sooner: check out the gcc project page.


Technorati Tag:
Technorati Tag:
(2005-06-14 08:09:00.0) Permalink Comments [0]

20050415 Friday April 15, 2005

DTrace is part of this complete operating system

Earlier this week, Mr. Vaughan-Nichols at eWeek wrote a largely inaccurate and needlessly hostile article about the CDDL, and our own Andy Tucker called him on a few points. Without bothering to correct that article or respond, he went back at it again on Wednesday, this time giving air time to SCO and their blessing of the OpenSolaris program. Why Mr. McBride of SCO felt the need to give this "blessing" is unclear; Sun obviously believes it has the rights needed to make the sources to nearly all of Solaris available under whatever license(s) we choose. Without those rights, no blessing would be sufficient; with them, none is necessary. I'll chalk this up to SCO taking whatever opportunity it can to appear relevant, especially as they continue to struggle in both the marketplace and the courtroom.

Enough of that. Instead, I'd like to focus on the most obvious and significant error in this article: the assertion that

"To date, though, the only released components of OpenSolaris are programs, such as DTrace, which aren't parts of the operating system."

We don't need to be too picky about what constitutes an operating system; even the most pedantic would surely agree that a component which spans the system from user applications to the heart of the kernel is part of the operating system. Under even an extremely narrow definition, DTrace is very much a part of the Solaris operating system - and therefore also of OpenSolaris technology. Our release of DTrace includes the sources for not just the standalone program dtrace(1M), but also all of the following:

It should be apparent that this is far more complex a subsystem than just one standalone user program. In fact, the source to dtrace(1M) is a single file out of 345 we released, and constitutes only 1431 of 102,163 lines of code (about 1.4%) in this initial release. It dtrace(1M) were simply an ordinary user program, it would not require over 100,000 lines of additional code - including over 32,000 in the kernel - to make it work.

As a final example, observe this comment block from usr/src/uts/os/common/dtrace_subr.c:

/*
 * Making available adjustable high-resolution time in DTrace is regrettably
 * more complicated than one might think it should be.  The problem is that
 * the variables related to adjusted high-resolution time (hrestime,
 * hrestime_adj and friends) are adjusted under hres_lock -- and this lock may
 * be held when we enter probe context.  One might think that we could address
 * this by having a single snapshot copy that is stored under a different lock
 * from hres_tick(), using the snapshot iff hres_lock is locked in probe
 * context.  Unfortunately, this too won't work:  because hres_lock is grabbed
 * in more than just hres_tick() context, we could enter probe context
 * concurrently on two different CPUs with both locks (hres_lock and the
 * snapshot lock) held.  As this implies, the fundamental problem is that we
 * need to have access to a snapshot of these variables that we _know_ will
 * not be locked in probe context.  To effect this, we have two snapshots
 * protected by two different locks, and we mandate that these snapshots are
 * recorded in succession by a single thread calling dtrace_hres_tick().  (We
 * assure this by calling it out of the same CY_HIGH_LEVEL cyclic that calls
 * hres_tick().)  A single thread can't be in two places at once:  one of the
 * snapshot locks is guaranteed to be unheld at all times.  The
 * dtrace_gethrestime() algorithm is thus to check first one snapshot and then
 * the other to find the unlocked snapshot.
 */

This comment, while arcane, is clear by itself, so I will not attempt to add to it. I will only point out that if DTrace were not a part of the operating system, it would not need to concern itself with the locking rules for updates to the high-resolution system timers. Further examples of DTrace's intimate association with core features of the Solaris kernel and userland libraries can easily be found by examining the sources.

Sun's DTrace experts have written extensively about their creation [more here and here to note just two] and provided a highly detailed reference manual. While much of this material may not be in a format which is accessible to the layman, even a cursory overview of the source we are offering and the breadth and depth of publications on the topic should be sufficient to satisfy one that DTrace is very much a part of the operating system. Perhaps Mr. Vaughan-Nichols was simply unfamiliar with the offering; in that case I would invite him to download the sources and inspect them himself, and to seek the opinions of expert engineers before making further claims of this sort. DTrace is very much a part of Solaris, and while we have much more to do, releasing it as open source was no trivial step. (2005-04-15 13:52:20.0) Permalink Comments [0]

20041223 Thursday December 23, 2004

Linus on Solaris Most people have probably read the recent Linus interview, in which he has a number of things to say about Linux, Solaris, and software development. Like any interview, it contains some interesting assertions, some obvious filler, and some real head-scratchers. Many in the Solaris community have expressed dismay or anger over some of his remarks, but rather than add to that, I'd like to examine some internal contradictions in Linus's statements and try better to understand why he's made them. As we ready OpenSolaris for public consumption and contribution, it's important to observe how similar development systems work and take steps to avoid difficulties encountered by other projects. Linus's comments indicate that, indeed, the structures and processes in place to serve Linux development are imperfect. We will be well-served to learn from this.

One of the head-scratchers is his assertion that he's not interested in Solaris because he feels it offers nothing of value that isn't already in Linux. This conclusion might be less baffling, though no less disappointing, if he'd actually examined the code, the feature set, and then made up his mind. But he admitted openly that he probably won't even look at the code, and instead will rely on others to tell him if it contains ideas worth considering. I really have to wonder about this approach, especially given his later comments concerning the reason for adding a feature to a system. We certainly agree with him that system design is about solving problems, not just doing something new and different for its own sake. Features don't get added to Solaris if they don't serve some useful purpose, fill some hole for developers, users, or both. It's difficult to believe that Solaris developers and users have problems to solve that differ greatly from those of Linux developers and users. In fact, as a long-time Linux developer myself, I can say with some confidence that the challenges are the same. So why does Solaris offer tools like kmdb, dtrace, and crash dumps, while Linus either refuses to integrate similar functionality or claims he hasn't heard of the problems these tools help to solve?

One possible reason is that distributions sometimes provide parts of these feature sets, so that users never even realize their absence in Linux proper. Linus talked about the distributors serving a valuable function, buffering developers from customers. But perhaps in that process, valuable information is not making its way back to Linus. The Linux development community would be well-served by talking to ordinary systems administrators now and then. Another possibility is that users and administrators can't, won't, or don't effectively communicate the problems they are trying to solve. But why don't Solaris users seem to have this problem? Do Linux distributors simply not listen? Or perhaps these decisions are really based on ideology, as so many Linux detractors claim. Regardless, a sober assessment of users' real-world needs might well reveal that Linus and others still have much work to do (as do Solaris developers), and that some of the changes they ought well to consider have already been made in other systems. The solutions Linus might choose may well be quite different from those chosen by Sun, but disregarding or remaining ignorant of the challenges is an opportunity lost to innovate and improve. What kind of engineer willingly passes up that opportunity?

If NIH is in fact "a disease" - a point which ought to solicit universal agreement, I'm left to wonder why Linus would pass up an opportunity to examine the works of other engineers. If he does in fact rely on others to tell him about valuable features in similar systems, something in that process is broken. If he wants to make sure Linux can solve all the problems Solaris can, I'd suggest he look closely at what's been done here. The code isn't even needed for this - a quick glance at public white papers would be sufficient to understand many of the problems Solaris engineers have been working to solve. If he doesn't believe these problems exist, a reality check is in order.

There are lessons here, of course. One of them is that systems developers must not lose touch with the problems they're supposed to solve. It pays to listen. Another lesson is that a process which prevents useful features from being implemented is broken, and someone has to be willing to recognize and correct such a process. If distributions take on the work of making a usable system and interacting with customers, engineers risk losing sight of appropriate goals. This is avoidable, but that it appears to be occurring implies that the relationship among Linux (the codebase), its distributors, and its developers (many of whom work for distributors) is defective in some way.

I'm cheered by the prospects for OpenSolaris to avoid these pitfalls, especially if we recognize them and take proper action. I hope we as a community will remain cognizant that they have hindered other large projects before ours, even those with leaders of Linus's stature. (2004-12-23 10:35:36.0) Permalink Comments [5]

20040813 Friday August 13, 2004

A Sense of Entitlement I've finally decided to write a bit about a topic that has bothered me for many years as a participant in the Free Software community (it applies equally well to Open Source if you prefer): User Entitlement.

Some of you out there know what I mean. You maintain an application in your spare time as a volunteer. You field trouble reports and RFEs and do your best to implement, at minimum, the suggestions that matter to you, all while holding down a job and meeting your personal and family obligations. But for a minority of users, that's not enough; they expect you to implement features that don't interest you and fix bugs you can't reproduce. In short, they expect you to provide support. While one tries never to be rude, at some point the urge to point out the obvious becomes overwhelming: you have the source, you obviously care a lot about this, and nobody else has the time or inclination to do anything about it! Instead of repeatedly asking when I'm going to implement your change, why not implement it yourself and send me a patch?

Of course, the inevitable response to this suggestion is that the user in question is not a programmer. This is a subtle but important fact that has changed the way the community functions over the years; in the beginning, we were all programmers. Now programmers are a minority of Free Software users, just as we are a minority of software users in general. The commons model breaks down under these conditions; many users have little to offer the community as a whole. Bug reports and testing are valuable services, true, but some users are just that - users. Not testers. Not contributors. Not developers. Just users; they use the software, expect (rightly) that it will work as advertised, and become unhappy and demanding if it does not. This looks a lot more like a customer than the fellow co-op shareholder the model would suggest.

I don't mean to suggest that this behaviour is representative, but it certainly has increased as the pool of users has expanded. How will Free Software projects in the future deal with the influx of Users? Much work has been done, mostly in economics, on the subject of managing cooperatives and commons; I believe this work is directly relevant to the Free Software community. I'll get more into some of that work in my next post. (2004-08-13 16:06:56.0) Permalink

20040803 Tuesday August 03, 2004

Solaris at OSCON Last week a contingent from Sun showed up at OSCON to talk about Solaris 10, meet with some community leaders to discuss building a community around open source Solaris, and of course learn from the other conference attendees.

Simon Phipps gave a talk on open source development and the meaning of freedom which was quite interesting. One of his points was that freedom for deployers is as important as freedom for developers, and that while licenses can help to build a community around a piece of software by giving the developers freedom, this is not sufficient as an end in itself. This is one of the things we're looking at as we get ready to release Solaris under an OSI-approved license of some as yet undetermined kind. Governance and community structure are at least as important as the license we eventually choose, and we have much more direct and immediate participation in these aspects. One point I've tried to emphasize is that successful communities form spontaneously and organically; they cannot be constructed from scratch, purchased, or willed into existence. Our challenge is to get people excited about Solaris and interested in being a part of that community, then to provide them with infrastructure and a reasonable way to make their voices heard as we proceed. Many people at the conference had some helpful suggestions for doing this.

Perhaps the most important thing to remember is that developers are fundamentally attracted by two things: exciting technology and a meaningful opportunity to work on it. The success of the BOF that Adam, Andy, Bart, and Eric put on showed many people just how exciting some of these technologies are. The feedback we received was overwhelmingly positive. Rarely does such a demo-friendly piece of technology as DTrace come along, and many of the attendees were clearly impressed. Still, it's only one of many major enhancements in S10; it's fairly obvious that there will be plenty of developers attracted to our technology provided we can generate enough awareness.

But as we've seen, compelling technology and an open license are not enough to make for a successful project. GCC and XFree86 had both, but neither project was successful in building and sustaining a community (GCC of course has been reborn since the quagmire of the 2.8 era). Many things make up a project's public perception as one which is easy and fun, or frustrating and counterproductive, in which to participate. Infrastructure, governance, and developer resources each play a role. My focus at the moment is on our efforts to help developers get started; this means providing good documentation and keeping the barriers to entry as low as possible. As one of the people who will be creating the developer documentation, tutorials, and examples, I'd be very interested in your feedback. When you're attracted to a project, what type of resource increases your desire to participate? What dissipates your interest and turns it into frustration? (2004-08-03 14:03:00.0) Permalink Comments [1]