Ali Bahrami

Tuesday Oct 21, 2008

GNU Hash ELF Sections

The ELF object format is used by several different operating systems, all sharing a common basic design, but each sporting their own extensions. One of the nice aspects of ELF's design is that it facilitates this, defining a common core, as well as reserving space for each implementation to define its own additions. Those of us in the ELF community try to stay current with each other, as a source of new ideas and inspiration, in order to avoid reinventing the wheel, and out of curiosity.

Recently, the GNU linker developers added a new style of hash section to their objects, and I've been learning about them in fine detail. Having done the work, it only makes sense to write it down and share it.

This posting describes the layout and interpretation of GNU ELF hash sections. The GNU hash section provides a new hash section for ELF objects, with better performance than the original SYSV hash.

The information here was gathered from the following sources:

I did not look at the GNU binutils source code to gather this information.

If you spot an error, send me email (First.Last@Sun.COM, where First and Last are replaced with my name) and I'll fix it.

Hash Function

The GNU hash function uses the DJB (Daniel J Bernstein) hash, which Professor Bernstein originally posted to the comp.lang.c usenet newsgroup years ago:
uint32_t
dl_new_hash (const char *s)
{
        uint32_t h = 5381;

        for (unsigned char c = *s; c != '\0'; c = *++s)
                h = h * 33 + c;

        return h;
}
If you search for this algorithm online, you will find that the hash expression
h = h * 33 + c
is frequently coded as
h = ((h << 5) + h) + c
These are equivalent statements, replacing integer multiplication with a presumably cheaper shift and add operation. Whether this is actually cheaper depends on the CPU used. There used to be a significant difference with older machines, but integer multiplication on modern machines is very fast.

Another variation of this algorithm clips the returned value to 31-bits:

return h & 0x7fffffff;
However, GNU hash uses the full unsigned 32-bit result.

The GNU binutils implementation utilizes the uint_fast32_t type for computing the hash. This type is defined to be the fastest available integer machine type capable of representing at least 32-bits on the current system. As it might be implemented using a wider type, the result is explicitly clipped to a 32-bit unsigned value before being returned.

static uint_fast32_t
dl_new_hash (const char *s)
{
        uint_fast32_t h = 5381;

        for (unsigned char c = *s; c != '\0'; c = *++s)
                h = h * 33 + c;

        return h & 0xffffffff;
}

Dynamic Section Requirements

GNU hash sections place some additional sorting requirements on the contents of the dynamic symbol table. This is in contrast to standard SVR4 hash sections, which allow the symbols to be placed in any order allowed by the ELF standard.

A standard SVR4 hash table includes all of the symbols in the dynamic symbol table. However, some of these symbols will never be looked up via the hash table:

  • LOCAL symbols, unless referenced by a relocation (on some architectures)

  • FILE symbols

  • For sharable objects: All UNDEF symbols

  • For executables: Any UNDEF symbol that are not referenced by a PLT.

  • The special index 0 symbol (a special case of UNDEF)
Omitting these symbols from the hash table section has no impact on correctness, and will result in less hash table congestion, shorter hash chains, and correspondingly better hash performance.

With GNU hash, the dynamic symbol table is divided into two parts. The first part receives the symbols that can be omitted from the hash table. GNU hash does not impose any specific order for the symbols in this part of the dynamic symbol table.

The second half of the dynamic symbol table receives the symbols that are accessible from the hash table. These symbols are required to be sorted by increasing (hash % nbuckets) value, using the GNU hash function described above. The number of hash buckets (nbuckets) is recorded in the GNU hash section, described below. As a result, symbols which will be found in a single hash chain are adjacent in memory, leading to better cache performance.

GNU_HASH section

A GNU_HASH section consists of four separate parts, in order:
Header
An array of (4) 32-bit words providing section parameters:

nbuckets
The number of hash buckets

symndx
The dynamic symbol table has dynsymcount symbols. symndx is the index of the first symbol in the dynamic symbol table that is to be accessible via the hash table. This implies that there are (dynsymcount - symndx) symbols accessible via the hash table.

maskwords
The number of ELFCLASS sized words in the Bloom filter portion of the hash table section. This value must be non-zero, and must be a power of 2 as explained below.

Note that a value of 0 could be interpreted to mean that no Bloom filter is present in the hash section. However, the GNU linkers do not do this — the GNU hash section always includes at least 1 mask word.

shift2
A shift count used by the Bloom filter.

Bloom Filter
GNU_HASH sections contain a Bloom filter. This filter is used to rapidly reject attempts to look up symbols that do not exist in the object. The Bloom filter words are 32-bit for ELFCLASS32 objects, and 64-bit for ELFCLASS64.

Hash Buckets
An array of nbuckets 32-bit hash buckets

Hash Values
An array of (dynsymcount - symndx) 32-bit hash chain values, one per symbol from the second part of the dynamic symbol table.
The header, hash buckets, and hash chains are always 32-bit words, while the Bloom filter words can be 32 or 64-bit depending on the class of object. This means that ELFCLASS32 GNU_HASH sections consist of only 32-bit words, and therefore have their section header sh_entsize field set to 4. ELFCLASS64 GNU_HASH sections have mixed element size, and therefore set sh_entsize to 0.

Assuming that the hash section is aligned properly for accessing ELFCLASS sized words, the (4) 32-bit words directly before the Bloom filter ensure that the filter mask words are always aligned properly and can be accessed directly in memory.

Bloom Filter
GNU hash sections include a Bloom filter. Bloom filters are probabilistic, meaning that false positives are possible, but false negatives are not. This filter is used to rapidly reject symbol names that will not be found in the object, avoiding the more expensive hash lookup operation. Normally, only one object in a process has the given symbol. Skipping the hash operation for all the other objects can greatly speed symbol lookup.

The filter consists of maskwords words, each of which is 32-bits (ELFCLASS32) or 64-bits (ELFCLASS64) depending on the class of object. In the following discussion, C will be used to stand for the size of one mask word in bits. Collectively, the mask words make up a logical bitmask of (C * maskwords) bits.

GNU hash uses a k=2 Bloom filter, which means that two independent hash functions are used for each symbol. The Bloom filter reference contains the following statement:

The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields.
The hash function used by the GNU hash has this property. This fact is leveraged to produce both hash functions required by the Bloom filter from the single hash function described above:
H1 = dl_new_hash(name);
H2 = H1 >> shift2;
As discussed above, the link editor determines how many mask words to use (maskwords) and the amount by which the first hash result is right shifted to produce the second (shift2). The more mask words used, the larger the hash section, but the lower the rate of false positives. I was told in private email that the GNU linker primarily derives shift2 from the base 2 log of the number of symbols entered into the hash table (dynsymcount - symndx), with a minimum value of 5 for ELFCLASS32, and 6 for ELFCLASS64. These values are explicitly recorded in the hash section in order to give the link editor the flexibility to change them in the future should better heuristics emerge.

The Bloom filter mask sets one bit for each of the two hash values. Based on the Bloom filter reference, the word containing each bit, and the bit to set would be calculated as:

N1 = ((H1 / C) % maskwords);
N2 = ((H2 / C) % maskwords);

B1 = H1 % C;
B2 = H2 % C;
To populate the bits when building the filter:
bloom[N1] |= (1 << B1);
bloom[N2] |= (1 << B2);
and to later test the filter:
(bloom[N1] & (1 << B1)) && (bloom[N2] & (1 << B2))
The GNU hash deviates from the above in a significant way. Rather than calculate N1 and N2 separately, a single mask word is used, corresponding to N1 above. This is a conscious decision by the GNU hash developers to optimize cache behavior:
This makes the 2 hash functions for the Bloom filter more dependent than when two different Ns were used, but in our measurements still has very good ratio of rejecting lookups that should be rejected, and is much more cache friendly. It is very important that we touch as few cache lines during lookup as possible.

Therefore, in the GNU hash, the single mask word is actually calculated as:

N = ((H1 / C) % maskwords);
The two bits set in the Bloom filter mask word N are:
BITMASK = (1 << (H1 % C)) | (1 << (H2 % C));
The link-editor sets these bits as
bloom[N] |= BITMASK;
And the test used by the runtime linker is:
(bloom[N1] & BITMASK) == BITMASK;
Bit Fiddling: Why maskwords Is Required To Be A Power Of Two
In general, a Bloom filter can be constructed using an arbitrary number of words. However, as noted above, the GNU hash calls for maskwords to be a power of 2. This requirement allows the modulo operation
N = ((H1 / C) % maskwords);
to instead be written as a simple mask operation:
N = ((H1 / C) & (maskwords - 1));
Note that (maskwords - 1) can be precomputed once
MASKWORDS_BITMASK = maskwords - 1;
and then used for every hash:
N = ((H1 / C) & MASKWORDS_BITMASK);
Bloom Filter Special Cases
Bloom filters have a pair of interesting special cases:
  • When a Bloom filter has all of its bits set, all tests result in a True (accept) value. The GNU linker takes advantage of this by issuing a single word Bloom filter with all bits set when it wants to "disable" the Bloom filter. The filter is still there, and is still used, at minimal overhead, but it lets everything through.

  • A Bloom filter with no bits set will return False in all cases. This case is relatively rare in ELF files, as an object that exports no symbols has limited application. However, sometimes objects are built this way, relying on init/fini sections to cause code from the object to run.
Hash Buckets
Following the Bloom filter are nbuckets 32-bit words. Each word N in the array contains the lowest index into the dynamic symbol table for which:
(dl_new_hash(symname) % nbuckets) == N
Since the dynamic symbol table is sorted by the same key (hash % nbuckets), dynsym[buckets[N]] is the first symbol in the hash chain that will contain the desired symbol if it exists.

A bucket element will contain the index 0 if there is no symbol in the hash table for the given value of N. As index 0 of the dynsym is a reserved value, this index cannot occur for a valid symbol, and is therefore non-ambiguous.

Hash Values
The final part of a GNU hash section contains (dynsymcount - symndx) 32-bit words, one entry for each symbol in the second part of the dynamic symbol table. The top 31 bits of each word contains the top 31 bits of the corresponding symbol's hash value. The least significant bit is used as a stopper bit. It is set to 1 when a symbol is the last symbol in a given hash chain:
lsb = (N == dynsymcount - 1) ||
  ((dl_new_hash (name[N]) % nbuckets)
   != (dl_new_hash (name[N + 1]) % nbuckets))

hashval = (dl_new_hash(name) & ~1) | lsb;

Symbol Lookup Using GNU Hash

The following shows how a symbol might be looked up in an object using the GNU hash section. We will assume the existence of an in memory record containing the information needed:
typedef struct {
        const char      *os_dynstr;      /* Dynamic string table */
        Sym             *os_dynsym;      /* Dynamic symbol table */
        Word            os_nbuckets;     /* # hash buckets */
        Word            os_symndx;       /* Index of 1st dynsym in hash */
        Word            os_maskwords_bm; /* # Bloom filter words, minus 1 */
        Word            os_shift2;       /* Bloom filter hash shift */
        const BloomWord *os_bloom;       /* Bloom filter words */
        const Word      *os_buckets;     /* Hash buckets */
        const Word      *os_hashval;     /* Hash value array */
} obj_state_t;
To simplify matters, we elide the details of handling different ELF classes. In the above, Word is a 32-bit unsigned value, BloomWord is either 32 or 64-bit depending in the ELFCLASS, and Sym is either Elf32_Sym or Elf64_Sym.

Given a variable containing the above information for an object, the following pseudo code returns a pointer to the desired symbol if it exists in the object, and NULL otherwise.

Sym *
symhash(obj_state_t *os, const char *symname)
{
        Word            h1, h2;
        Word            n;
        Word            bitmask; 
        const Sym       *sym;
        Word            *hashval;

        /*
         * Hash the name, generate the "second" hash
         * from it for the Bloom filter.
         */
        h1 = dl_new_hash(symname);
        h2 = h2 >> os->os_shift2;

        /* Test against the Bloom filter */
        n = (h1 / sizeof (BloomWord)) & os->os_maskwords_bm;
        bitmask = (1 << (h1 % sizeof (BloomWord))) |
            (1 << (h2 % sizeof (BloomWord)));
        if ((os->os_bloom[n] & bitmask) != bitmask)
                return (NULL);

        /* Locate the hash chain, and corresponding hash value element */
        n = os->os_buckets[h1 % os->os_nbuckets];
        if (n == 0)    /* Empty hash chain, symbol not present */
                return (NULL);
        sym = &os->os_dynsym[n];
        hashval = &os->os_hashval[n - os->os_symndx];

        /*
         * Walk the chain until the symbol is found or
         * the chain is exhausted.
         */
        for (h1 =& ~1; ; sym++) {
                h2 = *hashval++;

                /*
                 * Compare the strings to verify match. Note that
                 * a given hash chain can contain different hash
                 * values. We'd get the right result by comparing every
                 * string, but comparing the hash values first lets us
                 * screen obvious mismatches at very low cost and avoid
                 * the relatively expensive string compare.
                 *
		 * We are intentionally glossing over some things here:
	         *
		 *    -  We could test sym->st_name for 0, which indicates
		 *	 a NULL string, and avoid a strcmp() in that case.
		 *
                 *    - The real runtime linker must also take symbol
		 * 	versioning into account. This is an orthogonal
		 *	issue to hashing, and is left out of this
		 *	example for simplicity.
		 *
		 * A real implementation might test (h1 == (h2 & ~1), and then
		 * call a (possibly inline) function to validate the rest.
                 */
                if ((h1 == (h2 & ~1)) &&
                    !strcmp(symname, os->os_dynstr + sym->st_name))
                        return (sym);

                /* Done if at end of chain */
                if (h2 & 1)
                        break;
        }

        /* This object does not have the desired symbol */
        return (NULL);
}

The Cost Of ELF Symbol Hashing

Linux and Solaris are ELF cousins. Both systems use the same base ELF standard, though we've both made our own OS specific extensions over the years. Recently, the GNU linker folks made some changes to how Linux does symbol hashing in their ELF objects. I've been learning about what they've done, and that in turn caused me to consider the bigger picture of ELF hashing overhead.

History and Trends

In an ELF based system, the runtime linker looks up symbols within executables and sharable objects. The available symbols are found in the dynamic symbol table. The lookup is done using a pre-computed hash table section associated with that symbol table. The SVR4 ELF ABI defines the layout of these hash sections, and the hash function. These are all original ELF features, dating back to the late 1980's when ELF was designed. This aspect of ELF has been static since that time.

The runtime linker maintains a list of objects currently mapped into a programs address space. To find a desired symbol, it starts at the head of this list and searches each one in turn using a hash lookup operation, until the symbol is found (success) or the end of the list is hit (failure).

The per-symbol cost of symbol lookup hashing grows with:

  • The number of objects in a process.

  • The number of symbols in those objects.

  • The length of the symbol names. In particular, the C++ language encodes class names and argument types into symbol names, which makes the strings extremely long. Even worse from a string comparison point of view is that these strings tend to have long shared suffixes, differing only in the final characters.
These items all grow over time. In the days when ELF was originally defined, a process had one or two sharable objects at most, and a few hundred symbols, all of which had short names. More recently, it has become common for a program to have tens of sharable objects, and thousands of symbols, and names have grown. This trend shows no sign of abating. It is easy to imagine a near future with hundreds of objects and hundreds of thousands of symbols. In fact, we have already seen a program with almost 1000 objects.

In the past, when the list of objects in a program was very short, it was not necessary to search many objects before a given symbol was found. Most symbols were in the first, often only, object. Hence, most hash operations were successful, and hash overhead was not a significant concern. In modern programs however, failed hash operations dominate. It is usually necessary to perform one or more failing hash operations before getting to the object that has a desired symbol. The more objects, the larger the percentage of failing hashes.

Unless somehow mitigated, the per-symbol cost of hashing will continue to grow as programs grow larger, possibly to a level where the user can feel the effect. There are several ways in which this overhead can be reduced:

  1. Eliminate unnecessary symbols

  2. Eliminate the O(n) search of the object list to reduce the amount of hashing required.

  3. Make each hashing operation cost less.

Eliminate Unnecessary Symbols

Most objects contain global symbols that are for the use of the object, but not intended to be accessed by outside code. One common example would be that of a helper function called within multiple files that are compiled into a sharable object. Such a function needs to be global within the object so that it can be called from multiple files. However, it is not intended to be something the users of the library call directly.

ELF versioning allow symbols to be assigned to versions, thereby creating interfaces that can be maintained for backward compatibility as the object evolves. In a version definition, the scope reduction operator can be used to tell the link-editor that any global symbols not explicitly assigned to a version should not be visible outside the object. For example, the following exports the symbol foo from a version named VERSION_1.0, while reducing the scope of all other global symbols to local:

VERSION_1.0
{
	 global:
		 foo;
	 local:
		 *;
};
Some language compilers offer symbol visibility keywords that have similar effect.

Eliminating unnecessary symbols from the hash table reduces the average length of the hash chains, and speeds the lookup. In addition, hiding unnecessary symbols from object's external interface prevents accidental interposition in which a given library exports a function intended to only be for its own use, and that symbol ends up interposing on a symbol of the same name in a different object.

Eliminate O(n) Object Searching

There are different strategies employed in modern operating systems to minimize the need for symbol hashing:
Prelink
The Linux operating system emphasizes their prelink system. Prelink analyzes all the executables on the system, and the libraries they depend on. Non-conflicting addresses are assigned for all the libraries, and then the executables and libraries are pre-bound to be loaded at those addresses. In effect, the work normally done by the runtime linker for each object at process startup is instead done once. The runtime linker, recognizing a prelinked object at process startup, will map it to its assigned location, and immediately return rather than do the usual relocation processing.

Prelinking is a complex per-system operation, though Linux does a good job of hiding and managing this complexity. Changes to the system can require the prelink operation to be redone.

Prelinking pares the cost of ELF runtime processing to the absolute minimum. As part of that, it completely eliminates symbol hashing at startup. A further advantage is that it does not require any changes to the objects in question. All of the complexity is kept in the prelink system itself.

Prelinking will not prevent hashing from occurring if:

  • The object is loaded via dlopen().
  • The object is not prelinked
  • If the object is prelinked for one system, but is then used by another, either as a copy or via a network filesystem. Prelinking is a per-system concept. The system can use an object prelinked for a different system, but the benefits of prelinking may be lost.
  • The objects on the system have changed, altering the prelinking needed. In this case, prelinking can be recomputed.

Direct Binding
The Solaris operating system employs a combination of direct binding, preferably in conjunction with lazy loading and lazy binding. In non-direct binding, an object records the symbols it requires from other objects. In direct binding, each such symbol also records the object expected to provide that symbol.

Direct bindings were developed with several goals in mind:

  1. They harden large programs against the effects of unintended symbol interposition, when the wrong library provides a given symbol due to the order in which libraries get loaded. This makes it easier to reliably build larger programs.

  2. They allow multiple instances of a named interface to be referenced by different objects, further expanding the ability of complex programs to interface with multiple, perhaps incompatible, external interfaces. For instance, when a program depends on many objects that are developed independently by others, sometimes those objects have dependencies on different incompatible versions of some other object.

  3. Performance: At runtime, the runtime linker uses the direct binding information to skip the O(n) search of all the objects in the process, and instead to go directly to the right library for each symbol, carrying out a single successful hash operation to locate it within the object.

The the first two items were the driving issues that led to the invention of direct bindings, reflecting real problems we were encountering in the field.

Lazy binding, which is the default, reduces the amount of hashing done even further, by delaying the binding operation until the first use of a given symbol by the program. If a given symbol is not needed during the lifetime of a process, it is never looked up, and the hashing that would otherwise occur is eliminated. Most programs do not use all of the symbols available to them in a given execution, so lazy binding amplifies the benefits of direct binding.

Direct bindings do not eliminate hashing, but they eliminate the unproductive failure cases, leaving a single successful hash lookup per symbol. In a sense, they bring us back to the performance of early ELF systems where everything was found in a single library and most hash operations were successful. Direct bindings have a relatively simple implementation. They work with dlopen(), and across network filesystems. However, many objects do not use direct bindings. Unlike prelink, direct bindings require explicit action to be taken when the object is built. Converting an existing non-direct object to use them can require some analysis to be carried out by the programmer, to enable desired interposition that direct bindings would otherwise prevent.

Prelinking and direct bindings are very different solutions that attack the problem of hashing overhead along different axis. Systems will see greatly reduced symbol hash overhead with either strategy, but symbol hashing will still occur. As such, the cost of the hash lookup operation is still of interest.

Making Symbol Hashing Cheaper

The existing SVR4 ELF hash function and hash table sections are fixed. Improving their performance requires introducing a new hash function and hash section. Recently, the developers of the GNU linkers have done this. These new hash sections can coexist in an object with the older SVR4 hash sections, allowing for backward compatibility with older systems. The GNU hash reduces hash overhead in the following ways:
  • An improved hash function is used, to better spread the hash keys and reduce hash chain length.

  • The dynamic symbol table is sorted into hash order, such that memory access tends to be adjacent and monotonically increasing, which can help cache behavior. (Note that the Solaris link-editor does a similar sort, although the specific details differ.)

  • The dynamic symbol table contains some symbols that are never looked up by via the hash table. These symbols are left out of the hash table, reducing its size and hash chain lengths.

  • Perhaps most significantly, the GNU hash section includes a Bloom filter. This filter is used prior to hash lookup to determine if the symbol is found in the object or not.
Bloom filters are used to test whether a given item is part of a given set. They use a compact bitmask representation, and are fast to query. Bloom filters, are probabilistic: False positives are possible, but false negatives are not. The size of the bitmask used to represent the filter, and the quality and number of hash functions determine the rate of false positives.

A Bloom filter is never wrong when it says an item does not exist in a set. Applied to ELF hash tables, the runtime linker can test a symbol name against a Bloom filter, and if rejected, immediately move on to the next object. Since most symbol lookups end in failure as discussed above, this has the potential to eliminate a large number of unnecessary hash operations. It is possible for a Bloom filter to incorrectly indicate that an item exists in the set when it doesn't. When this happens, it is caught by the following hash lookup, so correct linker operation is not affected. Since false positives are rare, this does not significantly affect performance.

It is interesting to note that the use of a Bloom filter makes a successful symbol lookup slightly more expensive than it would be otherwise. The hash table alone can be used to determine if a symbol exists or not, so the Bloom filter is pure overhead in the success case. Despite that, the Bloom filter is a winning optimization, because it is very cheap to compute compared to a hash table lookup, and because most hash operations are against libraries that end up not containing the desired symbol.

It is also worth noting that the runtime linker is free to skip the use of the Bloom filter and proceed directly to the hash lookup operation. This may be worthwhile in situations where the runtime linker has other reasons to believe the lookup will be successful. In particular, if the runtime linker is directed to a given object via a direct binding, the odds of a failed symbol lookup should be zero, so there is no need to screen before the hash lookup.

Conclusions

Tweaking the performance of an existing algorithm has its place, particularly within the inner loops of a busy program. However, the big wins are usually the result of using a better algorithm. In Solaris, direct bindings have been our algorithmic approach to reducing hash overhead. We've made a conscious effort to develop and deploy direct bindings in preference to making improvements to the traditional hash process. We've been pleased with the results — direct bindings are faster and combined with their other attributes, allow programs to scale larger.

Nonetheless, the GNU hash embodies several worthwhile ideas:

  1. Better hash function
  2. Doesn't put symbols that the runtime linker doesn't care about into the hash table.
  3. Bloom filter cheaply eliminates most hash operations.
In particular, the use of a Bloom filter to cheaply filter away unproductive hash operations stands out as a very interesting idea.

It is clear that hash overhead can be measured and reduced. By and large however, we have not found ELF hash overhead to be a hot spot for real programs. It seems that the other things that go on in a program generally dominate the hit that comes from ELF symbol hashing. The core Solaris OS is now built using direct bindings, which has allowed us to harden and simplify aspects of its design. Interestingly enough, measurements do not reflect a significant resulting change in system performance. This does not prove that hash overhead should be ignored, but it does tend to suggest that one has to look towards extremely large non-direct bound programs in order to demonstrate the issue.

It's all food for thought, and perhaps time for some experimentation and measurement.

Wednesday Mar 19, 2008

ld Is Now A Cross Link-Editor

Until yesterday, ld, the Solaris link-editor, was a native linker. This means that it was only able to link objects for the same machine that the linker was running on. To link sparc objects required the use of of a sparc machine, and x86 objects required an x86 system.

With the integration of

PSARC 2008/179 cross link-editor
6671255 link-editor should support cross linking
into Solaris Nevada build 87, the Solaris ld became a cross link-editor. Now, ld running on any type of hardware can link objects for any of the systems Solaris supports. This is currently sparc and x86, but there are people playing with OpenSolaris on other hardware too, so who knows what we might end up with?

The user interface to this new capability is a simple extension of ld's traditional behavior. Traditionally, ld establishes the class (whether the object is 32 or 64-bit) of the object being generated by examining the first ELF object processed from the command line. In the case where an object is being built solely from an archive library or a mapfile, the -64 command line option is available to explicitly override this default. We have extended this to also determine the machine type of the object to produce:

  • The class and machine type of the resulting object default to those of the first ELF object processed from the command line.

  • The new '-z target=' command line option can be used to explicitly specify the machine type.
It's simple: To link objects for a given machine, you supply the link-editor with objects of that type. The link-editor examines the first object, and then configures itself to process objects of that type.

Of course, it's a rare program that doesn't link against at least one system library. You're going to need libc, if nothing else. To do a successful cross-link, you'll need to have an image of the root filesystem for a system of the target type. There are many ways to do this. For testing purposes, I used a sparc and x86 system, using NFS to allow each system to see the root filesystem of the other.

Even though we now have a cross link-editor, we expect that the vast majority of links will be native, for the machine running the linker. We decided to pursue cross linking anyway, for two reasons:

  1. To lower the bar for OpenSolaris ports: People have had some success using the GNU ld to port OpenSolaris, but it is difficult to get very far that way. The code in Solaris depends on link-editor features that are specific to the Solaris ld, so using GNU ld involves a fair amount of hacking around these things to make progress, and the farther up the stack you go from the kernel to userland, the harder it gets. We hope that providing a better framework for adding targets to the Solaris ld will help such efforts. It should now be possible (though still not trivial!) to add support for a new target to the Solaris linker running on sparc or x86, and then use the resulting system to cross-build for the new platform.

  2. To allow the use of fast/cheap commodity desktop systems to build objects for other systems, be they large expensive systems, or small embedded devices.

A cross link-editor is a significant step, but is of little use unless you also have a cross-compiler and assembler. The GNU gcc compiler can be built as a cross compiler, and that should be very helpful for OpenSolaris ports. However, the Sun Studio compilers are native compilers, so it will still be awhile before I can use my amd64 desktop to build a sparc version of Solaris as part of work at Sun. We've taken a first step. It will be interesting to see what follows.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Friday Nov 02, 2007

Avoiding LD_LIBRARY_PATH: The Options

With the introduction of the elfedit utility into Solaris, we have a new answer to the age old question of how to avoid everyones favorite way to get into trouble, the LD_LIBRARY_PATH environment variable. This seems like an appropriate time to revisit this topic.

LD_LIBRARY_PATH Seems Useful. What's the Problem?

The problem is that LD_LIBRARY_PATH is a crude tool, and cannot be easily targeted at a problem program without also hitting other innocent programs. Sometimes this overspray is harmless (it costs some time, but doesn't break anything). Other times, it causes a program to link to the wrong version of something, and that program dies in mysterious ways.

Historically, inappropriate use of LD_LIBRARY_PATH might be the #1 one way to get yourself into trouble in an ELF environment. In particular, people who redistribute binaries with instructions for their users to set LD_LIBRARY_PATH in their shell startup scripts are unleashing forces beyond their control. Experience tells us that such use is destined to end badly.

This subject has been written about many times by many people. My colleague Rod Evans wrote about this ( LD_LIBRARY_PATH - just say no) for one of his first blog entries.

If you need additional convincing on this point, here are some suggested Google searches you might want to try:

LD_LIBRARY_PATH problem
LD_LIBRARY_PATH bad
LD_LIBRARY_PATH evil
LD_LIBRARY_PATH darkest hell

If LD_LIBRARY_PATH is so bad, why does its use persist? Simply because it is the option of last resort, used when everything else has failed. We probably can't eliminate it, but we should strive to reduce its use to the bare minimum.

How to Use, and How To Avoid Using LD_LIBRARY_PATH

The best way to use LD_LIBRARY_PATH is interactively, as a short term aid for testing or development. A developer might use it to point his test program at an alternative version of a library. Beyond that, the less you use it, the better off you'll be. With that in mind, here is a list of ways to avoid LD_LIBRARY_PATH. The items are ordered from best to worst, with the best option right at the top:
  • Explicitly set the correct runpath for the objects you build. If you have the ability to relink the object, you can always do this, and no other workaround is needed. To set a runpath in an object, use the -R compiler/linker option.

    One common problem that people run into with a built in runpath is the use of an absolute path (e.g. /usr/local/lib). Absolute paths are no problem for the well known system libraries, because their location is fixed by convention as well as by standards. However, they can be trouble for libraries supplied by third parties and installed onto the system. Usually the user has a choice of where such applications are installed, their home directory, or /usr/local being two of the more popular places. An application that hard wires the location of user installed libraries cannot handle this. The solution in this case is to use the $ORIGIN token in those runpaths. The $ORIGIN token, which refers to the directory in which the using object resides, can be used to set a non-absolute runpath that will work in any location, as long as the desired libraries reside at a known location relative to the using program. Fortunately, this is often the case.

    For example, consider the case of a 32-bit application named myapp, which relies on a sharable library named mylib.so, as well as on the standard system libraries found in /lib and /usr/lib. The -R option to put the runpath into myapp that will look in these places would be:

    -R '$ORIGIN/../lib:/lib:/usr/lib'
    
    This allows myapp and mylib.so to be installed anywhere, as long as they are kept in the same positions relative to each other.

    Even for system libraries, the use of $ORIGIN can be useful. We use it for all of the linker components in the system. For instance:

    % elfdump -d /usr/bin/ld | grep RUNPATH
           [7]  RUNPATH           0x2e6               $ORIGIN/../../lib
    
    By setting the runpath using $ORIGIN instead of simply hardwiring the well known location /lib, we make it easier to test a tree of alternative linker components, such as results when we do a full build of the Solaris ON consolidation. We know that when we run a test copy of ld, that it will use the related libraries that were built with it, instead of binding to the installed system libraries.

    There is one exception to the advice to make heavy use of $ORIGIN. The runtime linker will not expand tokens like $ORIGIN for secure (setuid) applications. This should not be a problem in the vast majority of cases.

  • Many times, the problem comes in the form of open source software that explicitly sets the runpath to an incorrect value for Solaris. Can you fix the configuration script and contribute the change back to the package maintainer? You'll be doing lots of people a favor if you do.

  • If you have an object with a bad runpath (or no runpath) and the object cannot be rebuilt, it may be possible to alter its runpath using the elfedit command. Using the myapp example from the previous item:
    elfedit -e 'dyn:runpath $ORIGIN/../lib:/lib:/usr/lib' myapp
    
    For this option to be possible, you need to be running a recent version of Solaris that has elfedit, and your object has to have been linked by a version of Solaris that has the necessary extra room. Quoting from the elfedit manpage:
    • The desired string must already exist in the dynamic string table, or there must be enough reserved space within this section for the new string to be added. If your object has a string table reservation area, the value of the .dynamic DT_SUNW_STRPAD element indicates the size of the area. The following elfedit command can be used to check this:

      % elfedit -r -e 'dyn:tag DT_SUNW_STRPAD' file
    • The dynamic section must already have a runpath element, or there must be an unused dynamic slot available where one can be inserted. To test for the presence of an existing runpath:

      % elfedit -r -e 'dyn:runpath' file

      A dynamic section uses an element of type DT_NULL to terminate the array found in that section. The final DT_NULL cannot be changed, but if there are more than one of these, elfedit can convert one of them into a runpath element. To test for extra dynamic slots:

      % elfedit -r -e 'dyn:tag DT_NULL' file

  • If your application was linked with the -c option to the linker, then you can use the crle command to alter the configuration file associated with the application and change the settings for LD_LIBRARY_PATH that are applied for that application. This is a pretty good solution, but is limited by its complexity, and by the fact that the person who linked the object needs to have thought ahead far enough to provide for this option. Odds are that they didn't. If they had, they might just as well have set the runpath correctly in the first place, eliminating the need for anything else.

    You can use crle with an application that was not linked with -c, either by setting the LD_CONFIG environment variable, or by modifying the global system configuration file. However, both of these options suffer from the same issues as the LD_LIBRARY_PATH environment variable: They are too coarse grained to be applied to a single application in a targeted way.

  • If none of the above are possible, then you are indeed stuck with LD_LIBRARY_PATH. In this case, the goal should be to minimize the number of applications that see this environment variable. You should never set it in your interactive shell environment (via whatever dot file your shell supports: .profile, .login, .cshrc, .basrc, etc...). Instead, put it in a wrapper shell script that you use to run the specific program.

    The use of a wrapper script is a pretty safe way to use LD_LIBRARY_PATH, but you should be aware of one limitation of this approach: If the program being wrapped starts any other programs, then those programs will see the LD_LIBRARY_PATH environment variable. Since programs starting other programs is a common Unix technique, this form of leakage can be more common that you might realize.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Introducing elfedit: A Tool For Modifying Existing ELF Objects

Back in June, I wrote about changes we've recently made to Solaris ELF objects that allow their runpaths to be modified without having to rebuild the object. In that posting, I alluded to work that I was then doing when I said "Eventually, Solaris will ship with a standard utility for modifying runpaths". I am happy to say that this has come to pass. I recently integrated /usr/bin/elfedit into build 75 of Solaris Nevada with:

PSARC 2007/509 elfedit
6234471 need a way to edit ELF objects
elfedit can indeed modify the runpath in an object, but it is considerably more general than that. elfedit is a tool for examining and modifying the ELF metadata that resides within ELF objects. It can be used as a batch mode tool from shell scripts, makefiles, etc, or as an interactive tool, for examining and exploring objects. elfedit has a modular design, and ships with a set of standard modules for performing common edits. This design makes it easy to add new functionality by adding additional modules.

Prior to elfedit, making these sorts of modifications required the user to write a program, usually in C using libelf. elfedit raises the programming level required to do this significantly. Many operations can be done using existing elfedit commands. For those that cannot, it is far easier to write an elfedit module to add the ability than it is to write a standalone program.

We envision elfedit being used to solve the following sorts of problems:

[Small Fixups]
To correct minor issues in a built file that cannot be easily rebuilt, or for which sources are not available.

Probably the most notable such item is the ability to alter the runpath of objects built following the integration of

PSARC 2007/127 Reserved space for editing ELF dynamic sections
6516118 Reserved space needed in ELF dynamic section and string table
The ability to do this is a "Frequently Asked Question" for which there has previously been no good answer. This feature is expected to be used nearly as soon as it is available, to fix the runpaths of FOSS (free open source software) built for Solaris, which often has the wrong runpaths set.

Another common situation is when programmers forget to explicitly add the libraries they depend on to the link line, relying on indirect dependencies to make things work. elfedit can be used to add NEEDED dependencies to an existing object's dynamic section, making the dependencies explicit.

[Better Way To Support Specialized Rarely Used Features]
As an avenue for delivering small features to change some object attributes without the need to add additional complex and specialized features to ld and ld.so.1.

For example, we have had requests to allow a mechanism to ld that would allow the user to override the hardware capability bits that are placed in the object by the compiler. Such a feature would be complex to document and burdens already complex commands with features that are rarely used. Such features are a natural fit to elfedit. (See the elfedit(1) manpage for an example of modifying the hardware capabilities).

[Linker Development]
We sometimes work on linker features that require objects with new values or flag bits that the compilers do not yet generate. elfedit allows us to set arbitrary values for such items quickly, and without having to write a program.
[Linker Testing]
Many bugs involve an object that is broken in some way. Once the bug is fixed, we need an object broken in that particular way for our test suite. There are several problems that arise:

  • Cataloging and archiving broken objects is time consuming and error prone.

  • Producing similarly broken objects for different platforms is not always possible.

  • As new platforms appear, we end up with coverage gaps where some platforms can do a given test and others cannot.

elfedit gives us the ability to build a simple object, and then break it intentionally in a specific and controlled manner. Tests can then be self contained, requiring no external data, and applicable to all relevant platforms.

elfedit's ability to extract specific bits of data from an object is very useful for object and linker testing.

Every elfedit module contains documentation for the commands it provides. This information is displayed using the built in help command, in a format that is based on that of Solaris manpages. The help strings in the standard elfedit modules supplied with Solaris are internationalized using the same i18n mechanisms employed by the rest of the linker software found under usr/src/cmd/sgs. Hence, all elfedit modules supplied by Sun will have complete documentation, and will support the necessary language locales.

As with any program that changes the contents of an ELF file, changes to an object by elfedit will invalidate any pre-existing elfsign signature. Assuming the changes are understood and acceptable to the signing authority, such objects will need to be signed after the edits are done.

Modular Design And Extensibility

elfedit has a modular design, reflecting our own experience with dynamic linking, and influenced heavily by the design of mdb, the modular debugger.

The elfedit program contains the code that handles the details of reading objects, executing commands to modify them, and saving the results. Very little of the code that performs the actual edits is found in elfedit itself. Rather, the commands exist in modules, which are sharable objects with a well defined elfedit-specific interface. elfedit loads needed modules on demand when a command from the module is executed. These modules are self contained, and include their own documentation in a standard format that elfedit can display using its help command.

The module forms a namespace for the commands that it supplies. Each module delivers a set of commands, focused on related functionality. A command is specified by combining the module and command names with a colon (:) delimiter, with no intervening whitespace. For example, dyn:runpath refers to the runpath command provided by the dyn module.

Module names must be unique. The command names within a given module are unique within that module, but the same command names can be used in more than one module. For example, most modules contain a command named 'dump', which is used to provide an elfedump-style view of the data.

We have adopted the following general rules of thumb for naming modules and commands:

  • The module name reflects the part of the ELF format the module addresses (ehdr, phdr, shdr, ...)

  • Commands that directly access a field in an ELF structure are given the name of the field (e.g. ehdr:e_flags).

  • Commands that are higher level have a simple descriptive name that reflects their purpose (e.g. dyn:runpath).

Give 'Em Enough Rope

elfedit is a tool for linker development and testing. As such, it follows the Unix tradition of doing what it's told, without a lot of noise. This is great if you are doing linker research & development, or testing. We commonly need to intentionally set ELF metadata to undefined or even "wrong" values. However, it follows that elfedit won't prevent you from making nonsensical or otherwise incorrect changes to your ELF objects.

For example, X86 objects have little endian byteorder (ELFDATA2LSB):

% file /usr/bin/ls
/usr/bin/ls:    ELF 32-bit LSB executable 80386 Version 1 [FPU],
    dynamically linked, not stripped, no debugging information available
We can change the e_ident[EI_DATA] field in the ELF header from its proper value to ELFDATA2MSB, which reverses the byte order advertised by the program and makes it appear to be big endian:
% elfedit -e 'ehdr:ei_data elfdata2MSB' /usr/bin/ls /tmp/badls
% file /tmp/badls
/tmp/badls:     ELF 32-bit MSB executable 80386 Version 1 [FPU],
    dynamically linked, not stripped, no debugging information available
The file command sees the change that we made. However, we haven't really created a big endian X86 binary by changing what it advertises. We now have a little endian binary that is lying about what it contains. And of course, there is no such thing as a big endian X86 hardware, so if we had created such a binary, it wouldn't be runnable anywhere. It should come as no surprise that the system doesn't know what to do with our modified ls binary:
% /tmp/badls
/tmp/badls: cannot execute

This is really nothing to be worried about. If you are using elfedit's low level operations that allow arbitrary changes to individual ELF fields, then you need to know enough about the ELF format to make these changes properly. Most people will use elfedit for the high level operations such as changing runpaths. The high level operations are safe, and do not require expert knowledge to use.

If you are making those low level changes, the Solaris Linkers and Libraries Guide can be very helpful.

Learning More

elfedit is a standard part of the Solaris development branch, the code that will eventually ship from Sun as the next version of Solaris. It is also available as part of OpenSolaris. It is not part of Solaris 10 or earlier Solaris releases. If you are using a recent Solaris distribution, such as Solaris Express Developer Edition then elfedit should be already present on your system.

The elfedit(1) manpage describes the utility in more detail, and gives three examples that should be of general interest:

  1. Changing runpaths
  2. Changing hardware/software capability bits
  3. Reading specific data, without having to grep the output of elfdump.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Thursday Nov 01, 2007

What Are Fake ELF Section Headers?

I'd like to take a moment to explain an unusual feature we added to elfdump last summer. The -P option tells elfdump to ignore the section headers in the file (if any) and to instead generate a "fake" set from the program headers. So, what are fake section headers, and why would you want them?

Earlier this year, there was an "incident" in which a previously unknown hole in the Solaris telnet daemon was used by a worm. As soon as we got a copy of this worm, we tried to examine it with elfdump to see what we might learn:

% elfdump zoneadmd

ELF Header
  ei_magic:   { 0x7f, E, L, F }
  ei_class:   ELFCLASS32          ei_data:      ELFDATA2LSB
  e_machine:  EM_386              e_version:    EV_CURRENT
  e_type:     ET_EXEC
  e_flags:                     0
  e_entry:             0x80512d4  e_ehsize:     52  e_shstrndx:  0
  e_shoff:                     0  e_shentsize:   0  e_shnum:     0
  e_phoff:                  0x34  e_phentsize:  32  e_phnum:     5

Program Header[0]:
    p_vaddr:      0x8050034   p_flags:    [ PF_X  PF_R ]
    p_paddr:      0           p_type:     [ PT_PHDR ]
    p_filesz:     0xa0        p_memsz:    0xa0
    p_offset:     0x34        p_align:    0

Program Header[1]:
    p_vaddr:      0           p_flags:    [ PF_R ]
    p_paddr:      0           p_type:     [ PT_INTERP ]
    p_filesz:     0x11        p_memsz:    0
    p_offset:     0xd4        p_align:    0

Program Header[2]:
    p_vaddr:      0x8050000   p_flags:    [ PF_X  PF_R ]
    p_paddr:      0           p_type:     [ PT_LOAD ]
    p_filesz:     0x6491      p_memsz:    0x6491
    p_offset:     0           p_align:    0x10000

Program Header[3]:
    p_vaddr:      0x8066494   p_flags:    [ PF_X  PF_W  PF_R ]
    p_paddr:      0           p_type:     [ PT_LOAD ]
    p_filesz:     0x3e0       p_memsz:    0xc10
    p_offset:     0x6494      p_align:    0x10000

Program Header[4]:
    p_vaddr:      0x80665c4   p_flags:    [ PF_X  PF_W  PF_R ]
    p_paddr:      0           p_type:     [ PT_DYNAMIC ]
    p_filesz:     0xd8        p_memsz:    0
    p_offset:     0x65c4      p_align:    0
That's it — everything that elfdump could tell us about this object. We sure didn't learn much from that!

If you look at the ELF header, you'll see that our bad guy has set the e_shnum, e_shoff, and e_shentsize fields to zero. These fields are used to locate the section headers for an ELF object. The section headers in turn contain the information needed to look deeper into an object. Section headers are not used to run a program, only to examine it. Zeroing them is a crude, but effective way to obscure what's inside. ELF objects are just files after all, and anyone with write access can modify them. It's not unheard of to modify an ELF object using a binary capable editor like emacs.

Fortunately, the design of ELF makes it difficult to actually hide what an object calls from other sharable objects. And since the system call stubs are all located in libc, you can't hide the system calls your code makes. Here is one way to look inside:

% ldd -r -e LD_DEBUG=bindings zoneadmd 2>&1 | fgrep "binding file=zoneadmd"
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__deregister_frame_info_bases'
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__register_frame_info_bases'
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `_Jv_RegisterClasses'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_environ'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `__iob'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_cleanup'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `atexit'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `__fpstart'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `exit'
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__deregister_frame_info_bases'
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `_Jv_RegisterClasses'
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__register_frame_info_bases'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `getenv'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `setsid'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `printf'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fflush'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `signal'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pthread_create'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pthread_join'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `malloc'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pthread_cancel'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `free'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `close'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `snprintf'
04992: binding file=zoneadmd to file=/lib/libnsl.so.1: symbol `inet_addr'
04992: binding file=zoneadmd to file=/lib/libnsl.so.1: symbol `gethostbyname'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `bcopy'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `ntohl'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `socket'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `htons'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `htonl'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `connect'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `___errno'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `getsockopt'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fcntl'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `select'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `write'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `read'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strcpy'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `gettimeofday'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strstr'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `mkstemp'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fdopen'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fopen'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `unlink'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fputs'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fputc'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `lseek'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fclose'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fprintf'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fread'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `putc'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_xstat'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_lxstat'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_fxstat'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_xmknod'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `open'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `mmap'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strrchr'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `nanosleep'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fork'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `dup2'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pipe'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `execve'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `kill'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `waitpid'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `localtime_r'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `utimes'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strchr'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `sscanf'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strtoul'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `rename'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `chmod'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `execl'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `lockf' 
Unlike some other object systems, ELF inter-object references are always looked up by name at runtime. The runtime linker hashes the name and looks it up on the first reference. So if you want to actually call something outside of your own object, you have to call it by its real name. This information is located via the object's program headers, and unlike the section headers, they need to be reasonably accurate for the object to work.

The above experience led us to consider a new feature for elfdump. What if we started with the program headers, and generated a set of "fake" section headers based on the information they contain? Obviously the information available would be reduced in comparison to the real section headers, because the program headers only contain the information needed to run the object. Nonetheless, it would certainly be better than nothing in the case where the section headers are gone. And what about the case where they are present, but we fear that they have been maliciously modified? The information from the "fake" section headers could be compared to that from the actual section headers.

As a result of this worm episode and the aftermath, I added the -P option to elfdump last July:

PSARC 2007/395 Add -P option to elfdump
6530249 elfdump should handle ELF files with no section header table
With an object that has section headers, fake section headers will not be used unless you explicitly use the -P option. If an object doesn't have any section headers, then elfdump automatically turns on the -P option for you.

Let's use the new elfdump with fake section headers to examine the telnet worm. I apologize for the length of this output, but the length underscores the point — there is a lot of information that we can recover from this damaged object:

% elfdump zoneadmd

ELF Header
  ei_magic:   { 0x7f, E, L, F }
  ei_class:   ELFCLASS32          ei_data:      ELFDATA2LSB
  e_machine:  EM_386              e_version:    EV_CURRENT
  e_type:     ET_EXEC
  e_flags:                     0
  e_entry:             0x80512d4  e_ehsize:     52  e_shstrndx:  0
  e_shoff:                     0  e_shentsize:   0  e_shnum:     0  (see shdr[0].sh_size)
  e_phoff:                  0x34  e_phentsize:  32  e_phnum:     5

Program Header[0]:
    p_vaddr:      0x8050034   p_flags:    [ PF_X PF_R ]
    p_paddr:      0           p_type:     [ PT_PHDR ]
    p_filesz:     0xa0        p_memsz:    0xa0
    p_offset:     0x34        p_align:    0

Program Header[1]:
    p_vaddr:      0           p_flags:    [ PF_R ]
    p_paddr:      0           p_type:     [ PT_INTERP ]
    p_filesz:     0x11        p_memsz:    0
    p_offset:     0xd4        p_align:    0

Program Header[2]:
    p_vaddr:      0x8050000   p_flags:    [ PF_X PF_R ]
    p_paddr:      0           p_type:     [ PT_LOAD ]
    p_filesz:     0x6491      p_memsz:    0x6491
    p_offset:     0           p_align:    0x10000

Program Header[3]:
    p_vaddr:      0x8066494   p_flags:    [ PF_X PF_W PF_R ]
    p_paddr:      0           p_type:     [ PT_LOAD ]
    p_filesz:     0x3e0       p_memsz:    0xc10
    p_offset:     0x6494      p_align:    0x10000

Program Header[4]:
    p_vaddr:      0x80665c4   p_flags:    [ PF_X PF_W PF_R ]
    p_paddr:      0           p_type:     [ PT_DYNAMIC ]
    p_filesz:     0xd8        p_memsz:    0
    p_offset:     0x65c4      p_align:    0

Section Header[1]:  sh_name: .dynamic(phdr)
    sh_addr:      0x80665c4       sh_flags:   [ SHF_WRITE SHF_ALLOC ]
    sh_size:      0xd8            sh_type:    [ SHT_DYNAMIC ]
    sh_offset:    0x65c4          sh_entsize: 0x8 (27 entries)
    sh_link:      2               sh_info:    0
    sh_addralign: 0x4       

Section Header[2]:  sh_name: .dynstr(phdr)
    sh_addr:      0x8050890       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0x2da           sh_type:    [ SHT_STRTAB ]
    sh_offset:    0x890           sh_entsize: 0
    sh_link:      0               sh_info:    0
    sh_addralign: 0x1       

Section Header[3]:  sh_name: .dynsym(phdr)
    sh_addr:      0x8050380       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0x510           sh_type:    [ SHT_DYNSYM ]
    sh_offset:    0x380           sh_entsize: 0x10 (81 entries)
    sh_link:      2               sh_info:    1
    sh_addralign: 0x4       

Section Header[4]:  sh_name: .hash(phdr)
    sh_addr:      0x80500e8       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0x298           sh_type:    [ SHT_HASH ]
    sh_offset:    0xe8            sh_entsize: 0x4 (166 entries)
    sh_link:      3               sh_info:    0
    sh_addralign: 0x4       

Section Header[5]:  sh_name: .SUNW_version(phdr)
    sh_addr:      0x8050b6c       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0xa0            sh_type:    [ SHT_SUNW_verneed ]
    sh_offset:    0xb6c           sh_entsize: 0x1 (160 entries)
    sh_link:      2               sh_info:    5
    sh_addralign: 0x4       

Section Header[6]:  sh_name: .interp(phdr)
    sh_addr:      0x80500d4       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0x11            sh_type:    [ SHT_PROGBITS ]
    sh_offset:    0xd4            sh_entsize: 0
    sh_link:      0               sh_info:    0
    sh_addralign: 0x1       

Section Header[7]:  sh_name: .rel(phdr)
    sh_addr:      0x8050c0c       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0x258           sh_type:    [ SHT_REL ]
    sh_offset:    0xc0c           sh_entsize: 0x8 (75 entries)
    sh_link:      3               sh_info:    0
    sh_addralign: 0x4       

Interpreter Section:  .interp(phdr)
	/usr/lib/ld.so.1

Version Needed Section:  .SUNW_version(phdr)
            file                        version
            libnsl.so.1                 SUNW_0.7             
            libsocket.so.1              SUNW_0.7             
            librt.so.1                  SUNW_1.2             
            libpthread.so.1             SUNW_1.2             
            libc.so.1                   SUNW_1.1             

Symbol Table Section:  .dynsym(phdr)
     index    value      size      type bind oth ver shndx          name
       [0]  0x00000000 0x00000000  NOTY LOCL  D    0 UNDEF          
       [1]  0x08067088 0x00000004  OBJT WEAK  D    0 22             environ
       [2]  0x080511f4 0x00000000  FUNC GLOB  D    0 UNDEF          dup2
       [3]  0x00000000 0x00000000  NOTY WEAK  D    0 UNDEF          _Jv_RegisterClasses
       [4]  0x080510a4 0x00000000  FUNC GLOB  D    0 UNDEF          strstr
       [5]  0x08050f64 0x00000000  FUNC GLOB  D    0 UNDEF          pthread_cancel
       [6]  0x080511a4 0x00000000  FUNC GLOB  D    0 UNDEF          open
       [7]  0x080511d4 0x00000000  FUNC GLOB  D    0 UNDEF          nanosleep
       [8]  0x08051244 0x00000000  FUNC GLOB  D    0 UNDEF          localtime_r
       [9]  0x080665c4 0x00000000  OBJT GLOB  D    0 15             _DYNAMIC
      [10]  0x08050ea4 0x00000000  FUNC GLOB  D    0 UNDEF          exit
      [11]  0x08051174 0x00000000  FUNC GLOB  D    0 UNDEF          _lxstat
      [12]  0x080670a4 0x00000000  OBJT GLOB  D    0 22             _end
      [13]  0x08050fd4 0x00000000  FUNC GLOB  D    0 UNDEF          ntohl
      [14]  0x080510e4 0x00000000  FUNC GLOB  D    0 UNDEF          unlink
      [15]  0x08051114 0x00000000  FUNC GLOB  D    0 UNDEF          lseek
      [16]  0x08050ee4 0x00000000  FUNC GLOB  D    0 UNDEF          getenv
      [17]  0x08051234 0x00000000  FUNC GLOB  D    0 UNDEF          waitpid
      [18]  0x080510d4 0x00000000  FUNC GLOB  D    0 UNDEF          fopen
      [19]  0x08051164 0x00000000  FUNC GLOB  D    0 UNDEF          _xstat
      [20]  0x08051074 0x00000000  FUNC GLOB  D    0 UNDEF          read
      [21]  0x080511e4 0x00000000  FUNC GLOB  D    0 UNDEF          fork
      [22]  0x08051094 0x00000000  FUNC GLOB  D    0 UNDEF          gettimeofday
      [23]  0x08050fe4 0x00000000  FUNC GLOB  D    0 UNDEF          socket
      [24]  0x00000000 0x00000000  NOTY WEAK  D    0 UNDEF          __deregister_frame_info_bases
      [25]  0x08051294 0x00000000  FUNC GLOB  D    0 UNDEF          rename
      [26]  0x08050f54 0x00000000  FUNC GLOB  D    0 UNDEF          malloc
      [27]  0x080511b4 0x00000000  FUNC GLOB  D    0 UNDEF          mmap
      [28]  0x08050f94 0x00000000  FUNC GLOB  D    0 UNDEF          snprintf
      [29]  0x08051284 0x00000000  FUNC GLOB  D    0 UNDEF          strtoul
      [30]  0x08051264 0x00000000  FUNC GLOB  D    0 UNDEF          strchr
      [31]  0x08051274 0x00000000  FUNC GLOB  D    0 UNDEF          sscanf
      [32]  0x08051224 0x00000000  FUNC GLOB  D    0 UNDEF          kill
      [33]  0x08051254 0x00000000  FUNC GLOB  D    0 UNDEF          utimes
      [34]  0x08051184 0x00000000  FUNC GLOB  D    0 UNDEF          _fxstat
      [35]  0x08050e94 0x00000000  FUNC GLOB  D    0 UNDEF          __fpstart
      [36]  0x08051124 0x00000000  FUNC GLOB  D    0 UNDEF          fclose
      [37]  0x080510b4 0x00000000  FUNC GLOB  D    0 UNDEF          mkstemp
      [38]  0x08066494 0x00000000  OBJT GLOB  D    0 14             _GLOBAL_OFFSET_TABLE_
      [39]  0x080512b4 0x00000000  FUNC GLOB  D    0 UNDEF          execl
      [40]  0x08051214 0x00000000  FUNC GLOB  D    0 UNDEF          execve
      [41]  0x08051144 0x00000000  FUNC GLOB  D    0 UNDEF          fread
      [42]  0x08050e74 0x00000000  FUNC WEAK  D    0 UNDEF          _cleanup
      [43]  0x08050f24 0x00000000  FUNC GLOB  D    0 UNDEF          signal
      [44]  0x08051064 0x00000000  FUNC GLOB  D    0 UNDEF          write
      [45]  0x080510c4 0x00000000  FUNC GLOB  D    0 UNDEF          fdopen
      [46]  0x08050e64 0x00000000  OBJT GLOB  D    0 9              _PROCEDURE_LINKAGE_TABLE_
      [47]  0x08051154 0x00000000  FUNC GLOB  D    0 UNDEF          putc
      [48]  0x08056491 0x00000000  OBJT GLOB  D    0 13             _etext
      [49]  0x08050ef4 0x00000000  FUNC GLOB  D    0 UNDEF          setsid
      [50]  0x080512a4 0x00000000  FUNC GLOB  D    0 UNDEF          chmod
      [51]  0x08051194 0x00000000  FUNC GLOB  D    0 UNDEF          _xmknod
      [52]  0x08066874 0x00000000  OBJT GLOB  D    0 21             _edata
      [53]  0x08051054 0x00000000  FUNC GLOB  D    0 UNDEF          select
      [54]  0x080668a0 0x000003c0  OBJT WEAK  D    0 22             _iob
      [55]  0x08051014 0x00000000  FUNC GLOB  D    0 UNDEF          connect
      [56]  0x08050fb4 0x00000000  FUNC GLOB  D    0 UNDEF          gethostbyname
      [57]  0x080511c4 0x00000000  FUNC GLOB  D    0 UNDEF          strrchr
      [58]  0x080668a0 0x000003c0  OBJT GLOB  D    0 22             __iob
      [59]  0x08050f14 0x00000000  FUNC GLOB  D    0 UNDEF          fflush
      [60]  0x08051034 0x00000000  FUNC GLOB  D    0 UNDEF          getsockopt
      [61]  0x08051044 0x00000000  FUNC GLOB  D    0 UNDEF          fcntl
      [62]  0x080512c4 0x00000000  FUNC GLOB  D    0 UNDEF          lockf
      [63]  0x08050fa4 0x00000000  FUNC GLOB  D    0 UNDEF          inet_addr
      [64]  0x08050f44 0x00000000  FUNC GLOB  D    0 UNDEF          pthread_join
      [65]  0x08051024 0x00000000  FUNC GLOB  D    0 UNDEF          ___errno
      [66]  0x08051104 0x00000000  FUNC GLOB  D    0 UNDEF          fputc
      [67]  0x08050e84 0x00000000  FUNC GLOB  D    0 UNDEF          atexit
      [68]  0x08050f04 0x00000000  FUNC GLOB  D    0 UNDEF          printf
      [69]  0x00000000 0x00000000  NOTY WEAK  D    0 UNDEF          __register_frame_info_bases
      [70]  0x08051204 0x00000000  FUNC GLOB  D    0 UNDEF          pipe
      [71]  0x08051004 0x00000000  FUNC GLOB  D    0 UNDEF          htonl
      [72]  0x08050fc4 0x00000000  FUNC GLOB  D    0 UNDEF          bcopy
      [73]  0x08050ff4 0x00000000  FUNC GLOB  D    0 UNDEF          htons
      [74]  0x08050f34 0x00000000  FUNC GLOB  D    0 UNDEF          pthread_create
      [75]  0x08067088 0x00000004  OBJT GLOB  D    0 22             _environ
      [76]  0x08050f84 0x00000000  FUNC GLOB  D    0 UNDEF          close
      [77]  0x08050f74 0x00000000  FUNC GLOB  D    0 UNDEF          free
      [78]  0x080510f4 0x00000000  FUNC GLOB  D    0 UNDEF          fputs
      [79]  0x08051134 0x00000000  FUNC GLOB  D    0 UNDEF          fprintf
      [80]  0x08051084 0x00000000  FUNC GLOB  D    0 UNDEF          strcpy

Hash Section:  .hash(phdr)
    bucket  symndx      name
         0  [1]         environ
            [2]         dup2
         1  [3]         _Jv_RegisterClasses
            [4]         strstr
         2  [5]         pthread_cancel
            [6]         open
            [7]         nanosleep
            [8]         localtime_r
         3  [9]         _DYNAMIC
         4  [10]        exit
         6  [11]        _lxstat
        10  [12]        _end
        15  [13]        ntohl
            [14]        unlink
        18  [15]        lseek
            [16]        getenv
        19  [17]        waitpid
        20  [18]        fopen
        21  [19]        _xstat
            [20]        read
        22  [21]        fork
        23  [22]        gettimeofday
            [23]        socket
            [24]        __deregister_frame_info_bases
        24  [25]        rename
            [26]        malloc
        27  [27]        mmap
        28  [28]        snprintf
        29  [29]        strtoul
            [30]        strchr
        30  [31]        sscanf
            [32]        kill
        31  [33]        utimes
        33  [34]        _fxstat
            [35]        __fpstart
            [36]        fclose
        34  [37]        mkstemp
        35  [38]        _GLOBAL_OFFSET_TABLE_
        38  [39]        execl
        39  [40]        execve
            [41]        fread
        41  [42]        _cleanup
        43  [43]        signal
            [44]        write
        44  [45]        fdopen
        46  [46]        _PROCEDURE_LINKAGE_TABLE_
            [47]        putc
            [48]        _etext
        47  [49]        setsid
        48  [50]        chmod
        49  [51]        _xmknod
        52  [52]        _edata
            [53]        select
            [54]        _iob
        53  [55]        connect
        55  [56]        gethostbyname
            [57]        strrchr
        59  [58]        __iob
            [59]        fflush
        62  [60]        getsockopt
            [61]        fcntl
            [62]        lockf
        63  [63]        inet_addr
            [64]        pthread_join
            [65]        ___errno
        64  [66]        fputc
            [67]        atexit
        65  [68]        printf
        66  [69]        __register_frame_info_bases
            [70]        pipe
            [71]        htonl
        71  [72]        bcopy
        73  [73]        htons
        76  [74]        pthread_create
        77  [75]        _environ
            [76]        close
        78  [77]        free
        80  [78]        fputs
            [79]        fprintf
        81  [80]        strcpy

        35  buckets contain        0 symbols
        25  buckets contain        1 symbols
        15  buckets contain        2 symbols
         7  buckets contain        3 symbols
         1  buckets contain        4 symbols
        83  buckets               80 symbols (globals)

Relocation Section:  .rel(phdr)
    type                       offset             section        symbol
  R_386_GLOB_DAT            0x80664b0             .rel(phdr)     __deregister_frame_info_bases
  R_386_GLOB_DAT            0x80664b8             .rel(phdr)     __register_frame_info_bases
  R_386_GLOB_DAT            0x80664bc             .rel(phdr)     _Jv_RegisterClasses
  R_386_COPY                0x8067088             .rel(phdr)     _environ
  R_386_COPY                0x80668a0             .rel(phdr)     __iob
  R_386_JMP_SLOT            0x80664a0             .rel(phdr)     _cleanup
  R_386_JMP_SLOT            0x80664a4             .rel(phdr)     atexit
  R_386_JMP_SLOT            0x80664a8             .rel(phdr)     __fpstart
  R_386_JMP_SLOT            0x80664ac             .rel(phdr)     exit
  R_386_JMP_SLOT            0x80664b4             .rel(phdr)     __deregister_frame_info_bases
  R_386_JMP_SLOT            0x80664c0             .rel(phdr)     _Jv_RegisterClasses
  R_386_JMP_SLOT            0x80664c4             .rel(phdr)     __register_frame_info_bases
  R_386_JMP_SLOT            0x80664c8             .rel(phdr)     getenv
  R_386_JMP_SLOT            0x80664cc             .rel(phdr)     setsid
  R_386_JMP_SLOT            0x80664d0             .rel(phdr)     printf
  R_386_JMP_SLOT            0x80664d4             .rel(phdr)     fflush
  R_386_JMP_SLOT            0x80664d8             .rel(phdr)     signal
  R_386_JMP_SLOT            0x80664dc             .rel(phdr)     pthread_create
  R_386_JMP_SLOT            0x80664e0             .rel(phdr)     pthread_join
  R_386_JMP_SLOT            0x80664e4             .rel(phdr)     malloc
  R_386_JMP_SLOT            0x80664e8             .rel(phdr)     pthread_cancel
  R_386_JMP_SLOT            0x80664ec             .rel(phdr)     free
  R_386_JMP_SLOT            0x80664f0             .rel(phdr)     close
  R_386_JMP_SLOT            0x80664f4             .rel(phdr)     snprintf
  R_386_JMP_SLOT            0x80664f8             .rel(phdr)     inet_addr
  R_386_JMP_SLOT            0x80664fc             .rel(phdr)     gethostbyname
  R_386_JMP_SLOT            0x8066500             .rel(phdr)     bcopy
  R_386_JMP_SLOT            0x8066504             .rel(phdr)     ntohl
  R_386_JMP_SLOT            0x8066508             .rel(phdr)     socket
  R_386_JMP_SLOT            0x806650c             .rel(phdr)     htons
  R_386_JMP_SLOT            0x8066510             .rel(phdr)     htonl
  R_386_JMP_SLOT            0x8066514             .rel(phdr)     connect
  R_386_JMP_SLOT            0x8066518             .rel(phdr)     ___errno
  R_386_JMP_SLOT            0x806651c             .rel(phdr)     getsockopt
  R_386_JMP_SLOT            0x8066520             .rel(phdr)     fcntl
  R_386_JMP_SLOT            0x8066524             .rel(phdr)     select
  R_386_JMP_SLOT            0x8066528             .rel(phdr)     write
  R_386_JMP_SLOT            0x806652c             .rel(phdr)     read
  R_386_JMP_SLOT            0x8066530             .rel(phdr)     strcpy
  R_386_JMP_SLOT            0x8066534             .rel(phdr)     gettimeofday
  R_386_JMP_SLOT            0x8066538             .rel(phdr)     strstr
  R_386_JMP_SLOT            0x806653c             .rel(phdr)     mkstemp
  R_386_JMP_SLOT            0x8066540             .rel(phdr)     fdopen
  R_386_JMP_SLOT            0x8066544             .rel(phdr)     fopen
  R_386_JMP_SLOT            0x8066548             .rel(phdr)     unlink
  R_386_JMP_SLOT            0x806654c             .rel(phdr)     fputs
  R_386_JMP_SLOT            0x8066550             .rel(phdr)     fputc
  R_386_JMP_SLOT            0x8066554             .rel(phdr)     lseek
  R_386_JMP_SLOT            0x8066558             .rel(phdr)     fclose
  R_386_JMP_SLOT            0x806655c             .rel(phdr)     fprintf
  R_386_JMP_SLOT            0x8066560             .rel(phdr)     fread
  R_386_JMP_SLOT            0x8066564             .rel(phdr)     putc
  R_386_JMP_SLOT            0x8066568             .rel(phdr)     _xstat
  R_386_JMP_SLOT            0x806656c             .rel(phdr)     _lxstat
  R_386_JMP_SLOT            0x8066570             .rel(phdr)     _fxstat
  R_386_JMP_SLOT            0x8066574             .rel(phdr)     _xmknod
  R_386_JMP_SLOT            0x8066578             .rel(phdr)     open
  R_386_JMP_SLOT            0x806657c             .rel(phdr)     mmap
  R_386_JMP_SLOT            0x8066580             .rel(phdr)     strrchr
  R_386_JMP_SLOT            0x8066584             .rel(phdr)     nanosleep
  R_386_JMP_SLOT            0x8066588             .rel(phdr)     fork
  R_386_JMP_SLOT            0x806658c             .rel(phdr)     dup2
  R_386_JMP_SLOT            0x8066590             .rel(phdr)     pipe
  R_386_JMP_SLOT            0x8066594             .rel(phdr)     execve
  R_386_JMP_SLOT            0x8066598             .rel(phdr)     kill
  R_386_JMP_SLOT            0x806659c             .rel(phdr)     waitpid
  R_386_JMP_SLOT            0x80665a0             .rel(phdr)     localtime_r
  R_386_JMP_SLOT            0x80665a4             .rel(phdr)     utimes
  R_386_JMP_SLOT            0x80665a8             .rel(phdr)     strchr
  R_386_JMP_SLOT            0x80665ac             .rel(phdr)     sscanf
  R_386_JMP_SLOT            0x80665b0             .rel(phdr)     strtoul
  R_386_JMP_SLOT            0x80665b4             .rel(phdr)     rename
  R_386_JMP_SLOT            0x80665b8             .rel(phdr)     chmod
  R_386_JMP_SLOT            0x80665bc             .rel(phdr)     execl
  R_386_JMP_SLOT            0x80665c0             .rel(phdr)     lockf

Dynamic Section:  .dynamic(phdr)
     index  tag                value
       [0]  NEEDED            0x27f               libnsl.so.1
       [1]  NEEDED            0x294               libsocket.so.1
       [2]  NEEDED            0x2a3               librt.so.1
       [3]  NEEDED            0x2b7               libpthread.so.1
       [4]  NEEDED            0x2c7               libc.so.1
       [5]  INIT              0x80538fc           
       [6]  FINI              0x8053909           
       [7]  HASH              0x80500e8           
       [8]  STRTAB            0x8050890           
       [9]  STRSZ             0x2da               
      [10]  SYMTAB            0x8050380           
      [11]  SYMENT            0x10                
      [12]  CHECKSUM          0x3126              
      [13]  VERNEED           0x8050b6c           
      [14]  VERNEEDNUM        0x5                 
      [15]  PLTRELSZ          0x230               
      [16]  PLTREL            0x11                
      [17]  JMPREL            0x8050c34           
      [18]  REL               0x8050c0c           
      [19]  RELSZ             0x258               
      [20]  RELENT            0x8                 
      [21]  DEBUG             0                   
      [22]  FEATURE_1         0x1                 [ PARINIT ]
      [23]  FLAGS             0                   0
      [24]  FLAGS_1           0                   0
      [25]  PLTGOT            0x8066494           
      [26]  NULL              0                   
I don't expect this feature to get much daily use, but it will be handy to have it in the forensic toolchest next time we're scrambling to understand a damaged or malicious object.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Tuesday Jun 12, 2007

Changing ELF Runpaths (Code Included)

A recent change to Solaris ELF files makes it possible to change the runpath of a dynamic executable or sharable object, something that has not been safely possible up until now. This change 80is currently found in Solaris Nevada (the current development version of Solaris) and in OpenSolaris. It is not yet available in Solaris 10, but in time will appear in the standard shipping Solaris as well.

This seems like a good time to talk about runpaths and the business of how the runtime linker finds dependencies. I also provide a small program named rpath that you can use to modify the runpaths in your file (assuming they were linked under Nevada or OpenSolaris).

The Runpath Problem

The runtime linker looks in the following places, in the order listed, to find the sharable objects it loads into a process at startup time:

  • If LD_LIBRARY_PATH (or the related LD_LIBRARY_PATH_32 and LD_LIBRARY_PATH_64) environment variables are defined, the directories they specify are searched to resolve any remaining dependencies.

  • If the executable, or any sharable objects that are loaded, contain a runpath, the directories it specifies are searched to resolve dependencies for those objects.

  • Finally, it searches two default locations for any remaining dependencies: /lib and /usr/lib (or /lib/64 and /usr/lib/64 for 64-bit code).
As if this isn't complicated enough, it should be noted that the crle command can be used to set values for LD_LIBRARY_PATH, and the default directories.

The above scheme offers a great deal of flexibility, and it usually works well. There is however one notable exception — the "Runpath Problem". The problem is that many objects are not built with a correct runpath, and once an object has been built, it has not been possible to change it. It is common to find objects where the runpath is correct on the system the object was built on, but not on the system where it is installed. Usually, we deal with this all too common situation by setting LD_LIBRARY_PATH, or by creating a linker configuration file with crle. Such solutions have serious downsides, as detailed in an earlier blog entry by Rod Evans entitled "LD_LIBRARY_PATH - just say no".

Both approaches will cause unrelated programs to look in unnecessary additional directories for their dependencies. At best, this imposes unnecessary overhead on their operation. At worst, they may end up binding to the wrong version of a given library, leading to mysterious and hard to debug failures. The environment variable approach is simply too broad.

One important technique that people sometimes use, is to set the environment variables in a wrapper shell script, that may look something like:

#!/bin/sh
#
# Run myapp, setting LD_LIBRARY_PATH so it will run

LD_LIBRARY_PATH="/this/that/theother:/someplace/else"
export LD_LIBRARY_PATH

exec /usr/local/myapp
This is a huge improvement over simply setting LD_CONFIG or LD_LIBRARY_PATH in your shell login config script (.profile, .cshrc, .bashrc, etc), for many reasons:
  • Reduces the scope of influence to only cover the application (and its children — see below)
  • Doesn't require each user to modify their login script(s)
  • Can be managed in a central location
It isn't perfect though. If the program in question should happen to run any child processes (and this is more common than many realize), those child processes will inherit the LD_CONFIG and LD_LIBRARY_PATH settings you've established for this one program. This leak may, or may not, cause problems depending on what programs are run.

It would be far better to modify the object in question and set a runpath that accurately reflects the actual location of its dependencies. The effect of a runpath is limited to the file that contains it, so this solution does not "bleed through" to unrelated files, and it imposes no unnecessary overhead on the general operation of the system. This would be a superior solution if it were possible. However it hasn't been an option until recently.

How Runpaths Are Implemented

Every dynamic executable contains a dynamic section. This is an array of items which convey the information required by the dynamic linker (ld.so.1) to do its work. If an object has a runpath, there will be a DT_RUNPATH and/or DT_RPATH item in the dynamic section (there is more than one of these for historical reasons). As an example, lets examine crle:
% elfdump -d /usr/bin/crle | grep 'R*PATH'
       [4]  RUNPATH           0x612               $ORIGIN/../lib
       [5]  RPATH             0x612               $ORIGIN/../lib
The string (in this case, "$ORIGIN/../lib") is not actually stored in the dynamic section. Rather, it is contained in the dynamic string table (.dynstr). The value 0x612 is the offset within string table at which the desired string starts.

A string table is a section that contains NULL terminated strings, one immediately following the other. To access a given string, you add the offset of the string within the section to the base of the section data area. Consider a string table that contains the names of two variables "var1", and "var2" and a runpath "$ORIGIN/../lib". By ELF convention, string tables always have a 0-length NULL terminated string in the first position. In C language notation, we might declare the contents of the resulting string table section containing these 4 strings as

"\0var1\0var2\0$ORIGIN/../lib"
The indexes of the 4 strings in our table are [0], [1], [6], and [11], and any item in the dynamic section or the dynamic symbol table that needs one of these strings will specify it using the appropriate index. An interesting result of the way that string tables are designed is that that every single offset into a string table represents a usable string. Although our intent with the C string above was to represent 4 strings, it actually contains 23 potential strings (26 if you count the duplicate NULL strings), and not just the 4 we intentionally inserted. Listing them by offset, they are:
[0]  ""

[1]  "var1"
[2]  "ar1"
[3]  "r1"
[4]  "1"
[5]  ""

[6]  "var2"
[7]  "ar2"
[8]  "r2"
[9]  "2"
[10] ""

[11] "$ORIGIN/../lib"
[12] "ORIGIN/../lib"
[13] "RIGIN/../lib"
[14] "IGIN/../lib"
[15] "GIN/../lib"
[16] "IN/../lib"
[17] "N/../lib"
[18] "/../lib"
[19] "../lib"
[20] "./lib"
[21] "/lib"
[22] "lib"
[23] "ib"
[24] "b"
[25] ""
This is a very efficient scheme, since each string can appear once in the string table, and multiple ELF items can refer to it. Also, it allows fixed size things, like ELF symbols or dynamic section entries, to efficiently reference variable length strings. There are two things to note, however:
  1. For a given string, there is no way to tell if it is referenced, where it is referenced from, or how many references there are.
  2. There is no room to add new strings to a string table.

The options for modifying a runpath in this situation are limited:

  • Any string already in the string table, as with the 23 options listed in our example above, can be safely set as a runpath, by simply changing the offset in the runpath DT entries. Note that most string table strings are variable and file (not directory) names that are not likely to make useful path strings. This option is unlikely to help.

  • You might overwrite the existing path string with a new string of equal or shorter length in the (usually true) belief that nothing else is accessing that particular string. It is a simple matter using a binary aware editor to locate and overwrite the existing string. This usually works, but if there is another part of the file accessing that string, this change will break it. We cannot recommend or stand behind this, even though we have done it ourselves for one-off experiments (never in a shipping product).

  • A better approach would be to add a new string to the end of the section, and then change the offset in the dynamic section to use it. Traditionally, ELF files have not had any extra room in the string table section to allow this.

As a result, it has not been possible to support the modification of the runpath in an existing object up until recently.

Making Room

I recently integrated a change to Solaris Nevada (and OpenSolaris) to add a little unused space to our ELF files, in order to facilitate a limited amount of post-link modification:
PSARC 2007/127 Reserved space for editing ELF dynamic sections
6516118 Reserved space needed in ELF dynamic section and
        string table
This change does two things:
  1. Adds some extra NULL bytes to the end of every dynamic string table. (The current value is 512 bytes, but this can change in the future).

  2. Adds a new dynamic section entry named DT_SUNW_STRPAD to keep track of the size of the unused space at the end of the dynamic string table.

  3. Adds some extra (currently 10) unused DT_NULL entries at the end of the dynamic section.
This additional space is small enough that it doesn't increase the size of real world objects by a significant amount. Though small, it gives us a lot of new flexibility. The room in the string table allows for the safe addition of a moderate number of new strings. The additional null DT entries allow us to add a DT_RUNPATH item if the file doesn't already have one to modify. Looking at crle again:
% elfdump -d /usr/bin/crle | egrep 'R*PATH|STRPAD'
       [4]  RUNPATH           0x612               $ORIGIN/../lib
       [5]  RPATH             0x612               $ORIGIN/../lib
      [32]  SUNW_STRPAD       0x200               
The SUNW_STRPAD entry tells us that the dynamic string table has 512 (0x200) bytes of unused space available at the end of its data area.

The way this works is very simple: If a file lacks a DT_SUNW_STRPAD dynamic entry, then we know that it is an older file, and that the dynamic string table does not have any extra space. If it does have a DT_SUNW_STRPAD, then its value tells us how much room is available. In this case, we can add the string, modify the DT_RUNPATH items, and reduce the DT_SUNW_STRPAD value by the number of bytes we used.

If the value in DT_SUNW_STRPAD is too small for our new string, then we are out of luck and cannot add it. This extra room should help in the vast majority of cases, but as with any such approach, there are limits. We recommend the use of the special $ORIGIN token, both because it is a great way to organize objects, and because it is short.

The rpath Utility

Eventually, Solaris will ship with a standard utility for modifying runpaths. However, there is no need to wait. I have written an unofficial test program I call 'rpath' that you can download and build. To build rpath, you will need a version of Solaris Nevada newer than build 61, or a recent version of OpenSolaris. To check your system, try:
% grep DT_SUNW_STRPAD /usr/include/sys/link.h
#define DT_SUNW_STRPAD  0x60000019 /* # of unused bytes at the */
If your grep doesn't find DT_SUNW_STRPAD, your system lacks the necessary support.

To build rpath, unpack the compressed tar file and type 'make'. If you are using gcc, first edit the Makefile and uncomment the CC line:

% gunzip < rpath.tgz | tar xvpf -
% cd rpath
% make
rpath is used as follows:
NAME
     rpath - set/get runpath of ELF dynamic objects

SYNOPSIS
     rpath [-dr] file [runpath]

DESCRIPTION
     rpath can display,  modify,  or  delete  the  runpath  of  a
     dynamic ELF object.

     If called without a runpath  argument  and  without  the  -r
     option,  the  current runpath, if any, is written to stdout.
     If -r is specified, the existing runpath is removed. If run-
     path  is  supplied,  the runpath of the object is set to the
     new value.


OPTIONS
     The following options are supported:

     -d  Cause detailed ELF information about the  ELF  file  and
         the changes being made to it to be written to stderr.

     -r  Instead of adding or modifying the file  runpath,  rpath
         removes  any  DT_RPATH  or  DT_RUNPATH  entries from the
         dynamic section of  the  file.  This  action  completely
         removes  any existing from the file. When this option is
         used, rpath does not allow the runpath argument.

Using rpath

Let's use rpath to look at its own runpath. We will see that it doesn't have one, something that can be verified using elfdump:
% rpath rpath
% elfdump -d rpath | egrep 'R*PATH|STRPAD'
      [28]  SUNW_STRPAD       0x200               
Now, let's add a runpath to it:
% rpath rpath pointless:runpath
% rpath rpath
pointless:runpath
% elfdump -d rpath | egrep 'R*PATH|STRPAD'
      [28]  SUNW_STRPAD       0x1ee               
      [30]  RUNPATH           0x33f      pointless:runpath
Notice that the amount of unused space reported by SUNW_STRPAD has gone down from 512 (0x200) to 494 (0x1ee) bytes, a reduction of 18 bytes. This makes sense, since we added a 17 character string, and we must add a NULL termination.

We can observe the runtime linker looking in 'pointless' and 'runpath' as it loads rpath (note: output is edited for width):

% LD_DEBUG=libs ./rpath 
13707: 
13707: hardware capabilities - 0x25ff7  [ AHF SSE3 SSE2 
       SSE FXSR AMD_3DNowx AMD_3DNow AMD_MMX MMX CMOV
       AMD_SYSC CX8 TSC FPU ]
13707: 
13707: 
13707: configuration file=/var/ld/ld.config: unable to
       process file
13707: 
13707: 
13707: find object=libelf.so.1; searching
13707:  search path=pointless:runpath  (RUNPATH/RPATH
                                        from file rpath)
13707:  trying path=pointless/libelf.so.1
13707:  trying path=runpath/libelf.so.1
13707:  search path=/lib  (default)
13707:  search path=/usr/lib  (default)
13707:  trying path=/lib/libelf.so.1
13707: 
13707: find object=libc.so.1; searching
13707:  search path=pointless:runpath  (RUNPATH/RPATH from
                                        file rpath)
13707:  trying path=pointless/libc.so.1
13707:  trying path=runpath/libc.so.1
13707:  search path=/lib  (default)
13707:  search path=/usr/lib  (default)
13707:  trying path=/lib/libc.so.1
13707: 
13707: find object=libc.so.1; searching
13707:  search path=/lib  (default)
13707:  search path=/usr/lib  (default)
13707:  trying path=/lib/libc.so.1
13707: 
13707: 1: 
13707: 1: transferring control: rpath
13707: 1: 
usage: rpath [-dr] file [runpath]
13707: 1: 
Finally, we'll remove the runpath we just added:
% rpath -r rpath
% rpath rpath
% elfdump -d rpath | egrep 'R*PATH|STRPAD'
      [28]  SUNW_STRPAD       0x1ee               
Note that even though the runpath is gone, the amount of available extra space in the dynamic string section did not go back up from 494 (0x1ee) to 512 (0x200). Adding strings is a one way operation. Once they are added, they are permanent. So even though you now have the ability to add strings of moderate length, you won't want to do it indiscriminately.

On the plus side, you can always re-add the same runpath back without using any more space:

% rpath rpath pointless:runpath
% rpath rpath
pointless:runpath
% elfdump -d rpath | egrep 'R*PATH|STRPAD'
      [28]  SUNW_STRPAD       0x1ee               
      [30]  RUNPATH           0x33f      pointless:runpath
rpath found that the string 'pointless:runpath' was already in the string table, so it used it without inserting another copy.

Conclusions

Our best advice has always been that the LD_LIBRARY_PATH environment variable should not be used to work around objects with bad or missing runpaths. It is best to rebuild such objects and set the runpath correctly. This hasn't changed, and you should always do so if you can.

The problem with that advice is that there are times when all you have is the object, and no option to rebuild. In that case, LD_LIBRARY_PATH has been a necessary evil (and one that we've been glad to have). With the advent of objects that can have their runpaths modified, we now have a better answer, and the use of LD_LIBRARY_PATH for this purpose should be allowed to slowly fade away.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Friday Feb 09, 2007

Which Solaris Files Are Stripped?

In my previous blog entry about the new .SUNW_ldynsym sections, I made the following statement:

It used to be common practice for system binaries to be stripped in order to save space. However, observability is a central tenet of the Solaris philosophy. Solaris objects and executables are therefore shipped in unstripped form, and have been for many years, in order to support such symbol lookups.

It turns out that this is is only partially true...

Brian Utterback posted a comment and pointed out that 490 of the 719 ELF binaries in /usr/bin on his Solaris 10 system are stripped. This shows that Solaris binaries have not been unstripped "for many years". I looked at /usr/bin on my desktop system, which is running a fairily recent Nevada build, and found that only 51 of the 815 files there are stripped. It appears that binaries are (mostly) stripped now. What changed between Solaris 10 and today? And, why "mostly"?

As I usually do in such situations, I sent mail to my fellow Linker Alien Rod Evans. I asked him for his recollection of what policies were used for stripping Solaris files in the past. Here is a summary of what he told me:

  • For a very long time, the rule was that executables are stripped, and sharable libraries are not. The underlying idea was that people would not care to debug our executables, but certainly would debug their own programs that are linked to our libraries. We're not sure when this started, but are pretty sure that it covers most, if not all, of the Solaris 2.x era (we've no idea about the SunOS 4.x days).

  • In the early years, enforcement of this rule was rather incomplete, and exceptions occured. Starting with Solaris 9, automated checks in the nightly builds tightened things up significantly.

  • The policy was changed in September 2005 (2 months before I joined Sun) to not strip any files. The change took effect with Nevada build 24, with
    5072038 binaries shouldn't be stripped
    I imagine that the introduction of DTrace made complete symbol information in binaries more important than before.

  • Solaris is built by combining various "consolidations". The above comments only apply to the core ON consolidation, which consists of the OS and Networking parts of Solaris. The other consolidations are built according to their own rules, which can and do differ. So, you should not be surprised to find some stripped files, even on a current development build of Solaris, like the 51 files I found in /usr/bin on my system.

The new .SUNW_ldynsym sections reduce the need for everything to be unstripped, so we may end up relaxing our ON rule if there is a reason to do so. And if the other consolidations continue to strip their files, .SUNW_ldynsym will provide better observability for them.

Brian is absolutely right — we have not been shipping "Solaris objects and executables" in unstripped form for many years, only sharable libraries! I knew that libraries were unstripped, and that current builds don't strip either binaries, and those two facts misled me.

On the plus side, I have a better understanding of the issue now... :-)


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday Feb 07, 2007

What Is .SUNW_ldynsym?

Solaris ELF files have a new ELF symbol table. The section type is SHT_SUNW_LDYNSYM, and the section is named .SUNW_ldynsym. In the 20+ years in which the ELF standard has been in use, we have only needed two symbol tables (.symtab, and .dynsym) to support linking, so the addition of a third symbol table is a notable event for ELF cognoscenti. Even if you aren't one of those, you may encounter these sections, and wonder what they are for. I hope to explain that here.

Solaris has many tools that examine running processes or core files and generate stack traces. For example, consider the following call to pstack(1), made on an Xterm process currently running on my system:

% pstack 3094
3094:   xterm -ls -geometry 80x51+0+175
 fef4bea7 pollsys  (8046600, 2, 0, 0)
 fef0767e pselect  (5, 8400168, 84001e8, fef95260, 0, 0) + 19e
 fef0798e select   (5, 8400168, 84001e8, 0, 0) + 7e
 0805b250 in_put   (10, 8416720, 0, fedd561e, 8416720, 0) + 1b0
 08059b20 VTparse  (84166a8, 8057acc, fed387c5, 8416720, 84166a8, 804688c) + 90
 0805d1f1 VTRun    (8046a28, 8046870, feffa7c0, 8046808, 8046858, 804685c) + 205
 08057add main     (0, 80468b4, 80468c8) + 945
 08056eee _start   (4, 8046a90, 0, 8046a9a, 8046aa4, 0) + 7a
In order to show you those function names, pstack (really the libproc library used by pstack) needs to map the addresses of functions on the stack to the ELF symbols that correspond to them. Usually, these symbols come from the symbol table (.symtab). If this symbol table has been removed with the strip(1) program, then the dynamic symbol table (.dynsym) will be used instead. As described in a previous blog entry, the .dynsym contains the subset of global symbols from .symtab that are needed by the runtime linker ld.so.1(1). This fallback allows us to map global functions to their names, but local function symbols are not available. Observability tools like pstack(1) will display the hexidecimal address of such local functions when a name is not available. This is better than nothing, but is not particularly helpful.

It used to be common practice for system binaries to be stripped in order to save space. However, observability is a central tenet of the Solaris philosophy. Solaris objects and executables are therefore shipped in unstripped form, and have been for many years, in order to support such symbol lookups. For the most part, this has been a winning strategy, but there are still issues that come up from time to time:

  • strip(1) removes much more than the symbol table. Usually the size of this extra data is not a significant concern, but there are certain very large programs where the space savings might be worthwhile. It would be great to strip those particular things, but losing the local function symbols and the ability to make accurate stack traces is a bitter pill to swallow. This has led to a number of proposed features to "strip everything except local function symbols". These ideas are reasonable, but complicated. We like the fact that "strip" is a simple straightforward operation, and want to avoid complicating the concept.

  • We don't strip our files, but many Solaris users do. This becomes a problem when those applications misbehave, and they (or we, if you have a high end support contract) are trying to figure out why. Often, it is not possible to rebuild such applications in order to debug them. The ability to observe unmodified applications running in a production environment is another key Solaris virtue, as exemplified by DTrace.
Over the years, we have observed that these problems would be largely solved if we could add local function symbols to the .dynsym, and that in most programs, the additional space used would be minimal. Last fall, I embarked on a project to do this.

I tried hard to avoid adding a new symbol table type, and instead tried several experiments in which the additional local function symbols were placed in the dynsym. The reason for wanting this was to avoid having to modify ELF utilities and debuggers to know about a new symbol table. If the added symbols are in the existing .dynsym, those tools will automatically see them, without needing modification. As detailed in the ARC case that I filed for this work (PSARC/2006/526), I tried many different permutations. In every case, I discovered undesirable backward compatibility issues that kept me from using that solution. It turns out that the layout of .dynsym, and the other ELF sections that interact with it, are completely constrained, and there is no 100% backward compatible way to add local symbols to it.

ELF was designed from the very beginning to make it possible to introduce new section types with full backward/forward compatibility. You can always safely add a new section, with a moderate amount of care, and it will work. More than anything, this ability to extend ELF accounts for its long life. Given that the .dynsym cannot be extended with local symbols, I made the obvious (in hindsight) decision to to introduce a new section type (SHT_SUNW_LDYNSYM), and add a new symbol table section named .SUNW_ldynsym to every Solaris file that has a .dynsym section. Once that decision was made, the implementation was straightforward, giving me confidence that it was the right way to go.

The .SUNW_ldynsym section can be thought of as the local initial part of the .dynsym that we wish to build, but can't. The Solaris linker ( ld(1)) takes care to actually place them side by side, so that the end of the .SUNW_ldynsym section leads directly into the start of the .dynsym section. The runtime linker ( ld.so.1(1)) takes advantage of this to treat them as a single table within the implementation of dladdr(3C). Note that this trick works for applications that mmap(2) the file and access it directly. If you are accessing an ELF file via libelf, as many utilities do, you can't make any assumptions about the relative positions of different sections.

As with .dynsym, .SUNW_ldynsym sections are allocable, meaning that they are part of the process text segment. This means that they are available at runtime for dladdr(3C). It also means that they cannot be stripped. Although you cannot strip .SUNW_ldynsym sections, you can prevent them from being generated by ld(1), by using the -znoldynsym linker option.

.SUNW_ldynsym sections consume a small amount of additional space. We found that for all of core Solaris (OS and Networking), the increase in size was on the order of 1.4%. This small increase pays off by letting our observability tools do a better job. Furthermore, the presence of .SUNW_ldynsym means that in many cases, you can strip programs that you might not have been willing to strip before.

Example

Let's use the following program to see how .SUNW_ldynsym sections improve Solaris observability of local functions:
/*
 * Program to demonstrate SHT_SUNW_LDYNSYM sections. The
 * global main program calls a local function named
 * static_func(). static_func() uses printstack() to exercise
 * the dladdr(3C) function provided by the runtime linker,
 * and then deliberately causes a segfault. The resulting core
 * file can be examined by pstack(1) or mdb(1).
 *
 * In all these cases, if a stripped binary of this program
 * contains a .SUNW_ldynsym section, the static_func() function
 * will be observable by name, and otherwise simply as an
 * address.
 */


#include <ucontext.h>

static void
static_func(void)
{
	/* Use dladdr(3C) to print a call stack */
	printstack(1);

	/*
	 * Write to address 0, killing the process and
	 * producing a core file.
	 */
	*((char *) 0) = 1;
}


int main(int argc, char *argv[])
{
	static_func();
	return (0);
}

Let's build two versions of this program, one containing the .SUNW_ldynsym section, and one without:
% cc -Wl,-znoldynsym test.c -o test_noldynsym
% cc test.c -o test_ldynsym
The elfdump(1) command can be used to let us examine the three symbol tables contained in test_ldynsym. There is no need to examine this (large) output too carefully, but there are some interesting facts worth noticing:
  • Every symbol in .SUNW_ldynsym or .dynsym is also found in .symtab, because .symtab is a superset of the other two tables. This is why it is always preferred to the other two, when available.

  • .symtab is much larger than the other two tables combined, which leads to the temptation to strip it, along with the other things strip(1) removes.

  • The symbols in .dynsym are strictly limited to those needed by the runtime linker.

  • If you consider the .SUNW_ldynsym and .dynsym symbol tables as a single logical entity, you can see that the result follows the rules for ELF symbol table layout.
% elfdump -s test_ldynsym

Symbol Table Section:  .SUNW_ldynsym
     index    value      size      type bind oth ver shndx          name
       [0]  0x00000000 0x00000000  NOTY LOCL  D    0 UNDEF          
       [1]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            test_ldynsym
       [2]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crti.s
       [3]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crt1.o
       [4]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crt1.s
       [5]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            fsr.s
       [6]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            values-Xa.c
       [7]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            test.c
       [8]  0x080507f0 0x00000019  FUNC LOCL  D    0 .text          static_func
       [9]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crtn.s

Symbol Table Section:  .dynsym
     index    value      size      type bind oth ver shndx          name
       [0]  0x00000000 0x00000000  NOTY LOCL  D    0 UNDEF          
       [1]  0x08050668 0x00000000  OBJT GLOB  D    0 .plt           _PROCEDURE_LINKAGE_TABLE_
       [2]  0x08060974 0x00000004  OBJT WEAK  D    0 .data          environ
       [3]  0x0806088c 0x00000000  OBJT GLOB  D    0 .dynamic       _DYNAMIC
       [4]  0x080609c0 0x00000000  OBJT GLOB  D    0 .bssf          _edata
       [5]  0x08060990 0x00000004  OBJT GLOB  D    0 .data          ___Argv
       [6]  0x08050868 0x00000000  OBJT GLOB  D    0 .rodata        _etext
       [7]  0x0805082c 0x0000001b  FUNC GLOB  D    0 .init          _init
       [8]  0x00000000 0x00000000  NOTY GLOB  D    0 ABS            __fsr_init_value
       [9]  0x08050810 0x00000019  FUNC GLOB  D    0 .text          main
      [10]  0x08060974 0x00000004  OBJT GLOB  D    0 .data          _environ
      [11]  0x08060868 0x00000000  OBJT GLOB  P    0 .got           _GLOBAL_OFFSET_TABLE_
      [12]  0x080506b8 0x00000000  FUNC GLOB  D    0 UNDEF          printstack
      [13]  0x080506a8 0x00000000  FUNC GLOB  D    0 UNDEF          _exit
      [14]  0x08050864 0x00000004  OBJT GLOB  D    0 .rodata        _lib_version
      [15]  0x08050698 0x00000000  FUNC GLOB  D    0 UNDEF          atexit
      [16]  0x08050678 0x00000000  FUNC GLOB  D    0 UNDEF          __fpstart
      [17]  0x0805076c 0x0000007b  FUNC GLOB  D    0 .text          __fsr
      [18]  0x08050688 0x00000000  FUNC GLOB  D    0 UNDEF          exit
      [19]  0x080506c8 0x00000000  FUNC WEAK  D    0 UNDEF          _get_exit_frame_monitor
      [20]  0x080609c0 0x00000000  OBJT GLOB  D    0 .bss           _end
      [21]  0x080506e0 0x0000008b  FUNC GLOB  D    0 .text          _start
      [22]  0x08050848 0x0000001b  FUNC GLOB  D    0 .fini          _fini
      [23]  0x08060978 0x00000018  OBJT GLOB  D    0 .data          __environ_lock
      [24]  0x0806099c 0x00000004  OBJT GLOB  D    0 .data          __longdouble_used
      [25]  0x00000000 0x00000000  NOTY WEAK  D    0 UNDEF          __1cG__CrunMdo_exit_code6F_v_

Symbol Table Section:  .symtab
     index    value      size      type bind oth ver shndx          name
       [0]  0x00000000 0x00000000  NOTY LOCL  D    0 UNDEF          
       [1]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            test_ldynsym
       [2]  0x080500f4 0x00000000  SECT LOCL  D    0 .interp        
       [3]  0x08050108 0x00000000  SECT LOCL  D    0 .SUNW_cap      
       [4]  0x08050118 0x00000000  SECT LOCL  D    0 .hash          
       [5]  0x080501fc 0x00000000  SECT LOCL  D    0 .SUNW_ldynsym  
       [6]  0x0805029c 0x00000000  SECT LOCL  D    0 .dynsym        
       [7]  0x0805043c 0x00000000  SECT LOCL  D    0 .dynstr        
       [8]  0x080505c4 0x00000000  SECT LOCL  D    0 .SUNW_version  
       [9]  0x080505f4 0x00000000  SECT LOCL  D    0 .SUNW_dynsymso 
      [10]  0x08050630 0x00000000  SECT LOCL  D    0 .rel.data      
      [11]  0x08050638 0x00000000  SECT LOCL  D    0 .rel.plt       
      [12]  0x08050668 0x00000000  SECT LOCL  D    0 .plt           
      [13]  0x080506e0 0x00000000  SECT LOCL  D    0 .text          
      [14]  0x0805082c 0x00000000  SECT LOCL  D    0 .init          
      [15]  0x08050848 0x00000000  SECT LOCL  D    0 .fini          
      [16]  0x08050864 0x00000000  SECT LOCL  D    0 .rodata        
      [17]  0x08060868 0x00000000  SECT LOCL  D    0 .got           
      [18]  0x0806088c 0x00000000  SECT LOCL  D    0 .dynamic       
      [19]  0x08060974 0x00000000  SECT LOCL  D    0 .data          
      [20]  0x080609c0 0x00000000  SECT LOCL  D    0 .bssf          
      [21]  0x080609c0 0x00000000  SECT LOCL  D    0 .bss           
      [22]  0x00000000 0x00000000  SECT LOCL  D    0 .symtab        
      [23]  0x00000000 0x00000000  SECT LOCL  D    0 .strtab        
      [24]  0x00000000 0x00000000  SECT LOCL  D    0 .comment       
      [25]  0x00000000 0x00000000  SECT LOCL  D    0 .debug_info    
      [26]  0x00000000 0x00000000  SECT LOCL  D    0 .debug_line    
      [27]  0x00000000 0x00000000  SECT LOCL  D    0 .debug_abbrev  
      [28]  0x00000000 0x00000000  SECT LOCL  D    0 .shstrtab      
      [29]  0x080609c0 0x00000000  OBJT LOCL  D    0 .bss           _END_
      [30]  0x08050000 0x00000000  OBJT LOCL  D    0 .interp        _START_
      [31]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crti.s
      [32]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crt1.o
      [33]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crt1.s
      [34]  0x08060994 0x00000004  OBJT LOCL  D    0 .data          __get_exit_frame_monitor_ptr
      [35]  0x08060998 0x00000004  OBJT LOCL  D    0 .data          __do_exit_code_ptr
      [36]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            fsr.s
      [37]  0x080609a0 0x00000020  OBJT LOCL  D    0 .data          trap_table
      [38]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            values-Xa.c
      [39]  0x08060974 0x00000000  NOTY LOCL  D    0 .data          Ddata.data
      [40]  0x080609c0 0x00000000  NOTY LOCL  D    0 .bss           Bbss.bss
      [41]  0x08050868 0x00000000  NOTY LOCL  D    0 .rodata        Drodata.rodata
      [42]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            test.c
      [43]  0x080507f0 0x00000019  FUNC LOCL  D    0 .text          static_func
      [44]  0x080609c0 0x00000000  OBJT LOCL  D    0 .bss           Bbss.bss
      [45]  0x08060974 0x00000000  OBJT LOCL  D    0 .data          Ddata.data
      [46]  0x08050864 0x00000000  OBJT LOCL  D    0 .rodata        Drodata.rodata
      [47]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crtn.s
      [48]  0x08050668 0x00000000  OBJT GLOB  D    0 .plt           _PROCEDURE_LINKAGE_TABLE_
      [49]  0x08060974 0x00000004  OBJT WEAK  D    0 .data          environ
      [50]  0x0806088c 0x00000000  OBJT GLOB  D    0 .dynamic       _DYNAMIC
      [51]  0x080609c0 0x00000000  OBJT GLOB  D    0 .bssf          _edata
      [52]  0x08060990 0x00000004  OBJT GLOB  D    0 .data          ___Argv
      [53]  0x08050868 0x00000000  OBJT GLOB  D    0 .rodata        _etext
      [54]  0x0805082c 0x0000001b  FUNC GLOB  D    0 .init          _init
      [55]  0x00000000 0x00000000  NOTY GLOB  D    0 ABS            __fsr_init_value
      [56]  0x08050810 0x00000019  FUNC GLOB  D    0 .text          main
      [57]  0x08060974 0x00000004  OBJT GLOB  D    0 .data          _environ
      [58]  0x08060868 0x00000000  OBJT GLOB  P    0 .got           _GLOBAL_OFFSET_TABLE_
      [59]  0x080506b8 0x00000000  FUNC GLOB  D    0 UNDEF          printstack
      [60]  0x080506a8 0x00000000  FUNC GLOB  D    0 UNDEF          _exit
      [61]  0x08050864 0x00000004  OBJT GLOB  D    0 .rodata        _lib_version
      [62]  0x08050698 0x00000000  FUNC GLOB  D    0 UNDEF          atexit
      [63]  0x08050678 0x00000000  FUNC GLOB  D    0 UNDEF          __fpstart
      [64]  0x0805076c 0x0000007b  FUNC GLOB  D    0 .text          __fsr
      [65]  0x08050688 0x00000000  FUNC GLOB  D    0 UNDEF          exit
      [66]  0x080506c8 0x00000000  FUNC WEAK  D    0 UNDEF          _get_exit_frame_monitor
      [67]  0x080609c0 0x00000000  OBJT GLOB  D    0 .bss           _end
      [68]  0x080506e0 0x0000008b  FUNC GLOB  D    0 .text          _start
      [69]  0x08050848 0x0000001b  FUNC GLOB  D    0 .fini          _fini
      [70]  0x08060978 0x00000018  OBJT GLOB  D    0 .data          __environ_lock
      [71]  0x0806099c 0x00000004  OBJT GLOB  D    0 .data          __longdouble_used
      [72]  0x00000000 0x00000000  NOTY WEAK  D    0 UNDEF          __1cG__CrunMdo_exit_code6F_v_
Now, we strip the two versions of our program to remove the .symtab symbol table, and force the system to use the dynamic tables instead:
% strip test_ldynsym test_noldynsym 
% file test_ldynsym test_noldynsym 
test_ldynsym:   ELF 32-bit LSB executable 80386 Version 1, dynamically linked, stripped
test_noldynsym: ELF 32-bit LSB executable 80386 Version 1, dynamically linked, stripped
Running the version without a .SUNW_ldynsym section:
% ./test_noldynsym 
/home/ali/test/test_noldynsym:0x6ca
/home/ali/test/test_noldynsym:main+0xb
/home/ali/test/test_noldynsym:_start+0x7a
Segmentation Fault (core dumped)
% pstack core
core 'core' of 5041:    ./test_noldynsym
 080506d2 ???????? (804692c, 80467a4, 805062a, 1, 80467b0, 80467b8)
 080506eb main     (1, 80467b0, 80467b8) + b
 0805062a _start   (1, 8046994, 0, 80469a5, 80469bf, 8046a03) + 7a
Our program used the printstack(3C) function to display its own stack. Afterwards, we use the pstack command to view the same data from the core file. In both cases, the top line represents the call to the local function static_func(), a fact that we know from examining the source code, since the number and/or '????????' used to represent it are less than obvious to an external observer.

Running the version with a .SUNW_ldynsym section, the system is able to put a name to the local function:

% ./test_ldynsym 
/home/ali/test/test_ldynsym:static_func+0xa
/home/ali/test/test_ldynsym:main+0xb
/home/ali/test/test_ldynsym:_start+0x7a
Segmentation Fault (core dumped)
% pstack core
core 'core' of 5044:    ./test_ldynsym
 08050802 static_func (8046930, 80467a8, 805075a, 1, 80467b4, 80467bc) + 12
 0805081b main     (1, 80467b4, 80467bc) + b
 0805075a _start   (1, 8046998, 0, 80469a7, 80469c1, 8046a05) + 7a

Conclusions

Sometimes it is the little things that make a difference. I expect that the local dynamic symbol table will provide valuable information in difficult debugging situations where one is examining large stripped programs running in a production environment. The rest of the time, the additional data is small, and will have little or no impact on performance.

.SUNW_ldynsym sections have been part of the Solaris development (Nevada) builds since last fall, and are also available in OpenSolaris.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Saturday Sep 23, 2006

Inside ELF Symbol Tables

ELF files are full of things we need to keep track of for later access: Names, addresses, sizes, and intended purpose. Without this information, an ELF file would not be very useful. We would have no way to make sense of the impenetrable mass of octal or hexidecimal numbers.

Consider: When you write a program in any language above direct machine code, you give symbolic names to functions and data. The compiler turns these things into code. At the machine level, they are known only by their address (offset within the file) and their size. There are no names in this machine code. How then, can a linker combine multiple object files, or a symbolic debugger know what name to use for a given address? How do we make sense of these files?

Symbols are the way we manage this information. Compilers generate symbol information along with code. Linkers manipulate symbols, reading them in, matching them up, and writing them out. Almost everything a linker does is driven by symbols. Finally, debuggers use them to figure out what they are looking at and to provide you with a human readable view of that information.

It is therefore a rare ELF file that doesn't have a symbol table. However, most programmers have only an abstract knowledge that symbol tables exist, and that they loosely correspond to their functions and data, and some "other stuff". Protected by the abstractions of compiler, linker, and debugger, we don't usually need to know too much about the details of how a symbol table is organized. I've recently completed a project that required me to learn about symbol tables in great detail. Today, I'm going to write about the symbol tables used by the linker.

.symtab and .dynsym

Sharable objects and dynamic executables usually have 2 distinct symbol tables, one named ".symtab", and the other ".dynsym". (To make this easier to read, I am going to refer to these without the quotes or leading dot from here on.)

The dynsym is a smaller version of the symtab that only contains global symbols. The information found in the dynsym is therefore also found in the symtab, while the reverse is not necessarily true. You are almost certainly wondering why we complicate the world with two symbol tables. Won't one table do? Yes, it would, but at the cost of using more memory than necessary in the running process.

To understand how this works, we need to understand the difference between allocable and a non-allocable ELF sections. ELF files contain some sections (e.g. code and data) needed at runtime by the process that uses them. These sections are marked as being allocable. There are many other sections that are needed by linkers, debuggers, and other such tools, but which are not needed by the running program. These are said to be non-allocable. When a linker builds an ELF file, it gathers all of the allocable sections together in one part of the file, and all of the non-allocable sections are placed elsewhere. When the operating system loads the resulting file, only the allocable part is mapped into memory. The non-allocable part remains in the file, but is not visible in memory. strip(1) can be used to remove certain non-allocable sections from a file. This reduces file size by throwing away information. The program is still runnable, but debuggers may be hampered in their ability to tell you what the program is doing.

The full symbol table contains a large amount of data needed to link or debug our files, but not needed at runtime. In fact, in the days before sharable libraries and dynamic linking, none of it was needed at runtime. There was a single, non-allocable symbol table (reasonably named "symtab"). When dynamic linking was added to the system, the original designers faced a choice: Make the symtab allocable, or provide a second smaller allocable copy. The symbols needed at runtime are a small subset of the total, so a second symbol table saves virtual memory in the running process. This is an important consideration. Hence, a second symbol table was invented for dynamic linking, and consequently named "dynsym".

And so, we have two symbol tables. The symtab contains everything, but it is non-allocable, can be stripped, and has no runtime cost. The dynsym is allocable, and contains the symbols needed to support runtime operation. This division has served us well over the years.

Types Of Symbols

Given how long symbols have been around, there are surprisingly few types:
STT_NOTYPE
Used when we don't know what a symbol is, or to indicate the absence of a symbol.

STT_OBJECT / STT_COMMON
These are both used to represent data. (The word OBJECT in this context should not interpreted as having anything to do with object orientation. STT_DATA might have been a better name.)

STT_OBJECT is used for normal variable definitions, while STT_COMMON is used for tentative definitions. See my earlier blog entry about tentative symbols for more information on the differences between them.

STT_FUNC
A function, or other executable code.

STT_SECTION
When I first started learning about ELF, and someone would say something about "section symbols", I thought they meant a symbol from some given section. That's not it though: A section symbol is a symbol that is used to refer to the section itself. They are used mainly when performing relocations, which are often specified in the form of "modify the value at offset XXX relative to the start of section YYY".

STT_FILE
The name of a file, either of an input file used to construct the ELF file, or of the ELF file itself.

STT_TLS
A third type of data symbol, used for thread local data. A thread local variable is a variable that is unique to each thread. For instance, if I declare the variable "foo" to be thread local, then every thread has a separate foo variable of their own, and they do not see or share values from the other threads. Thread local variables are created for each thread when the thread is created. As such, their number (one per thread) and addresses (depends on when the thread is created, and how many threads there are) are unknown until runtime. An ELF file cannot contain an address for them. Instead, a STT_TLS symbol is used. The value of a STT_TLS symbol is an offset, which is used to calculate a TLS offset relative to the thread pointer. You can read more about TLS in the Linker And Libraries Guide.

STT_REGISTER
The Sparc architecture has a concept known as a "register symbol". These symbols are used to validate symbol/register usage, and can also be used to initialize global registers. Other architectures don't use these.

In addition to symbol type, each symbols has other attributes:

  • Name (Optional: Not all symbols need a name, though most do)
  • Value
  • Size
  • Binding and Visibility
  • ELF Section it references
The exact meaning for some of these attributes depends on the type of symbol involved. For more details, consult the Solaris Linker and Libraries Guide, which is available in PDF form online.

Symbols Table Layout And Conventions

The symbols in a symbol table are written in the following order:
  1. Index 0 in any symbol table is used to represent undefined symbols. As such, the first entry in a symbol table (index 0) is always completely zeroed (type STT_NOTYPE), and is not used.

  2. If the file contains any local symbols, the second entry (index 1) the symbol table will be a STT_FILE symbol giving the name of the file.

  3. Section symbols.

  4. Register symbols.

  5. Global symbols that have been reduced to local scope via a mapfile.

  6. For each input file that supplies local symbols, a STT_FILE symbol giving the name of the input file is put in the symbol table, followed by the symbols in question.

  7. The global symbols immediately follow the local symbols in the symbol table. Local and global symbols are always kept separate in this manner, and cannot be mixed together.
What would happen if we ignored these rules and reordered things in some other way (e.g. sorted by address)? There is no way to answer this question with 100% certainty. It would probably confuse existing tools that manipulate ELF files. In particular, it seems clear that the local and global symbols must remain separate. For years and years, arbitrary software has been free to assume the above layout. We can't possibly know how much software has been written, or how dependent on layout it is. The only safe move is to maintain the well known layout described above.

Next Time: Augmenting The Dynsym

One of the big advantages of Solaris relative to other operating systems is the extensive support for observability: The ability to easily look inside a running program and see what it is doing, in detail. To do that well requires symbols. The symbols in the dynsym may not be enough to do a really good job. For example, to produce a stack trace, we need to take each function address and match it up to its name. If we are looking at a stripped file, or referencing the file from within the process using it via dladdr(3C), we won't have any way to find names for the non-global functions, and will have to resort to displaying hex addresses. This is better than nothing, but not by much. The standard files in a Solaris distribution are not stripped for exactly this reason. However, many files found in production are stripped, and in-process inspection is still limited to the dynsym.

Machines are much larger than they used to be. The memory saved by the symtab/dynsym division is still a good thing, but there are times when we wish that the dynsym contained a bit more data. This is harder than it sounds. The layout of dynsym interacts with the rest of an ELF file in ways that are set in stone by years of existing practice. Backward compatibility is a critical feature of Solaris. We try extremely hard to keep those old programs running. And yet, the needs of observability, spearheaded by important new features like DTrace, put pressure on us in the other direction.

This discussion is prelude to work I recently did to augment the dynsym to contain local symbols, while preserving full backward compatibility with older versions Solaris. I plan to cover that in a future blog entry. ELF is old, and much of how it works cannot be changed. Its original designers (our "Founding Fathers", as Rod calls them) anticipated that this would be the case, based no doubt on hard experience with earlier systems. The ELF design is therefore uniquely flexible, which explains why it has survived as long as it has. There is always a way to add something new. Sometimes, it takes several tries to find the best way.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Friday Sep 22, 2006

What Are "Tentative" Symbols?

In the Linker and Libraries Guide, you will encounter discussion of tentative symbols. Based on the name, we might expect that such a symbol is missing something, but what? And why does the linker have to treat them as a special case?

A tentative symbol is a symbol used to track a global variable when we don't know its size or initial value. In other words, a symbol for which we have not yet assigned a storage address. They are also known as "common block" symbols, because they have their origins in the implementation of Fortran COMMON blocks. They are historical baggage — something that needs to work for compatibility with the past, but also something to avoid in new code.

Consider the following two C declarations, made at outer file scope:

        int foo;
        int foo = 0;
Superficially, these both appear to declare a global variable named foo with an initial value of 0. However, the first definition is tentative — it will have a value of 0 only if some other file doesn't explicitly give it a different value. The outcome depends on what else we link this file against.

To get a better handle on this, let's create two separate C files (t1.c, and t2.c) and experiment:

t1.c
        #include <stdio.h>

        #ifdef TENTATIVE_FOO
        int foo;
        #else
        int foo = 0;
        #endif

        int
        main(int argc, char *argv[])
        {
                printf("FOO: %d\n", foo);
                return (0);
        }
t2.c
        int foo = 12;

First, we compile and link t1.c by itself, using both forms of declaration for variable foo:

        % cc -DTENTATIVE_FOO t1.c; ./a.out
        FOO: 0
        % cc t1.c; ./a.out
        FOO: 0

As expected, they give identical results. Now, lets add t2.c to the mix and see what happens:

        % cc -DTENTATIVE_FOO t1.c t2.c; ./a.out
        FOO: 12
        % cc t1.c t2.c; ./a.out
        ld: fatal: symbol `foo' is multiply-defined:
                (file t1.o type=OBJT; file t2.o type=OBJT);
        ld: fatal: File processing errors. No output written to a.out
        ./a.out: No such file or directory
As you can see, the two different ways of declaring foo are not 100% equivalent. The tentative declaration of foo in t1.c took on the value provided by the declaration in t2.c. In contrast, the linker was unwilling to merge the two non-tentative definitions of foo that had different values, and instead issued a fatal link error.

Normal C rules say that a variable at file scope without an explicit value is assigned an initial value of 0. However, the existence of other global variables with the same name can change this. The C compiler is only able to see the code in the single file it is compiling, and cannot know how to handle this case. So, it marks it as tentative by giving the symbol a type of STT_COMMON, and leaves it for the linker to figure out. The linker is in a position to match up all of these symbols and merge them into a single instance. The linker has no insight into programmer intent though, and it cannot protect you from doing this by accident. The result usually works, but is fragile.

The other declaration form (with a value) causes a non-tentative symbol to be created (STT_OBJECT). In this case, the linker ensures that all the declarations agree. This is the right behavior if you care about robust and scalable code.

It is worth noting that you will never see a tentative symbol with local scope. It can only happen to global symbols, because global symbols in different files are the only way you can get this form of aliasing to occur.

History

Tentative symbols are bad software engineering. A declaration in one file should not be able to alter one in another file. The need for them dates from the early days of the Fortran language. In Fortran, you can declare a common block in more than one file, with each file independently specifying the number, types, and sizes of the variables. The linker then takes all of these blocks, allocates enough space to satisfy the largest one, and makes all them point at that space. This is a very crude form of a union (variant), and is therefore very useful (and dangerous) Fortran technique.

Sadly, it didn't stop there. We still sometimes find this practice in C code. Two files will both declare:

        int foo;
and then expect that they are both be referring to a single global variable, with an initial value of 0. This is not necessary. The proper solution has existed for decades. The safe way to do the above is to have exactly one declaration for the global variable in a single file. The other files that need to access to it use the "extern" keyword to let the compiler know what is going on. The statement
        extern int foo;
is a reference, not a declaration, and it has a single unambiguous interpretation.

Moral: Don't Do That!

Don't use common block binding in your code. It was a bad idea 40 years ago, and it hasn't improved with age. The necessity of backward compatibility is such that compilers and linkers must support common block binding. We are stuck with it, but we don't have to use it.

You should always try to minimize or eliminate global variables. However, when you do use them:

  • There should be exactly one declaration for each global variable, contained in a single file. When declaring that single instance, always give it an explicit value, even if that value is 0. The C language says that the value is 0 if you don't, but doing it explicitly ensures that you can't accidentally fall into the "tentative trap" if some other module should come along later and define it. Note that this only applies to global variables. Static variables declared at file scope can be safely assumed to have an initial value of 0.

  • The module that declares the variable should supply a header file containing an extern statement for the variable. Furthermore, the module must #include its own header file. The compiler allows you to have a declaration and an extern statement for a variable in the same compilation scope, and it will check to make sure they agree. This ensures that your module can't export a bad extern definition to other code.

  • Other modules that access the global variable must always #include the header file from the defining module, and must never supply their own explicit extern statement for the variable. This protects them from being stuck with an obsolete and incorrect definition if the variable should change later.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday Jun 14, 2006

Settling An Old Score (Linker Division)

For years, I worked on an interactive language used by scientists to do data analysis and visualization. That program makes heavy use of sharable libraries. Solaris was my primary development platform, due to its superior facilities for observing and debugging software, so my usual strategy was to write and debug my code under Solaris, and then move the results to the many other Unix (and VMS, Windows, and Macintosh) platforms that we supported.

I became very familiar with one quirk of the Solaris linker that bit me many times over the years. The issue has to do with how ld(1) handles the situation where it needs to replace an existing output file. Historically, this was handled by truncating the existing file and rewriting it in place. This preserves the existing inode, and any hard links that may happen to be pointing to it. However, it has a very bad effect on any running process that happens to be using that file. For example, if the output file is a sharable library, creating a new version while a program that uses it is running will inevitably cause that program to die in an unplanned and unexpected way.

If you're not a developer of software that runs for unbounded amounts of time, then you probably have not seen this behavior. If you develop code that does however, then you've almost certainly hit it at some point. In my case, I'd hit it about once a year, usually while multitasking, flipping back and forth between several xterms. Usually, it was obvious what had happened, and I'd quickly recover. Sometimes though, if I was debugging the program for some other reason, the unexpected SIGSEGV or SIGBUS would send me off into the weeds, debugging a mysterious problem in a part of the program unrelated to where I expected the problem to lie. After a minute or so, I'd realize what had happened.. Gack! Bitten again...

I cheerfully admit that this is not a big deal. However, I always wondered what reasons those people at Sun had for not changing it. I assumed that there was some subtlety that I was missing. Other platforms handle it in a different way, by having ld unlink the existing file first, and then create a new file under the same name. Any existing processes continue to see the old file, while the new file becomes available to new processes. When the last program with an open file descriptor exits, the Unix kernel removes the old file from the disk, in the standard Unix way.

I now work at Sun, on the Solaris linker and various related parts. One day the subject of this ld behavior came up, reminding me of my old questions. So, I started asking around. It turns out that no one at Sun is particularly fond of it either. The reasons for not having changed it boil down to the fact that it rarely causes problems, a desire to maximize compatibility with the past, and the fact that there are bigger things to worry about (almost everything).

The compatibility issue boils down to what happens if the output file has a link count that is greater than 1, as described earlier. It's rare for an output file to have multiple links, and even more rare for the makefile to not remove and relink all of the other names. Especially rare, since other operating systems (like Linux and the Macintosh) do break those links, and most Unix software is targetted at multiple operating systems (almost always including Linux).

As described in an earler blog entry, I've been using Solaris Zones to improve our linker testing. The basic approach I want to use for this is to replace the linker files in the test zone with symbolic links that point at the corresponding files in my development workspace. This means that every time I use make(1) to rebuild the linker components, the results will immediately become available in the zone. There is one problem with this idea: The way ld handles existing output files means that any processes using the old linker components will suddenly crash when I type make. Depending on what is running in the zone, this could destabilize and/or cripple the zone.

Of course, we could modify our makefiles to unlink the existing files before running ld, but this is tedious and error prone. I realized that it would be better to change the linker. At last, time to settle an old score.

The first step in making a change like this is to write and submit an ARC (Architectural Review Committee) case describing the change. We take backward compatibility very seriously here, so a change like this requires formal consideration and approval. I submitted PSARC 2006/353 a couple of weeks ago. It generated some discussion pro and con (the change causes existing hard links to be broken), but in the end, the undesirability of causing programs to die in uncontrolled ways, and the bonus of Linux compatibility won the day and the proposal was approved. I put back the code change earlier today, and it will appear in build 43, and soon after in OpenSolaris. Appropriately enough, more typing was involved in proposing the ARC case than in changing the code. The actual code change is very small.

It was a good learning experience for me to go through the ARC process, and gratifying to finally take my revenge on "my old friend". And of course, it will allow us to do better testing of the linker subsystem.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Happy Birthday To OpenSolaris

OpenSolaris 1 Year Anniversary

OpenSolaris is 1 year old today! Congratulations, and thanks to everyone that made it possible!

I've used SunOS/Solaris for ~20 years now, for all of the old school reasons: Solaris has long set the gold standard for things like reliability, scalability, observability, debuggability backward compatibility, and standards compliance. Over those years, I've written code that had to run on most versions of Unix as well as VMS, Windows, and the Macintosh. Sun was always the place where I did initial work, and any possible debugging as well. The reason is that SunOS has always had superior tools for observing and diagnosing what your code is doing. However, the user visible functionality of the OS (as with all Unix OS's) had not changed much in quite some time. Solaris still had the above advantages, but the various Unix variants were beginning to look and feel increasingly similar.

In contrast, the current Solaris is a real jump forward. While other operating systems have been working away at duplicating things that Solaris has had forever, Solaris itself has moved forward with next generation features that no one else will have for quite awhile (Dtrace, ZFS, Zones, FMA, SMF, etc). Others have some subset of these abilities, but no one has the complete package, and no one else has it in such a simple and fully integrated fashion. Once you experience these things, you'll find running systems without them to be limiting (evoking a quieter, simpler era). At the same time, Solaris has moved aggressively to the 64-bit X86 PC platform, making it possible for many people to run it without first having to buy new hardware, and has moved to adopt a modern desktop and make other needed improvements (many of which involve open source code from other projects).

The fact that Solaris has joined the community of open source Unix operating systems makes the above even more compelling. The energy level surrounding Solaris today is high. The quality of Solaris as a desktop OS is rapidly improving, thanks to its adoption of other open source software. At the same time, the OpenSolaris code is there for others to examine, modify, and even to port to other systems This would be a great thing for Unix in general, and is something that we would love to see happen (and it is: See Dtrace and ZFS). And, Solaris still leads the pack with those boring old school virtues.

So on this 1st anniversary of OpenSolaris: If you are a Unix fan who is not familiar with the modern Solaris OS, you should give it a try. You will find it interesting and eye opening. Rest easy: It has a real open source license, and what has been given cannot be taken away. You probably already have a PC lying around that you can use. And the price (free) is exceedingly reasonable.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday Apr 19, 2006

Testing Critical System Components Without Turning Your System Into A Brick


I work on the Solaris runtime linker. One thing you quickly learn in this work is that a small mistake can bring down your system. The runtime linker is an extreme example, but the same thing is true of other core system components. Modifying core parts of a running machine can be a risky game.

There is a time honored strategy for dealing with this:

  1. Be careful

  2. Minimize your exposure

  3. Deal with it

That's not much of a safety net. "Deal with it" can sometimes be a slow, painful process. There has been little improvement in this area for years. Now however, the advent of Solaris zones and ZFS gives us some powerful new options that can make recovery easy and instantaneous.

I'm going to talk about how to do that here. Much of this discussion is linker-specific detail and background motivation, followed by some general comments about zones and ZFS. This is followed by an actual example of how I built a test environment. Please feel free to skip right to the example.

The Linker Testing Problem

The runtime linker lies at the heart of nearly every process on a Solaris (or any modern) operating system. This makes modifying and testing it problematic: If you install a runtime linker that has an error, your entire running system will instantly break. Since everything on the system is dynamically linked, this isn't a casual breakage. Rather, your system is unable to execute anything. Recovery may require booting a memory resident copy of the OS from the installation CD, restoring working files, and rebooting. One moment, you were focusing on solving a problem. Now, your attention is yanked away and focused on system recovery. Once you get your system back, you have to go back and try to remember what you were thinking when it broke. Your productivity is shot.

If you work on something as central as the runtime linker, the odds of never breaking it are stacked against you. That it is going to happen is a simple mathematical fact. If you are careful and methodical, it will happen less. Unless you shy away from doing valuable work though, it is an ever present possibility. Since we can't eliminate the possibility, we have to accommodate it.

Our main strategy in this game is avoidance. To avoid this problem, most linker testing is done against a local copy of the linker components, without installing them in the standard locations (/bin, /usr, etc). We do this by manipulating the command PATH, and setting linker environment variables. We may install them later, if testing seems to show that they are OK, and if we believe that there may be system interactions we need to guard against. The good news is that this approach usually works, and can be managed with a reasonable amount of effort. It has some limits though:

  • It is complicated to get 100% right. Sometimes we end up using linker components from the standard places instead of the ones we think we're using.

  • It isn't a 100% accurate representation of how anyone else uses the linker. It is a close approximation, but not perfect.

  • It isn't efficient: Since it isn't a perfect test, we often have to do a a real install and test again before we know for sure that things are OK.

An ideal approach would not require so much human judgement. It would reflect the user experience exactly.

Doing Better

How would the ideal testing environment for the runtime linker subsystem look? Here's my wish list:

  • Keeps your system in a completely stock and vanilla configuration, without altering system files.

  • Lets you modify any system file, from the point of view of the software you're testing, without violating the previous point. I want to use my test linker subsystem, installed in the standard places, without having it affect anything except my tests.

  • Quick and easy to setup.

  • Upgrades to the operating system should be quick and easy to do.

  • Lets you access and use your development environment from within the test environment exactly as you can outside, and with the same filesystem paths.

  • Testing mistakes can't take down the system.

  • Self healing: After I mess it up, I want to be able to reset to a working vanilla state with a simple command, and without having to remember what I changed.

  • Run on a standard system with only modest extra resources. More disk space is OK. Using another computer isn't.

In years past, you might have tried to construct something like this by constructing an image of the system in a test area, and then applying the chroot(2) system call (probably in the form of the /usr/sbin/chroot command) to make it appear like the real system. This can work, but it has some big drawbacks:

  • Requires a lot of work to set up.

  • Requires a lot of ongoing work to track system changes from release to release (which in the Solaris group, come every 2 weeks).

  • Requires a lot of work to keep stable and correct.

If you've ever set up an anonymous FTP server, you know how much manual work is involved. Imagine doing it for an entire OS and then having to keep up with daily changes. People have tried this, but it ends up being too much ongoing effort to manage and maintain. No one minds doing work up front, but afterwards, we really want a system that can take care of itself. The goal is to save time and effort, not to simply redirect it.

What we really need is a sort of super chroot: One that sets itself up and doesn't demand so much from us. Something that creates a virtual instance of the machine we're using, that is created automatically by the system, so we don't have to construct a Solaris root filesystem manually. Something easy to create, lightweight in operation, that is essentially identical to our installed system, and something that we can play with, wreck, and reset with little or no overhead.

Before Solaris 10, this would have been a tall order. As of Solaris 10, it is standard stuff: We can build it using Solaris Zones in conjunction with ZFS. Not only can we do it, but it's easy.

Zones

You can read more about Solaris Zones at the OpenSolaris website. Quoting from that page:

Zones are an operating system abstraction for partitioning systems, allowing multiple applications to run in isolation from each other on the same physical hardware. This isolation prevents processes running within a zone from monitoring or affecting processes running in other zones, seeing each other's data, or manipulating the underlying hardware. Zones also provide an abstraction layer that separates applications from physical attributes of the machine on which they are deployed, such as physical device paths and network interface names.

The main instance of Solaris running on your system is known as the global zone. A given system is allowed to have 1 or more non-global zones: These are virtualized copies of the main system that present the programs running within them with the illusion that they are running on separate and distinct systems. Zones come in two flavors: Sparse, and Whole Root. The difference is that a sparse zone uses loopback mounts to re-use key filesystems (/, /usr, /platform) from the main system in a readonly mode, wheras a whole root zone makes a complete copy of these filesystems. A whole root zone allows you install different Solaris packages into its root filesystem — this is what we need for linker testing.

Zones are extremely easy to set up. They provide us with the ability to create an environment in which we can install and test the runtime linker without running the risk of taking down the machine. The worst that can happen is that we wreck the zone, but the damage will always be contained. A non-global zone cannot damage the global zone. If we do damage the non-global zone, it is easy to halt, destroy, and recreate it, all without any need to halt or reboot the main system.

This is a big leap forward, and by itself, would be worth using. However, setting up a whole root zone can take half an hour. To really make this approach win, we need to be able to reset a zone much faster than that. We can do this using ZFS.

ZFS

ZFS is a powerful new filesystem that is making its debut with Solaris 10, Update 2. ZFS makes it cheap and easy to create an arbitrary number of filesystems on any Solaris system, from small desktop machines to large servers.

ZFS has a snapshot facility that allows you to capture a readonly copy of a filesystem (even really large ones) in a matter of seconds. A snapshot requires almost no disk space initially, as all the file data blocks are shared. As the main filesystem is modified, the snapshot continues to reference the old data blocks. Once a snapshot has been made, ZFS allows you to roll back the main filesystem to the state captured by the snapshot. This operation is trivial to do, and essentially instantaneous.

Each Solaris whole root zone has a copy of the main system filesystems, kept at a location you specify when you create the zone. ZFS therefore presents us with a solution to the problem of how to rapidly and easily reset a linker test environment:

  • Create a ZFS filesystem to hold the zone data.

  • Create a zone in the ZFS filesystem, and do the initial login that finishes the Solaris "install" process for the zone.

  • Halt the zone.

  • Capture a ZFS snapshot of the filesystem.

  • Restart the zone.

Once this is done, you can use the zone for testing, as if it were a especially convenient second system that can see the same files your real system can see. When you need to reset it:

  • Halt the zone.

  • Revert to the zone snapshot.

  • Restart the zone.

I created such a zone using my Ultra 20 desktop system. Here are the commands to do the above:

% zoneadm -z test halt
% zfs rollback -r tank/test@baseline
% zoneadm -z test boot

These commands take 7 seconds from start to finish! Speed is not going to be a problem.

Building It

Let's walk through the construction of the linker test zone I have on my desktop system. The first step is to get a ZFS filesystem set up. My system has an extra disk (/dev/rdsk/c2d0) that I will use for this purpose. It doesn't have any pre-existing data on it that I care about saving, so I will dedicate the entire thing for ZFS to use.

I need to create a ZFS pool, and then create a filesystem within it. Following the ZFS examples I've seen, I'm going to name the pool "tank". I will mount the filesystem on /zone/test.

root# zpool create -f tank c2d0
root# zfs create tank/test
root# zfs set mountpoint=/zone/test tank/test 
root# df -k /zone/test
Filesystem            kbytes    used   avail capacity  Mounted on
tank/test            241369088      98 241368561     1%    /zone/test

That took 4 seconds.

The next step is to create the zone within the ZFS filesystem now mounted at /zone/test. In order to allow installing linker components into the root and usr filesystems, this needs to be a whole root zone. At Sun, all of our home directories are automounted via NFS, with NIS used to manage user authentication. So, I'll need to give my zone a network interface. This interface needs a unique IP address, different from the main system address. I do most of my development work in a local filesystem (/export/home), so I'll arrange for it to appear within my test zone as well. My host is named rtld, so I will name my test zone rtld-test. Summarizing these decisions:

Hostname: rtld
Zone Hostname: rtld-test
Zone IP: 172.20.25.173
Zone Type: Whole Root
Zone Path: /zone/test
Loopback Mounts: /export/home

Let's create a test zone:

root# chmod 700 /zone/test
root# zonecfg -z test
test: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:test> create -b
zonecfg:test> set autoboot=true
zonecfg:test> set zonepath=/zone/test
zonecfg:test> add net
zonecfg:test:net> set address=172.20.25.173
zonecfg:test:net> set physical=nge0
zonecfg:test:net> end
zonecfg:test> add fs
zonecfg:test:fs> set dir=/export/home
zonecfg:test:fs> set special=/export/home
zonecfg:test:fs> set type=lofs
zonecfg:test:fs> end
zonecfg:test> info
zonename: test
zonepath: /zone/test
autoboot: true
pool: 
fs:
        dir: /export/home
        special: /export/home
        raw not specified
        type: lofs
        options: []
net:
        address: 172.20.25.173
        physical: nge0
zonecfg:test> verify
zonecfg:test> commit
zonecfg:test> exit
root# zoneadm -z test verify
root# zoneadm -z test install
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <120628> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <974> packages on the zone.
Initialized <974> packages on zone.
Zone  is initialized.
Installation of these packages generated errors: 
Installation of these packages generated warnings: 
The file  contains a log of the zone installation.
root# zoneadm list -cv
  ID NAME             STATUS         PATH
   0 global           running        /
   - test             installed      /zone/test

This part of the process takes about 12 minutes on this system.

The output from "zoneadm list" shows us that the zone is installed, but not running. To get it running for the first time, we must boot it, and then login to the console and finish the installation process. This is the same process a standard Solaris goes through after the initial reboot — smf initializes, you are asked some questions about hostname, root password, and name service, and then the system is ready for use. Before using it though, we halt it and capture a snapshot, for later use.

root# zoneadm -z test boot
root# zlogin -C test
[Install Output omitted]
~.
[Connection to zone 'test' console closed]
root# zoneadm list -cv
  ID NAME             STATUS         PATH
   0 global           running        /
  12 test             running        /zone/test
root# zoneadm -z test halt
root# zfs snapshot tank/test@baseline
root# zoneadm -z test boot

This last part takes about 5 minutes. In total, we can go from no ZFS and no zone, to having a usable linker test zone in well under half an hour. This story is going to get even better soon: There are "zone cloning" features coming soon which will greatly lower the time it takes to create new zones.

Using It

Now that we have a test zone, let's experiment with it. In this section, I will be using two separate terminal windows, one logged into the global zone, and one logged into the test zone. I will show interactions with the global zone on the left, and the test zone on the right. In this example, I remove the runtime linker (/lib/ld.so.1) and demonstrate that (1) This does not take down the system, and (2) It is easily and quickly repaired.

The first step is to log into the test zone. The uname command is used as a trivial way to show that both zones are operating normally.

ali@rtld% uname
SunOS
ali@rtld% ssh rtld-test
Password: passwd
ali@rtld-test% uname
SunOS

Now, let's simulate the situation in which a bad runtime linker is installed, by simply removing it.

ali@rtld-test% su -
Password: passwd
root@rtld-test# rm /lib/ld.so.1

That is normally all it takes to wreck a working system. However, the global zone is unharmed, and my system continues to run.

ali@rtld% uname
SunOS
root@rtld-test# uname
uname: Cannot find /lib/ld.so.1
Killed
root@rtld-test# ls
ls: Cannot find /usr/lib/ld.so.1
Killed

Since my system is still running, I can quickly repair the broken test environment. In this simple case, I can repair the damage by copying /lib/ld.so.1 from my global zone into the test zone.

ali@rtld% su -
Password: passwd
root@rtld# cp /lib/ld.so.1 \
              /zone/test/root/lib
root@rtld-test# uname
SunOS

That's fine if the damage is simple, but what if the situation is more complex? The ld.so.1 from the global zone may be incompatible with other changes made to the linker components in the test zone, in which case, the above fix will not work. In that case, we will want to exercise the ability to quickly reset the test zone to a known good state. First, let's break it again:

root@rtld-test# rm /lib/ld.so.1
root@rtld-test# uname
uname: Cannot find /lib/ld.so.1
Killed

This time, we'll reset the test zone, from the global one:

root@rtld# zoneadm -z test halt
root@rtld# zfs rollback \
               -r tank/test@baseline
root@rtld# zoneadm -z test boot
# Connection to rtld-test closed by remote host.
Connection to rtld-test closed.

The test zone is back, good as new and ready for use:

ali@rtld% ssh rtld-test 
Password: passwd
ali@rtld-test(501)% uname
SunOS

Conclusions: A Rising Tide Floats All Boats

I've started to regard the test zone the same way I view an Etch-A-Sketch®: I can play with it, mess it up, learn from the results, and then I give it a quick shake and it is ready to go again. This is cool stuff!

Before doing this experiment, I had never used zones or ZFS. I had heard about them, but nothing more. I sat down on Friday morning to see what I could do with them, and I had the solution described here working within 8 hours of effort. It's hard to beat that return on investment. The result is a real leap forward in terms of how easily and completely we can test our work.

Zones and ZFS provide new and powerful abilities not available elsewhere. They're included in the standard system for free, and not as expensive add ons. They're simple and easy to use. Once you play with them, I am confident that you'll start seeing uses for them in your daily work. Happy hunting!

Technorati Tag: OpenSolaris
Technorati Tag: Solaris


Archives
Links
Referrers