Ali Bahrami

Tuesday Oct 21, 2008

GNU Hash ELF Sections

The ELF object format is used by several different operating systems, all sharing a common basic design, but each sporting their own extensions. One of the nice aspects of ELF's design is that it facilitates this, defining a common core, as well as reserving space for each implementation to define its own additions. Those of us in the ELF community try to stay current with each other, as a source of new ideas and inspiration, in order to avoid reinventing the wheel, and out of curiosity.

Recently, the GNU linker developers added a new style of hash section to their objects, and I've been learning about them in fine detail. Having done the work, it only makes sense to write it down and share it.

This posting describes the layout and interpretation of GNU ELF hash sections. The GNU hash section provides a new hash section for ELF objects, with better performance than the original SYSV hash.

The information here was gathered from the following sources:

I did not look at the GNU binutils source code to gather this information.

If you spot an error, send me email (First.Last@Sun.COM, where First and Last are replaced with my name) and I'll fix it.

Hash Function

The GNU hash function uses the DJB (Daniel J Bernstein) hash, which Professor Bernstein originally posted to the comp.lang.c usenet newsgroup years ago:
uint32_t
dl_new_hash (const char *s)
{
        uint32_t h = 5381;

        for (unsigned char c = *s; c != '\0'; c = *++s)
                h = h * 33 + c;

        return h;
}
If you search for this algorithm online, you will find that the hash expression
h = h * 33 + c
is frequently coded as
h = ((h << 5) + h) + c
These are equivalent statements, replacing integer multiplication with a presumably cheaper shift and add operation. Whether this is actually cheaper depends on the CPU used. There used to be a significant difference with older machines, but integer multiplication on modern machines is very fast.

Another variation of this algorithm clips the returned value to 31-bits:

return h & 0x7fffffff;
However, GNU hash uses the full unsigned 32-bit result.

The GNU binutils implementation utilizes the uint_fast32_t type for computing the hash. This type is defined to be the fastest available integer machine type capable of representing at least 32-bits on the current system. As it might be implemented using a wider type, the result is explicitly clipped to a 32-bit unsigned value before being returned.

static uint_fast32_t
dl_new_hash (const char *s)
{
        uint_fast32_t h = 5381;

        for (unsigned char c = *s; c != '\0'; c = *++s)
                h = h * 33 + c;

        return h & 0xffffffff;
}

Dynamic Section Requirements

GNU hash sections place some additional sorting requirements on the contents of the dynamic symbol table. This is in contrast to standard SVR4 hash sections, which allow the symbols to be placed in any order allowed by the ELF standard.

A standard SVR4 hash table includes all of the symbols in the dynamic symbol table. However, some of these symbols will never be looked up via the hash table:

  • LOCAL symbols, unless referenced by a relocation (on some architectures)

  • FILE symbols

  • For sharable objects: All UNDEF symbols

  • For executables: Any UNDEF symbol that are not referenced by a PLT.

  • The special index 0 symbol (a special case of UNDEF)
Omitting these symbols from the hash table section has no impact on correctness, and will result in less hash table congestion, shorter hash chains, and correspondingly better hash performance.

With GNU hash, the dynamic symbol table is divided into two parts. The first part receives the symbols that can be omitted from the hash table. GNU hash does not impose any specific order for the symbols in this part of the dynamic symbol table.

The second half of the dynamic symbol table receives the symbols that are accessible from the hash table. These symbols are required to be sorted by increasing (hash % nbuckets) value, using the GNU hash function described above. The number of hash buckets (nbuckets) is recorded in the GNU hash section, described below. As a result, symbols which will be found in a single hash chain are adjacent in memory, leading to better cache performance.

GNU_HASH section

A GNU_HASH section consists of four separate parts, in order:
Header
An array of (4) 32-bit words providing section parameters:

nbuckets
The number of hash buckets

symndx
The dynamic symbol table has dynsymcount symbols. symndx is the index of the first symbol in the dynamic symbol table that is to be accessible via the hash table. This implies that there are (dynsymcount - symndx) symbols accessible via the hash table.

maskwords
The number of ELFCLASS sized words in the Bloom filter portion of the hash table section. This value must be non-zero, and must be a power of 2 as explained below.

Note that a value of 0 could be interpreted to mean that no Bloom filter is present in the hash section. However, the GNU linkers do not do this — the GNU hash section always includes at least 1 mask word.

shift2
A shift count used by the Bloom filter.

Bloom Filter
GNU_HASH sections contain a Bloom filter. This filter is used to rapidly reject attempts to look up symbols that do not exist in the object. The Bloom filter words are 32-bit for ELFCLASS32 objects, and 64-bit for ELFCLASS64.

Hash Buckets
An array of nbuckets 32-bit hash buckets

Hash Values
An array of (dynsymcount - symndx) 32-bit hash chain values, one per symbol from the second part of the dynamic symbol table.
The header, hash buckets, and hash chains are always 32-bit words, while the Bloom filter words can be 32 or 64-bit depending on the class of object. This means that ELFCLASS32 GNU_HASH sections consist of only 32-bit words, and therefore have their section header sh_entsize field set to 4. ELFCLASS64 GNU_HASH sections have mixed element size, and therefore set sh_entsize to 0.

Assuming that the hash section is aligned properly for accessing ELFCLASS sized words, the (4) 32-bit words directly before the Bloom filter ensure that the filter mask words are always aligned properly and can be accessed directly in memory.

Bloom Filter
GNU hash sections include a Bloom filter. Bloom filters are probabilistic, meaning that false positives are possible, but false negatives are not. This filter is used to rapidly reject symbol names that will not be found in the object, avoiding the more expensive hash lookup operation. Normally, only one object in a process has the given symbol. Skipping the hash operation for all the other objects can greatly speed symbol lookup.

The filter consists of maskwords words, each of which is 32-bits (ELFCLASS32) or 64-bits (ELFCLASS64) depending on the class of object. In the following discussion, C will be used to stand for the size of one mask word in bits. Collectively, the mask words make up a logical bitmask of (C * maskwords) bits.

GNU hash uses a k=2 Bloom filter, which means that two independent hash functions are used for each symbol. The Bloom filter reference contains the following statement:

The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields.
The hash function used by the GNU hash has this property. This fact is leveraged to produce both hash functions required by the Bloom filter from the single hash function described above:
H1 = dl_new_hash(name);
H2 = H1 >> shift2;
As discussed above, the link editor determines how many mask words to use (maskwords) and the amount by which the first hash result is right shifted to produce the second (shift2). The more mask words used, the larger the hash section, but the lower the rate of false positives. I was told in private email that the GNU linker primarily derives shift2 from the base 2 log of the number of symbols entered into the hash table (dynsymcount - symndx), with a minimum value of 5 for ELFCLASS32, and 6 for ELFCLASS64. These values are explicitly recorded in the hash section in order to give the link editor the flexibility to change them in the future should better heuristics emerge.

The Bloom filter mask sets one bit for each of the two hash values. Based on the Bloom filter reference, the word containing each bit, and the bit to set would be calculated as:

N1 = ((H1 / C) % maskwords);
N2 = ((H2 / C) % maskwords);

B1 = H1 % C;
B2 = H2 % C;
To populate the bits when building the filter:
bloom[N1] |= (1 << B1);
bloom[N2] |= (1 << B2);
and to later test the filter:
(bloom[N1] & (1 << B1)) && (bloom[N2] & (1 << B2))
The GNU hash deviates from the above in a significant way. Rather than calculate N1 and N2 separately, a single mask word is used, corresponding to N1 above. This is a conscious decision by the GNU hash developers to optimize cache behavior:
This makes the 2 hash functions for the Bloom filter more dependent than when two different Ns were used, but in our measurements still has very good ratio of rejecting lookups that should be rejected, and is much more cache friendly. It is very important that we touch as few cache lines during lookup as possible.

Therefore, in the GNU hash, the single mask word is actually calculated as:

N = ((H1 / C) % maskwords);
The two bits set in the Bloom filter mask word N are:
BITMASK = (1 << (H1 % C)) | (1 << (H2 % C));
The link-editor sets these bits as
bloom[N] |= BITMASK;
And the test used by the runtime linker is:
(bloom[N1] & BITMASK) == BITMASK;
Bit Fiddling: Why maskwords Is Required To Be A Power Of Two
In general, a Bloom filter can be constructed using an arbitrary number of words. However, as noted above, the GNU hash calls for maskwords to be a power of 2. This requirement allows the modulo operation
N = ((H1 / C) % maskwords);
to instead be written as a simple mask operation:
N = ((H1 / C) & (maskwords - 1));
Note that (maskwords - 1) can be precomputed once
MASKWORDS_BITMASK = maskwords - 1;
and then used for every hash:
N = ((H1 / C) & MASKWORDS_BITMASK);
Bloom Filter Special Cases
Bloom filters have a pair of interesting special cases:
  • When a Bloom filter has all of its bits set, all tests result in a True (accept) value. The GNU linker takes advantage of this by issuing a single word Bloom filter with all bits set when it wants to "disable" the Bloom filter. The filter is still there, and is still used, at minimal overhead, but it lets everything through.

  • A Bloom filter with no bits set will return False in all cases. This case is relatively rare in ELF files, as an object that exports no symbols has limited application. However, sometimes objects are built this way, relying on init/fini sections to cause code from the object to run.
Hash Buckets
Following the Bloom filter are nbuckets 32-bit words. Each word N in the array contains the lowest index into the dynamic symbol table for which:
(dl_new_hash(symname) % nbuckets) == N
Since the dynamic symbol table is sorted by the same key (hash % nbuckets), dynsym[buckets[N]] is the first symbol in the hash chain that will contain the desired symbol if it exists.

A bucket element will contain the index 0 if there is no symbol in the hash table for the given value of N. As index 0 of the dynsym is a reserved value, this index cannot occur for a valid symbol, and is therefore non-ambiguous.

Hash Values
The final part of a GNU hash section contains (dynsymcount - symndx) 32-bit words, one entry for each symbol in the second part of the dynamic symbol table. The top 31 bits of each word contains the top 31 bits of the corresponding symbol's hash value. The least significant bit is used as a stopper bit. It is set to 1 when a symbol is the last symbol in a given hash chain:
lsb = (N == dynsymcount - 1) ||
  ((dl_new_hash (name[N]) % nbuckets)
   != (dl_new_hash (name[N + 1]) % nbuckets))

hashval = (dl_new_hash(name) & ~1) | lsb;

Symbol Lookup Using GNU Hash

The following shows how a symbol might be looked up in an object using the GNU hash section. We will assume the existence of an in memory record containing the information needed:
typedef struct {
        const char      *os_dynstr;      /* Dynamic string table */
        Sym             *os_dynsym;      /* Dynamic symbol table */
        Word            os_nbuckets;     /* # hash buckets */
        Word            os_symndx;       /* Index of 1st dynsym in hash */
        Word            os_maskwords_bm; /* # Bloom filter words, minus 1 */
        Word            os_shift2;       /* Bloom filter hash shift */
        const BloomWord *os_bloom;       /* Bloom filter words */
        const Word      *os_buckets;     /* Hash buckets */
        const Word      *os_hashval;     /* Hash value array */
} obj_state_t;
To simplify matters, we elide the details of handling different ELF classes. In the above, Word is a 32-bit unsigned value, BloomWord is either 32 or 64-bit depending in the ELFCLASS, and Sym is either Elf32_Sym or Elf64_Sym.

Given a variable containing the above information for an object, the following pseudo code returns a pointer to the desired symbol if it exists in the object, and NULL otherwise.

Sym *
symhash(obj_state_t *os, const char *symname)
{
        Word            h1, h2;
        Word            n;
        Word            bitmask; 
        const Sym       *sym;
        Word            *hashval;

        /*
         * Hash the name, generate the "second" hash
         * from it for the Bloom filter.
         */
        h1 = dl_new_hash(symname);
        h2 = h2 >> os->os_shift2;

        /* Test against the Bloom filter */
        n = (h1 / sizeof (BloomWord)) & os->os_maskwords_bm;
        bitmask = (1 << (h1 % sizeof (BloomWord))) |
            (1 << (h2 % sizeof (BloomWord)));
        if ((os->os_bloom[n] & bitmask) != bitmask)
                return (NULL);

        /* Locate the hash chain, and corresponding hash value element */
        n = os->os_buckets[h1 % os->os_nbuckets];
        if (n == 0)    /* Empty hash chain, symbol not present */
                return (NULL);
        sym = &os->os_dynsym[n];
        hashval = &os->os_hashval[n - os->os_symndx];

        /*
         * Walk the chain until the symbol is found or
         * the chain is exhausted.
         */
        for (h1 =& ~1; ; sym++) {
                h2 = *hashval++;

                /*
                 * Compare the strings to verify match. Note that
                 * a given hash chain can contain different hash
                 * values. We'd get the right result by comparing every
                 * string, but comparing the hash values first lets us
                 * screen obvious mismatches at very low cost and avoid
                 * the relatively expensive string compare.
                 *
		 * We are intentionally glossing over some things here:
	         *
		 *    -  We could test sym->st_name for 0, which indicates
		 *	 a NULL string, and avoid a strcmp() in that case.
		 *
                 *    - The real runtime linker must also take symbol
		 * 	versioning into account. This is an orthogonal
		 *	issue to hashing, and is left out of this
		 *	example for simplicity.
		 *
		 * A real implementation might test (h1 == (h2 & ~1), and then
		 * call a (possibly inline) function to validate the rest.
                 */
                if ((h1 == (h2 & ~1)) &&
                    !strcmp(symname, os->os_dynstr + sym->st_name))
                        return (sym);

                /* Done if at end of chain */
                if (h2 & 1)
                        break;
        }

        /* This object does not have the desired symbol */
        return (NULL);
}

The Cost Of ELF Symbol Hashing

Linux and Solaris are ELF cousins. Both systems use the same base ELF standard, though we've both made our own OS specific extensions over the years. Recently, the GNU linker folks made some changes to how Linux does symbol hashing in their ELF objects. I've been learning about what they've done, and that in turn caused me to consider the bigger picture of ELF hashing overhead.

History and Trends

In an ELF based system, the runtime linker looks up symbols within executables and sharable objects. The available symbols are found in the dynamic symbol table. The lookup is done using a pre-computed hash table section associated with that symbol table. The SVR4 ELF ABI defines the layout of these hash sections, and the hash function. These are all original ELF features, dating back to the late 1980's when ELF was designed. This aspect of ELF has been static since that time.

The runtime linker maintains a list of objects currently mapped into a programs address space. To find a desired symbol, it starts at the head of this list and searches each one in turn using a hash lookup operation, until the symbol is found (success) or the end of the list is hit (failure).

The per-symbol cost of symbol lookup hashing grows with:

  • The number of objects in a process.

  • The number of symbols in those objects.

  • The length of the symbol names. In particular, the C++ language encodes class names and argument types into symbol names, which makes the strings extremely long. Even worse from a string comparison point of view is that these strings tend to have long shared suffixes, differing only in the final characters.
These items all grow over time. In the days when ELF was originally defined, a process had one or two sharable objects at most, and a few hundred symbols, all of which had short names. More recently, it has become common for a program to have tens of sharable objects, and thousands of symbols, and names have grown. This trend shows no sign of abating. It is easy to imagine a near future with hundreds of objects and hundreds of thousands of symbols. In fact, we have already seen a program with almost 1000 objects.

In the past, when the list of objects in a program was very short, it was not necessary to search many objects before a given symbol was found. Most symbols were in the first, often only, object. Hence, most hash operations were successful, and hash overhead was not a significant concern. In modern programs however, failed hash operations dominate. It is usually necessary to perform one or more failing hash operations before getting to the object that has a desired symbol. The more objects, the larger the percentage of failing hashes.

Unless somehow mitigated, the per-symbol cost of hashing will continue to grow as programs grow larger, possibly to a level where the user can feel the effect. There are several ways in which this overhead can be reduced:

  1. Eliminate unnecessary symbols

  2. Eliminate the O(n) search of the object list to reduce the amount of hashing required.

  3. Make each hashing operation cost less.

Eliminate Unnecessary Symbols

Most objects contain global symbols that are for the use of the object, but not intended to be accessed by outside code. One common example would be that of a helper function called within multiple files that are compiled into a sharable object. Such a function needs to be global within the object so that it can be called from multiple files. However, it is not intended to be something the users of the library call directly.

ELF versioning allow symbols to be assigned to versions, thereby creating interfaces that can be maintained for backward compatibility as the object evolves. In a version definition, the scope reduction operator can be used to tell the link-editor that any global symbols not explicitly assigned to a version should not be visible outside the object. For example, the following exports the symbol foo from a version named VERSION_1.0, while reducing the scope of all other global symbols to local:

VERSION_1.0
{
	 global:
		 foo;
	 local:
		 *;
};
Some language compilers offer symbol visibility keywords that have similar effect.

Eliminating unnecessary symbols from the hash table reduces the average length of the hash chains, and speeds the lookup. In addition, hiding unnecessary symbols from object's external interface prevents accidental interposition in which a given library exports a function intended to only be for its own use, and that symbol ends up interposing on a symbol of the same name in a different object.

Eliminate O(n) Object Searching

There are different strategies employed in modern operating systems to minimize the need for symbol hashing:
Prelink
The Linux operating system emphasizes their prelink system. Prelink analyzes all the executables on the system, and the libraries they depend on. Non-conflicting addresses are assigned for all the libraries, and then the executables and libraries are pre-bound to be loaded at those addresses. In effect, the work normally done by the runtime linker for each object at process startup is instead done once. The runtime linker, recognizing a prelinked object at process startup, will map it to its assigned location, and immediately return rather than do the usual relocation processing.

Prelinking is a complex per-system operation, though Linux does a good job of hiding and managing this complexity. Changes to the system can require the prelink operation to be redone.

Prelinking pares the cost of ELF runtime processing to the absolute minimum. As part of that, it completely eliminates symbol hashing at startup. A further advantage is that it does not require any changes to the objects in question. All of the complexity is kept in the prelink system itself.

Prelinking will not prevent hashing from occurring if:

  • The object is loaded via dlopen().
  • The object is not prelinked
  • If the object is prelinked for one system, but is then used by another, either as a copy or via a network filesystem. Prelinking is a per-system concept. The system can use an object prelinked for a different system, but the benefits of prelinking may be lost.
  • The objects on the system have changed, altering the prelinking needed. In this case, prelinking can be recomputed.

Direct Binding
The Solaris operating system employs a combination of direct binding, preferably in conjunction with lazy loading and lazy binding. In non-direct binding, an object records the symbols it requires from other objects. In direct binding, each such symbol also records the object expected to provide that symbol.

Direct bindings were developed with several goals in mind:

  1. They harden large programs against the effects of unintended symbol interposition, when the wrong library provides a given symbol due to the order in which libraries get loaded. This makes it easier to reliably build larger programs.

  2. They allow multiple instances of a named interface to be referenced by different objects, further expanding the ability of complex programs to interface with multiple, perhaps incompatible, external interfaces. For instance, when a program depends on many objects that are developed independently by others, sometimes those objects have dependencies on different incompatible versions of some other object.

  3. Performance: At runtime, the runtime linker uses the direct binding information to skip the O(n) search of all the objects in the process, and instead to go directly to the right library for each symbol, carrying out a single successful hash operation to locate it within the object.

The the first two items were the driving issues that led to the invention of direct bindings, reflecting real problems we were encountering in the field.

Lazy binding, which is the default, reduces the amount of hashing done even further, by delaying the binding operation until the first use of a given symbol by the program. If a given symbol is not needed during the lifetime of a process, it is never looked up, and the hashing that would otherwise occur is eliminated. Most programs do not use all of the symbols available to them in a given execution, so lazy binding amplifies the benefits of direct binding.

Direct bindings do not eliminate hashing, but they eliminate the unproductive failure cases, leaving a single successful hash lookup per symbol. In a sense, they bring us back to the performance of early ELF systems where everything was found in a single library and most hash operations were successful. Direct bindings have a relatively simple implementation. They work with dlopen(), and across network filesystems. However, many objects do not use direct bindings. Unlike prelink, direct bindings require explicit action to be taken when the object is built. Converting an existing non-direct object to use them can require some analysis to be carried out by the programmer, to enable desired interposition that direct bindings would otherwise prevent.

Prelinking and direct bindings are very different solutions that attack the problem of hashing overhead along different axis. Systems will see greatly reduced symbol hash overhead with either strategy, but symbol hashing will still occur. As such, the cost of the hash lookup operation is still of interest.

Making Symbol Hashing Cheaper

The existing SVR4 ELF hash function and hash table sections are fixed. Improving their performance requires introducing a new hash function and hash section. Recently, the developers of the GNU linkers have done this. These new hash sections can coexist in an object with the older SVR4 hash sections, allowing for backward compatibility with older systems. The GNU hash reduces hash overhead in the following ways:
  • An improved hash function is used, to better spread the hash keys and reduce hash chain length.

  • The dynamic symbol table is sorted into hash order, such that memory access tends to be adjacent and monotonically increasing, which can help cache behavior. (Note that the Solaris link-editor does a similar sort, although the specific details differ.)

  • The dynamic symbol table contains some symbols that are never looked up by via the hash table. These symbols are left out of the hash table, reducing its size and hash chain lengths.

  • Perhaps most significantly, the GNU hash section includes a Bloom filter. This filter is used prior to hash lookup to determine if the symbol is found in the object or not.
Bloom filters are used to test whether a given item is part of a given set. They use a compact bitmask representation, and are fast to query. Bloom filters, are probabilistic: False positives are possible, but false negatives are not. The size of the bitmask used to represent the filter, and the quality and number of hash functions determine the rate of false positives.

A Bloom filter is never wrong when it says an item does not exist in a set. Applied to ELF hash tables, the runtime linker can test a symbol name against a Bloom filter, and if rejected, immediately move on to the next object. Since most symbol lookups end in failure as discussed above, this has the potential to eliminate a large number of unnecessary hash operations. It is possible for a Bloom filter to incorrectly indicate that an item exists in the set when it doesn't. When this happens, it is caught by the following hash lookup, so correct linker operation is not affected. Since false positives are rare, this does not significantly affect performance.

It is interesting to note that the use of a Bloom filter makes a successful symbol lookup slightly more expensive than it would be otherwise. The hash table alone can be used to determine if a symbol exists or not, so the Bloom filter is pure overhead in the success case. Despite that, the Bloom filter is a winning optimization, because it is very cheap to compute compared to a hash table lookup, and because most hash operations are against libraries that end up not containing the desired symbol.

It is also worth noting that the runtime linker is free to skip the use of the Bloom filter and proceed directly to the hash lookup operation. This may be worthwhile in situations where the runtime linker has other reasons to believe the lookup will be successful. In particular, if the runtime linker is directed to a given object via a direct binding, the odds of a failed symbol lookup should be zero, so there is no need to screen before the hash lookup.

Conclusions

Tweaking the performance of an existing algorithm has its place, particularly within the inner loops of a busy program. However, the big wins are usually the result of using a better algorithm. In Solaris, direct bindings have been our algorithmic approach to reducing hash overhead. We've made a conscious effort to develop and deploy direct bindings in preference to making improvements to the traditional hash process. We've been pleased with the results — direct bindings are faster and combined with their other attributes, allow programs to scale larger.

Nonetheless, the GNU hash embodies several worthwhile ideas:

  1. Better hash function
  2. Doesn't put symbols that the runtime linker doesn't care about into the hash table.
  3. Bloom filter cheaply eliminates most hash operations.
In particular, the use of a Bloom filter to cheaply filter away unproductive hash operations stands out as a very interesting idea.

It is clear that hash overhead can be measured and reduced. By and large however, we have not found ELF hash overhead to be a hot spot for real programs. It seems that the other things that go on in a program generally dominate the hit that comes from ELF symbol hashing. The core Solaris OS is now built using direct bindings, which has allowed us to harden and simplify aspects of its design. Interestingly enough, measurements do not reflect a significant resulting change in system performance. This does not prove that hash overhead should be ignored, but it does tend to suggest that one has to look towards extremely large non-direct bound programs in order to demonstrate the issue.

It's all food for thought, and perhaps time for some experimentation and measurement.


Archives
Links
Referrers