Ali Bahrami |
Tuesday Oct 21, 2008
GNU Hash ELF Sections The ELF object format is used by several different operating systems, all sharing a common basic design, but each sporting their own extensions. One of the nice aspects of ELF's design is that it facilitates this, defining a common core, as well as reserving space for each implementation to define its own additions. Those of us in the ELF community try to stay current with each other, as a source of new ideas and inspiration, in order to avoid reinventing the wheel, and out of curiosity. Recently, the GNU linker developers added a new style of hash section to their objects, and I've been learning about them in fine detail. Having done the work, it only makes sense to write it down and share it. This posting describes the layout and interpretation of GNU ELF hash sections. The GNU hash section provides a new hash section for ELF objects, with better performance than the original SYSV hash. The information here was gathered from the following sources:
I did not look at the GNU binutils source code to gather this information. If you spot an error, send me email (First.Last@Sun.COM, where First and Last are replaced with my name) and I'll fix it. Hash FunctionThe GNU hash function uses the DJB (Daniel J Bernstein) hash, which Professor Bernstein originally posted to the comp.lang.c usenet newsgroup years ago:
If you search for this algorithm online, you will find that the
hash expression
is frequently coded ash = h * 33 + c These are equivalent statements, replacing integer multiplication with a presumably cheaper shift and add operation. Whether this is actually cheaper depends on the CPU used. There used to be a significant difference with older machines, but integer multiplication on modern machines is very fast.h = ((h << 5) + h) + c Another variation of this algorithm clips the returned value to 31-bits: However, GNU hash uses the full unsigned 32-bit result.return h & 0x7fffffff; The GNU binutils implementation utilizes the uint_fast32_t type for computing the hash. This type is defined to be the fastest available integer machine type capable of representing at least 32-bits on the current system. As it might be implemented using a wider type, the result is explicitly clipped to a 32-bit unsigned value before being returned.
Dynamic Section RequirementsGNU hash sections place some additional sorting requirements on the contents of the dynamic symbol table. This is in contrast to standard SVR4 hash sections, which allow the symbols to be placed in any order allowed by the ELF standard.A standard SVR4 hash table includes all of the symbols in the dynamic symbol table. However, some of these symbols will never be looked up via the hash table:
With GNU hash, the dynamic symbol table is divided into two parts. The first part receives the symbols that can be omitted from the hash table. GNU hash does not impose any specific order for the symbols in this part of the dynamic symbol table. The second half of the dynamic symbol table receives the symbols that are accessible from the hash table. These symbols are required to be sorted by increasing (hash % nbuckets) value, using the GNU hash function described above. The number of hash buckets (nbuckets) is recorded in the GNU hash section, described below. As a result, symbols which will be found in a single hash chain are adjacent in memory, leading to better cache performance. GNU_HASH sectionA GNU_HASH section consists of four separate parts, in order:The header, hash buckets, and hash chains are always 32-bit words, while the Bloom filter words can be 32 or 64-bit depending on the class of object. This means that ELFCLASS32 GNU_HASH sections consist of only 32-bit words, and therefore have their section header sh_entsize field set to 4. ELFCLASS64 GNU_HASH sections have mixed element size, and therefore set sh_entsize to 0. Assuming that the hash section is aligned properly for accessing ELFCLASS sized words, the (4) 32-bit words directly before the Bloom filter ensure that the filter mask words are always aligned properly and can be accessed directly in memory. Bloom FilterGNU hash sections include a Bloom filter. Bloom filters are probabilistic, meaning that false positives are possible, but false negatives are not. This filter is used to rapidly reject symbol names that will not be found in the object, avoiding the more expensive hash lookup operation. Normally, only one object in a process has the given symbol. Skipping the hash operation for all the other objects can greatly speed symbol lookup.The filter consists of maskwords words, each of which is 32-bits (ELFCLASS32) or 64-bits (ELFCLASS64) depending on the class of object. In the following discussion, C will be used to stand for the size of one mask word in bits. Collectively, the mask words make up a logical bitmask of (C * maskwords) bits. GNU hash uses a k=2 Bloom filter, which means that two independent hash functions are used for each symbol. The Bloom filter reference contains the following statement: The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields.The hash function used by the GNU hash has this property. This fact is leveraged to produce both hash functions required by the Bloom filter from the single hash function described above: As discussed above, the link editor determines how many mask words to use (maskwords) and the amount by which the first hash result is right shifted to produce the second (shift2). The more mask words used, the larger the hash section, but the lower the rate of false positives. I was told in private email that the GNU linker primarily derives shift2 from the base 2 log of the number of symbols entered into the hash table (dynsymcount - symndx), with a minimum value of 5 for ELFCLASS32, and 6 for ELFCLASS64. These values are explicitly recorded in the hash section in order to give the link editor the flexibility to change them in the future should better heuristics emerge.H1 = dl_new_hash(name); H2 = H1 >> shift2; The Bloom filter mask sets one bit for each of the two hash values. Based on the Bloom filter reference, the word containing each bit, and the bit to set would be calculated as: To populate the bits when building the filter:N1 = ((H1 / C) % maskwords); N2 = ((H2 / C) % maskwords); B1 = H1 % C; B2 = H2 % C; and to later test the filter:bloom[N1] |= (1 << B1); bloom[N2] |= (1 << B2); The GNU hash deviates from the above in a significant way. Rather than calculate N1 and N2 separately, a single mask word is used, corresponding to N1 above. This is a conscious decision by the GNU hash developers to optimize cache behavior:(bloom[N1] & (1 << B1)) && (bloom[N2] & (1 << B2)) This makes the 2 hash functions for the Bloom filter more dependent than when two different Ns were used, but in our measurements still has very good ratio of rejecting lookups that should be rejected, and is much more cache friendly. It is very important that we touch as few cache lines during lookup as possible. Therefore, in the GNU hash, the single mask word is actually calculated as: The two bits set in the Bloom filter mask word N are:N = ((H1 / C) % maskwords); The link-editor sets these bits asBITMASK = (1 << (H1 % C)) | (1 << (H2 % C)); And the test used by the runtime linker is:bloom[N] |= BITMASK; (bloom[N1] & BITMASK) == BITMASK; Bit Fiddling: Why maskwords Is Required To Be A Power Of TwoIn general, a Bloom filter can be constructed using an arbitrary number of words. However, as noted above, the GNU hash calls for maskwords to be a power of 2. This requirement allows the modulo operationto instead be written as a simple mask operation:N = ((H1 / C) % maskwords); Note that (maskwords - 1) can be precomputed onceN = ((H1 / C) & (maskwords - 1)); and then used for every hash:MASKWORDS_BITMASK = maskwords - 1; N = ((H1 / C) & MASKWORDS_BITMASK); Bloom Filter Special CasesBloom filters have a pair of interesting special cases:
Hash BucketsFollowing the Bloom filter are nbuckets 32-bit words. Each word N in the array contains the lowest index into the dynamic symbol table for which:Since the dynamic symbol table is sorted by the same key (hash % nbuckets), dynsym[buckets[N]] is the first symbol in the hash chain that will contain the desired symbol if it exists.(dl_new_hash(symname) % nbuckets) == N A bucket element will contain the index 0 if there is no symbol in the hash table for the given value of N. As index 0 of the dynsym is a reserved value, this index cannot occur for a valid symbol, and is therefore non-ambiguous. Hash ValuesThe final part of a GNU hash section contains (dynsymcount - symndx) 32-bit words, one entry for each symbol in the second part of the dynamic symbol table. The top 31 bits of each word contains the top 31 bits of the corresponding symbol's hash value. The least significant bit is used as a stopper bit. It is set to 1 when a symbol is the last symbol in a given hash chain:lsb = (N == dynsymcount - 1) || ((dl_new_hash (name[N]) % nbuckets) != (dl_new_hash (name[N + 1]) % nbuckets)) hashval = (dl_new_hash(name) & ~1) | lsb; Symbol Lookup Using GNU HashThe following shows how a symbol might be looked up in an object using the GNU hash section. We will assume the existence of an in memory record containing the information needed:
To simplify matters, we elide the details of handling
different ELF classes.
In the above, Word is a 32-bit unsigned value, BloomWord is
either 32 or 64-bit depending in the ELFCLASS, and Sym is
either Elf32_Sym or Elf64_Sym.
Given a variable containing the above information for an object, the following pseudo code returns a pointer to the desired symbol if it exists in the object, and NULL otherwise.
Posted at 03:01PM Oct 21, 2008 by ali in Sun | Comments[1] The Cost Of ELF Symbol Hashing Linux and Solaris are ELF cousins. Both systems use the same base ELF standard, though we've both made our own OS specific extensions over the years. Recently, the GNU linker folks made some changes to how Linux does symbol hashing in their ELF objects. I've been learning about what they've done, and that in turn caused me to consider the bigger picture of ELF hashing overhead. History and TrendsIn an ELF based system, the runtime linker looks up symbols within executables and sharable objects. The available symbols are found in the dynamic symbol table. The lookup is done using a pre-computed hash table section associated with that symbol table. The SVR4 ELF ABI defines the layout of these hash sections, and the hash function. These are all original ELF features, dating back to the late 1980's when ELF was designed. This aspect of ELF has been static since that time.The runtime linker maintains a list of objects currently mapped into a programs address space. To find a desired symbol, it starts at the head of this list and searches each one in turn using a hash lookup operation, until the symbol is found (success) or the end of the list is hit (failure). The per-symbol cost of symbol lookup hashing grows with:
In the past, when the list of objects in a program was very short, it was not necessary to search many objects before a given symbol was found. Most symbols were in the first, often only, object. Hence, most hash operations were successful, and hash overhead was not a significant concern. In modern programs however, failed hash operations dominate. It is usually necessary to perform one or more failing hash operations before getting to the object that has a desired symbol. The more objects, the larger the percentage of failing hashes. Unless somehow mitigated, the per-symbol cost of hashing will continue to grow as programs grow larger, possibly to a level where the user can feel the effect. There are several ways in which this overhead can be reduced:
Eliminate Unnecessary SymbolsMost objects contain global symbols that are for the use of the object, but not intended to be accessed by outside code. One common example would be that of a helper function called within multiple files that are compiled into a sharable object. Such a function needs to be global within the object so that it can be called from multiple files. However, it is not intended to be something the users of the library call directly.ELF versioning allow symbols to be assigned to versions, thereby creating interfaces that can be maintained for backward compatibility as the object evolves. In a version definition, the scope reduction operator can be used to tell the link-editor that any global symbols not explicitly assigned to a version should not be visible outside the object. For example, the following exports the symbol foo from a version named VERSION_1.0, while reducing the scope of all other global symbols to local:
Some language compilers offer symbol visibility keywords that
have similar effect.
Eliminating unnecessary symbols from the hash table reduces the average length of the hash chains, and speeds the lookup. In addition, hiding unnecessary symbols from object's external interface prevents accidental interposition in which a given library exports a function intended to only be for its own use, and that symbol ends up interposing on a symbol of the same name in a different object. Eliminate O(n) Object SearchingThere are different strategies employed in modern operating systems to minimize the need for symbol hashing:Prelinking and direct bindings are very different solutions that attack the problem of hashing overhead along different axis. Systems will see greatly reduced symbol hash overhead with either strategy, but symbol hashing will still occur. As such, the cost of the hash lookup operation is still of interest. Making Symbol Hashing CheaperThe existing SVR4 ELF hash function and hash table sections are fixed. Improving their performance requires introducing a new hash function and hash section. Recently, the developers of the GNU linkers have done this. These new hash sections can coexist in an object with the older SVR4 hash sections, allowing for backward compatibility with older systems. The GNU hash reduces hash overhead in the following ways:
A Bloom filter is never wrong when it says an item does not exist in a set. Applied to ELF hash tables, the runtime linker can test a symbol name against a Bloom filter, and if rejected, immediately move on to the next object. Since most symbol lookups end in failure as discussed above, this has the potential to eliminate a large number of unnecessary hash operations. It is possible for a Bloom filter to incorrectly indicate that an item exists in the set when it doesn't. When this happens, it is caught by the following hash lookup, so correct linker operation is not affected. Since false positives are rare, this does not significantly affect performance. It is interesting to note that the use of a Bloom filter makes a successful symbol lookup slightly more expensive than it would be otherwise. The hash table alone can be used to determine if a symbol exists or not, so the Bloom filter is pure overhead in the success case. Despite that, the Bloom filter is a winning optimization, because it is very cheap to compute compared to a hash table lookup, and because most hash operations are against libraries that end up not containing the desired symbol. It is also worth noting that the runtime linker is free to skip the use of the Bloom filter and proceed directly to the hash lookup operation. This may be worthwhile in situations where the runtime linker has other reasons to believe the lookup will be successful. In particular, if the runtime linker is directed to a given object via a direct binding, the odds of a failed symbol lookup should be zero, so there is no need to screen before the hash lookup. ConclusionsTweaking the performance of an existing algorithm has its place, particularly within the inner loops of a busy program. However, the big wins are usually the result of using a better algorithm. In Solaris, direct bindings have been our algorithmic approach to reducing hash overhead. We've made a conscious effort to develop and deploy direct bindings in preference to making improvements to the traditional hash process. We've been pleased with the results direct bindings are faster and combined with their other attributes, allow programs to scale larger.Nonetheless, the GNU hash embodies several worthwhile ideas:
It is clear that hash overhead can be measured and reduced. By and large however, we have not found ELF hash overhead to be a hot spot for real programs. It seems that the other things that go on in a program generally dominate the hit that comes from ELF symbol hashing. The core Solaris OS is now built using direct bindings, which has allowed us to harden and simplify aspects of its design. Interestingly enough, measurements do not reflect a significant resulting change in system performance. This does not prove that hash overhead should be ignored, but it does tend to suggest that one has to look towards extremely large non-direct bound programs in order to demonstrate the issue. It's all food for thought, and perhaps time for some experimentation and measurement.
Posted at 02:49PM Oct 21, 2008 by ali in Sun |
Wednesday Mar 19, 2008
ld Is Now A Cross Link-Editor Until yesterday, ld, the Solaris link-editor, was a native linker. This means that it was only able to link objects for the same machine that the linker was running on. To link sparc objects required the use of of a sparc machine, and x86 objects required an x86 system. With the integration of PSARC 2008/179 cross link-editorinto Solaris Nevada build 87, the Solaris ld became a cross link-editor. Now, ld running on any type of hardware can link objects for any of the systems Solaris supports. This is currently sparc and x86, but there are people playing with OpenSolaris on other hardware too, so who knows what we might end up with? The user interface to this new capability is a simple extension of ld's traditional behavior. Traditionally, ld establishes the class (whether the object is 32 or 64-bit) of the object being generated by examining the first ELF object processed from the command line. In the case where an object is being built solely from an archive library or a mapfile, the -64 command line option is available to explicitly override this default. We have extended this to also determine the machine type of the object to produce:
Of course, it's a rare program that doesn't link against at least one system library. You're going to need libc, if nothing else. To do a successful cross-link, you'll need to have an image of the root filesystem for a system of the target type. There are many ways to do this. For testing purposes, I used a sparc and x86 system, using NFS to allow each system to see the root filesystem of the other. Even though we now have a cross link-editor, we expect that the vast majority of links will be native, for the machine running the linker. We decided to pursue cross linking anyway, for two reasons:
A cross link-editor is a significant step, but is of little use unless you also have a cross-compiler and assembler. The GNU gcc compiler can be built as a cross compiler, and that should be very helpful for OpenSolaris ports. However, the Sun Studio compilers are native compilers, so it will still be awhile before I can use my amd64 desktop to build a sparc version of Solaris as part of work at Sun. We've taken a first step. It will be interesting to see what follows.
Posted at 05:56PM Mar 19, 2008 by ali in Sun | Comments[2]
Friday Nov 02, 2007
Avoiding LD_LIBRARY_PATH: The Options With the introduction of the elfedit utility into Solaris, we have a new answer to the age old question of how to avoid everyones favorite way to get into trouble, the LD_LIBRARY_PATH environment variable. This seems like an appropriate time to revisit this topic. LD_LIBRARY_PATH Seems Useful. What's the Problem?The problem is that LD_LIBRARY_PATH is a crude tool, and cannot be easily targeted at a problem program without also hitting other innocent programs. Sometimes this overspray is harmless (it costs some time, but doesn't break anything). Other times, it causes a program to link to the wrong version of something, and that program dies in mysterious ways.Historically, inappropriate use of LD_LIBRARY_PATH might be the #1 one way to get yourself into trouble in an ELF environment. In particular, people who redistribute binaries with instructions for their users to set LD_LIBRARY_PATH in their shell startup scripts are unleashing forces beyond their control. Experience tells us that such use is destined to end badly. This subject has been written about many times by many people. My colleague Rod Evans wrote about this ( LD_LIBRARY_PATH - just say no) for one of his first blog entries. If you need additional convincing on this point, here are some suggested Google searches you might want to try: LD_LIBRARY_PATH problem LD_LIBRARY_PATH bad LD_LIBRARY_PATH evil LD_LIBRARY_PATH darkest hell If LD_LIBRARY_PATH is so bad, why does its use persist? Simply because it is the option of last resort, used when everything else has failed. We probably can't eliminate it, but we should strive to reduce its use to the bare minimum. How to Use, and How To Avoid Using LD_LIBRARY_PATHThe best way to use LD_LIBRARY_PATH is interactively, as a short term aid for testing or development. A developer might use it to point his test program at an alternative version of a library. Beyond that, the less you use it, the better off you'll be. With that in mind, here is a list of ways to avoid LD_LIBRARY_PATH. The items are ordered from best to worst, with the best option right at the top:
Posted at 05:10PM Nov 02, 2007 by ali in Sun | Comments[2] Introducing elfedit: A Tool For Modifying Existing ELF Objects Back in June, I wrote about changes we've recently made to Solaris ELF objects that allow their runpaths to be modified without having to rebuild the object. In that posting, I alluded to work that I was then doing when I said "Eventually, Solaris will ship with a standard utility for modifying runpaths". I am happy to say that this has come to pass. I recently integrated /usr/bin/elfedit into build 75 of Solaris Nevada with: elfedit can indeed modify the runpath in an object, but it is considerably more general than that. elfedit is a tool for examining and modifying the ELF metadata that resides within ELF objects. It can be used as a batch mode tool from shell scripts, makefiles, etc, or as an interactive tool, for examining and exploring objects. elfedit has a modular design, and ships with a set of standard modules for performing common edits. This design makes it easy to add new functionality by adding additional modules.PSARC 2007/509 elfedit 6234471 need a way to edit ELF objects Prior to elfedit, making these sorts of modifications required the user to write a program, usually in C using libelf. elfedit raises the programming level required to do this significantly. Many operations can be done using existing elfedit commands. For those that cannot, it is far easier to write an elfedit module to add the ability than it is to write a standalone program. We envision elfedit being used to solve the following sorts of problems: [Small Fixups]Every elfedit module contains documentation for the commands it provides. This information is displayed using the built in help command, in a format that is based on that of Solaris manpages. The help strings in the standard elfedit modules supplied with Solaris are internationalized using the same i18n mechanisms employed by the rest of the linker software found under usr/src/cmd/sgs. Hence, all elfedit modules supplied by Sun will have complete documentation, and will support the necessary language locales.To correct minor issues in a built file that cannot be easily rebuilt, or for which sources are not available.[Better Way To Support Specialized Rarely Used Features] As with any program that changes the contents of an ELF file, changes to an object by elfedit will invalidate any pre-existing elfsign signature. Assuming the changes are understood and acceptable to the signing authority, such objects will need to be signed after the edits are done. Modular Design And Extensibilityelfedit has a modular design, reflecting our own experience with dynamic linking, and influenced heavily by the design of mdb, the modular debugger.The elfedit program contains the code that handles the details of reading objects, executing commands to modify them, and saving the results. Very little of the code that performs the actual edits is found in elfedit itself. Rather, the commands exist in modules, which are sharable objects with a well defined elfedit-specific interface. elfedit loads needed modules on demand when a command from the module is executed. These modules are self contained, and include their own documentation in a standard format that elfedit can display using its help command. The module forms a namespace for the commands that it supplies. Each module delivers a set of commands, focused on related functionality. A command is specified by combining the module and command names with a colon (:) delimiter, with no intervening whitespace. For example, dyn:runpath refers to the runpath command provided by the dyn module. Module names must be unique. The command names within a given module are unique within that module, but the same command names can be used in more than one module. For example, most modules contain a command named 'dump', which is used to provide an elfedump-style view of the data. We have adopted the following general rules of thumb for naming modules and commands:
Give 'Em Enough Ropeelfedit is a tool for linker development and testing. As such, it follows the Unix tradition of doing what it's told, without a lot of noise. This is great if you are doing linker research & development, or testing. We commonly need to intentionally set ELF metadata to undefined or even "wrong" values. However, it follows that elfedit won't prevent you from making nonsensical or otherwise incorrect changes to your ELF objects.For example, X86 objects have little endian byteorder (ELFDATA2LSB):
We can change the e_ident[EI_DATA] field in the ELF header from
its proper value to ELFDATA2MSB, which
reverses the byte order advertised by the program and makes it appear
to be big endian:
The file command sees the change that we made. However, we haven't really
created a big endian X86 binary by changing what it advertises. We now
have a little endian binary that is lying about what it contains.
And of course, there is no such thing as a big endian X86 hardware, so
if we had created such a binary, it wouldn't be runnable anywhere.
It should come as
no surprise that the system doesn't know what to do with our modified
ls binary:
% /tmp/badls /tmp/badls: cannot execute This is really nothing to be worried about. If you are using elfedit's low level operations that allow arbitrary changes to individual ELF fields, then you need to know enough about the ELF format to make these changes properly. Most people will use elfedit for the high level operations such as changing runpaths. The high level operations are safe, and do not require expert knowledge to use. If you are making those low level changes, the Solaris Linkers and Libraries Guide can be very helpful. Learning Moreelfedit is a standard part of the Solaris development branch, the code that will eventually ship from Sun as the next version of Solaris. It is also available as part of OpenSolaris. It is not part of Solaris 10 or earlier Solaris releases. If you are using a recent Solaris distribution, such as Solaris Express Developer Edition then elfedit should be already present on your system.The elfedit(1) manpage describes the utility in more detail, and gives three examples that should be of general interest:
Posted at 10:34AM Nov 02, 2007 by ali in Sun |
Thursday Nov 01, 2007
What Are Fake ELF Section Headers? I'd like to take a moment to explain an unusual feature we added to elfdump last summer. The -P option tells elfdump to ignore the section headers in the file (if any) and to instead generate a "fake" set from the program headers. So, what are fake section headers, and why would you want them? Earlier this year, there was an "incident" in which a previously unknown hole in the Solaris telnet daemon was used by a worm. As soon as we got a copy of this worm, we tried to examine it with elfdump to see what we might learn:
That's it everything that elfdump could tell us about this
object. We sure didn't learn much from that!
If you look at the ELF header, you'll see that our bad guy has set the e_shnum, e_shoff, and e_shentsize fields to zero. These fields are used to locate the section headers for an ELF object. The section headers in turn contain the information needed to look deeper into an object. Section headers are not used to run a program, only to examine it. Zeroing them is a crude, but effective way to obscure what's inside. ELF objects are just files after all, and anyone with write access can modify them. It's not unheard of to modify an ELF object using a binary capable editor like emacs. Fortunately, the design of ELF makes it difficult to actually hide what an object calls from other sharable objects. And since the system call stubs are all located in libc, you can't hide the system calls your code makes. Here is one way to look inside: Unlike some other object systems, ELF inter-object references are always looked up by name at runtime. The runtime linker hashes the name and looks it up on the first reference. So if you want to actually call something outside of your own object, you have to call it by its real name. This information is located via the object's program headers, and unlike the section headers, they need to be reasonably accurate for the object to work.% ldd -r -e LD_DEBUG=bindings zoneadmd 2>&1 | fgrep "binding file=zoneadmd" 04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__deregister_frame_info_bases' 04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__register_frame_info_bases' 04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `_Jv_RegisterClasses' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_environ' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `__iob' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_cleanup' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `atexit' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `__fpstart' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `exit' 04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__deregister_frame_info_bases' 04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `_Jv_RegisterClasses' 04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__register_frame_info_bases' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `getenv' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `setsid' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `printf' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fflush' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `signal' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pthread_create' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pthread_join' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `malloc' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pthread_cancel' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `free' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `close' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `snprintf' 04992: binding file=zoneadmd to file=/lib/libnsl.so.1: symbol `inet_addr' 04992: binding file=zoneadmd to file=/lib/libnsl.so.1: symbol `gethostbyname' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `bcopy' 04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `ntohl' 04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `socket' 04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `htons' 04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `htonl' 04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `connect' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `___errno' 04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `getsockopt' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fcntl' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `select' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `write' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `read' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strcpy' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `gettimeofday' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strstr' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `mkstemp' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fdopen' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fopen' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `unlink' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fputs' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fputc' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `lseek' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fclose' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fprintf' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fread' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `putc' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_xstat' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_lxstat' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_fxstat' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_xmknod' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `open' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `mmap' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strrchr' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `nanosleep' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fork' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `dup2' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pipe' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `execve' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `kill' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `waitpid' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `localtime_r' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `utimes' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strchr' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `sscanf' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strtoul' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `rename' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `chmod' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `execl' 04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `lockf' The above experience led us to consider a new feature for elfdump. What if we started with the program headers, and generated a set of "fake" section headers based on the information they contain? Obviously the information available would be reduced in comparison to the real section headers, because the program headers only contain the information needed to run the object. Nonetheless, it would certainly be better than nothing in the case where the section headers are gone. And what about the case where they are present, but we fear that they have been maliciously modified? The information from the "fake" section headers could be compared to that from the actual section headers. As a result of this worm episode and the aftermath, I added the -P option to elfdump last July: With an object that has section headers, fake section headers will not be used unless you explicitly use the -P option. If an object doesn't have any section headers, then elfdump automatically turns on the -P option for you.PSARC 2007/395 Add -P option to elfdump 6530249 elfdump should handle ELF files with no section header table Let's use the new elfdump with fake section headers to examine the telnet worm. I apologize for the length of this output, but the length underscores the point there is a lot of information that we can recover from this damaged object:
I don't expect this feature to get much daily use, but it will
be handy to have it in the forensic toolchest next time we're
scrambling to understand a damaged or malicious object.
Posted at 09:56AM Nov 01, 2007 by ali in Sun |
Tuesday Jun 12, 2007
Changing ELF Runpaths (Code Included) A recent change to Solaris ELF files makes it possible to change the runpath of a dynamic executable or sharable object, something that has not been safely possible up until now. This change 80is currently found in Solaris Nevada (the current development version of Solaris) and in OpenSolaris. It is not yet available in Solaris 10, but in time will appear in the standard shipping Solaris as well. This seems like a good time to talk about runpaths and the business of how the runtime linker finds dependencies. I also provide a small program named rpath that you can use to modify the runpaths in your file (assuming they were linked under Nevada or OpenSolaris). The Runpath ProblemThe runtime linker looks in the following places, in the order listed, to find the sharable objects it loads into a process at startup time:
The above scheme offers a great deal of flexibility, and it usually works well. There is however one notable exception the "Runpath Problem". The problem is that many objects are not built with a correct runpath, and once an object has been built, it has not been possible to change it. It is common to find objects where the runpath is correct on the system the object was built on, but not on the system where it is installed. Usually, we deal with this all too common situation by setting LD_LIBRARY_PATH, or by creating a linker configuration file with crle. Such solutions have serious downsides, as detailed in an earlier blog entry by Rod Evans entitled "LD_LIBRARY_PATH - just say no". Both approaches will cause unrelated programs to look in unnecessary additional directories for their dependencies. At best, this imposes unnecessary overhead on their operation. At worst, they may end up binding to the wrong version of a given library, leading to mysterious and hard to debug failures. The environment variable approach is simply too broad. One important technique that people sometimes use, is to set the environment variables in a wrapper shell script, that may look something like: This is a huge improvement over simply setting LD_CONFIG or LD_LIBRARY_PATH in your shell login config script (.profile, .cshrc, .bashrc, etc), for many reasons:#!/bin/sh # # Run myapp, setting LD_LIBRARY_PATH so it will run LD_LIBRARY_PATH="/this/that/theother:/someplace/else" export LD_LIBRARY_PATH exec /usr/local/myapp
It would be far better to modify the object in question and set a runpath that accurately reflects the actual location of its dependencies. The effect of a runpath is limited to the file that contains it, so this solution does not "bleed through" to unrelated files, and it imposes no unnecessary overhead on the general operation of the system. This would be a superior solution if it were possible. However it hasn't been an option until recently. How Runpaths Are ImplementedEvery dynamic executable contains a dynamic section. This is an array of items which convey the information required by the dynamic linker (ld.so.1) to do its work. If an object has a runpath, there will be a DT_RUNPATH and/or DT_RPATH item in the dynamic section (there is more than one of these for historical reasons). As an example, lets examine crle:
The string (in this case, "$ORIGIN/../lib") is not actually
stored in the dynamic section. Rather, it is contained in the
dynamic string table (.dynstr). The value 0x612 is the offset within
string table at which the desired string starts.
A string table is a section that contains NULL terminated strings, one immediately following the other. To access a given string, you add the offset of the string within the section to the base of the section data area. Consider a string table that contains the names of two variables "var1", and "var2" and a runpath "$ORIGIN/../lib". By ELF convention, string tables always have a 0-length NULL terminated string in the first position. In C language notation, we might declare the contents of the resulting string table section containing these 4 strings as The indexes of the 4 strings in our table are [0], [1], [6], and [11], and any item in the dynamic section or the dynamic symbol table that needs one of these strings will specify it using the appropriate index. An interesting result of the way that string tables are designed is that that every single offset into a string table represents a usable string. Although our intent with the C string above was to represent 4 strings, it actually contains 23 potential strings (26 if you count the duplicate NULL strings), and not just the 4 we intentionally inserted. Listing them by offset, they are:"\0var1\0var2\0$ORIGIN/../lib" This is a very efficient scheme, since each string can appear once in the string table, and multiple ELF items can refer to it. Also, it allows fixed size things, like ELF symbols or dynamic section entries, to efficiently reference variable length strings. There are two things to note, however:[0] "" [1] "var1" [2] "ar1" [3] "r1" [4] "1" [5] "" [6] "var2" [7] "ar2" [8] "r2" [9] "2" [10] "" [11] "$ORIGIN/../lib" [12] "ORIGIN/../lib" [13] "RIGIN/../lib" [14] "IGIN/../lib" [15] "GIN/../lib" [16] "IN/../lib" [17] "N/../lib" [18] "/../lib" [19] "../lib" [20] "./lib" [21] "/lib" [22] "lib" [23] "ib" [24] "b" [25] ""
The options for modifying a runpath in this situation are limited:
As a result, it has not been possible to support the modification of the runpath in an existing object up until recently. Making RoomI recently integrated a change to Solaris Nevada (and OpenSolaris) to add a little unused space to our ELF files, in order to facilitate a limited amount of post-link modification:
This change does two things:
The SUNW_STRPAD entry tells us that the dynamic string table has 512 (0x200)
bytes of unused space available at the end of its data area.
The way this works is very simple: If a file lacks a DT_SUNW_STRPAD dynamic entry, then we know that it is an older file, and that the dynamic string table does not have any extra space. If it does have a DT_SUNW_STRPAD, then its value tells us how much room is available. In this case, we can add the string, modify the DT_RUNPATH items, and reduce the DT_SUNW_STRPAD value by the number of bytes we used. If the value in DT_SUNW_STRPAD is too small for our new string, then we are out of luck and cannot add it. This extra room should help in the vast majority of cases, but as with any such approach, there are limits. We recommend the use of the special $ORIGIN token, both because it is a great way to organize objects, and because it is short. The rpath UtilityEventually, Solaris will ship with a standard utility for modifying runpaths. However, there is no need to wait. I have written an unofficial test program I call 'rpath' that you can download and build. To build rpath, you will need a version of Solaris Nevada newer than build 61, or a recent version of OpenSolaris. To check your system, try:If your grep doesn't find DT_SUNW_STRPAD, your system lacks the necessary support.% grep DT_SUNW_STRPAD /usr/include/sys/link.h #define DT_SUNW_STRPAD 0x60000019 /* # of unused bytes at the */ To build rpath, unpack the compressed tar file and type 'make'. If you are using gcc, first edit the Makefile and uncomment the CC line: rpath is used as follows:% gunzip < rpath.tgz | tar xvpf - % cd rpath % make
Using rpathLet's use rpath to look at its own runpath. We will see that it doesn't have one, something that can be verified using elfdump:
Now, let's add a runpath to it:
Notice that the amount of unused space reported by SUNW_STRPAD has gone down
from 512 (0x200) to 494 (0x1ee) bytes, a reduction of 18 bytes. This
makes sense,
since we added a 17 character string, and we must add a NULL termination.
We can observe the runtime linker looking in 'pointless' and 'runpath' as it loads rpath (note: output is edited for width):
Finally, we'll remove the runpath we just added:
Note that even though the runpath is gone, the amount of available extra
space in the dynamic string section did not go back up from 494 (0x1ee)
to 512 (0x200). Adding
strings is a one way operation. Once they are added, they are permanent.
So even though you now have the ability
to add strings of moderate length, you won't want to do it indiscriminately.
On the plus side, you can always re-add the same runpath back without using any more space:
rpath found that the string 'pointless:runpath' was already in
the string table, so it used it without inserting another copy.
ConclusionsOur best advice has always been that the LD_LIBRARY_PATH environment variable should not be used to work around objects with bad or missing runpaths. It is best to rebuild such objects and set the runpath correctly. This hasn't changed, and you should always do so if you can.The problem with that advice is that there are times when all you have is the object, and no option to rebuild. In that case, LD_LIBRARY_PATH has been a necessary evil (and one that we've been glad to have). With the advent of objects that can have their runpaths modified, we now have a better answer, and the use of LD_LIBRARY_PATH for this purpose should be allowed to slowly fade away.
Posted at 03:49PM Jun 12, 2007 by ali in Sun | Comments[3]
Friday Feb 09, 2007
Which Solaris Files Are Stripped? In my previous blog entry about the new .SUNW_ldynsym sections, I made the following statement: It used to be common practice for system binaries to be stripped in order to save space. However, observability is a central tenet of the Solaris philosophy. Solaris objects and executables are therefore shipped in unstripped form, and have been for many years, in order to support such symbol lookups. It turns out that this is is only partially true... Brian Utterback posted a comment and pointed out that 490 of the 719 ELF binaries in /usr/bin on his Solaris 10 system are stripped. This shows that Solaris binaries have not been unstripped "for many years". I looked at /usr/bin on my desktop system, which is running a fairily recent Nevada build, and found that only 51 of the 815 files there are stripped. It appears that binaries are (mostly) stripped now. What changed between Solaris 10 and today? And, why "mostly"? As I usually do in such situations, I sent mail to my fellow Linker Alien Rod Evans. I asked him for his recollection of what policies were used for stripping Solaris files in the past. Here is a summary of what he told me:
The new .SUNW_ldynsym sections reduce the need for everything to be unstripped, so we may end up relaxing our ON rule if there is a reason to do so. And if the other consolidations continue to strip their files, .SUNW_ldynsym will provide better observability for them. Brian is absolutely right we have not been shipping "Solaris objects and executables" in unstripped form for many years, only sharable libraries! I knew that libraries were unstripped, and that current builds don't strip either binaries, and those two facts misled me. On the plus side, I have a better understanding of the issue now... :-)
Posted at 03:12PM Feb 09, 2007 by ali in Sun |
Wednesday Feb 07, 2007
What Is .SUNW_ldynsym? Solaris ELF files have a new ELF symbol table. The section type is SHT_SUNW_LDYNSYM, and the section is named .SUNW_ldynsym. In the 20+ years in which the ELF standard has been in use, we have only needed two symbol tables (.symtab, and .dynsym) to support linking, so the addition of a third symbol table is a notable event for ELF cognoscenti. Even if you aren't one of those, you may encounter these sections, and wonder what they are for. I hope to explain that here. Solaris has many tools that examine running processes or core files and generate stack traces. For example, consider the following call to pstack(1), made on an Xterm process currently running on my system: In order to show you those function names, pstack (really the libproc library used by pstack) needs to map the addresses of functions on the stack to the ELF symbols that correspond to them. Usually, these symbols come from the symbol table (.symtab). If this symbol table has been removed with the strip(1) program, then the dynamic symbol table (.dynsym) will be used instead. As described in a previous blog entry, the .dynsym contains the subset of global symbols from .symtab that are needed by the runtime linker ld.so.1(1). This fallback allows us to map global functions to their names, but local function symbols are not available. Observability tools like pstack(1) will display the hexidecimal address of such local functions when a name is not available. This is better than nothing, but is not particularly helpful.% pstack 3094 3094: xterm -ls -geometry 80x51+0+175 fef4bea7 pollsys (8046600, 2, 0, 0) fef0767e pselect (5, 8400168, 84001e8, fef95260, 0, 0) + 19e fef0798e select (5, 8400168, 84001e8, 0, 0) + 7e 0805b250 in_put (10, 8416720, 0, fedd561e, 8416720, 0) + 1b0 08059b20 VTparse (84166a8, 8057acc, fed387c5, 8416720, 84166a8, 804688c) + 90 0805d1f1 VTRun (8046a28, 8046870, feffa7c0, 8046808, 8046858, 804685c) + 205 08057add main (0, 80468b4, 80468c8) + 945 08056eee _start (4, 8046a90, 0, 8046a9a, 8046aa4, 0) + 7a It used to be common practice for system binaries to be stripped in order to save space. However, observability is a central tenet of the Solaris philosophy. Solaris objects and executables are therefore shipped in unstripped form, and have been for many years, in order to support such symbol lookups. For the most part, this has been a winning strategy, but there are still issues that come up from time to time:
I tried hard to avoid adding a new symbol table type, and instead tried several experiments in which the additional local function symbols were placed in the dynsym. The reason for wanting this was to avoid having to modify ELF utilities and debuggers to know about a new symbol table. If the added symbols are in the existing .dynsym, those tools will automatically see them, without needing modification. As detailed in the ARC case that I filed for this work (PSARC/2006/526), I tried many different permutations. In every case, I discovered undesirable backward compatibility issues that kept me from using that solution. It turns out that the layout of .dynsym, and the other ELF sections that interact with it, are completely constrained, and there is no 100% backward compatible way to add local symbols to it. ELF was designed from the very beginning to make it possible to introduce new section types with full backward/forward compatibility. You can always safely add a new section, with a moderate amount of care, and it will work. More than anything, this ability to extend ELF accounts for its long life. Given that the .dynsym cannot be extended with local symbols, I made the obvious (in hindsight) decision to to introduce a new section type (SHT_SUNW_LDYNSYM), and add a new symbol table section named .SUNW_ldynsym to every Solaris file that has a .dynsym section. Once that decision was made, the implementation was straightforward, giving me confidence that it was the right way to go. The .SUNW_ldynsym section can be thought of as the local initial part of the .dynsym that we wish to build, but can't. The Solaris linker ( ld(1)) takes care to actually place them side by side, so that the end of the .SUNW_ldynsym section leads directly into the start of the .dynsym section. The runtime linker ( ld.so.1(1)) takes advantage of this to treat them as a single table within the implementation of dladdr(3C). Note that this trick works for applications that mmap(2) the file and access it directly. If you are accessing an ELF file via libelf, as many utilities do, you can't make any assumptions about the relative positions of different sections. As with .dynsym, .SUNW_ldynsym sections are allocable, meaning that they are part of the process text segment. This means that they are available at runtime for dladdr(3C). It also means that they cannot be stripped. Although you cannot strip .SUNW_ldynsym sections, you can prevent them from being generated by ld(1), by using the -znoldynsym linker option. .SUNW_ldynsym sections consume a small amount of additional space. We found that for all of core Solaris (OS and Networking), the increase in size was on the order of 1.4%. This small increase pays off by letting our observability tools do a better job. Furthermore, the presence of .SUNW_ldynsym means that in many cases, you can strip programs that you might not have been willing to strip before. ExampleLet's use the following program to see how .SUNW_ldynsym sections improve Solaris observability of local functions:
Let's build two versions of this program, one containing the .SUNW_ldynsym
section, and one without:
The elfdump(1) command can be used to let us examine the three symbol tables contained in test_ldynsym. There is no need to examine this (large) output too carefully, but there are some interesting facts worth noticing:% cc -Wl,-znoldynsym test.c -o test_noldynsym % cc test.c -o test_ldynsym
Now, we strip the two versions of our program to remove the .symtab
symbol table, and force the system to use the dynamic tables instead:
Running the version without a .SUNW_ldynsym section:% strip test_ldynsym test_noldynsym % file test_ldynsym test_noldynsym test_ldynsym: ELF 32-bit LSB executable 80386 Version 1, dynamically linked, stripped test_noldynsym: ELF 32-bit LSB executable 80386 Version 1, dynamically linked, stripped Our program used the printstack(3C) function to display its own stack. Afterwards, we use the pstack command to view the same data from the core file. In both cases, the top line represents the call to the local function static_func(), a fact that we know from examining the source code, since the number and/or '????????' used to represent it are less than obvious to an external observer.% ./test_noldynsym /home/ali/test/test_noldynsym:0x6ca /home/ali/test/test_noldynsym:main+0xb /home/ali/test/test_noldynsym:_start+0x7a Segmentation Fault (core dumped) % pstack core core 'core' of 5041: ./test_noldynsym 080506d2 ???????? (804692c, 80467a4, 805062a, 1, 80467b0, 80467b8) 080506eb main (1, 80467b0, 80467b8) + b 0805062a _start (1, 8046994, 0, 80469a5, 80469bf, 8046a03) + 7a
Running the version with a .SUNW_ldynsym section, the system is able to put a name to the local function: % ./test_ldynsym /home/ali/test/test_ldynsym:static_func+0xa /home/ali/test/test_ldynsym:main+0xb /home/ali/test/test_ldynsym:_start+0x7a Segmentation Fault (core dumped) % pstack core core 'core' of 5044: ./test_ldynsym 08050802 static_func (8046930, 80467a8, 805075a, 1, 80467b4, 80467bc) + 12 0805081b main (1, 80467b4, 80467bc) + b 0805075a _start (1, 8046998, 0, 80469a7, 80469c1, 8046a05) + 7a ConclusionsSometimes it is the little things that make a difference. I expect that the local dynamic symbol table will provide valuable information in difficult debugging situations where one is examining large stripped programs running in a production environment. The rest of the time, the additional data is small, and will have little or no impact on performance..SUNW_ldynsym sections have been part of the Solaris development (Nevada) builds since last fall, and are also available in OpenSolaris.
Posted at 02:05PM Feb 07, 2007 by ali in Sun | Comments[7]
Saturday Sep 23, 2006
Inside ELF Symbol Tables ELF files are full of things we need to keep track of for later access: Names, addresses, sizes, and intended purpose. Without this information, an ELF file would not be very useful. We would have no way to make sense of the impenetrable mass of octal or hexidecimal numbers. Consider: When you write a program in any language above direct machine code, you give symbolic names to functions and data. The compiler turns these things into code. At the machine level, they are known only by their address (offset within the file) and their size. There are no names in this machine code. How then, can a linker combine multiple object files, or a symbolic debugger know what name to use for a given address? How do we make sense of these files? Symbols are the way we manage this information. Compilers generate symbol information along with code. Linkers manipulate symbols, reading them in, matching them up, and writing them out. Almost everything a linker does is driven by symbols. Finally, debuggers use them to figure out what they are looking at and to provide you with a human readable view of that information. It is therefore a rare ELF file that doesn't have a symbol table. However, most programmers have only an abstract knowledge that symbol tables exist, and that they loosely correspond to their functions and data, and some "other stuff". Protected by the abstractions of compiler, linker, and debugger, we don't usually need to know too much about the details of how a symbol table is organized. I've recently completed a project that required me to learn about symbol tables in great detail. Today, I'm going to write about the symbol tables used by the linker.
.symtab and .dynsymSharable objects and dynamic executables usually have 2 distinct symbol tables, one named ".symtab", and the other ".dynsym". (To make this easier to read, I am going to refer to these without the quotes or leading dot from here on.)The dynsym is a smaller version of the symtab that only contains global symbols. The information found in the dynsym is therefore also found in the symtab, while the reverse is not necessarily true. You are almost certainly wondering why we complicate the world with two symbol tables. Won't one table do? Yes, it would, but at the cost of using more memory than necessary in the running process. To understand how this works, we need to understand the difference between allocable and a non-allocable ELF sections. ELF files contain some sections (e.g. code and data) needed at runtime by the process that uses them. These sections are marked as being allocable. There are many other sections that are needed by linkers, debuggers, and other such tools, but which are not needed by the running program. These are said to be non-allocable. When a linker builds an ELF file, it gathers all of the allocable sections together in one part of the file, and all of the non-allocable sections are placed elsewhere. When the operating system loads the resulting file, only the allocable part is mapped into memory. The non-allocable part remains in the file, but is not visible in memory. strip(1) can be used to remove certain non-allocable sections from a file. This reduces file size by throwing away information. The program is still runnable, but debuggers may be hampered in their ability to tell you what the program is doing. The full symbol table contains a large amount of data needed to link or debug our files, but not needed at runtime. In fact, in the days before sharable libraries and dynamic linking, none of it was needed at runtime. There was a single, non-allocable symbol table (reasonably named "symtab"). When dynamic linking was added to the system, the original designers faced a choice: Make the symtab allocable, or provide a second smaller allocable copy. The symbols needed at runtime are a small subset of the total, so a second symbol table saves virtual memory in the running process. This is an important consideration. Hence, a second symbol table was invented for dynamic linking, and consequently named "dynsym". And so, we have two symbol tables. The symtab contains everything, but it is non-allocable, can be stripped, and has no runtime cost. The dynsym is allocable, and contains the symbols needed to support runtime operation. This division has served us well over the years.
Types Of SymbolsGiven how long symbols have been around, there are surprisingly few types:
In addition to symbol type, each symbols has other attributes:
Symbols Table Layout And ConventionsThe symbols in a symbol table are written in the following order:
Next Time: Augmenting The DynsymOne of the big advantages of Solaris relative to other operating systems is the extensive support for observability: The ability to easily look inside a running program and see what it is doing, in detail. To do that well requires symbols. The symbols in the dynsym may not be enough to do a really good job. For example, to produce a stack trace, we need to take each function address and match it up to its name. If we are looking at a stripped file, or referencing the file from within the process using it via dladdr(3C), we won't have any way to find names for the non-global functions, and will have to resort to displaying hex addresses. This is better than nothing, but not by much. The standard files in a Solaris distribution are not stripped for exactly this reason. However, many files found in production are stripped, and in-process inspection is still limited to the dynsym.Machines are much larger than they used to be. The memory saved by the symtab/dynsym division is still a good thing, but there are times when we wish that the dynsym contained a bit more data. This is harder than it sounds. The layout of dynsym interacts with the rest of an ELF file in ways that are set in stone by years of existing practice. Backward compatibility is a critical feature of Solaris. We try extremely hard to keep those old programs running. And yet, the needs of observability, spearheaded by important new features like DTrace, put pressure on us in the other direction. This discussion is prelude to work I recently did to augment the dynsym to contain local symbols, while preserving full backward compatibility with older versions Solaris. I plan to cover that in a future blog entry. ELF is old, and much of how it works cannot be changed. Its original designers (our "Founding Fathers", as Rod calls them) anticipated that this would be the case, based no doubt on hard experience with earlier systems. The ELF design is therefore uniquely flexible, which explains why it has survived as long as it has. There is always a way to add something new. Sometimes, it takes several tries to find the best way.
Posted at 02:22PM Sep 23, 2006 by ali in Sun |
Friday Sep 22, 2006
What Are "Tentative" Symbols? In the Linker and Libraries Guide, you will encounter discussion of tentative symbols. Based on the name, we might expect that such a symbol is missing something, but what? And why does the linker have to treat them as a special case? A tentative symbol is a symbol used to track a global variable when we don't know its size or initial value. In other words, a symbol for which we have not yet assigned a storage address. They are also known as "common block" symbols, because they have their origins in the implementation of Fortran COMMON blocks. They are historical baggage something that needs to work for compatibility with the past, but also something to avoid in new code. Consider the following two C declarations, made at outer file scope:
int foo;
int foo = 0;
Superficially, these both appear to declare a global variable named
foo with an initial value of 0. However, the first definition is
tentative it will have a value of 0 only if some other file
doesn't explicitly give it a different value. The outcome depends on
what else we link this file against.
To get a better handle on this, let's create two separate C files (t1.c, and t2.c) and experiment:
t1.c t2.c First, we compile and link t1.c by itself, using both forms of declaration for variable foo:
% cc -DTENTATIVE_FOO t1.c; ./a.out
FOO: 0
% cc t1.c; ./a.out
FOO: 0
As expected, they give identical results. Now, lets add t2.c to the mix and see what happens:
% cc -DTENTATIVE_FOO t1.c t2.c; ./a.out
FOO: 12
% cc t1.c t2.c; ./a.out
ld: fatal: symbol `foo' is multiply-defined:
(file t1.o type=OBJT; file t2.o type=OBJT);
ld: fatal: File processing errors. No output written to a.out
./a.out: No such file or directory
As you can see, the two different ways of declaring foo are not
100% equivalent. The tentative declaration of foo in t1.c
took on the value provided by the declaration in t2.c. In contrast,
the linker was unwilling to merge the two non-tentative definitions of
foo that had different values, and instead issued a fatal link error.
Normal C rules say that a variable at file scope without an explicit value is assigned an initial value of 0. However, the existence of other global variables with the same name can change this. The C compiler is only able to see the code in the single file it is compiling, and cannot know how to handle this case. So, it marks it as tentative by giving the symbol a type of STT_COMMON, and leaves it for the linker to figure out. The linker is in a position to match up all of these symbols and merge them into a single instance. The linker has no insight into programmer intent though, and it cannot protect you from doing this by accident. The result usually works, but is fragile. The other declaration form (with a value) causes a non-tentative symbol to be created (STT_OBJECT). In this case, the linker ensures that all the declarations agree. This is the right behavior if you care about robust and scalable code. It is worth noting that you will never see a tentative symbol with local scope. It can only happen to global symbols, because global symbols in different files are the only way you can get this form of aliasing to occur.
HistoryTentative symbols are bad software engineering. A declaration in one file should not be able to alter one in another file. The need for them dates from the early days of the Fortran language. In Fortran, you can declare a common block in more than one file, with each file independently specifying the number, types, and sizes of the variables. The linker then takes all of these blocks, allocates enough space to satisfy the largest one, and makes all them point at that space. This is a very crude form of a union (variant), and is therefore very useful (and dangerous) Fortran technique.Sadly, it didn't stop there. We still sometimes find this practice in C code. Two files will both declare:
int foo;
and then expect that they are both be referring to a single global
variable, with an initial value of 0. This is not necessary.
The proper solution has existed for decades. The safe way to do
the above is to have exactly one
declaration for the global variable in a single file. The other files
that need to access to it use the "extern" keyword
to let the compiler know what is going on. The statement
extern int foo;
is a reference, not a declaration, and it has a single unambiguous
interpretation.
Moral: Don't Do That!Don't use common block binding in your code. It was a bad idea 40 years ago, and it hasn't improved with age. The necessity of backward compatibility is such that compilers and linkers must support common block binding. We are stuck with it, but we don't have to use it.You should always try to minimize or eliminate global variables. However, when you do use them:
Posted at 10:16PM Sep 22, 2006 by ali in Sun |
Wednesday Jun 14, 2006
Settling An Old Score (Linker Division)
For years, I worked on an interactive language used by scientists to do data analysis and visualization. That program makes heavy use of sharable libraries. Solaris was my primary development platform, due to its superior facilities for observing and debugging software, so my usual strategy was to write and debug my code under Solaris, and then move the results to the many other Unix (and VMS, Windows, and Macintosh) platforms that we supported. I became very familiar with one quirk of the Solaris linker that bit me many times over the years. The issue has to do with how ld(1) handles the situation where it needs to replace an existing output file. Historically, this was handled by truncating the existing file and rewriting it in place. This preserves the existing inode, and any hard links that may happen to be pointing to it. However, it has a very bad effect on any running process that happens to be using that file. For example, if the output file is a sharable library, creating a new version while a program that uses it is running will inevitably cause that program to die in an unplanned and unexpected way. If you're not a developer of software that runs for unbounded amounts of time, then you probably have not seen this behavior. If you develop code that does however, then you've almost certainly hit it at some point. In my case, I'd hit it about once a year, usually while multitasking, flipping back and forth between several xterms. Usually, it was obvious what had happened, and I'd quickly recover. Sometimes though, if I was debugging the program for some other reason, the unexpected SIGSEGV or SIGBUS would send me off into the weeds, debugging a mysterious problem in a part of the program unrelated to where I expected the problem to lie. After a minute or so, I'd realize what had happened.. Gack! Bitten again... I cheerfully admit that this is not a big deal. However, I always wondered what reasons those people at Sun had for not changing it. I assumed that there was some subtlety that I was missing. Other platforms handle it in a different way, by having ld unlink the existing file first, and then create a new file under the same name. Any existing processes continue to see the old file, while the new file becomes available to new processes. When the last program with an open file descriptor exits, the Unix kernel removes the old file from the disk, in the standard Unix way. I now work at Sun, on the Solaris linker and various related parts. One day the subject of this ld behavior came up, reminding me of my old questions. So, I started asking around. It turns out that no one at Sun is particularly fond of it either. The reasons for not having changed it boil down to the fact that it rarely causes problems, a desire to maximize compatibility with the past, and the fact that there are bigger things to worry about (almost everything). The compatibility issue boils down to what happens if the output file has a link count that is greater than 1, as described earlier. It's rare for an output file to have multiple links, and even more rare for the makefile to not remove and relink all of the other names. Especially rare, since other operating systems (like Linux and the Macintosh) do break those links, and most Unix software is targetted at multiple operating systems (almost always including Linux). As described in an earler blog entry, I've been using Solaris Zones to improve our linker testing. The basic approach I want to use for this is to replace the linker files in the test zone with symbolic links that point at the corresponding files in my development workspace. This means that every time I use make(1) to rebuild the linker components, the results will immediately become available in the zone. There is one problem with this idea: The way ld handles existing output files means that any processes using the old linker components will suddenly crash when I type make. Depending on what is running in the zone, this could destabilize and/or cripple the zone. Of course, we could modify our makefiles to unlink the existing files before running ld, but this is tedious and error prone. I realized that it would be better to change the linker. At last, time to settle an old score. The first step in making a change like this is to write and submit an ARC (Architectural Review Committee) case describing the change. We take backward compatibility very seriously here, so a change like this requires formal consideration and approval. I submitted PSARC 2006/353 a couple of weeks ago. It generated some discussion pro and con (the change causes existing hard links to be broken), but in the end, the undesirability of causing programs to die in uncontrolled ways, and the bonus of Linux compatibility won the day and the proposal was approved. I put back the code change earlier today, and it will appear in build 43, and soon after in OpenSolaris. Appropriately enough, more typing was involved in proposing the ARC case than in changing the code. The actual code change is very small. It was a good learning experience for me to go through the ARC process, and gratifying to finally take my revenge on "my old friend". And of course, it will allow us to do better testing of the linker subsystem.
Posted at 04:35PM Jun 14, 2006 by ali in Sun | Happy Birthday To OpenSolaris OpenSolaris is 1 year old today! Congratulations, and thanks to everyone that made it possible! I've used SunOS/Solaris for ~20 years now, for all of the old school reasons: Solaris has long set the gold standard for things like reliability, scalability, observability, debuggability backward compatibility, and standards compliance. Over those years, I've written code that had to run on most versions of Unix as well as VMS, Windows, and the Macintosh. Sun was always the place where I did initial work, and any possible debugging as well. The reason is that SunOS has always had superior tools for observing and diagnosing what your code is doing. However, the user visible functionality of the OS (as with all Unix OS's) had not changed much in quite some time. Solaris still had the above advantages, but the various Unix variants were beginning to look and feel increasingly similar. In contrast, the current Solaris is a real jump forward. While other operating systems have been working away at duplicating things that Solaris has had forever, Solaris itself has moved forward with next generation features that no one else will have for quite awhile (Dtrace, ZFS, Zones, FMA, SMF, etc). Others have some subset of these abilities, but no one has the complete package, and no one else has it in such a simple and fully integrated fashion. Once you experience these things, you'll find running systems without them to be limiting (evoking a quieter, simpler era). At the same time, Solaris has moved aggressively to the 64-bit X86 PC platform, making it possible for many people to run it without first having to buy new hardware, and has moved to adopt a modern desktop and make other needed improvements (many of which involve open source code from other projects). The fact that Solaris has joined the community of open source Unix operating systems makes the above even more compelling. The energy level surrounding Solaris today is high. The quality of Solaris as a desktop OS is rapidly improving, thanks to its adoption of other open source software. At the same time, the OpenSolaris code is there for others to examine, modify, and even to port to other systems This would be a great thing for Unix in general, and is something that we would love to see happen (and it is: See Dtrace and ZFS). And, Solaris still leads the pack with those boring old school virtues. So on this 1st anniversary of OpenSolaris: If you are a Unix fan who is not familiar with the modern Solaris OS, you should give it a try. You will find it interesting and eye opening. Rest easy: It has a real open source license, and what has been given cannot be taken away. You probably already have a PC lying around that you can use. And the price (free) is exceedingly reasonable.
Posted at 03:23PM Jun 14, 2006 by ali in Sun |
Wednesday Apr 19, 2006
Testing Critical System Components Without Turning Your System Into A Brick
I work on the Solaris runtime linker. One thing you quickly learn in this work is that a small mistake can bring down your system. The runtime linker is an extreme example, but the same thing is true of other core system components. Modifying core parts of a running machine can be a risky game. There is a time honored strategy for dealing with this:
That's not much of a safety net. "Deal with it" can sometimes be a slow, painful process. There has been little improvement in this area for years. Now however, the advent of Solaris zones and ZFS gives us some powerful new options that can make recovery easy and instantaneous. I'm going to talk about how to do that here. Much of this discussion is linker-specific detail and background motivation, followed by some general comments about zones and ZFS. This is followed by an actual example of how I built a test environment. Please feel free to skip right to the example. The Linker Testing ProblemThe runtime linker lies at the heart of nearly every process on a Solaris (or any modern) operating system. This makes modifying and testing it problematic: If you install a runtime linker that has an error, your entire running system will instantly break. Since everything on the system is dynamically linked, this isn't a casual breakage. Rather, your system is unable to execute anything. Recovery may require booting a memory resident copy of the OS from the installation CD, restoring working files, and rebooting. One moment, you were focusing on solving a problem. Now, your attention is yanked away and focused on system recovery. Once you get your system back, you have to go back and try to remember what you were thinking when it broke. Your productivity is shot. If you work on something as central as the runtime linker, the odds of never breaking it are stacked against you. That it is going to happen is a simple mathematical fact. If you are careful and methodical, it will happen less. Unless you shy away from doing valuable work though, it is an ever present possibility. Since we can't eliminate the possibility, we have to accommodate it. Our main strategy in this game is avoidance. To avoid this problem, most linker testing is done against a local copy of the linker components, without installing them in the standard locations (/bin, /usr, etc). We do this by manipulating the command PATH, and setting linker environment variables. We may install them later, if testing seems to show that they are OK, and if we believe that there may be system interactions we need to guard against. The good news is that this approach usually works, and can be managed with a reasonable amount of effort. It has some limits though:
An ideal approach would not require so much human judgement. It would reflect the user experience exactly. Doing BetterHow would the ideal testing environment for the runtime linker subsystem look? Here's my wish list:
In years past, you might have tried to construct something like this by constructing an image of the system in a test area, and then applying the chroot(2) system call (probably in the form of the /usr/sbin/chroot command) to make it appear like the real system. This can work, but it has some big drawbacks:
If you've ever set up an anonymous FTP server, you know how much manual work is involved. Imagine doing it for an entire OS and then having to keep up with daily changes. People have tried this, but it ends up being too much ongoing effort to manage and maintain. No one minds doing work up front, but afterwards, we really want a system that can take care of itself. The goal is to save time and effort, not to simply redirect it. What we really need is a sort of super chroot: One that sets itself up and doesn't demand so much from us. Something that creates a virtual instance of the machine we're using, that is created automatically by the system, so we don't have to construct a Solaris root filesystem manually. Something easy to create, lightweight in operation, that is essentially identical to our installed system, and something that we can play with, wreck, and reset with little or no overhead. Before Solaris 10, this would have been a tall order. As of Solaris 10, it is standard stuff: We can build it using Solaris Zones in conjunction with ZFS. Not only can we do it, but it's easy. ZonesYou can read more about Solaris Zones at the OpenSolaris website. Quoting from that page:
The main instance of Solaris running on your system is known as the global zone. A given system is allowed to have 1 or more non-global zones: These are virtualized copies of the main system that present the programs running within them with the illusion that they are running on separate and distinct systems. Zones come in two flavors: Sparse, and Whole Root. The difference is that a sparse zone uses loopback mounts to re-use key filesystems (/, /usr, /platform) from the main system in a readonly mode, wheras a whole root zone makes a complete copy of these filesystems. A whole root zone allows you install different Solaris packages into its root filesystem this is what we need for linker testing. Zones are extremely easy to set up. They provide us with the ability to create an environment in which we can install and test the runtime linker without running the risk of taking down the machine. The worst that can happen is that we wreck the zone, but the damage will always be contained. A non-global zone cannot damage the global zone. If we do damage the non-global zone, it is easy to halt, destroy, and recreate it, all without any need to halt or reboot the main system. This is a big leap forward, and by itself, would be worth using. However, setting up a whole root zone can take half an hour. To really make this approach win, we need to be able to reset a zone much faster than that. We can do this using ZFS. ZFSZFS is a powerful new filesystem that is making its debut with Solaris 10, Update 2. ZFS makes it cheap and easy to create an arbitrary number of filesystems on any Solaris system, from small desktop machines to large servers. ZFS has a snapshot facility that allows you to capture a readonly copy of a filesystem (even really large ones) in a matter of seconds. A snapshot requires almost no disk space initially, as all the file data blocks are shared. As the main filesystem is modified, the snapshot continues to reference the old data blocks. Once a snapshot has been made, ZFS allows you to roll back the main filesystem to the state captured by the snapshot. This operation is trivial to do, and essentially instantaneous. Each Solaris whole root zone has a copy of the main system filesystems, kept at a location you specify when you create the zone. ZFS therefore presents us with a solution to the problem of how to rapidly and easily reset a linker test environment:
Once this is done, you can use the zone for testing, as if it were a especially convenient second system that can see the same files your real system can see. When you need to reset it:
I created such a zone using my Ultra 20 desktop system. Here are the commands to do the above: % zoneadm -z test halt % zfs rollback -r tank/test@baseline % zoneadm -z test boot These commands take 7 seconds from start to finish! Speed is not going to be a problem. Building ItLet's walk through the construction of the linker test zone I have on my desktop system. The first step is to get a ZFS filesystem set up. My system has an extra disk (/dev/rdsk/c2d0) that I will use for this purpose. It doesn't have any pre-existing data on it that I care about saving, so I will dedicate the entire thing for ZFS to use. I need to create a ZFS pool, and then create a filesystem within it. Following the ZFS examples I've seen, I'm going to name the pool "tank". I will mount the filesystem on /zone/test. root# zpool create -f tank c2d0 root# zfs create tank/test root# zfs set mountpoint=/zone/test tank/test root# df -k /zone/test Filesystem kbytes used avail capacity Mounted on tank/test 241369088 98 241368561 1% /zone/test That took 4 seconds. The next step is to create the zone within the ZFS filesystem now mounted at /zone/test. In order to allow installing linker components into the root and usr filesystems, this needs to be a whole root zone. At Sun, all of our home directories are automounted via NFS, with NIS used to manage user authentication. So, I'll need to give my zone a network interface. This interface needs a unique IP address, different from the main system address. I do most of my development work in a local filesystem (/export/home), so I'll arrange for it to appear within my test zone as well. My host is named rtld, so I will name my test zone rtld-test. Summarizing these decisions:
Let's create a test zone:
This part of the process takes about 12 minutes on this system. The output from "zoneadm list" shows us that the zone is installed, but not running. To get it running for the first time, we must boot it, and then login to the console and finish the installation process. This is the same process a standard Solaris goes through after the initial reboot smf initializes, you are asked some questions about hostname, root password, and name service, and then the system is ready for use. Before using it though, we halt it and capture a snapshot, for later use. root# zoneadm -z test boot root# zlogin -C test [Install Output omitted] ~. [Connection to zone 'test' console closed] root# zoneadm list -cv ID NAME STATUS PATH 0 global running / 12 test running /zone/test root# zoneadm -z test halt root# zfs snapshot tank/test@baseline root# zoneadm -z test boot This last part takes about 5 minutes. In total, we can go from no ZFS and no zone, to having a usable linker test zone in well under half an hour. This story is going to get even better soon: There are "zone cloning" features coming soon which will greatly lower the time it takes to create new zones. Using ItNow that we have a test zone, let's experiment with it. In this section, I will be using two separate terminal windows, one logged into the global zone, and one logged into the test zone. I will show interactions with the global zone on the left, and the test zone on the right. In this example, I remove the runtime linker (/lib/ld.so.1) and demonstrate that (1) This does not take down the system, and (2) It is easily and quickly repaired.
Conclusions: A Rising Tide Floats All BoatsI've started to regard the test zone the same way I view an Etch-A-Sketch®: I can play with it, mess it up, learn from the results, and then I give it a quick shake and it is ready to go again. This is cool stuff! Before doing this experiment, I had never used zones or ZFS. I had heard about them, but nothing more. I sat down on Friday morning to see what I could do with them, and I had the solution described here working within 8 hours of effort. It's hard to beat that return on investment. The result is a real leap forward in terms of how easily and completely we can test our work.
Zones and ZFS provide new
and powerful abilities not available elsewhere. They're included in
the standard system for free, and not as expensive add ons.
They're simple and easy to use. Once you play with them, I am confident
that you'll start seeing uses for them in your daily work.
Happy hunting!
Technorati Tag:
OpenSolaris
Posted at 05:52PM Apr 19, 2006 by ali in Sun | |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||