Wednesday May 14, 2008 | Surfing With a Linker Alien Rod Evans's Weblog |
|
Direct Binding - now the default for OSNet components Direct Binding refers to a symbol search and binding model that has been available in Solaris for quite some time. See Library Bindings. At runtime, a symbol reference from an object must be located by the runtime linker (ld.so.1(1) ). Under direct bindings, symbol definitions are searched for directly in the dependency that provides the symbol definition. The provider of the symbol definition was determined by the link-editor (ld(1)) when the object was originally built. This direct binding model differs from the traditional symbol search and binding model. In the traditional model, the symbol search starts with the application and advances through each object that is loaded within the process until a symbol definition is found. Given that direct binding capabilities have been available for some time, and a number of other consolidations have been happily using them, why did it take so long to get this model employed to build the OSNet consolidation? (that's the Solaris core OS and networking). Basically, there were a number of corner cases to solve. One advantage of direct bindings is that this model can protect against unintentional interposition. One disadvantage of direct bindings is that this model can circumvent intentional interposition. Determining whether interposition exists, and whether it is intentional or unintentional is the fun part. The core Solaris libraries seem to be a frequent target of interposition.
So first, what is interposition? Suppose a process is made up of several
shared objects, and two shared objects,
Now, suppose that two other shared objects within the process,
One avenue to observe this difference in binding is to employ lari(1), a utility that looks for interesting binding events. Not surprisingly, most interesting events revolve around the multiple instance of a symbol. From our example, the traditional symbol search model will reveal:
% lari main
[2:2E]: xy(): ./libX.so
[2:0]: xy(): ./libY.so
Here, we see the two instances of
However, if
% lari main
[2:1ED]: xy(): ./libX.so
[2:1ED]: xy(): ./libY.so
Here, both
The question now is what did the developer of It is this latter name-clash issue that was one of the main motivators in having the OSNet consolidation use direct bindings for all system libraries. There have been numerous instances of user applications breaking system functionality by unintentionally interposin g on a symbol that exists within a system library. However, although we wished to protect our libraries from unintentional interposition, we still wished to provide for interposition where it was intended.
Although the direct bindings implementation prevents unintentional interposition
,
the implementation does allow for interposition. However, if you want
interposition then you now need to be explicit. Explicit interposition can
be achieved with LD_PRELOAD (an old favorite), or by tagging the
associated object with
Alternatively, if you design a library with the intent that users be
allowed to interpose on symbols within the library, you can disable
direct binding to the library. Disabling can be achieved for the whole library
using
the link-editors If you suspect an issue with direct bindings in effect, you can return to the tradition symbol search model by setting the environment variable LD_NODIRECT=yes. A suggestion for investigating the issue further would be:
% lari main > direct
% LD_NODIRECT=yes lari main > no-direct
% diff direct no-direct
Standard interposition dates from an era where applications had very few dependencies. Times have changed, and the number of dependencies have dramatically increased. Although interposition can be powerful, it can also be fragile and scale badly. Diagnosing the occurrence of interposition can be a challenge. Given the ability to time travel, direct binding would probably have been the only model for symbol binding, and explicit interposition the only means of defining an interposer. Having to support direct bindings and the traditional model with the various flags and options is the cost of backward compatibility. However, the ability of ELF to stretch this far speaks to the overall quality of its initial design, warts and all. The OSNet consolidation uses the various binding-control flags to both identify interposers, and prevent direct bindings to commonly interposed upon symbols. All the gory details of direct binding, the various flags that can be used, and examples of their use, can be found in the Direct Binding Appendix of the Linker and Libraries Guide. Technorati Tag: OpenSolaris Technorati Tag: Solaris (2008-05-14 11:48:34.0) Permalink Comments [0]
We've moved -
A recent
update
to Solaris Nevada (build 68 to be precise) has moved the
Technorati Tag: OpenSolaris Technorati Tag: Solaris (2007-06-25 21:14:31.0) Permalink '_init'/'_fini' not found - use the compiler drivers
A recently added error check within
ld: warning: symbol `_init' not found, but .init section exists - \
possible link-edit without using the compiler driver
ld: warning: symbol `_fini' not found, but .fini section exists - \
possible link-edit without using the compiler driver
The encapsulation, and execution of
Users typically create these sections using a
% cat foobar.c
static int foobar = 0;
#pragma init (foo)
void foo()
{
foobar = 1;
}
#pragma fini (bar)
void bar()
{
foobar = 0;
}
The functions themselves are placed in a
This is where the compiler drivers come in. As part of creating a dynamic
object, the compiler drivers provide input files that encapsulate the
_init { # provided by
It is the symbols
Some folks are using
This leaves the developer wondering why their
It's best not to use Technorati Tag: OpenSolaris Technorati Tag: Solaris (2006-12-19 14:43:49.0) Permalink Comments [2] Displacement Relocation Warnings - what do they mean?
There have been a couple postings recently regarding relocation
warnings that have been observed when using the link-editors
ld: warning: relocation warning: R_SPARC_DISP32: file shared.o: \
symbol <unknown>: displacement relocation applied to the \
symbol __RTTI__1nEBase_: at 0x8: displacement relocation will \
not be visible in output image
Then, if this shared object is referenced as a dependency when building a dynamic executable, another warning can be generated:
ld: warning: relocation warning: R_SPARC_COPY: file shared.so: \
symbol xxxx: may contain displacement relocation
These warnings stem from an old request from the compiler folks to help prevent problems with displacement relocations and copy relocations. You have to be a little relocation savvy to understand these scenarios - they make my head hurt. Investigations are underway to determine why these warnings are starting to surface.
Technorati Tag: OpenSolaris Technorati Tag: Solaris (2006-10-13 14:24:56.0) Permalink Dynamic Object Versioning - specifying a version binding
After reading a previous posting on
versioning,
a developer asked
how they could specify what version to bind to when they built their
application. For example, from the version definitions
within
% pvs -d /lib/libelf.so.1
libelf.so.1;
SUNW_1.5;
SUNW_1.4;
....
SUNWprivate_1.1;
how could you restrict an application to only use the interfaces
defined by The Linker and Libraries Guide covers this topic in the section Specifying a Version Binding. In a nutshell, you can specify a version control mapfile directive:
% cat mapfile
libelf.so - SUNW_1.4;
Notice that the shared object name is the compilation environment name.
This is the name that gets resolved when you specify
Note, if you build against For example, this application is referencing gelf_getcap(3ELF).
% pvs -r prog
libelf.so.1 (SUNW_1.5);
....
% pvs -dos /lib/libelf.so.1 | fgrep SUNW_1.5
/lib/libelf.so.1 - SUNW_1.5: gelf_getcap;
/lib/libelf.so.1 - SUNW_1.5: gelf_update_cap;
Note also, binding to specific versions is not a panacea for building
software on release Of course, building on the latest release can provide a richer debugging environment in which to develop your software. I often try building things on the latest environment, and then fall back to the oldest environment for final testing and for creating final deliveries.
Technorati Tag: OpenSolaris Technorati Tag: Solaris (2006-10-13 10:48:22.0) Permalink Changing Search Paths with crle(1) - they are a replacement
A developer who wished to add
# crle -l /usr/sfw/lib
# ls
ld.so.1: ls: fatal: libsec.so.1: open failed: No such file or directory
Killed
The problem was that
crle(1),in this basic form, created a system wide configuration file.
This configuration file defined that the default runtime search path for
shared object dependencies is You can determine the standard search path defaults using crle(1).For example, without any system wide configuration file, the following defaults might exist:
$ crle
Default configuration file (/var/ld/ld.config) not found
Platform: 32-bit LSB 80386
Default Library Path (ELF): /lib:/usr/lib (system default)
Trusted Directories (ELF): /lib/secure:/usr/lib/secure (system default)
This user had effectively removed the system default search paths, and hence the runtime linker, ld.so.1,had been unable to find the basic dependencies required by all applications. The new configuration file revealed:
$ crle
Configuration file [version 4]: /var/ld/ld.config
Platform: 32-bit LSB 80386
Default Library Path (ELF): /usr/sfw/lib
Trusted Directories (ELF): /lib/secure:/usr/lib/secure (system default)
Command line:
crle -c /var/ld/ld.config -l /usr/sfw/lib
The
-l dir
....
Use of this option replaces the default search path.
Therefore, a -l option is normally required to specify
the original system default in relation to any new paths
that are being applied. ....
Therefore, to prepend the new search path to the existing defaults you should specify each search path:
# crle -l /usr/sfw/lib -l /lib -l /usr/lib
# ls
devices/ lib/ proc/
....
An alternative is to use the
# crle -u -l /usr/sfw/lib
# crle
Configuration file [version 4]: /var/ld/ld.config
Platform: 32-bit LSB 80386
Default Library Path (ELF): /lib:/usr/lib:/usr/sfw/lib
Trusted Directories (ELF): /lib/secure:/usr/lib/secure (system default)
Command line:
crle -c /var/ld/ld.config -l /lib:/usr/lib:/usr/sfw/lib
Note that the usage message from crle(1) is a little misleading, as it implies that the new search path is an addition:
# crle -X
crle: illegal option -- X
....
[-l dir] add default search directory
....
We'll get the usage message updated to be more precise.
Remember, should you ever get in trouble with
crle(1)
configuration files, you can always instruct the runtime linker to
ignore processing the configuration file by setting the environment
variable # crle -l /does/not/exist # ls ld.so.1: ls: fatal: libsec.so.1: open failed: No such file or directory Killed # LD_NOCONFIG=yes ls devices/ lib/ proc/ .... # LD_NOCONFIG=yes rm /var/ld/ld.config # ls devices/ lib/ proc/ ....
It is recommended that when creating a new configuration file, you first
create the file in a temporary location. The environment variable
Note. crle(1) should not be crippled by blowing away the system default search paths:
# crle -l /does/not/exist
# crle
Configuration file [version 4]: /var/ld/ld.config
Platform: 32-bit MSB SPARC
Default Library Path (ELF): /does/not/exist
Trusted Directories (ELF): /lib/secure:/usr/lib/secure (system default)
Command line:
crle -c /var/ld/ld.config -l /does/not/exist
# elfdump -d /usr/bin/crle | fgrep RPATH
ld.so.1: fgrep: fatal: libc.so.1: open failed: No such file or directory
ksh: 18184 Killed
# LD_NOCONFIG=yes; export LD_NOCONFIG
# elfdump -d /usr/bin/crle | fgrep RPATH
[6] RPATH 0x61b $ORIGIN/../lib
Using $ORIGIN within a runpath provides crle(1) with a level of protection against insufficient configuration file information.
Technorati Tag: OpenSolaris Technorati Tag: Solaris (2006-10-04 14:48:41.0) Permalink Comments [1] Wrong ELF Class - requires consistent compiler flags Every now and then, someone encounters the following error. % cc -G -o foo.so foo.o -lbar ld: fatal: file foo.o: wrong ELF class: ELFCLASS64 ld: fatal: File processing errors. No output written to foo Or perhaps the similar error. % cc -G -xarch=amd64 -o foo.so foo.o -lbar ld: fatal: file foo.o: wrong ELF class: ELFCLASS32 ld: fatal: File processing errors. No output written to foo This issue stems from the compiler flags that have been used to compile the relocatable object foo.o, and the compiler flags that are finally used to invoke the link-edit of this object. The man page for ld(1) hints at the issue. No command-line option is required to distinguish 32-bit objects or 64-bit objects. The link-editor uses the ELF class of the first relocatable object file that is found on the command line, to govern the mode in which to operate. When the compiler drivers are used to generate an executable or shared object, the driver typically supplies a couple of their own files to the link-edit. One or more of these additional files will be read by the link-editor before the file foo.o. Expanding the compiler processing might reveal: % cc -# -G -o foo.so foo.o ... ld crti.o values-xa.o -o foo.so -G foo.o ... crtn.o Here, the first input file read by the link-editor is crti.o (this is typically a full path to a compiler specific subdirectory). Expanding a 64-bit link-edit request might reveal: % cc -# -xarch=64 -G -o foo.so foo.o ... ld amd64/crti.o amd64/values-xa.o -o foo.so -G foo.o ... amd64/crtn.o Armed with this information it should be easy to see how the ELFCLASS error messages can be produced. If for example, you wish to create a 64-bit shared object from one or more relocatable objects, you might first create the 64-bit relocatable object like: % cc -c -xarch=amd64 foo.c % file foo.o foo.o: ELF 64-bit LSB relocatable AMD64 Version 1
But, if you fail to inform the compiler driver that this object should
be linked into a 64-bit object, you'll produce the ELFCLASS64
error message. The first file read by the link-editor will be the 32-bit
version of crti.o. This puts Similarly, a 32-bit relocatable object: % cc -c foo.c that is handed to a 64-bit link-edit will produce the ELFCLASS32 error message. Make sure that the architecture flag used to build a relocatable object is also passed to the compiler driver phase of linking the relocatable object into a final executable or shared object.
Technorati Tag: OpenSolaris Technorati Tag: Solaris (2006-04-26 08:10:14.0) Permalink C++ Dynamic Linking - symbol visibility issues Recently, a customers use of C++ objects within a dlopen(3c) environment revealed a problem that took some time to evaluate and understand. Sadly, this seems to be a recurring issue where the expectations of the C++ implementation are compromised by dynamic linking capabilities. Of course, dynamic linking is the norm for Solaris, and C++ is commonly employed in dynamic linking environments. But there are subtleties in regards symbol visibility that can cause problems. This customer was using a java application to System.loadLibrary a C++ shared object, built to use standard iostreams. The underlying dlopen() failed as part of calling _init, and the result was a core dump. By preloading libumem(3lib), the customer discovered the problem was a bad free(). >::umem_status Status: ready and active Concurrency: 4 Logs: (inactive) Message buffer: free(d352a040): invalid or corrupted buffer There seemed to be an inconsistency in memory allocation underlying this failure. And, I felt I'd been here before. A similar (but slightly different as it turns out) problem had been uncovered a few months ago. So, I stated poking through the symbol bindings for this process. I do this for a living, but even I find analyzing the symbol bindings of a process to be a little daunting. There are just so many bindings to wade through. In Solaris 10 we invented lari(1) to help uncover interesting symbol bindings. I gave a quick introduction to this tool in a previous posting. First it was necessary to obtain a trace of all process bindings, including those produced by the dlopen(). The following environment variables result in this trace being saved in the file dbg.pid.
% LD_DEBUG=files,detail LD_DEBUG_OUTPUT=dbg java-app
The interesting information that lari() unravels focuses on the existence of multiple instances of the same symbol name. But even this can be a lot of information to digest (although I still don't understand why so many objects export the same interfaces). For this application, I wanted to narrow things down to just those symbols that were involved in a runtime binding. And, as we're dealing with C++, a little bit of demangling might be useful too.
% lari -bC -D dbg.pid
[3:1EP]: __1cDstdMbasic_string4Ccn0ALchar_traits4Cc__n0AJallocator4Cc___J__nullref_[0x30] \
[std::basic_string <char,std::char_traits <char>,std::allocator <char>>::__nullref]: \
/local/ISV/libdlopened.so
[3:1SF]: __1cDstdMbasic_string4Ccn0ALchar_traits4Cc__n0AJallocator4Cc___J__nullref_[0x30] \
[std::basic_string <char,std::char_traits <char>,std::allocator <char>>::__nullref]: \
/usr/lib/cpu/sparcv8plus/libCstd_isa.so.1
.....
Now that's interesting. Here we have three occurrences of the same __nullref_ symbol, and two different instances have been bound to. The libdlopened version is also defined as protected, which means that there may be internal references to this symbol from within the same object. A quick inspection of the original process bindings for this symbol also uncovers their addresses.
09268: 1: binding file=/usr/lib/libCstd.so.1 (0xd1677b00:0x177b00) to \
file=/local/ISV/libdlopened.so (0xd352a040:0x192a040): \
symbol `__1cDstdMbasic_string4Ccn0ALchar_traits4Cc__n0AJallocator4Cc___J__nullref_'
There's that bad free() address, 0xd352a040. Now I'm not sure why the C++ implementation is trying to free a data item that exists within an object, but the core of the problem (I'm told) is that there are two instances of __nullref_ being used, and this has led to confusion. But why have we bound to two different instances? The problem seems to stem from the search scope and visibility of the objects loaded with dlopen(). Refer to the section "Symbol Lookup" under "Runtime Linking Programming Interface" for a detailed explanation. By default, a dlopen() family is loaded with the RTLD_LOCAL attribute. In this customers application, libdlopened.so is loaded by the dlopen(), and libCstd.so.1 is loaded as one of the dependencies. libCstd.so.1 is not a dependency of the java application itself. Therefore libCstd.so.1 is maintained within the local scope of the family of dlopen objects. All objects within this family are able to bind to this dependency. Objects outside of this family can not. But, libCstd.so.1 also acts as a filter, and brings in the filtee libCstd_isa.so.1. This filtee is effectively brought in using another dlopen(), and thus libCstd_isa.so.1 exists within its own local scope. Hence, the __nullref_ reference from libCstd_isa.so.1 can not be satisfied by the definition in libdlopened.so - the referring object, and the defining object, live in different local scopes. Hence we get two different symbol bindings. Sadly, this seems to be a common failure point. The C++ implementation can deposit the same data item in multiple objects. However, the design expects all such objects to be of global scope, such that interposition occurs, and only one definition from the multiple symbols is bound to. This requirement can be undermined by a number of dynamic linking techniques. The first is the local scope families produced by dlopen() and filters as shown by this customers scenario - although both of these techniques have been around since the early days of Solaris. It is possible that scenarios like this are typically avoided because the application maintains its own dependency on the C++ libraries, or dlopen() is employed with the RTLD_GLOBAL flag. The scenario can also be avoided by preloading the C++ library. All these mechanisms force the C++ library to be of global scope, and hence allow interposition to bind to one instance of the problematic symbol. (Another hack for this scenario is to set LD_NOAUXFLTR=yes, which suppresses auxiliary filtering - hence libCstd_isa.so.1 wouldn't get loaded). However, similar issues can result from using linker options such as -Bsymbolic, and direct bindings, or scoping dynamic object interfaces using mapfiles. The problem is that the dynamic linking technologies exist to carve out local namespaces within a process, and protect multiple dlopen() families from adversely interacting with one-another. A requirement that is becoming more and more relevant in todays large dynamic applications. C++ implementation requirements, and user dynamic linking requirements seem to be a odds. Perhaps it is time to invent a new symbol attribute. Attributes that allow symbols to be demoted to protected, or local scope already exists. A previous posting introduced some compiler techniques in this area. But we have no attribute that states that a symbol must remain global, and that it should have no internal or direct bindings established to it, and that it should be elevated above any local scope families created within a dynamically linked process. Perhaps with such a symbol attribute, assigned by the compilers for the symbols they know must be completely interposable, we'd establish a more robust environment. Now, I wonder what name we'd give this new super-global attribute? Technorati Tag: OpenSolaris Technorati Tag: Solaris (2006-03-16 15:14:55.0) Permalink Comments [2] Runtime Token Expansion - some clarification I recently came across a mail exchange where the following runtime linker error message was observed: illegal mode: potential multiple path expansion requires RTLD_FIRST The exchange, and a quick review of the documents, reveal that some explanation wouldn't go amiss. The runtime linker, ld.so.1, provides a number of tokens that can be used within dynamic linking string definitions. These string definitions can provide filters, dependencies and runpath information, and are documented in the section Establishing Dependencies with Dynamic String Tokens of the Linker and Libraries Guide. Presently, a dependency expressed within an object, points to a single file. For example:
% elfdump -b main | fgrep NEEDED
[0] NEEDED 0x230 libc.so.1
Filtee definitions and runpaths however, are frequently defined as a lists of colon separated items. For example:
% elfdump -d foo.so | egrep "FILTER|RPATH"
[4] FILTER 0xd8e libbar.so.1:libnuts.so.1
[13] RPATH 0xbf3 /usr/ISVI/lib:/usr/ISVII/lib
The tokens $OSNAME, $OSREL, $PLATFORM and $ORIGIN all expand into a single string. For example:
% elfdump -d libc_psr.so.1 | fgrep AUXILIARY
[2] AUXILIARY 0x56ea /platform/$PLATFORM/lib/libc_psr.so.1
can expand at runtime into:
/platform/SUNW,Sun-Blade-1000/lib/libc_psr.so.1
This single string expansion means that these tokens can be used in filter, dependency and runpath definitions. The tokens $HWCAP and $ISALIST however, typically expand into a list of elements. For example:
% elfdump -d bar.so | fgrep RPATH
[4] RPATH 0x1ad /usr/ISV/$ISALIST
can expand at runtime into:
search path=/usr/ISV/$ISALIST (RPATH from file bar.so)
trying path=/usr/ISV/sparcv9+vis2/libfoo.so.1
trying path=/usr/ISV/sparcv9+vis/libfoo.so.1
trying path=/usr/ISV/sparcv9/libfoo.so.1
trying path=/usr/ISV/sparcv8plus+vis2/libfoo.so.1
....
This list is well suited for filter and runpath definitions, where lists are already expected. But what about dependency definitions? As our present implementation of dependency strings expects a single object, allowing a token that can expand into multiple objects was questioned. Basically, the infrastructure to assign multiple head objects to a handle isn't yet available, and we really don't know of anyone wanted this capability. Because of these issues, we decided to restrict the use of $HWCAP and $ISALIST when used to define dependencies. If you use either of these tokens to establish dependencies, only the first object that is found from their expansion is used. Note that pathnames used for dependencies don't seem very common, but we've restricted their use for these tokens anyway.
Likewise, if you use these tokens in a
dlopen(3c), only the first object found is applicable.
But here we wanted the user to be explicit, and know what they are
getting. Hence, was ask for the RTLD_FIRST flag, which
happened to be lying around and seemed kind of appropriate.
Without this flag you'll get the illegal mode error message.
Of course, the RTLD_FIRST is now a little overloaded, it
restricts symbol searches, and clarifies a
For The illegal mode error message is an attempt to make users aware of a token processing restriction, that may be lifted in future. Technorati Tag: OpenSolaris Technorati Tag: Solaris (2005-12-14 11:11:01.0) Permalink A Very Slow Link-Edit - get the latest patch A customer recently posted to the Dynamic Linking forum in regards to an awfully slow link-edit. A shared library, built from Sun Studio 10 with debugging (-g), was taking 20 minutes to link on Solaris. By comparison, the same link took 4-5 minutes on Linux, and 15 seconds on Windows. I got a copy of the objects and found that the link-edit was considerably faster on my desktop - less than a minute. This is no small link-edit. There are a number of very large input files, and in total, ld(1) processes 65057 input symbols, and the killer, over 1.3 million input relocations. It turns out we'd already uncovered a scalability issue from investigating a slow link-edit from another customer. Basically there are some tests within ld(1) that attempt to identify displacement relocation use within data items that have the potential for copy-relocations. Not something typically users come across, but an area where our compiler engineers had once fallen foul. Thus the checks were requested by our compiler developers to aid their experimentation.
A patch already existed that addressed this slow link-edit, which was fixed
under bugid 6262789. The patches are:
The customer now has the relevant patch. Their link-time is down to 35 seconds.Which is still not as fast as Windows, so we still have some work to do. Perhaps the compilers could generate a little less for us to do :-). Technorati Tag: OpenSolaris Technorati Tag: Solaris (2005-10-13 09:43:39.0) Permalink Comments [1] Init and Fini Processing - who designed this? Recently we had to make yet another modification to our runtime .init processing to compensate for an undesirable application interaction. I thought an overview of our torturous history of .init processing might be entertaining. During the creation of a dynamic object, the link-editor ld(1), arranges for any .init and .fini sections to be collected into blocks of code that are executed by the runtime linker ld.so.1(1). These blocks of code are typically used to implement constructors and destructors, or code identified with #pragma init, or #pragma fini.
The original System V ABI was rather vague in regard these sections, stating
simply:
The system has evolved since this was written, and today there is an expectation that any initialization functions are called before any code within the same object is referenced. This holds true for dependencies of the application and objects that are dynamically added to a process with dlopen(3c). The reverse is expected on process termination, and when objects are removed from the process with dlclose(3c).
Todays processes, language requirements (C++), lazy-loading, together
with For the rest of this discussion, let's use .init sections for examples.
At First it was SimpleIn the early days of Solaris, .init sections were run in reverse load order, sometimes referred to as reverse breadth first order. If an application had the following dependencies:
% ldd a.out
lib1.so.1 => /opt/ISV/lib/lib1.so.1
lib2.so.1 => /opt/ISV/lib/lib2.so.1
libc.so.1 => /usr/lib/libc.so.1
then the initialization sequence would be libc.so.1 followed by lib2.so.1 followed by lib1.so.1. This level of simplicity proved insufficient for calling .init sections in their expected order. All that was required was for a dependency to have its own dependency. For example, if lib1.so.1 had a dependency on lib3.so.1, the load order would reveal:
% ldd a.out
lib1.so.1 => /opt/ISV/lib/lib1.so.1
lib2.so.1 => /opt/ISV/lib/lib2.so.1
libc.so.1 => /usr/lib/libc.so.1
lib3.so.1 => /opt/ISV/lib/lib3.so.1
The result of this loading was that the .init for lib3.so.1 was called before the system library libc.so.1. In practice this wasn't a big issue back in our very early releases. Although libc.so.1 is typically the dependency of every application and library created, its .init used to contributed little that was required by the .init's of other objects. Issues really started to arise as C++ and/or threads use started to expand. The libraries libC.so.1 and libthread.so.1, and the C++ objects themselves, had far more complex initialization requirements. It became essential that libC.so.1 and libthread.so.1's .init ran before any other objects .init.
Topological Sorting ArrivesIn Solaris 6, the runtime linker started constructing a dependency ordered list of .init sections to call. This list is built from the dependency relationships expressed by each object and any bindings that have been processed outside of the expressed dependencies. Explicit dependencies are established at link-edit i.e., lib1.so.1 needs lib3.so.1. However, explicit dependency relationships are often insufficient to establish complete dependency relationships. It is still very typical for a shared object to be generated that does not express its dependencies. Use of the link-editors -z defs option would enforce that dependencies be expressed, but this isn't always the case. Therefore, the runtime-linker also adds any dependencies established by relocations to the information used for topological sorting. ldd(1) can be used to display expected .init order:
% ldd -di a.out
lib1.so.1 => ./lib1.so.1
lib2.so.1 => ./lib2.so.1
libc.so.1 => /usr/lib/libc.so.1
lib3.so.1 => ./lib3.so.1
init object=/usr/lib/libc.so.1
init object=./lib3.so.1
init object=./lib1.so.1
init object=./lib2.so.1
But there's still something missing. The above example shows ldd processing only immediate (data) relocations. This is normal when executing an application, i.e., without LD_BIND_NOW being set. Typically, functions are not resolved when objects are loaded, but are done lazily when the function is first called. The problem with the runtime linkers dependency analysis, is that without resolving all relocations, including functions, the exact dependency relationship may not be known at the time .init firing starts. Of course, we had one library that had to be treated differently, libthread.so.1. Even though this library could have dependencies on other system libraries, it's .init had to fire first. This was insured with the .dynamic flag DF_1_INITFIRST. But, this also excited others who claim they'd like to be first too! Better mechanisms have since evolved to insure the new merged libthread and libc are initialized appropriately. As a side note, when topological sorting was added, the environment variable LD_BREADTH was also provided. This variable suppressed topological sorting and reverted to the original breadth first sorting. This fall back was provided in case applications were found to be dependent upon breadth first sorting, or in case bugs existed in the topological sort mechanism. Sadly, the latter proved true, and LD_BREADTH found its way into scripts and user startup files. But, as systems became more complex, LD_BREADTH became increasingly inappropriate, and its existence caused more problems than it solved. This environment variables processing was finally removed. Note, the debugging capabilities that are available with the runtime linker in OpenSolaris have been significantly updated so that LD_DEBUG=init,detail provides detailed information on the topological sorting process.
Dynamic .init CallingTo complete the initialization model, each time the runtime-linker resolves a function call (through a .plt), the defining objects .init is called if it hasn't already executed. With dynamic .init calling, a lazily bound family of objects can assume that an objects .init is called before code in that object is referenced. Suppose in our a.out example, that the .init code executed for lib3.so.1 makes a call to lib2.so.1. It follows that the .init for lib2.so.1 should be called before it had previously been scheduled. This dynamic initialization can not be observed under ldd, but can be seen with the runtime-linkers debugging:
% LD_DEBUG=init a.out
......
09086: calling .init (from sorted order): /usr/lib/libc.so.1
09086:
09086: calling .init (done): /usr/lib/libc.so.1
09086:
09086: calling .init (from sorted order): ./lib3.so.1
09086:
09086: calling .init (dynamically triggered): ./lib2.so.1
09086:
09086: calling .init (done): ./lib2.so.1
09086:
09086: calling .init (done): ./lib3.so.1
09086:
09086: calling .init (from sorted order): ./lib1.so.1
09086:
09086: calling .init (done): ./lib1.so.1
But, there is still something missing. What if a family of objects are not lazily bound? And, can a user assume that an objects .init has completed before the code in that object is referenced?
Loss of Lazy Binding
It has been observed that users frequently call
Suppose a family of objects have been loaded under the default mode of
lazy binding. The runtime linker has sorted this family and is
in the process of firing the .init's. Now let's say that a
particular .init calls
This observation prompted an additional level of dynamic .init
calling. When ever a family of objects are
No Lazy Binding
Objects can be loaded without lazy-binding, either under
control of LD_BIND_NOW, the
Cyclic DependenciesCyclic dependencies seem to be quite common. It has been observed that calling one objects .init can result in one or more other objects being called, which in turn reference code from the originating object. Problems start manifesting themselves when this return to the originating object exercises code whose initialization has not yet completed. When topological sorting detects cycles, the members of the cycles will have their .init's fired in reverse load order. With dynamic .init calling, this order may be a fine starting point, but without dynamic .init calling this order may not be sufficient to prepare for the execution path of code throughout the cyclic objects. A similar issue has arisen between different threads of control. One thread may be in the process of calling an .init and get preempted for another thread that references the same object. In fact, the runtime linker is fully capable of using condition variables to synchronize .inituse. However, this functionality is not enabled by default because of cyclic dependencies, and the possible deadlock conditions that can result.
Recursion with dlopen(0)When we added the reevaluation and firing of .init's if any loaded objects where referenced by a dlopen(), dlopen(0) fell under the same model. And, although this model was integrated into Solaris a couple of years ago, it wasn't until recently that some existing applications were observed to fail because of the .init reevaluation. The problem was that unexpected recursion was occurring. .init's were being fired, and then the code within a .init called dlopen(0). This caused a reevaluation of the outstanding .init sections, some .init's to fire, and a thread of execution lead to running code within an object whose .init had not yet completed. Yet another refinement was added. Now, if any dlopen() operation only references objects that are already loaded, and that dlopen() operation does not promote any objects relocation requirements, i.e., doesn't use RTLD_NOW, then the thread of initialization that must presently be in existence is left to finish the job. If however, the dlopen() operation adds new objects, or promotes any existing objects relocations, then the family of objects referenced will have their .init's reevaluated, and a new thread of initialization is kicked off to process this family.
ConclusionsAs you can now see, the dynamic initialization of objects within a process is quite complex, and no-one in the right mind would have ever designed it this way. This complexity has evolved, from some incredibly vague starting point, to an implementation that has become necessary to satisfy existing dynamic objects. Whether a lot of this complexity is by accident or by design, is open to debate. And whether developers have ever considered initialization requirements or designed to some goal, seems doubtful. Given the inability to initialize groups of cyclic dependencies in a correct order, you have to wonder if users ever meant to create such an environment. Personally I think most initialization "requirements" have worked more through luck than judgment, I've tried to provide documentation to educate users of the issues facing object initialization. However, the Linker and Libraries Guide isn't the first place folks look. Documentation should probably start with the compiler documents, which are the first place developers usually go. Documenting how users should use .init's and .fini's might also be useful. Although this really falls back to the languages, like C++, to document constructor and deconstructor use. Better methods of analyzing initialization dependencies may also be useful. The runtime linkers debugging capabilities are a start. Objects can also be exercised under ldd using the -i and -d/r options. But this still seems a little too late in the development cycle. It would be nice if we could flag things like cyclic dependencies during object development. Specifically, the cyclic dependencies of .init's and .fini's. However, one problem with todays applications is that they are not all created in one build environment. Many components, asynchronously delivered from different ISV's are brought together to produce a complete application. The bottom line is keep it simple. Try and get rid of .init code. It's amazing how much exists, and what it does (dlopen()/dlclose(), firing off new threads, I've even seen .init sections fork()/exec() processes). .init code often initializes an object for all eventualities, whereas much of the initialization is never used for a particular thread of execution. Make the initialization self contained by eliminating and reducing the references to external objects from the initialization code. Reducing exported interfaces can help too. By keeping things simple you can avoid much of the initialization interactions that the runtime linkers implementation has evolved to handle. Your process will start up faster too. Folks often comment how long it takes for a process to get to main. Have a look how much initialization processing comes before this! Technorati Tag: OpenSolaris Technorati Tag: Solaris (2005-09-27 13:44:45.0) Permalink Comments [0] Finding Symbols - reducing dlsym() overhead In a previous post, I'd explained how lazy loading provides a fall back mechanism. If a symbol search exhausts all presently loaded objects, any pending lazy loaded objects are processed to determine whether the required symbol can be found. This fall back is required as many dynamic objects exist that do not define all their dependencies. These objects have (probably unknowingly) become reliant on other dynamic objects making available the dependencies they need. Dynamic object developers should define what they need, and nothing else.
dlsym(3c) can also trigger a lazy load fall back.
You can observe such an event by enabling the runtime linkers
diagnostics. Here, we're looking for a symbol in
% LD_DEBUG=symbols,files,bindings main
.....
19231: symbol=elf_errmsg; dlsym() called from file=main [ RTLD_DEFAULT ]
19231: symbol=elf_errmsg; lookup in file=main [ ELF ]
19231: symbol=elf_errmsg; lookup in file=/lib/libc.so.1 [ ELF ]
19231:
19231: rescanning for lazy dependencies for symbol: elf_errmsg
19231:
19231: file=libnsl.so.1; lazy loading from file=main: symbol=elf_errmsg
......
19231: file=libsocket.so.1; lazy loading from file=main: symbol=elf_errmsg
......
19231: file=libelf.so.1; lazy loading from file=main: symbol=elf_errmsg
......
19231: symbol=elf_errmsg; lookup in file=/lib/libelf.so.1 [ ELF ]
19231: binding file=main to file=/lib/libelf.so.1: symbol `elf_errmsg'
Exhaustively loading lazy dependencies to resolve a symbol isn't always what you want. This is especially true if the symbol may not exist. In Solaris 10 we added RTLD_PROBE. This flag results in the same lookup semantics as RTLD_DEFAULT, but does not fall back to an exhaustive loading of pending lazy objects. This handle can be thought of as the light weight version of RTLD_DEFAULT.
Therefore, if we wanted to test for the existence of a symbol within
the objects that were presently loaded within a process, we could use
% LD_DEBUG=symbols,files,bindings main
.....
19251: symbol=doyouexist; dlsym() called from file=main [ RTLD_PROBE ]
19251: symbol=doyouexist; lookup in file=main [ ELF ]
19251: symbol=doyouexist; lookup in file=/lib/libc.so.1 [ ELF ]
......
19251: ld.so.1: main: fatal: doyouexist: can't find symbol
When
if ((handle = dlopen("foo.so", RTLD_LAZY)) != NULL) {
fprt = dlsym(handle, "foo");
then intuitively the search for foo would be isolated to
In Solaris 9 8/03 we provided an extension to
Perhaps RTLD_PROBE and RTLD_FIRST can reduce your
Technorati Tag: OpenSolaris Technorati Tag: Solaris (2005-09-09 10:32:29.0) Permalink Comments [0] The Link-editors - a source tour Welcome to OpenSolaris. I've been working with the link-editors for many years, and I thought that with the general availability of the source, now would be an opportune time to cover some history, and give a brief overview of the link-editors source hierarchy. The link-editor components reside under the usr/src/cmd/sgs directory. This Software Generation Subsystem hierarchy originated from the AT&T and Sun collaboration that produced Solaris 2.0. Under this directory exist the link-editors, and various tools that manipulate or display ELF file information. There are also some ancillary components that I've never modified. I believe at some point it may also have contained compilers, however these have long since moved to their own separate source base.
The Link-EditorWhen you mention the link-editor, most folks think of ld(1). You'll find this under sgs/ld. However, this binary is only a stub that provides argument processing and then dynamically loads the heart of the link-editor, libld.so. This library provides two flavors, a 32-bit version, and a 64-bit version, both capable of producing a 32-bit or 64-bit output file. The class of library that is loaded, is chosen from the class of the first input relocatable object read from the command line. This model stems from a compiler requirement that the link-editor class remain consistent with various compiler subcomponents.
The Runtime Linker
However, there's another link-editor that is required to execute every
application on Solaris. This editor takes over where the standard
link-editor left off, and is referred to as the runtime-linker,
ld.so.1(1).
You can find this under
sgs/rtld.
The runtime linker takes an application from
exec(2),
loads any required dependencies, and binds the associated objects
together with the information left from
This very close association of
One historic area of the runtime linker is its AOUT support.
Objects from our SunOS4.x release were in AOUT format, and
to aid customer transition from this release to Solaris, support for
executing AOUT applications was provided by
Also, if you poke around the relocation files for
Support Libraries
There are various support libraries employed by the link-editors.
A debugging library,
liblddbg.so, is employed by As you can see, there is a lot of interrelationships between the various components of the link-editors. The interfaces between these components are private and often change. When providing updates to the link-editors in patches and updates, this family of components is maintained and supplied as a whole unit.
Proto Build
As part of building the link-editor components, you might notice that
we first build a version of
Package BuildUnder sgs/packages you'll see we have the capability of building our own package. This isn't the official package(s) that the link-editors are distributed under, but a sparse package, containing all our components. This package is how we install new link-editors quickly on a variety of test machines, or provide to other developers to test new capabilities or bug fixes, before we integrate into an official build. Note, there's no co-ordination between this package and the official package database, it's really no different than tar(1)'ing the bits onto your system, except you can back the changes out!
PatchesWe make a lot of patches. Sure, there are bugs and escalations that need resolving, but we frequently have to make new capabilities available on older releases. The compilers are released asynchronously from the core OS, and new capabilities required by these compilers must be made available on every release the compilers are targeted to. We have a unique way of generating patches. When asked to generate a patch we typically backport all the latest and greatest components. As I described earlier, there's a lot of interaction between the various components, and thus trying to evaluate whether an individual component can be delivered isn't always easy. So, we've cut this question out of the puzzle from the start, and always deliver all the link-editor components as a family. Trying to isolate a particular bug fix can also be challenging. It may look like a two line code fix addresses a customer escalation, but these two lines are often dependent on some other fixes, in other files, that occurred many months before. Trying to remember, and test for all these possible interactions can be a nightmare, so we've removed this question from the puzzle too. When we address a bug, we address it in the latest source base. If the bug can't be duplicated, then it may have been fixed by some previous change, in which case we'll point to the associated patch. Otherwise, we'll use all the resources available on the latest systems to track down and fix the issue. Yep, that means we get to use the latest mdb(1) features, dtrace(1M), etc. There's nothing more frustrating that having to evaluate a bug on an old release where none of your favorite tools exist. Sometime we have to fall back to an older release, but we try and avoid it if we can. Having a fix for the issue, we'll integrate the changes in the latest Solaris release. And, after some soak time, in which the fix has gone through various test cycles and been deployed on our desktops and building servers, we'll integrate the same changes in all the patch gates. Effectively, we're only maintaining one set of bits across all releases. This greatly reduces the maintenance of the various patch environments, and frees up more time for future development. This model hasn't been without some vocal opponents - "I want a fix for xyz, and you're giving me WHAT!". But most have come around to the simplicity and efficiency of the whole process. Is it flawless? No. Occasionally, regressions have occurred, although these are always in some area that has been outside of the scenarios we're aware of, or test for. Customers always do interesting things. But it will be a customer who finds such as issue, either in a major release, update or patch. Our answer is to respond immediately to any such issues. Our package build comes in very handy here. You can always find the bugs we've fixed and patches generated from our SUNWonld-README file.
Other StuffSome other support tools are elfdump(1), ldd(1), and pvs(1). And, there's crle(1), moe(1), and lari(1). I quite enjoyed coming up with these latter names, but you might need some familiarity with old American culture to appreciate this. Which is rather odd in itself, as I'm British.
Anyway, hopefully this blog has enlightened you on navigating the OpenSolaris hierarchy in regard the link-editors. Have fun, and in respect for a popular, current media event - may the source be with you.
Technorati Tag: OpenSolaris Technorati Tag: Solaris (2005-06-14 08:09:30.0) Permalink Comments [3] Loading Multiple Files - same name, different directories A recent customer observation reminded me of a subtlety of shared object dependency lookup, and a change that occurred between Solaris 8 and 9. The customer observed different dependencies being loaded on the two systems, although the applications file system hierarchy was the same on the two systems. On Solaris 8, the following was observed.
$ ldd ./app
libX.so.1 => /opt/ISV/weblib/libX.so.1
libY.so.1 => /opt/ISV/weblib/libY.so.1
libZ.so.1 => /opt/ISV/weblib/libZ.so.1
And, on Solaris 9, the following was observed.
$ ldd ./app
libX.so.1 => /opt/ISV/weblib/libX.so.1
libY.so.1 => /opt/ISV/weblib/libY.so.1
libZ.so.1 => /opt/ISV/weblib/libZ.so.1
libX.so.1 => /opt/ISV/lib/libX.so.1
libY.so.1 => /opt/ISV/lib/libY.so.1
Notice, with Solaris 9 we seem to have gained two new dependencies
from the directory In a previous posting I'd discussed some warnings in regard to using LD_LIBRARY_PATH, and how using a runpath was a better alternative. This customers application and dependencies are using runpaths, however the runpaths are not consistent, and are revealing the different behavior between Solaris 8 and Solaris 9. In Solaris 8 and prior releases, dependencies were loaded by:
It was discovered that this dependency name pattern matching was becoming a significant bottleneck, especially as the number of application dependencies continues to increase. A second drawback to this model was that requirements started to materialize for processes to be able to open different dependencies. That is, the same filename, but where the files were located in different directories. These observations resulted in a change to the loading behavior. With Solaris 9 we no longer carry out the filename dependency pattern match against previously loaded objects. We simply search for the file using any search paths relevant to the caller (which includes the RPATH of the caller). Should this search result in a file that has already been loaded, a quick dev/inode check catches this, and prevents a duplicate loading. The result is a much faster, and scalable search for dependencies, and the flexibility required to locate the same filename in different locations. Hence, starting with Solaris 9 we now see:
$ ldd -s ./app
....
find object=libX.so.1; required by /opt/ISV/weblib/libY.so.1
search path=/opt/ISV/lib:/opt/ISV/lib/../SS:......:/usr/lib/lwp \
(RPATH from file /opt/ISV/weblib/libY.so.1)
trying path=/opt/ISV/lib/libX.so.1
libX.so.1 => /opt/ISV/lib/libX.so.1
This search path, initiated from
Whether two different versions of the same file are required in
this users hierarchy are still unknown, perhaps they should be consolidated.
But, if you want to insure the dependencies located by your components
are the same, the search paths (RPATHS - set using (2005-05-08 22:36:01.0) Permalink Comments [0] My Relocations Don't Fit - Position Independence A couple of folks have come across the following relocation error when running their applications on AMD64:
$ prog
ld.so.1: prog: fatal: relocation error: R_AMD64_32: file \
libfoo.so.1: symbol (unknown): value 0xfffffd7fff0cd457 does not fit
The culprit,
Shared objects are typically built using position independent code, using
compiler options such as If a shared object is built from position-dependent code, the text segment can require modification at runtime. This modification allows relocatable references to be assigned to the location that the object has been loaded. The relocation of the text segment requires the segment to be remapped as writable. This modification requires a swap space reservation, and results in a private copy of the text segment for the process. The text segment is no longer sharable between multiple processes. Position-dependent code typically requires more runtime relocations than the corresponding position-independent code. Overall, the overhead of processing text relocations can cause serious performance degradation. When a shared object is built from position-independent code, relocatable references are generated as indirections through data in the shared object's data segment. The code within the text segment requires no modification. All relocation updates are applied to corresponding entries within the data segment. The runtime linker attempts to handle text relocations should these relocations exist. However, some relocations can not be satisfied at runtime. The AMD64 position-dependent code sequence typically generates code which can only be loaded into the lower 32-bits of memory. The upper 32-bits of any address must all be zeros. Since shared objects are typically loaded at the top of memory, the upper 32-bits of an address are required. Position-dependent code within an AMD64 shared object is therefore insufficient to cope with relocation requirements. Use of such code within a shared object can result in runtime relocation errors cited above. This situation differs from the default ABS64 mode that is used for 64-bit SPARCV9 code. This position-dependent code is typically compatible with the full 64-bit address range. Thus, position-dependent code sequences can exist within SPARCV9 shared objects. Use of either the ABS32 mode, or ABS44 mode for 64-bit SPARCV9 code, can still result in relocations that can not be resolved at runtime. However, each of these modes require the runtime linker to relocate the text segment. Build all your shared objects using position independent code.
A Update - Wednesday March 21, 2007
If you believe you have compiled all the components of your shared object
using $ elfdump -d library | fgrep TEXTREL If this flag is found, then the link-editor thinks this file contains non-pic code. One explanation might be that you have include an assembler file in the shared library. Any assembler must be written using position-independent instructions. Another explanation is that you might have included objects from an archive library as part of the link-edit. Typically, archives are built with non-pic objects. You can track down the culprit from the link-editors debugging capabilities. Build your shared object. $ LD_OPTIONS=-Dreloc,detail cc -o library .... 2> dbg The diagnostic output in dbg can be inspected to locate the non-pic relocation, and from this you can trace back to the input file that supplied the relocation as part of building your library. (2005-04-26 09:40:09.0) Permalink Comments [1] |
|
||||