Surfing With a Linker Alien
Rod Evans's Weblog
All | General | Solaris

20040830 Monday August 30, 2004

Relocations - careful with that debugging flag

I received an application from a customer the other day. It's quite a big sucker, consisting of the application and over 70 shared objects (that's besides the system objects that also get used).

    % size -x main *.so
    main:       2df35c + 2675a4 + 80918f8 = 0x85d81f8
    libxxx.so: 64d4d9c + 9af9f6 + 19604ba = 0x87e4c4c
    libyyy.so: 4db7aeb + 76aa4c + 32cc16c = 0x87ee6a3
    libzzz.so: 3f347ce + d8ebb1 + 4642a3b = 0x9305dba
    ....

The customer has complained that it takes a long time to load this application. In particular, it takes a long time to verify their objects using ldd(1) and the -r option. See the Solaris 10 man pages section 1: User Commands

Using ldd(1) and the -d option, emulates the cost of starting a process. The relocation processing is exactly the same as applied by ld.so.1(1), at runtime. Using the -r option processes all relocations as would be applied by ld.so.1(1) if the environment variable LD_BIND_NOW were in effect. Using ldd(1) and the -r option is a convenient way of testing that all symbol references can be found for a particular object.

The set of shared objects supplied by the customer do not specify any of their dependencies. In fact, the application seems responsible for establishing all dependencies, not only those that the application references, but also those needed to satisfy all dependencies. If each shared object defined their dependencies, then ldd(1) could be used on each object to validate its symbol requirements. However, with this set of objects, the only means of validating symbol requirements is to run ldd(1) against the application, and this is taking a long time.

A quick poke around with elfdump(1), reveals that there are over 3.3 million relocations to process for this application and dependencies. Some profiling (DTrace is your friend here), revealed that relocation processing is taking nearly 99% of the startup cost. Locating and mapping all the objects is trivial.

Looking a little deeper I found that around 2.3 million relocations are RELATIVE relocations. These are relocations that simply need the base offset of the object to be added to the relocation offset. This is a simple operation, involving no symbol lookup, and is only accounting for a few percent of the cost.

The rest of the startup cost stems from the symbolic relocations, of which there are some 740,000 that needed processing with ldd(1) and the -d option. Poking some more revealed that 680,000 of these symbols are of the form $XBGEQEsZEHHBGQS.... (it's the $X prefix that's the give away). These are local symbols that have been made unique and promoted to global symbols by the compilers when the -g (debugging) option is used. By all accounts, they allow for the debuggers to provide fix-and-continue processing, or, if you're compilers have this capability, inter-object optimization.

I don't have the source for this set of objects to experiment with not using -g. But I'm left concluding that the bulk of the startup cost of this process is due to these $X.... symbols.

If you don't want fix-and-continue, be careful how you use the compiler flags. This overhead in relocation processing probably isn't what you want in production software.

Note, you can also build objects with the -zcombreloc flag of ld(1). This option combines relocation sections into one table. The RELATIVE relocations are all concatenated, and represented by dynamic entries that allow the runtime linker to process them through an optimized loop, that is even faster than normal relative relocation processing.

(2004-08-30 15:55:07.0) Permalink Comments [2]

20040822 Sunday August 22, 2004

Dynamic Object Versioning

For some time now, we've been versioning core system libraries. You can display version definitions, and version requirements with pvs(1). For example, the latest version of libelf.so.1 from Solaris 10, provides the following versions:

    % pvs -d /lib/libelf.so.1
        libelf.so.1;
        SUNW_1.5;
        SUNW_1.4;
        ....
        SUNWprivate_1.1;

So, what do these versions provide? Shared object versioning has often been established with various conventions of renaming the file itself with different major or minor (or micro) version numbers. However, as applications have become more complex, specifically because they are constructed from objects that are asynchronously delivered from external partners, this file naming convention can be problematic.

In developing the core Solaris libraries, we've been rather obsessed with compatibility, and rather than expect customers to rebuild against different shared object file names (i.e., libfoo.so.1, and later libfoo.so.2), we've maintained compatibility by defining fixed interface sets within the same library file. And, the only changes we've made to the library is to add new interface sets. These interface sets are described by version names.

Now you could maintain compatibility by retaining all existing public interfaces, and only adding new interfaces, without the versioning scheme. However, the version scheme has a couple of advantages:

  • consumers of the interface sets record their requirements on the version name they reference.

  • establishing interface sets removes unnecessary interfaces from the name-space.

  • the version sets provide a convenient means of policing interface evolution.

When a consumer references a versioned shared object, the version name representing the interfaces the consumer references are recorded. For example, an application that references the elf_getshnum(3elf) interface from libelf.so.1, will record a dependency on the SUNW_1.4 version:

    % cc -o main main.c -lelf
    % pvs -r main
        libelf.so.1 (SUNW_1.4);

This version name requirement is verified at runtime. Therefore, should this application be executed in an environment consisting of an older libelf.so.1, one that perhaps only offers version names up to SUNW_1.3, then a fatal error will result when libelf.so.1 is processed:

    % pvs -dn /lib/libelf.so.1
        SUNW_1.3;
        SUNWprivate_1.1;
    % main
    ld.so.1: ./main: fatal: libelf.so.1: version `SUNW_1.4' not found \
        (required by file ./main)

This verification might seem simplistic, and won't the application be terminated anyway if a required interface can't be located? Well yes, but function binding normally occurs at the time the function is first called. And this call can be some time after an application is started (think scientific applications that can run for days or weeks). It is far better to be informed that an interface can't be located when a library is first loaded, that to be killed some time later when a specific interface can't be found.

Defining a version typically results in the demotion of many other global symbols to local scope. This localization can prevent unintended symbol collisions. For example, most shared objects are built from many relocatable objects, each referencing one another. The interface that the developer wishes to export from the shared object is normally a subset of the number of global symbols that would normally remain visible.

Version definitions can be defined using a mapfile. For example, the following mapfile defines a version containing two interfaces. Any other global symbols that would normally be made available by the objects that contribute to the shared object are demoted, and hence hidden as locals:

    % cat mapfile
    ISV_1.1 {
        global:
            foo1();
            foo2();
        local:
            *;
    };
    % cc -o libfoo.so.1 -G -Kpic -Mmapfile foo.c bar.c ...
    % pvs -dos libfoo.so.1
    libfoo.so.1 -       ISV_1.1: foo1;
    libfoo.so.1 -       ISV_1.1: foo2;

The demotion of unnecessary global symbols to locals greatly reduces the relocation requirements of the object at runtime, and can significantly reduce the runtime startup cost of loading the object.

Of course, interface compatibility requires a disciplined approach to maintaining interfaces. In the previous example, should the signature of foo1() be changed, or foo2() be deleted, then the use of a version name is meaningless. Any application that had built against the original interfaces, will fail at runtime when the new library is delivered, even though the version name verification will have been satisfied.

With the core Solaris libraries we maintain compatibility as we evolve through new releases by maintaining existing public interfaces and only adding new version sets. Auditing of the version sets help catch any mistaken interface deletions or additions. Yeah, we fall foul of cut-and-paste errors too :-)

For more information on versioning refer to the Versioning Quick Reference. Or for a detailed description refer to Application Binary Interfaces and Versioning.

(2004-08-22 21:47:13.0) Permalink

20040801 Sunday August 01, 2004

Lazy Loading - there's even a fall back

In my previous posting, I described the use of lazy loading. Of course, when we initially played with an implementation of this technology, a couple of applications immediately fell over. It turns out that a fall back was necessary.

Let's say an application developer creates an application with two dependencies. The developer wishes to employ lazy loading for both dependencies.

    % ldd main
        foo.so =>        ./foo.so
        bar.so =>        ./bar.so
        ...

The application developer has no control over the dependency bar.so, as this dependency is provided by an outside party. In addition, this shared object has its own dependency on foo.so, however it does not express the required dependency information. If we were to inspect this dependency, we would see that it is not ldd(1) clean.

    % ldd -r bar.so
        symbol not found: foo     (./bar.so)

The only reason this library has been successfully employed by any application is because the application, or some other shared object within the process, has made the dependency foo.so available. This is probably more by accident than design, but sadly it is an all to common occurrence.

Now, suppose the application main makes reference to a symbol that causes the lazy loading of bar.so before the application makes reference to a symbol that would cause the lazy loading of foo.so to occur.

    % LD_DEBUG=bindings,symbols,files main  
    .....
    07683: 1: transferring control: ./main
    .....
    07683: 1: file=bar.so;  lazy loading from file=./main: symbol=bar
    .....
    07683: 1: binding file=./main to file=./bar.so: symbol `bar'

When control is passed to bar(), the reference it makes to its implicit dependency foo() is not going to be found, because the shared object foo.so is not yet available. Because this scenario is so common, the runtime linker provides a fall back. If a symbol can not be found, and lazy loadable dependencies are still pending, the runtime linker will process these pending dependencies in a final attempt to locate the symbol. This can be observed from the remaining debugging output.

    07683: 1: symbol=foo;  lookup in file=./main  [ ELF ]
    07683: 1: symbol=foo;  lookup in file=./bar.so  [ ELF ]
    07683: 1: 
    07683: 1: rescanning for lazy dependencies for symbol: foo
    07683: 1: 
    07683: 1: file=foo.so;  lazy loading from file=./main: symbol=foo
    .....
    07683: 1: binding file=./bar.so to file=./foo.so: symbol `foo'

Of course, there can be a down-side to this fall back. If main were to have many lazy loadable dependencies, each will be processed until foo() is found. Thus, several dependencies may get loaded that aren't necessary. The use of lazy loading is never going to be more expensive than non-lazy loading, but if this fall back mechanism has to kick in to find implicit dependencies, the advantage of lazy loading is going to be compromised.

To prevent lazy loading from being compromised, always record those dependencies you need (and nothing else).

(2004-08-01 20:05:07.0) Permalink


archives
links
referers