Morgan Herrington's Blog Adventures in Porting and Tuning

Thursday Oct 19, 2006

I was assisting an engineer in debugging a large application (consisting of millions of lines of code in dozens of libraries) and we suspected that the problem we were working on might have been caused by (inadvertent) symbol interposition.

With enough analysis of the output of /usr/ccs/bin/nm (applied across all of the libraries), I knew I could find symbols which were duplicated in multiple libraries. And by pouring over the diagnostics of the linker and/or runtime loader, I could even determine which symbols were incorrectly bound. So I started by dumping the runtime loader help message (to remind myself how to turn on the linker/loader diagnostics):

$ LD_DEBUG=help /bin/echo
[By the way, this is also supported by the loader on Linux.]

Even with instructions, I usually have to experiment for a couple of minutes to discover if I want to look at the diagnostics from bindings or symbols and whether or not I need to add the detail specifier. Pretty quickly, though, I settled on:

$ LD_DEBUG=bindings LD_DEBUG_OUTPUT=/tmp/dump myapplication 
This generated a lot of output, but with some filtering, I eventually saw something like the following:
08764: 1: binding file=./libone.so to file=./libone.so: symbol `lookup_symbol'
... many lines omitted ...
08764: 1: binding file=./libtwo.so to file=./libone.so: symbol `lookup_symbol'
Because libtwo.so expected to use its own version of lookup_symbol, this was the symbol problem we were looking for.

The real point of this story, however, is that I was reading a two year old entry from Rod Evan's blog and (re)discovered that Solaris 10 provides a tool, /usr/ccs/bin/lari, which would have uncovered this problem automatically. Going back and trying this on my original problem:

$ /usr/ccs/bin/lari myapplication | c++filt 
[2:0]: std::bad_cast::~bad_cast()(): /opt/SUNWspro/lib/libCrun.so.1
[2:1EP]: std::bad_cast::~bad_cast()(): myapplication
[2:0]: htonl(): /lib/libc.so.1
[2:3ES]: htonl(): /lib/libsocket.so.1
[2:2ES]: lookup_symbol(): ./libone.so
[2:0]: lookup_symbol(): ./libtwo.so
The syntax of each line is "[symbol count, bindings] symbol name: object"

I still need to figure out what is causing the messages for bad_cast and htonl, but the last two lines immediately show the problem that we were looking for. There are two definitions of lookup_symbol, but only the version from libone.so is being used.

I don't have much experience with lari (yet), but just for good linker hygiene, I think I might run it proactively to check for lurking problems.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed