Darryl Gove's blog

Friday May 22, 2009

Graph of libraries used by firefox and thunderbird

Just gathered library usage charts for firefox and thunderbird. The full charts look like:

Firefox

Thunderbird

Neither of which is particularly telling. The reduced charts look much better:

Firefox

Thunderbird

Wednesday May 20, 2009

Drawing libraries - neater eye-candy!

Chris Quenelle posted an interesting comment to my post which showed the dependencies for StarOffice. As you can see from the mass of lines below, adding more dependency information, using the latest version of ld_dot, into the StarOffice library map did not make the graphic any clearer!

It turns out that the reduction operation that Chris was alluding to is implemented by tred (the "transitive reduction filter", what great technobabble!). This filtering reduces the graph down to something which even looks ok when shrunk down to fit this page:

This clarifies the relationships between the libraries. More importantly it also looks pretty.

Libraries (5) - Runtime costs - TLBs

The next consideration when using libraries is that each library will get mapped in on a new virtual page of memory; as shown in this pmap output:

% pmap 60500
60500:  a.out
00010000       8K r-x--  /libraries/a.out
00020000       8K rwx--  /libraries/a.out
FEEC0000      24K rwx--    [ anon ]
FEED0000       8K r-x--  /libraries/lib1_26.so
FEEE0000       8K rwx--  /libraries/lib1_26.so
FEEF0000       8K r-x--  /libraries/lib1_25.so
FEF00000       8K rwx--  /libraries/lib1_25.so
FEF10000       8K r-x--  /libraries/lib1_24.so
FEF20000       8K rwx--  /libraries/lib1_24.so
FEF30000       8K r-x--  /libraries/lib1_23.so
FEF40000       8K rwx--  /libraries/lib1_23.so
FEF50000       8K rwx--    [ anon ]
FEF60000       8K r-x--  /libraries/lib1_22.so
FEF70000       8K rwx--  /libraries/lib1_22.so
FEF80000       8K r-x--  /libraries/lib1_21.so
FEF90000       8K rwx--  /libraries/lib1_21.so
FEFA0000       8K r-x--  /libraries/lib1_20.so
FEFB0000       8K rwx--  /libraries/lib1_20.so
FEFC0000       8K r-x--  /libraries/lib1_19.so
....

There are finite number of TLB entries on a chip. If each library takes an entry, and the code jumps around between libraries, then a single application can utilise quite a few TLB entries. Take a CMT system where there are multiple applications (or copies of the same application) running, and there becomes a lot of pressure on the TLB.

One of the enhancements in Solaris to support CMT processors is Shared Context. When multiple applications map the same library at the same address, then they can share a single context to map that library. This can lead to a significant reduction in the TLB pressure. Shared context only works for libraries that are loaded into the same memory locations in different contexts, so it can be defeated if the libraries are loaded in different orders or any other mechanisms that scramble the locations in memory.

If each library is mapped into a different TLB entry, then every call into a new library is a new ITLB entry, together with a jump through the PLT, together with the normal register spill/fill overhead. This can become quite a significant chunk of overhead.

To round this off, lets look at some figures from an artificial code run on an UltraSPARC T1 system that was hanging around here.

ExperimentRuntime
Application that jumps between 26 different routines a->b->c...->z. All the routines are included in the same executable. 3s
Application that jumps between 26 different routines a->...z. The routines are provided as a library, and calls are therefore routed through the PLT. 6s
Application that jumps between 26 different routines a->...z. The routines are provided as a library, but all are declared static except for the initial routine that is called by main. Therefore the calls within the library avoid the PLT. 3s
Application that jumps between 26 different routines a->...z. Each routine is defined in its own library, so calls to the routine have to go through the PLT, and also require a new ITLB entry to be used. 60s

Since the routines in this test code don't actually do anything, the overhead of calling through the PLT is clearly shown as a doubling of runtime. However, this is insignificant when compared with the costs of calling to separate libraries, which is about 10x slower than this.

Moving the experiment to look at the impact on CMT systems:

ExperimentRuntime
One copy of this executable per core of an UltraSPARC T1 processor 1 minute
Two copies of this executable per core 5 minutes
Four copies of this executable per core (fully loaded system) 8 minutes

Running multiple copies of the application has a significant impact on performance. The performance counters show very few instructions being executed, and much time being lost to ITLB misses. Now this performance is from a system without the shared context changes - so I would expect much better scaling on a system with these improvements (if I find one I'll rerun the experiment).

The conclusion is that care needs to be taken when deciding to split application code into libraries.

Libraries (4) - Runtime costs - Procedure Lookup Table (PLT)

Most applications spend the majority of their time running - rather than starting up. So it's useful to look at the costs of using libraries at runtime.

The most apparent cost of using libraries is that calls to routines now go indirectly to the target routine through the procedure look up table (PLT). Unless the developer explicitly limits the scope of a function, it is exported from the library as a global function, which means that even calls within the library will go through the PLT. Consider the following code snippet:

void func2()
{
 ...
}

void func1()
{
   func2();
}

If this is compiled into an executable the assembly code will look like:

func1()
        11104:  82 10 00 0f  mov        %o7, %g1
        11108:  7f ff ff f8  call       func2   ! 0x110e8
        1110c:  9e 10 00 01  mov        %g1, %o7

However, if this is compiled as part of a library then the code looks like:

func2()
         664:  82 10 00 0f  mov         %o7, %g1
         668:  40 00 40 b9  call        .plt+0x3c       ! 0x1094c
         66c:  9e 10 00 01  mov         %g1, %o7

This is a doubling of the cost of the call.

In C it's possible to limit the scope of the function using the static keyword. Declaring func1 as static will cause the compiler to generate a direct call to that routine. The downside is that the routine will only be visible within the source file that defines it. It is also possible to use other methods to limit the visibility of symbols.

Libraries (3) - Application startup costs

As can be seen from the previous graphs, even a simple application (like ssh) can pull in a fair number of libraries. Whenever a library is pulled in, the linker has to request memory, load the image from disk, and then link in all the routines. This effort takes time - it's basically a large chunk of the start up time of an application. If you profile the start up of an application, you'll probably not see much because much of this time is basically the OS/disk activity of mapping the libraries into memory.

Of course applications also have start up costs associated with initialising data structures etc. However, the biggest risk is that applications will pull in libraries that they don't need, or perhaps do need, but don't need yet. The best work-around for this is to lazy load the libraries. Of course it's fairly easy to write code that either breaks under lazy loading or breaks lazy loading. It's not hard to work around these issues with care, and doing so can have a substantial impact on start up time.

Libraries (2)

Just updated the ld_dot script to include filter libraries. Added a profile for ssh logging into a system, rather than just showing the help message (click the image for the full size version).

Libraries

I was talking to Rod Evans about the diagnostic capabilities available in the runtime linker. These are available through the environment setting LD_DEBUG. The setting LD_DEBUG=files gives diagnostic information about which libraries were loaded by which other libraries. This is rather hard to interpret, and would look better as a graph. It's relatively easy to parse the output from LD_DEBUG into dot format. This script does the parsing. The full stesp to do this for the date command are:

$ LD_DEBUG=files date >ld_date 2>&1
$ ld_dot ld_date
$ dot -Tpng -o date.png dot.dot

The lines in the graph represent which libraries use which other libraries. Solid lines indicate "needed" or hard links, the dotted lines represent lazy loading or dynamic loading (dlopen). The resulting graph looks like:

More complex commands like ssh pull in a larger set of libraries:

It is possible to use this on much larger applications. Unfortunately, the library dependencies tend to get very complex. This is the library map for staroffice.

Tuesday May 19, 2009

Developer's Edge safari rerelease

We've just pushed a new version of The Developer's Edge to safari. The original version didn't show any of the text from each section of the book unless you logged into the safari site. The new version shows the snippet from each section even if you're not a subscriber.

I was pleased to see that the book is featured on the Sun Studio developer portal.

I'm also scheduled to give a second life presentation during JavaOne at 9am PST on the 3rd June.

Thursday May 07, 2009

The perils of strlen

Just been looking at an interesting bit of code. Here's a suitably benign version of it:

#include <string.h>
#include <stdio.h>

void main()
{
  char string[50];
  string[49]='\0';
  int i;
  int j=0;
  for (i=0; i<strlen(string); i++)
  {
   if (string[i]=='1') {j=i;}
  }
  printf("%i\n",j);
}

Compiling this bit of code leads to a loop that looks like:

                        .L900000109:
/* 0x002c         12 */         cmp     %i5,49
/* 0x0030         10 */         add     %i3,1,%i3
/* 0x0034         12 */         move    %icc,%i4,%i2
/* 0x0038         10 */         call    strlen  ! params =  %o0 ! Result =  %o0
/* 0x003c            */         or      %g0,%i1,%o0
/* 0x0040            */         add     %i4,1,%i4
/* 0x0044            */         cmp     %i4,%o0
/* 0x0048            */         bcs,a,pt        %icc,.L900000109
/* 0x004c         12 */         ldsb    [%i3],%i5

The problem being that for each character tested there's also a call to strlen! The reason for this is that the compiler cannot be sure what the call to strlen actually returns. The return value might depend on some external variable that could change as the loop progresses.

There's a lot of functions defined in the libraries that the compiler could optimise, if it was certain that it recognised them. The compiler flag that enables recognition of the "builtin" functions is -xbuiltin (which is included in -fast. This enables the compiler to do things like recognise calls to memcpy or memset and in some instances produce more optimal code. However, it doesn't recognise the call the strlen.

In terms of solving the problem, there are two approaches. The most portable approach is to hold the length of the string in a temporary variable:

  int length=strlen(string);
  for (i=0; i<length; i++)

Another, less portable approach, is to use #pragma no_side_effect. This pragma means the return value of the function depends only on the parameters passed into the function. So the result of calling strlen only depends on the value of the constant string that is passed in. The modified code looks like:

#include <string.h>
#include <stdio.h>
#pragma no_side_effect(strlen)

void main()
{
  char string[50];
  string[49]='\0';
  int i;
  int j=0;
  for (i=0; i<strlen(string); i++)
  {
   if (string[i]=='1') {j=i;}
  }
  printf("%i\n",j);
}

And more importantly, the resulting disassembly looks like:

                        .L900000109:
/* 0x0028          0 */         sub     %i1,49,%o7
/* 0x002c         11 */         add     %i3,1,%i3
/* 0x0030          0 */         sra     %o7,0,%o5
/* 0x0034         13 */         movrz   %o5,%i4,%i2
/* 0x0038         11 */         add     %i4,1,%i4
/* 0x003c            */         cmp     %i4,%i5
/* 0x0040            */         bcs,a,pt        %icc,.L900000109
/* 0x0044         13 */         ldsb    [%i3],%i1

Calendar

Search this blog

About

Solaris Application Programming

Book resources

The Developer's Edge

Book resources

OpenSPARC Internals

Book resources

Recent entries

Custom search

Tag cloud

book cmt communityone compiler cooltools cpu2006 dtrace gcc libraries linker multithreading openmp opensolaris opensparc optimisation optimization parallelisation parallelization performance performanceanalyzer programming secondlife solaris solarisapplicationprogramming sparc spot sunstudio ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Interesting docs

Navigation

Referers

Feeds