Darryl Gove's blog
Webcast: Improving the performance of parallel codes using the Performance Analyzer
Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.
Posted at 11:47AM Oct 09, 2009 by Darryl Gove in Sun | Comments[0]
Updated compiler flags article
Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.
Posted at 12:55PM Sep 28, 2009 by Darryl Gove in Sun |
Stlport4 and multithreaded code
I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.
First off, it's probably good to take a look at a code that illustrates the problem:
#include <vector>
int main()
{
#pragma omp parallel for default (__auto)
for (int i=0; i<10000000; i++)
{
std::vector<int> v;
v.push_back(10);
}
return(0);
}
The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.
$ CC -O t1.cc $ timex a.out real 15.85 user 15.64 sys 0.01 $ CC -O -library=stlport4 t1.cc $ timex a.out real 7.87 user 7.78 sys 0.01
This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.
The next thing to investigate is scaling when there are multiple threads:
$ CC -O -xopenmp -library=stlport4 t1.cc $ timex a.out real 7.00 user 6.96 sys 0.01 $ export OMP_NUM_THREADS=2 $ timex a.out real 7.18 user 14.28 sys 0.01
So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:
Excl. Incl. Name User CPU User CPU sec. sec. 8.076 8.0761.571 2.272 mutex_lock_impl 1.501 1.971 mutex_unlock 1.051 4.573 std::vector >::_M_insert_overflow(int*,const int&,const std::__true_type&,unsigned,bool) 0.871 8.076 _$d1A5.main 0.871 3.272 std::__node_alloc<true,0>::_M_allocate(unsigned) 0.560 1.721 std::__node_alloc<true,0>::_M_deallocate(void*,unsigned) 0.480 0.480 sigon 0.440 0.440 mutex_trylock_adaptive 0.250 0.470 mutex_unlock_queue
So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.
This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.
The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:
$ gmake BUILDTYPE=12D CONFIG=sunpro.config
Once I had it built, I could install it with:
$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install
The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:
CC -m64 -O -xopenmp -library=no%Cstd \ -I ./stdcxx-4.2.1/install/include/ \ -L ./stdcxx-4.2.1/install/lib/ \ -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc
So we can build the test and look at the scaling between one and two threads:
$ export OMP_NUM_THREADS=1 $ timex a.out real 18.98 user 18.93 sys 0.01 $ export OMP_NUM_THREADS=2 $ timex a.out real 18.42 user 36.73 sys 0.01
Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:
Excl. Incl. Name User CPU User CPU sec. sec. 21.145 21.1452.572 16.411 std::vector<int,std::allocator<int> >::_C_insert_n(int*const&,unsigned long,const int&) 2.402 4.293 mutex_unlock 2.342 3.613 mutex_lock_impl 1.961 10.697 std::vector<int,std::allocator<int> >::_C_realloc(unsigned long) 1.681 5.634 free 1.341 1.891 mutex_unlock_queue 1.271 1.271 _free_unlocked 0.991 0.991 sigon
So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:
(er_print) csingle mutex_lock Attr. Excl. Incl. Name User CPU User CPU User CPU sec. sec. sec. 0.170 1.681 5.634 free 0.020 0.690 4.623 malloc 0.190 0.190 0.190 *mutex_lock
So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:
CC -m64 -O -xopenmp -library=no%Cstd \ -I ./stdcxx-4.2.1/install/include/ \ -L ./stdcxx-4.2.1/install/lib/ \ -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc -lmtmalloc
Then we can try the timing tests again:
$ export OMP_NUM_THREADS=1 $ timex a.out real 18.02 user 17.98 sys 0.01 $ export OMP_NUM_THREADS=2 real 13.76 user 27.05 sys 0.01 $ export OMP_NUM_THREADS=4 $ timex a.out real 6.92 user 26.97 sys 0.02 $ export OMP_NUM_THREADS=8 $ timex a.out real 3.51 user 26.99 sys 0.02
So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]
Posted at 04:05PM Jun 12, 2009 by Darryl Gove in Sun | Comments[5]
Code complete: burn this chapter
That's a somewhat inflammatory title for this post, but continuing from my previous post on the book Code Complete, I think that the chapters (25 & 26) on performance really do not contain good advice or good examples.
To make this more concrete, consider the example on pg 593 where Steve McConnell compares the performance of these two code fragments:
| Original | Unrolled |
|---|---|
for i = 1 to 10 a[ i ] = i end for |
a[ 1 ] = 1 a[ 2 ] = 2 a[ 3 ] = 3 a[ 4 ] = 4 a[ 5 ] = 5 a[ 6 ] = 6 a[ 7 ] = 7 a[ 8 ] = 8 a[ 9 ] = 9 a[ 10 ] = 10 |
Steve finds that Visual Basic and Java run the unrolled version of the loop faster.
There's a couple of examples that talk about incorrect access ordering for arrays. Here's some C code that illustrates the problem:
| Slow code | Fast code |
|---|---|
for (column=0; column < max_column; column++)
{
for (row=0; row < max_row; row++)
{
data[row][column]=stuff();
}
}
|
for (row=0; row < max_row; row++)
{
for (column=0; column < max_column; column++)
{
data[row][column]=stuff();
}
}
|
On page 599 it is suggested that the slow code is inefficient because it might cause paging to disk, on page 623 it is suggested that the higher trip count loop should be inside to amortise the initialisation overhead for each execution of the inner loop. Neither of these explanations is right. As I'm sure most of you recognise the code is slow because of cache misses incurred when accessing non-adjacent memory locations. There is a cost to initialisation of the inner loop, but nothing significant, and yes, you could get paging to disk - but only if you are running out of memory (and if you're running out of memory, you're hosed anyway!). You're more likely to get TLB misses (and perhaps that is what Mr McConnell intended to say.
I consider the above issues to be quite serious, but, unfortunately, I'm not terribly happy with the rest of the material. Hence my recommendation to ignore (or burn
these chapters. I'll go through my other reservations now.
Lack of details. The timing information is presented with no additional information (pg 623) "C++ Straight Time = 4.75 Code-Tuned Time = 3.19 Time Savings = 33%". What was the compiler? What compiler flags were given? What was the test harness?
The book presents it as somehow that "C++" runs this code slowly, but in reality it's more likely to be a test of the effectiveness of the compiler, and the ability of the user to use the compiler. I'd be surprised if any compiler with minimal optimisation enabled did not do the loop interchange operation necessary to get good performance. Which leads to my next observation:
Don't compilers do this? I think the book falls into one of the common "optimisation book" traps, where lots of ink is spent describing and naming the various optimisations. This gives the false impression that it is necessary for the expert programmer to be able to identify these optimisations and apply them to their program. Most compilers will apply all these optimisations - afterall that is what compilers are supposed to do - take the grudgery out of producing optimal code. It's great for page count to enumerate all the possible ways that code might be restructured for performance, but for most situations the restructuring will lead to code that has the same performance.
Profiling. It's not there! To me the most critical thing that a developer can do to optimise their program is to profile it. Understanding where the time is being spent is the necessary first step towards improving the performance of the application. This omission is alarming. The chapter already encourages users to do manual optimisations where there might be no gains (at the cost of time spent doing restructuring that could be better spent writing new code, and the risk that the resulting code is less maintainable), but without profiling the application, the users are basically encouraged to do this over the entire source code, not just the lines that actually matter.
Assembly language. Yes, I love assembly language, there's nothing I enjoy better than working with it (no comment), but I wouldn't encourage people to drop into it for performance reasons, unless they had utterly exhausted every other option. The book includes an example using Delphi where the assembly language version ran faster than the high-level version. My guess is that the compilers had some trouble with aliasing, and hence had more loads than were necessary - a check of the assembly code that the compilers generated would indicate that, and then it's pretty straight forward to write assembly-language-like high level code that the compiler can produce optimal code for. [Note, that I view reading and analysing the code at the assembly language level to be very useful, but I wouldn't recommend leaping into writing assembly language without a good reason.]
So what would I recommend:
- Profile. Always profile. This will indicate where the time is being spent, and what sort of gains you should expect from optimising parts of the application.
- Know the tools. Make sure that you know what compiler flags are available, and that you are requesting the right kind of things from the compiler. All too often there are stories about how A is faster than B, which are due to people not knowing how to use the tools.
- Identify those parts of the code where the time is spent, and examine them in detail to determine if it's a short coming of the compiler, the compiler flags, or an ambiguity in the source code, that causes time to be spent there. Many performance problems can be solved with by adding a new flag, or perhaps a minor tweak to the source code.
- Only when you have exhausted all other options, and you know that you can get a significant performance gain should you start wildly hacking at the source code, or recoding parts in assembly language.
The other thing to recommend is a read of Bart Smaalder's Performance Anti-patterns.
Posted at 08:00AM Jun 11, 2009 by Darryl Gove in Personal | Comments[6]
Libraries (5) - Runtime costs - TLBs
The next consideration when using libraries is that each library will get mapped in on a new virtual page of memory; as shown in this pmap output:
% pmap 60500 60500: a.out 00010000 8K r-x-- /libraries/a.out 00020000 8K rwx-- /libraries/a.out FEEC0000 24K rwx-- [ anon ] FEED0000 8K r-x-- /libraries/lib1_26.so FEEE0000 8K rwx-- /libraries/lib1_26.so FEEF0000 8K r-x-- /libraries/lib1_25.so FEF00000 8K rwx-- /libraries/lib1_25.so FEF10000 8K r-x-- /libraries/lib1_24.so FEF20000 8K rwx-- /libraries/lib1_24.so FEF30000 8K r-x-- /libraries/lib1_23.so FEF40000 8K rwx-- /libraries/lib1_23.so FEF50000 8K rwx-- [ anon ] FEF60000 8K r-x-- /libraries/lib1_22.so FEF70000 8K rwx-- /libraries/lib1_22.so FEF80000 8K r-x-- /libraries/lib1_21.so FEF90000 8K rwx-- /libraries/lib1_21.so FEFA0000 8K r-x-- /libraries/lib1_20.so FEFB0000 8K rwx-- /libraries/lib1_20.so FEFC0000 8K r-x-- /libraries/lib1_19.so ....
There are finite number of TLB entries on a chip. If each library takes an entry, and the code jumps around between libraries, then a single application can utilise quite a few TLB entries. Take a CMT system where there are multiple applications (or copies of the same application) running, and there becomes a lot of pressure on the TLB.
One of the enhancements in Solaris to support CMT processors is Shared Context. When multiple applications map the same library at the same address, then they can share a single context to map that library. This can lead to a significant reduction in the TLB pressure. Shared context only works for libraries that are loaded into the same memory locations in different contexts, so it can be defeated if the libraries are loaded in different orders or any other mechanisms that scramble the locations in memory.
If each library is mapped into a different TLB entry, then every call into a new library is a new ITLB entry, together with a jump through the PLT, together with the normal register spill/fill overhead. This can become quite a significant chunk of overhead.
To round this off, lets look at some figures from an artificial code run on an UltraSPARC T1 system that was hanging around here.
| Experiment | Runtime |
|---|---|
| Application that jumps between 26 different routines a->b->c...->z. All the routines are included in the same executable. | 3s |
| Application that jumps between 26 different routines a->...z. The routines are provided as a library, and calls are therefore routed through the PLT. | 6s |
Application that jumps between 26 different routines a->...z. The routines are
provided as a library, but all are declared static except for the
initial routine that is called by main. Therefore the calls within the library
avoid the PLT. |
3s |
| Application that jumps between 26 different routines a->...z. Each routine is defined in its own library, so calls to the routine have to go through the PLT, and also require a new ITLB entry to be used. | 60s |
Since the routines in this test code don't actually do anything, the overhead of calling through the PLT is clearly shown as a doubling of runtime. However, this is insignificant when compared with the costs of calling to separate libraries, which is about 10x slower than this.
Moving the experiment to look at the impact on CMT systems:
| Experiment | Runtime |
|---|---|
| One copy of this executable per core of an UltraSPARC T1 processor | 1 minute |
| Two copies of this executable per core | 5 minutes |
| Four copies of this executable per core (fully loaded system) | 8 minutes |
Running multiple copies of the application has a significant impact on performance. The performance counters show very few instructions being executed, and much time being lost to ITLB misses. Now this performance is from a system without the shared context changes - so I would expect much better scaling on a system with these improvements (if I find one I'll rerun the experiment).
The conclusion is that care needs to be taken when deciding to split application code into libraries.
Posted at 06:00PM May 20, 2009 by Darryl Gove in Sun |
Libraries (4) - Runtime costs - Procedure Lookup Table (PLT)
Most applications spend the majority of their time running - rather than starting up. So it's useful to look at the costs of using libraries at runtime.
The most apparent cost of using libraries is that calls to routines now go indirectly to the target routine through the procedure look up table (PLT). Unless the developer explicitly limits the scope of a function, it is exported from the library as a global function, which means that even calls within the library will go through the PLT. Consider the following code snippet:
void func2()
{
...
}
void func1()
{
func2();
}
If this is compiled into an executable the assembly code will look like:
func1()
11104: 82 10 00 0f mov %o7, %g1
11108: 7f ff ff f8 call func2 ! 0x110e8
1110c: 9e 10 00 01 mov %g1, %o7
However, if this is compiled as part of a library then the code looks like:
func2()
664: 82 10 00 0f mov %o7, %g1
668: 40 00 40 b9 call .plt+0x3c ! 0x1094c
66c: 9e 10 00 01 mov %g1, %o7
This is a doubling of the cost of the call.
In C it's possible to limit the scope of the function using the static keyword. Declaring func1 as static will cause the compiler to generate a direct call to that routine. The downside is that the routine will only be visible within the source file that defines it. It is also possible to use other methods to limit the visibility of symbols.
Posted at 03:00PM May 20, 2009 by Darryl Gove in Sun | Comments[2]
Libraries (3) - Application startup costs
As can be seen from the previous graphs, even a simple application (like ssh) can pull in a fair number of libraries. Whenever a library is pulled in, the linker has to request memory, load the image from disk, and then link in all the routines. This effort takes time - it's basically a large chunk of the start up time of an application. If you profile the start up of an application, you'll probably not see much because much of this time is basically the OS/disk activity of mapping the libraries into memory.
Of course applications also have start up costs associated with initialising data structures etc. However, the biggest risk is that applications will pull in libraries that they don't need, or perhaps do need, but don't need yet. The best work-around for this is to lazy load the libraries. Of course it's fairly easy to write code that either breaks under lazy loading or breaks lazy loading. It's not hard to work around these issues with care, and doing so can have a substantial impact on start up time.
Posted at 02:01PM May 20, 2009 by Darryl Gove in Sun |
Developer's Edge safari rerelease
We've just pushed a new version of The Developer's Edge to safari. The original version didn't show any of the text from each section of the book unless you logged into the safari site. The new version shows the snippet from each section even if you're not a subscriber.
I was pleased to see that the book is featured on the Sun Studio developer portal.
I'm also scheduled to give a second life presentation during JavaOne at 9am PST on the 3rd June.
Posted at 02:01PM May 19, 2009 by Darryl Gove in Sun |
The perils of strlen
Just been looking at an interesting bit of code. Here's a suitably benign version of it:
#include <string.h>
#include <stdio.h>
void main()
{
char string[50];
string[49]='\0';
int i;
int j=0;
for (i=0; i<strlen(string); i++)
{
if (string[i]=='1') {j=i;}
}
printf("%i\n",j);
}
Compiling this bit of code leads to a loop that looks like:
.L900000109:
/* 0x002c 12 */ cmp %i5,49
/* 0x0030 10 */ add %i3,1,%i3
/* 0x0034 12 */ move %icc,%i4,%i2
/* 0x0038 10 */ call strlen ! params = %o0 ! Result = %o0
/* 0x003c */ or %g0,%i1,%o0
/* 0x0040 */ add %i4,1,%i4
/* 0x0044 */ cmp %i4,%o0
/* 0x0048 */ bcs,a,pt %icc,.L900000109
/* 0x004c 12 */ ldsb [%i3],%i5
The problem being that for each character tested there's also a call to strlen! The reason for this is that the compiler cannot be sure what the call to strlen actually returns. The return value might depend on some external variable that could change as the loop progresses.
There's a lot of functions defined in the libraries that the compiler could optimise, if it was certain that it recognised them. The compiler flag that enables recognition of the "builtin" functions is -xbuiltin (which is included in -fast. This enables the compiler to do things like recognise calls to memcpy or memset and in some instances produce more optimal code. However, it doesn't recognise the call the strlen.
In terms of solving the problem, there are two approaches. The most portable approach is to hold the length of the string in a temporary variable:
int length=strlen(string); for (i=0; i<length; i++)
Another, less portable approach, is to use #pragma no_side_effect. This pragma means the return value of the function depends only on the parameters passed into the function. So the result of calling strlen only depends on the value of the constant string that is passed in. The modified code looks like:
#include <string.h>
#include <stdio.h>
#pragma no_side_effect(strlen)
void main()
{
char string[50];
string[49]='\0';
int i;
int j=0;
for (i=0; i<strlen(string); i++)
{
if (string[i]=='1') {j=i;}
}
printf("%i\n",j);
}
And more importantly, the resulting disassembly looks like:
.L900000109:
/* 0x0028 0 */ sub %i1,49,%o7
/* 0x002c 11 */ add %i3,1,%i3
/* 0x0030 0 */ sra %o7,0,%o5
/* 0x0034 13 */ movrz %o5,%i4,%i2
/* 0x0038 11 */ add %i4,1,%i4
/* 0x003c */ cmp %i4,%i5
/* 0x0040 */ bcs,a,pt %icc,.L900000109
/* 0x0044 13 */ ldsb [%i3],%i1
Posted at 02:15PM May 07, 2009 by Darryl Gove in Sun | Comments[2]
A look inside the Sun compiler team
At the end of last year I was asked to appear in a short video for AMD and give an elevator pitch for performance analysis. The video is now up on the site.
They also took videos of some of the Solaris folks, as well as a few others from the compiler team.
Yuan Lin has a short talk about parallelisation. My manager, Fu-Hwa Wang, talks about the work of our organisation.
Posted at 10:14AM Jan 28, 2009 by Darryl Gove in music |


