Darryl Gove's blog
Stlport4 and multithreaded code
I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.
First off, it's probably good to take a look at a code that illustrates the problem:
#include <vector>
int main()
{
#pragma omp parallel for default (__auto)
for (int i=0; i<10000000; i++)
{
std::vector<int> v;
v.push_back(10);
}
return(0);
}
The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.
$ CC -O t1.cc $ timex a.out real 15.85 user 15.64 sys 0.01 $ CC -O -library=stlport4 t1.cc $ timex a.out real 7.87 user 7.78 sys 0.01
This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.
The next thing to investigate is scaling when there are multiple threads:
$ CC -O -xopenmp -library=stlport4 t1.cc $ timex a.out real 7.00 user 6.96 sys 0.01 $ export OMP_NUM_THREADS=2 $ timex a.out real 7.18 user 14.28 sys 0.01
So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:
Excl. Incl. Name User CPU User CPU sec. sec. 8.076 8.0761.571 2.272 mutex_lock_impl 1.501 1.971 mutex_unlock 1.051 4.573 std::vector >::_M_insert_overflow(int*,const int&,const std::__true_type&,unsigned,bool) 0.871 8.076 _$d1A5.main 0.871 3.272 std::__node_alloc<true,0>::_M_allocate(unsigned) 0.560 1.721 std::__node_alloc<true,0>::_M_deallocate(void*,unsigned) 0.480 0.480 sigon 0.440 0.440 mutex_trylock_adaptive 0.250 0.470 mutex_unlock_queue
So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.
This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.
The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:
$ gmake BUILDTYPE=12D CONFIG=sunpro.config
Once I had it built, I could install it with:
$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install
The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:
CC -m64 -O -xopenmp -library=no%Cstd \ -I ./stdcxx-4.2.1/install/include/ \ -L ./stdcxx-4.2.1/install/lib/ \ -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc
So we can build the test and look at the scaling between one and two threads:
$ export OMP_NUM_THREADS=1 $ timex a.out real 18.98 user 18.93 sys 0.01 $ export OMP_NUM_THREADS=2 $ timex a.out real 18.42 user 36.73 sys 0.01
Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:
Excl. Incl. Name User CPU User CPU sec. sec. 21.145 21.1452.572 16.411 std::vector<int,std::allocator<int> >::_C_insert_n(int*const&,unsigned long,const int&) 2.402 4.293 mutex_unlock 2.342 3.613 mutex_lock_impl 1.961 10.697 std::vector<int,std::allocator<int> >::_C_realloc(unsigned long) 1.681 5.634 free 1.341 1.891 mutex_unlock_queue 1.271 1.271 _free_unlocked 0.991 0.991 sigon
So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:
(er_print) csingle mutex_lock Attr. Excl. Incl. Name User CPU User CPU User CPU sec. sec. sec. 0.170 1.681 5.634 free 0.020 0.690 4.623 malloc 0.190 0.190 0.190 *mutex_lock
So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:
CC -m64 -O -xopenmp -library=no%Cstd \ -I ./stdcxx-4.2.1/install/include/ \ -L ./stdcxx-4.2.1/install/lib/ \ -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc -lmtmalloc
Then we can try the timing tests again:
$ export OMP_NUM_THREADS=1 $ timex a.out real 18.02 user 17.98 sys 0.01 $ export OMP_NUM_THREADS=2 real 13.76 user 27.05 sys 0.01 $ export OMP_NUM_THREADS=4 $ timex a.out real 6.92 user 26.97 sys 0.02 $ export OMP_NUM_THREADS=8 $ timex a.out real 3.51 user 26.99 sys 0.02
So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]
Posted at 04:05PM Jun 12, 2009 by Darryl Gove in Sun | Comments[5]



Darryl, thank you very much for sharing this and especially for sharing your thought process that brought you to such conclusions!
Posted by Kristofer Spinka on June 12, 2009 at 05:54 PM PDT #
there used to be an arc case to make Apache's STL available in Solaris by default, what happened with that?
Posted by nacho on June 13, 2009 at 08:21 AM PDT #
As far as I know it's still going to happen. Which is why I was interested to know how apache handled multiple threads :)
http://arc.opensolaris.org/caselog/PSARC/2008/549/proposal_v2.txt
Posted by Darryl Gove on June 13, 2009 at 10:42 AM PDT #
Er, as far as I know, the example code you are showing is completely unsafe: for 2 threads to be able to add stuff to a vector in parallel requires some special code in the vector implementation of the standard library (typically a lock) than neither Cstd nor stlport4 has.
However, if each thread uses a different vector, the analysis is valid.
It would be interesting to know why the apache library is so much slower than stlport4, especially since it is going to replace it in solaris...
Posted by Marc on June 15, 2009 at 06:22 PM PDT #
Oooooooooops!
Please ignore my previous comment... I didn't read it properly.
Posted by Marc on June 15, 2009 at 06:26 PM PDT #