Darryl Gove's blog
Fishing with cputrack
I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.
I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:
- Excessive system time
- Process migration
- Memory placement
- Page size
- etc.
Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.
Posted at 08:00AM Oct 19, 2009 by Darryl Gove in Sun | Comments[0]
Surprisingly slow compile time
I had an e-mail which told the sorry tale of a new system which tool longer to build a project than an older system, of theoretically similar performance. The system showed low utilisation when doing the build indicating that it was probably spending a lot of time waiting for something.
The first thing to look at was a profile of the build process using `collect -F on`, which produced the interesting result that the build was taking just over 2 minutes of user time, a few seconds of system time, and thousands of seconds of "Other Wait" time.
"Other wait" often means waiting for network, or disk, or just sleeping. The other thing to realise about profiling multiple processes is that all the times are cumulative, so all the processes that are waiting accumulate "other wait" time. Hence it will be a rather large number if multiple processes are doing it. So this confirmed and half explained the performance issue. The build was slow because it was waiting for something.
Sorting the profile by "other wait" indicated two places that the wait was coming from, one was waitpid - meaning that the time was due to a process waiting for another process, well we knew that! The other was a door call. Tracing up the call stack eventually lead into the C and C++ compiler, which were calling gethostbyname. The routine doing the calling was "generate_prefix" which is the routine responsible for generating a random prefix for function names - the IP address of the machine was used as one of the inputs for the generation of a prefix.
The performance problem was due to gethostbyname timing out, common reasons for this are missed configurations in the /etc/hosts and /etc/nsswitch.conf files. In this example adding the host name to the hosts file cured the problem.
Posted at 08:00AM Oct 13, 2009 by Darryl Gove in Sun | Comments[6]
An aliasing example
The compiler flag -xalias_level allows a user to assert the degree of aliasing that exists within the source code of an application. If the assertion is not true, then the behaviour of the application is undefined. It is definitely worth looking at the examples given in the user's guide, although they can be a bit "dry" to read. So here's an example which illustrates what can happen:
struct stuff{
int value1;
int value2;
};
void fill(struct stuff *x)
{
x->value1=0; // Clear value1
int * r=(int*)x; // Take the address of the structure
int var = *r; // Take the value from value1
x->value1=var; // And store it back into value1
}
The above code will clear value1 and then load and store this value back. So for correctly working code value1 should exit the function containing zero. However, if -xalias_level=basic is used to build the application, then this tells the compiler that no two pointers to variables of different types will alias. So pointer to an int will never alias with an int. So the read from *r does not alias with x.value1.
So with this knowledge the compiler is free to remove the original store to x.value1, because it has been told that nothing will alias with it, and there is a later store to the same address. The later store will overwrite the initial store.
Fortunately it the lint utility can pick up these issues:
$ lint -Xalias_level=basic alias.c (9) warning: cast of nonscalar pointer to scalar pointer is valid only at -xalias_level=any
For the example above the compiler does the correct thing and eliminates all the instructions but the store to value1. For more complex examples there is no guarantee that the code will be correct if it violates the -xalias_level setting.
Posted at 08:00AM Oct 12, 2009 by Darryl Gove in Sun | Comments[1]
Webcast: Improving the performance of parallel codes using the Performance Analyzer
Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.
Posted at 11:47AM Oct 09, 2009 by Darryl Gove in Sun | Comments[0]
Querying locality groups
Locality groups are a mechanism that provides Solaris information about how the physical hardware is wired together. A locality group is a bunch of threads that share the same CPU or memory access characteristics. For example a locality group might be all the threads on a single chip.
The command to display the locality group information is lgrpinfo, but this is not on Solaris 10. Here's an example of the output from that command:
% lgrpinfo
lgroup 0 (root):
Children: 1 2
CPUs: 0-7
Memory: installed 16G, allocated 3.8G, free 12G
Lgroup resources: 1 2 (CPU); 1 2 (memory)
Latency: 90
lgroup 1 (leaf):
Children: none, Parent: 0
CPUs: 0-3
Memory: installed 8.0G, allocated 1.8G, free 6.2G
Lgroup resources: 1 (CPU); 1 (memory)
Load: 0.263
Latency: 54
lgroup 2 (leaf):
Children: none, Parent: 0
CPUs: 4-7
Memory: installed 8.0G, allocated 2.0G, free 6.0G
Lgroup resources: 2 (CPU); 2 (memory)
Load: 0
Latency: 54
It is possible to access this programmatically:
#include <sys/lgrp_user.h>
#include <stdio.h>
#include <stdlib.h>
void explore(lgrp_cookie_t cookie,lgrp_id_t node,int level)
{
printf("Lgroup level %i\n",level);
int ncpus=lgrp_cpus(cookie,node,0,0,LGRP_CONTENT_DIRECT);
processorid_t * cpus=(processorid_t*)calloc(ncpus,sizeof(processorid_t));
lgrp_cpus(cookie,node,cpus,ncpus,LGRP_CONTENT_DIRECT);
printf("CPUs: ");
for(int i=0; i<ncpus; i++)
{
printf("%i ",cpus[i]);
}
printf("\n");
int nchildren=lgrp_children(cookie, node, 0,0);
lgrp_id_t* children=(lgrp_id_t*)calloc(nchildren,sizeof(lgrp_id_t));
lgrp_children(cookie, node,children,nchildren);
for (int i=0; i<nchildren; i++)
{
explore(cookie,children[i],level+1);
}
free(children);
}
void main()
{
lgrp_cookie_t cookie =lgrp_init(LGRP_VIEW_CALLER);
lgrp_id_t node = lgrp_root(cookie);
explore(cookie,node,0);
lgrp_fini(cookie);
}
Which provides the following output:
% cc local.c -llgrp % ./a.out Lgroup level 0 CPUs: Lgroup level 1 CPUs: 0 1 2 3 Lgroup level 1 CPUs: 4 5 6 7
Posted at 12:09PM Sep 30, 2009 by Darryl Gove in Sun |
Updated compiler flags article
Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.
Posted at 12:55PM Sep 28, 2009 by Darryl Gove in Sun |
Haskell (GHC) on UltraSPARC T2
Ben Lippmeier gave an excellent presentation at the recent Haskell conference in Edinburgh on his work on porting the Glasgow Haskell Compiler (GHC) back to SPARC. A video of the talk is available.
Update:Link to slides
Posted at 09:37PM Sep 21, 2009 by Darryl Gove in Sun |
Profiling scripts
If you try to use the Sun Studio Performance Analyzer on something that is not an executable, you'll end up with an error message:
$ collect kstat Target `kstat' is not a valid ELF executable
The most reliable workaround for this that I've discovered is as follows. First of all make up shell script that executes the command passed into it:
$ more shell.sh #!/bin/sh $@
Then run the collect command as:
$ collect -F on /bin/sh shell.sh <script> <params>
The -F on is required so that collect follows forked processes, otherwise collect will just profile the top /bin/sh which will do minimal work before forking off the actual command.
When loading the resulting experiment into the Analyzer you have to load all the descendant processes. You can do this by going to the filter dialog box and selecting all the processes, or you can take the easier route of placing en_desc on into your .er.rc file in your home directory (this will tell the analyzer to always load the descendant processes, which might make loading experiments slower, but will guarantee that you actually load all the data, and not just the top-level code).
One other thing to note is that each new process can contribute wall and wait time, so the wall time shown in the analyzer can be a multiple of the actual wall time. To see this in action do:
$ collect -F on /bin/sh shell.sh shell.sh shell.sh shell.sh kstat
The wall time on this will be a multiple of the actual runtime because each shell script contributes wall time while it waits for the kstat command to complete.
Posted at 11:22AM Sep 21, 2009 by Darryl Gove in Sun |
Performance tuning webcast
I wrote one of the TechDays 2008-2009 sessions on application performance tuning. Unfortunately I never actually got to give it to alive audience, but I did get this version recorded. Thanks to the HPC Watercooler for pointing it out to me.
Posted at 12:33PM Sep 08, 2009 by Darryl Gove in Sun |
Profiling a rate
Sometimes it's the rate of doing something which is the target that needs to be improved through optimisation. ie increase the widgets per second of some application. I've just been looking at a code that estimated performance by counting the number of computations completed in a known constant length of time. The code was showing a performance regression, and I wanted to find out what changed. The analysis is kind of counter intuitive, so I thought I'd share an example with you.
Here's an example code that does a computation for a fixed length of time, in this case about 30 seconds:
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
double f1(int i)
{
double t=0;
while (i-->0) {t+=t;}
return t;
}
double f2(int i)
{
double t=0;
while (i-->0) {t+=t;}
return t;
}
void main(int argc,char**args)
{
struct timeval now;
long startsecs;
long count=0;
int vcount;
if (argc!=2){ printf("Needs a number to be passed in\n"); exit(0);}
vcount=atoi(args[1]);
gettimeofday(&now,0);
startsecs=now.tv_sec;
do
{
f1(100);
f2(vcount);
count++;
gettimeofday(&now,0);
} while (now.tv_sec<startsecs+30);
printf("Iterations %i duration %i rate %f\n",count, now.tv_sec-startsecs, 1.0*count/(now.tv_sec-startsecs));
}
The code takes a command line parameter to indicate the number of iterations to do in function f2, function f1 always does 100 iterations.
If I compile and run this code under the performance analyzer with 50 and 70 as the commandline parameters I get the following profile:
| Description | 50 Iterations | 70 Iterations |
| Total time | 26.6s | 25.87s |
| f1 | 11.89s | 10.66s |
| gettimeofday | 9.9s | 8.76s |
| f2 | 4.53s | 6.09s |
| Main | 0.28s | 0.37s |
| Total iterations | 942,684 | 841,921 |
We can make the following observation when we go from 70 down to 50 for parameter passed to f2, we see a 12% gain in the total rate. This is to be expected as we are reducing the total number of iterations of the pair of loops in f1 and f2 will reduce from 170 down to 150, which is the same ~12% gain.
Where it gets counter intuitive is that for the run which achieves the higher rate, the time spent in the routines f1 and gettimeofday increases - by the same 12%. This is counter intuitive because increased time in a routine normally indicates that the routine is the one to be investigated, but for a 'rate' situation the opposite is true. These routines are being well behaved. The way to think about it is that each unit of work needs a smidgeon of time in both of these routines, if the number of units of work increases, then the absolute amount of time in these two routines needs to increase linearly with the increase in rate.
However, the time in routine f2 decreases as the rate increases. This is the routine which has been "improved" to get the better rate. The other thing to note is that the time went from ~6s to ~4.5s, but the rate went from 841k to 941k, so the time per unit work dropped further than that - this makes comparing the profiles of the two runs more tricky.
Note that Amdahl's law would still tell us that the routines that need to be optimised are the ones where the time is spent - so in one sense nothing has changed. But my particular scenario today is figuring out what has changed in the executable when compiled in two different ways that leads to the performance gain. In this context, I now know the routine, and I can dig into the assembly code to figure out the why.
Posted at 04:37PM Sep 02, 2009 by Darryl Gove in Sun |
Maps in the STL
I was looking at some code with a colleague and we observed a bunch of time in some code which used the std::map to set up mappings between strings. The source code looked rather like the following:
#include <map>
#include <string>
using namespace std;
int func(map<string,string>&mymap, string &s1, string &s2)
{
mymap.insert(pair<string,string>(s1,s2));
return 0;
}
When compiled with CC -O -c -library=stlport4 map.cc this expands to a horrendous set of calls, here's the first few:
$ er_src -dis func map.o|grep call
[?] 188d: call std::basic_string...::basic_string
[?] 189f: call std::basic_string...::basic_string
[?] 18b2: call std::basic_string...::basic_string
[?] 18c2: call std::basic_string...::basic_string
[?] 18d8: call std::_Rb_tree...::insert_unique
[?] 18f8: call std::__node_alloc...::_M_deallocate
[?] 190c: call std::_STLP_alloc_proxy...::~_STLP_alloc_proxy
...
What's happening is that the act of making a pair object is causing copies to be made of the two strings that are passed into the pair constructor. Then the pair object is passed into the insert method of std::map and this results in two more copies of the strings being made. There's a bunch of other stuff going on, and the resulting code is a mess.
There's an alternative way of assigning the mapping:
#include <map>
#include <string>
using namespace std;
int func(map<string,string>&mymap, string &s1, string &s2)
{
mymap[s1]=s2;
return 0;
}
When compiled the resulting code looks a lot neater:
$ er_src -dis func map.o|grep call
[?] 28e6: call std::map...::operator[]
[?] 2903: call std::basic_string...::_M_assign_dispatch
Of course a neater chunk of code is nice, but the question is whether the code for ::operator[] contains the same ugly mess. Rather than disassembling to find out, it's simpler to time the two versions and see which does better. A simple test harness looks like:
int main()
{
map<string,string>mymap;
string s1,s2;
long long i;
s1="123456789";
s2="987654321";
for (i=0; i<100000000; i++)
{
func(mymap,s1,s2);
}
}
It's a less than ideal harness since it uses constant strings, and one version of the code might end up bailing early because of this. The performance of the two codes is quite surprising:
real 6.79 user 6.77 sys 0.00 real 1:03.53 user 1:03.26 sys 0.01
So the version that creates the pair object is about 10x slower!
Posted at 09:57AM Aug 28, 2009 by Darryl Gove in Sun | Comments[4]
Second life talk audio available.
The slides, transcript, and audio from my Second Life talk about The Developer's Edge are now available.
Posted at 09:30AM Aug 28, 2009 by Darryl Gove in Sun |
Sun Studio 12 Update 1 blog entry live on AMD site
Just had a blog entry about Sun Studio 12 Update 1 posted to the AMD forums site.
Posted at 02:05PM Jul 17, 2009 by Darryl Gove in Sun | Comments[5]
Lesser known Solaris features
Joerg Moellenkamp has put together a very nice downloadable pdf book on lesser known Solaris features. Well worth skimming through if you're interested in some of the features that Solaris has that have not captured the headlines. The book walks through the features and has examples of how to use them.
Posted at 12:00PM Jul 13, 2009 by Darryl Gove in Sun |
BBC computer history slidecast
Interesting slidecast with audio from the BBC on the history of computers in the valley. Runs about 5 minutes, so there's not much detail, but some interesting images.
Posted at 01:29PM Jul 02, 2009 by Darryl Gove in Sun |
Introduction to parallel programming
My colleague, Ruud van der Pas, has recorded a series of seven webcasts on parallel programming which will be released on the HPC portal. Ruud is an expert on parallel programming, and one of the authors of the book "Using OpenMP".
Posted at 09:17AM Jul 01, 2009 by Darryl Gove in Sun | Comments[1]
Cost of pointer chasing in libmtmalloc
I thought I'd quickly write up some comments about how to improve the issue with mtmalloc.
The "obvious" solution is to allocate larger chunks of memory. I was using 512 byte objects, so 72KB can allocate 144 of these. That's quite a few, I'm not sure that in general I'd want to allocate say 1024 objects of 512 bytes just in case I needed them. So increasing the chunk size might be a useful workaround, but I think it's rather a sledgehammer solution.
Of course, if I end up allocating 60KB objects, then I hope the code does the right thing and reserves space for several of them. If I need one, I'm quite likely to need a second, and I don't want to be doing pointer chasing on a linked list of single object blocks - that would be very painful. So hopefully there is some kind of scaling of the requestsize for larger objects.
However, the fundamental problem is not actually the number of objects in each allocated chunk, that's what reveals the problem. No, the real problem is the pointer chasing loop to locate a chunk with free space in it. There are several upfront problems with this. Fortunately the two data structures that are used in the pointer chasing (mt-next and mt-nfree) are on the same cache line - although given their offsets, I'm not convinced that this was a design decision. However, the pointer chasing from block to block guarantees that the next pair of values that need to be inspected are not in the cache. Given the fact that we're jumping at least 72KB, there's a good chance that the mapping isn't even in the TLB.
We could argue at this point that there should be a list of free memory so a single pointer step could get us to the next free block, but this approach almost certainly opens up a lot of issues around ensuring thread safety (ie you probably need mutexes) and getting good performance (ie mutexes cost quite a few cycles). So I don't think that's the solution.
I suspect the easiest thing that could be done would be to take the two key fields and place them into an array. So you would have an array of pointers to the chunks interleaved with the counts of the number of free elements in each of the chunks. The advantage of this is that we could identify a chunk with free space without having to stride through memory to do it, we can just scan the array. We'd rarely need multiple TLB entries, and we might even be able to fit the details of multiple chunks on the same cacheline - reducing cache misses substantially (there is an issue of false sharing here, so it may not be entirely feasible), and the other gain would be that we'd be streaming through memory so the hardware might be able to prefetch the next cacheline, or we just just add prefetches if that were necessary.
The programming challenge with this approach would be in the situations where we need to increase the size of the array to allocate more chunks. This should happen rarely, but could be tricky to do safely. But not impossible.
Posted at 12:23PM Jun 26, 2009 by Darryl Gove in Sun | Comments[3]
mtmalloc vs umem
A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:
Excl. Incl. Name User CPU User CPU sec. sec. 266.446 266.446258.301 263.084 malloc_internal 1.661 1.951 free 1.401 1.401 mutex_lock_impl 0.961 0.961 mutex_unlock
We can dig into the disassembly of malloc_internal to find out what's going on:
73.201 73.201 [?] 1724: cmp %o5, 0
1.981 1.981 [?] 1728: bne 0x1740
0.320 0.320 [?] 172c: nop
0.490 0.490 [?] 1730: ld [%i2 + 44], %i2
1.191 1.191 [?] 1734: cmp %i2, 0
0.901 0.901 [?] 1738: bne,a 0x1724
## 176.443 176.443 [?] 173c: ld [%i2 + 32], %o5
It's not hard to visualise what the original C code would look like:
while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }
Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:
1032 while (thiscache != NULL && thiscache->mt_nfree == 0) 1033 thiscache = thiscache->mt_next;
So what's going on?
Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.
It's probably worth revealing the test code, at this point, so that you can see what it does:
#include <stdlib.h>
typedef struct s
{
struct s * next;
char padding[508];
} S;
void main()
{
struct s * head;
struct s * keep;
struct s * current;
head=0;
keep=0;
for (int j=0; j<100; j++)
{
for (int i=0; i<100000; i++)
{
current=(struct s*)malloc(sizeof(struct s));
if (i&1)
{
current->next=head;
head=current;
}
else
{
current->next=keep;
keep=current;
}
}
current = head;
while (current!=0)
{
struct s * tmp = current;
current = current -> next;
free(current);
}
head = 0;
}
}
The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:
% cc -O mtm.c -lmtmalloc % timex a.out real 4:44.18 user 4:33.80 sys 8.70
However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:
#include... mallocctl(MTCHUNKSIZE,20); ...
The performance reduces from nearly 5 minutes to about 1 minute:
% cc -O mtm.c -lmtmalloc % timex a.out real 1:09.10 user 1:01.09 sys 6.53
If we increase the requestsize to 30, performance improves still further:
% cc -O mtm.c -lmtmalloc % timex a.out real 38.36 user 31.41 sys 4.96
Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:
% cc -O mtm.c -lumem % timex a.out real 31.06 user 18.10 sys 10.95
So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.
Posted at 11:15AM Jun 26, 2009 by Darryl Gove in Sun |
Sun Studio 12 Update 1
Sun Studio 12 Update 1 went live yesterday. It's still a free download, and it's got a raft of new features. Many people will have been using the express releases, so they will already be familiar with the improvements.
It's been about two years since Sun Studio 12 came out, and the most obvious change in that time is the prevalence of multicore processors. I figured the easiest way to discern this would be to look at the submissions of SPEC CPU2006 results in that time period. The following chart shows the cummulative number of SPEC CPU2006 Integer speed results over that time broken down by the number of threads that the chip was capable of supporting.
Ok, the first surprising thing about the chart is that there's very few single threaded chips. There were a few results when the suite was launched back in 2006, but nothing much since. What is more apparent is the number of dual-thread chips, that was where the majority of the market was. There were also a number of quad-thread chips at that point. If we fast-forward to the situation today, we can see that the number of dual-thread chips has pretty much leveled off, the bulk of the chips are capable of supporting four threads. But you can see the start of a ramp of chips that are capable of supporting 6 or 8 simultaneous threads.
The relevance of this chart to Sun Studio is that Sun Studio has always been a tool that supports the development of multi-threaded applications. Every release of the product improves on the support in the previous release. Sun Studio 12 Update 1 includes improvements in the compiler's ability to automatically parallelise codes - afterall the easiest way to develop parallel applications is if the compiler can do it for you; improvements to the support of parallelisation specifications like OpenMP, this release includes support for the latest OpenMP 3.0 specification; and improvements in the tools and their ability to provide the developer meaningful feedback about parallel code, for example the ability of the Performance Analyzer to profile MPI code.
Footnote SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results posted on www.spec.org as of 15 June 2009.
Posted at 10:01AM Jun 23, 2009 by Darryl Gove in Sun |
Glasgow Haskell Compiler successfully ported to OpenSPARC
Ben Lippmeier has been working on the native port of the Glasgow Haskell Compiler (GHC) to SPARC. The port was completed a few months back, and since then he's been using an UltraSPARC T2 system to look at thread activity and scaling as the number of threads is increased. The full details of the project are on his GHC on SPARC blog. The latest SPARC/Solaris binary can be downloaded here, although the full set of changes probably won't be available for a couple of months.
Posted at 04:46PM Jun 15, 2009 by Darryl Gove in Sun | Comments[1]
Audio for JavaOne interview available

A couple of weeks back I recorded an interview where I discussed The Developer's Edge. I've just found the audio up at BlogTalkRadio, it's about 15 minutes in duration.
Posted at 11:13PM Jun 14, 2009 by Darryl Gove in Sun |
Stlport4 and multithreaded code
I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.
First off, it's probably good to take a look at a code that illustrates the problem:
#include <vector>
int main()
{
#pragma omp parallel for default (__auto)
for (int i=0; i<10000000; i++)
{
std::vector<int> v;
v.push_back(10);
}
return(0);
}
The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.
$ CC -O t1.cc $ timex a.out real 15.85 user 15.64 sys 0.01 $ CC -O -library=stlport4 t1.cc $ timex a.out real 7.87 user 7.78 sys 0.01
This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.
The next thing to investigate is scaling when there are multiple threads:
$ CC -O -xopenmp -library=stlport4 t1.cc $ timex a.out real 7.00 user 6.96 sys 0.01 $ export OMP_NUM_THREADS=2 $ timex a.out real 7.18 user 14.28 sys 0.01
So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:
Excl. Incl. Name User CPU User CPU sec. sec. 8.076 8.0761.571 2.272 mutex_lock_impl 1.501 1.971 mutex_unlock 1.051 4.573 std::vector >::_M_insert_overflow(int*,const int&,const std::__true_type&,unsigned,bool) 0.871 8.076 _$d1A5.main 0.871 3.272 std::__node_alloc<true,0>::_M_allocate(unsigned) 0.560 1.721 std::__node_alloc<true,0>::_M_deallocate(void*,unsigned) 0.480 0.480 sigon 0.440 0.440 mutex_trylock_adaptive 0.250 0.470 mutex_unlock_queue
So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.
This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.
The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:
$ gmake BUILDTYPE=12D CONFIG=sunpro.config
Once I had it built, I could install it with:
$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install
The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:
CC -m64 -O -xopenmp -library=no%Cstd \ -I ./stdcxx-4.2.1/install/include/ \ -L ./stdcxx-4.2.1/install/lib/ \ -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc
So we can build the test and look at the scaling between one and two threads:
$ export OMP_NUM_THREADS=1 $ timex a.out real 18.98 user 18.93 sys 0.01 $ export OMP_NUM_THREADS=2 $ timex a.out real 18.42 user 36.73 sys 0.01
Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:
Excl. Incl. Name User CPU User CPU sec. sec. 21.145 21.1452.572 16.411 std::vector<int,std::allocator<int> >::_C_insert_n(int*const&,unsigned long,const int&) 2.402 4.293 mutex_unlock 2.342 3.613 mutex_lock_impl 1.961 10.697 std::vector<int,std::allocator<int> >::_C_realloc(unsigned long) 1.681 5.634 free 1.341 1.891 mutex_unlock_queue 1.271 1.271 _free_unlocked 0.991 0.991 sigon
So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:
(er_print) csingle mutex_lock Attr. Excl. Incl. Name User CPU User CPU User CPU sec. sec. sec. 0.170 1.681 5.634 free 0.020 0.690 4.623 malloc 0.190 0.190 0.190 *mutex_lock
So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:
CC -m64 -O -xopenmp -library=no%Cstd \ -I ./stdcxx-4.2.1/install/include/ \ -L ./stdcxx-4.2.1/install/lib/ \ -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc -lmtmalloc
Then we can try the timing tests again:
$ export OMP_NUM_THREADS=1 $ timex a.out real 18.02 user 17.98 sys 0.01 $ export OMP_NUM_THREADS=2 real 13.76 user 27.05 sys 0.01 $ export OMP_NUM_THREADS=4 $ timex a.out real 6.92 user 26.97 sys 0.02 $ export OMP_NUM_THREADS=8 $ timex a.out real 3.51 user 26.99 sys 0.02
So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]
Posted at 04:05PM Jun 12, 2009 by Darryl Gove in Sun | Comments[5]
Utilising a CMT system
I got asked about how to utilise a CMT system, it's probably not an uncommon question, so I'll post my (somewhat brief) answer here:
The CMT processor appears as a system with many CPUs. These virtual CPUs can be provisioned in the same way as you would with any multiprocessor system:
- The OS will naturally handle positioning multiple active threads so as to get the optimal performance.
- If you wish to manually tune this then you can use Solaris tools like processor binding (pbind, or processor_bind) to statically allocate a particular thread or process to a particular core. You can use processor sets (psrset) to restrict a set of processes to a particular set of processors (or to exclude particular processes from using these processors).
- The machine can be divided into multiple virtual machines either through Solaris Zones, where all zones run the same version of the Solaris operating system. Or through logical domains where multiple different operating systems can be installed onto the same machine.
The optimal configuration will depend on the problem to be solved.
I've actually heard someone argue that multicore processors require a redesign of applications. Um, no. Applications will work just fine. However, multicore processors do give you opportunities to throw more threads at a problem - which can be very useful.
Posted at 12:04AM Jun 11, 2009 by Darryl Gove in Sun |
Secure programming paper
Excellent paper from Joep Vesseur on secure programming.
Posted at 11:45AM Jun 03, 2009 by Darryl Gove in Sun |
Graph of libraries used by firefox and thunderbird
Just gathered library usage charts for firefox and thunderbird. The full charts look like:
Firefox
Thunderbird
Neither of which is particularly telling. The reduced charts look much better:
Firefox
Thunderbird
Posted at 03:00AM May 22, 2009 by Darryl Gove in Sun |
Drawing libraries - neater eye-candy!
Chris Quenelle posted an interesting comment to my post which showed the dependencies for StarOffice. As you can see from the mass of lines below, adding more dependency information, using the latest version of ld_dot, into the StarOffice library map did not make the graphic any clearer!
It turns out that the reduction operation that Chris was alluding to is implemented by tred (the "transitive reduction filter", what great technobabble!). This filtering reduces the graph down to something which even looks ok when shrunk down to fit this page:
This clarifies the relationships between the libraries. More importantly it also looks pretty.
Posted at 09:48PM May 20, 2009 by Darryl Gove in Sun | Comments[1]
Libraries (5) - Runtime costs - TLBs
The next consideration when using libraries is that each library will get mapped in on a new virtual page of memory; as shown in this pmap output:
% pmap 60500 60500: a.out 00010000 8K r-x-- /libraries/a.out 00020000 8K rwx-- /libraries/a.out FEEC0000 24K rwx-- [ anon ] FEED0000 8K r-x-- /libraries/lib1_26.so FEEE0000 8K rwx-- /libraries/lib1_26.so FEEF0000 8K r-x-- /libraries/lib1_25.so FEF00000 8K rwx-- /libraries/lib1_25.so FEF10000 8K r-x-- /libraries/lib1_24.so FEF20000 8K rwx-- /libraries/lib1_24.so FEF30000 8K r-x-- /libraries/lib1_23.so FEF40000 8K rwx-- /libraries/lib1_23.so FEF50000 8K rwx-- [ anon ] FEF60000 8K r-x-- /libraries/lib1_22.so FEF70000 8K rwx-- /libraries/lib1_22.so FEF80000 8K r-x-- /libraries/lib1_21.so FEF90000 8K rwx-- /libraries/lib1_21.so FEFA0000 8K r-x-- /libraries/lib1_20.so FEFB0000 8K rwx-- /libraries/lib1_20.so FEFC0000 8K r-x-- /libraries/lib1_19.so ....
There are finite number of TLB entries on a chip. If each library takes an entry, and the code jumps around between libraries, then a single application can utilise quite a few TLB entries. Take a CMT system where there are multiple applications (or copies of the same application) running, and there becomes a lot of pressure on the TLB.
One of the enhancements in Solaris to support CMT processors is Shared Context. When multiple applications map the same library at the same address, then they can share a single context to map that library. This can lead to a significant reduction in the TLB pressure. Shared context only works for libraries that are loaded into the same memory locations in different contexts, so it can be defeated if the libraries are loaded in different orders or any other mechanisms that scramble the locations in memory.
If each library is mapped into a different TLB entry, then every call into a new library is a new ITLB entry, together with a jump through the PLT, together with the normal register spill/fill overhead. This can become quite a significant chunk of overhead.
To round this off, lets look at some figures from an artificial code run on an UltraSPARC T1 system that was hanging around here.
| Experiment | Runtime |
|---|---|
| Application that jumps between 26 different routines a->b->c...->z. All the routines are included in the same executable. | 3s |
| Application that jumps between 26 different routines a->...z. The routines are provided as a library, and calls are therefore routed through the PLT. | 6s |
Application that jumps between 26 different routines a->...z. The routines are
provided as a library, but all are declared static except for the
initial routine that is called by main. Therefore the calls within the library
avoid the PLT. |
3s |
| Application that jumps between 26 different routines a->...z. Each routine is defined in its own library, so calls to the routine have to go through the PLT, and also require a new ITLB entry to be used. | 60s |
Since the routines in this test code don't actually do anything, the overhead of calling through the PLT is clearly shown as a doubling of runtime. However, this is insignificant when compared with the costs of calling to separate libraries, which is about 10x slower than this.
Moving the experiment to look at the impact on CMT systems:
| Experiment | Runtime |
|---|---|
| One copy of this executable per core of an UltraSPARC T1 processor | 1 minute |
| Two copies of this executable per core | 5 minutes |
| Four copies of this executable per core (fully loaded system) | 8 minutes |
Running multiple copies of the application has a significant impact on performance. The performance counters show very few instructions being executed, and much time being lost to ITLB misses. Now this performance is from a system without the shared context changes - so I would expect much better scaling on a system with these improvements (if I find one I'll rerun the experiment).
The conclusion is that care needs to be taken when deciding to split application code into libraries.
Posted at 06:00PM May 20, 2009 by Darryl Gove in Sun |
Libraries (4) - Runtime costs - Procedure Lookup Table (PLT)
Most applications spend the majority of their time running - rather than starting up. So it's useful to look at the costs of using libraries at runtime.
The most apparent cost of using libraries is that calls to routines now go indirectly to the target routine through the procedure look up table (PLT). Unless the developer explicitly limits the scope of a function, it is exported from the library as a global function, which means that even calls within the library will go through the PLT. Consider the following code snippet:
void func2()
{
...
}
void func1()
{
func2();
}
If this is compiled into an executable the assembly code will look like:
func1()
11104: 82 10 00 0f mov %o7, %g1
11108: 7f ff ff f8 call func2 ! 0x110e8
1110c: 9e 10 00 01 mov %g1, %o7
However, if this is compiled as part of a library then the code looks like:
func2()
664: 82 10 00 0f mov %o7, %g1
668: 40 00 40 b9 call .plt+0x3c ! 0x1094c
66c: 9e 10 00 01 mov %g1, %o7
This is a doubling of the cost of the call.
In C it's possible to limit the scope of the function using the static keyword. Declaring func1 as static will cause the compiler to generate a direct call to that routine. The downside is that the routine will only be visible within the source file that defines it. It is also possible to use other methods to limit the visibility of symbols.
Posted at 03:00PM May 20, 2009 by Darryl Gove in Sun | Comments[2]
Libraries (3) - Application startup costs
As can be seen from the previous graphs, even a simple application (like ssh) can pull in a fair number of libraries. Whenever a library is pulled in, the linker has to request memory, load the image from disk, and then link in all the routines. This effort takes time - it's basically a large chunk of the start up time of an application. If you profile the start up of an application, you'll probably not see much because much of this time is basically the OS/disk activity of mapping the libraries into memory.
Of course applications also have start up costs associated with initialising data structures etc. However, the biggest risk is that applications will pull in libraries that they don't need, or perhaps do need, but don't need yet. The best work-around for this is to lazy load the libraries. Of course it's fairly easy to write code that either breaks under lazy loading or breaks lazy loading. It's not hard to work around these issues with care, and doing so can have a substantial impact on start up time.
Posted at 02:01PM May 20, 2009 by Darryl Gove in Sun |
Libraries (2)
Just updated the ld_dot script to include filter libraries. Added a profile for ssh logging into a system, rather than just showing the help message (click the image for the full size version).
Posted at 11:52AM May 20, 2009 by Darryl Gove in Sun |


