Darryl Gove's blog
Viewing thread activity in the Performance Analyzer
The Sun Studio Performance Analyzer is one of the two tools that I use most frequently (the other is spot - which is now in SS12U1!). It's a very powerful tool, but a lot of that power is not immediately visible to users. I'm going to discuss a couple of ways I've used the analyzer to view parallel applications.
The most common first step for looking at the performance of parallel apps is to use the timeline. However, the timeline can look a bit cluttered with all of the call stack data. Often you are really just interested in the leaf node. Fortunately this can be configured from the data presentation dialog box. To get the view I want I'm only showing the top leaf in the call stack:

This results in a display of the samples in each routine, by default this can look very colourful. You can make it easier on the eye by selecting the colours used to display the graphic. In the following graphic I've picked green for one parallel routine that I'm interested in, and blue for another, then used a yellow to colour all the time waiting for more work to be assigned:

The graphic shows that the work is not evenly spread across all threads. The first few threads spend more time in the hot routines than the later threads. We can see this much more clearly using the 'threads' view of the data. To get this view you need to go back to the data presentation dialog and select the threads tab, it's also useful to select the 'cpus' tab at the same time.

The threads tab shows the activity of each thread for the currently displayed metrics. This is useful to see if one thread is doing more work than another. The cpus tab shows time that the app spends on each CPU in the machine - this can indicate whether a particular CPU is over subscribed. The thread activity looks like:

This confirms what we thought earlier that some of the threads are much more active than other threads. The top chart shows the user time, which indicates that all the threads spent the same amount of time running 'stuff', the middle chart shows the time that each thread spent running useful work, the lower chart shows the time spent in overhead. The exercise now is to try and improve the distribution of work across the threads......
Posted at 12:45PM Nov 25, 2009 by Darryl Gove in music | Comments[0]
When threads go bad
When a thread hits an error in a multithreaded application, that error will take out the entire app. Here's some example code:
#include <pthread.h>
#include <stdio.h>
void *work(void * param)
{
int*a;
a=(int*)(1024*1024);
(*a)++;
printf("Child thread exit\n");
}
void main()
{
pthread_t thread;
pthread_create(&thread,0,work,0);
pthread_join(thread,0);
printf("Main thread exit\n");
}
Compiling and running this produces:
% cc -O -mt pthread_error.c % ./a.out Segmentation Fault (core dumped)
Not entirely unexpected, that. The app died without the main thread having the chance to clear up resources etc. This is probably not ideal. However, it is possible to write a signal handler to capture the segmentation fault, and terminate the child thread without causing the main thread to terminate. It's important to realise that there's probably little chance of actually recovering from the unspecified error, but this at least might give the app the chance to report the symptoms of its demise.
#include <pthread.h>
#include <stdio.h>
#include <signal.h>
void *work(void * param)
{
int*a;
a=(int*)(1024*1024);
(*a)++;
printf("Child thread exit\n");
}
void hsignal(int i)
{
printf("Signal %i\n",i);
pthread_exit(0);
}
void main()
{
pthread_t thread;
sigset(SIGSEGV,hsignal);
pthread_create(&thread,0,work,0);
pthread_join(thread,0);
printf("Main thread exit\n");
}
Which produces the output:
% cc -O -mt pthread_error.c % ./a.out Signal 11 Main thread exit
Posted at 10:02AM Nov 23, 2009 by Darryl Gove in Sun | Comments[3]
Programming and electronics for kids
I've been continuing to look into programming and electronics for kids. I wrote some of the programming options up a while back. Scratch is still a firm favourite.
On the list of things to try we have brickcc to try out with the lego NXT. Here's an old comparison of the various approaches to programming the NXT brick.
The other on-going project is a microcontroller - the STM Primer. Includes a screen, tilt sensor, and a single button.
On the electronics side (which is what lead me to microcontrollers in the first place), this is a nice article on kits for kids, and a second earlier one. There's also a bunch of kits available at makershed (the most surprising one is an EX-150, which is a couple of kits up from what I had. I think I had the EX-60). Here's a list of some microcontroller starter kits, and a different list of microcontroller like options.
Posted at 07:00AM Oct 27, 2009 by Darryl Gove in Personal |
Things to see in Dorset
We spent a bit of time in Dorset on our trip home. Interesting things that we missed seeing giant sea monsters and unexploded bombs.
Posted at 11:51PM Oct 26, 2009 by Darryl Gove in Personal |
GMT
One of the things I didn't manage to do on our recent vacation in the UK was to visit Greenwich Observatory. This is the "home" of Greenwich Mean Time. The BBC has a nice article on the history of GMT. Perhaps next time....
Posted at 08:00AM Oct 21, 2009 by Darryl Gove in Personal |
Fishing with cputrack
I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.
I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:
- Excessive system time
- Process migration
- Memory placement
- Page size
- etc.
Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.
Posted at 08:00AM Oct 19, 2009 by Darryl Gove in Sun |
Surprisingly slow compile time
I had an e-mail which told the sorry tale of a new system which tool longer to build a project than an older system, of theoretically similar performance. The system showed low utilisation when doing the build indicating that it was probably spending a lot of time waiting for something.
The first thing to look at was a profile of the build process using `collect -F on`, which produced the interesting result that the build was taking just over 2 minutes of user time, a few seconds of system time, and thousands of seconds of "Other Wait" time.
"Other wait" often means waiting for network, or disk, or just sleeping. The other thing to realise about profiling multiple processes is that all the times are cumulative, so all the processes that are waiting accumulate "other wait" time. Hence it will be a rather large number if multiple processes are doing it. So this confirmed and half explained the performance issue. The build was slow because it was waiting for something.
Sorting the profile by "other wait" indicated two places that the wait was coming from, one was waitpid - meaning that the time was due to a process waiting for another process, well we knew that! The other was a door call. Tracing up the call stack eventually lead into the C and C++ compiler, which were calling gethostbyname. The routine doing the calling was "generate_prefix" which is the routine responsible for generating a random prefix for function names - the IP address of the machine was used as one of the inputs for the generation of a prefix.
The performance problem was due to gethostbyname timing out, common reasons for this are missed configurations in the /etc/hosts and /etc/nsswitch.conf files. In this example adding the host name to the hosts file cured the problem.
Posted at 08:00AM Oct 13, 2009 by Darryl Gove in Sun | Comments[6]
An aliasing example
The compiler flag -xalias_level allows a user to assert the degree of aliasing that exists within the source code of an application. If the assertion is not true, then the behaviour of the application is undefined. It is definitely worth looking at the examples given in the user's guide, although they can be a bit "dry" to read. So here's an example which illustrates what can happen:
struct stuff{
int value1;
int value2;
};
void fill(struct stuff *x)
{
x->value1=0; // Clear value1
int * r=(int*)x; // Take the address of the structure
int var = *r; // Take the value from value1
x->value1=var; // And store it back into value1
}
The above code will clear value1 and then load and store this value back. So for correctly working code value1 should exit the function containing zero. However, if -xalias_level=basic is used to build the application, then this tells the compiler that no two pointers to variables of different types will alias. So pointer to an int will never alias with an int. So the read from *r does not alias with x.value1.
So with this knowledge the compiler is free to remove the original store to x.value1, because it has been told that nothing will alias with it, and there is a later store to the same address. The later store will overwrite the initial store.
Fortunately it the lint utility can pick up these issues:
$ lint -Xalias_level=basic alias.c (9) warning: cast of nonscalar pointer to scalar pointer is valid only at -xalias_level=any
For the example above the compiler does the correct thing and eliminates all the instructions but the store to value1. For more complex examples there is no guarantee that the code will be correct if it violates the -xalias_level setting.
Posted at 08:00AM Oct 12, 2009 by Darryl Gove in Sun | Comments[1]
Webcast: Improving the performance of parallel codes using the Performance Analyzer
Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.
Posted at 11:47AM Oct 09, 2009 by Darryl Gove in Sun |
Querying locality groups
Locality groups are a mechanism that provides Solaris information about how the physical hardware is wired together. A locality group is a bunch of threads that share the same CPU or memory access characteristics. For example a locality group might be all the threads on a single chip.
The command to display the locality group information is lgrpinfo, but this is not on Solaris 10. Here's an example of the output from that command:
% lgrpinfo
lgroup 0 (root):
Children: 1 2
CPUs: 0-7
Memory: installed 16G, allocated 3.8G, free 12G
Lgroup resources: 1 2 (CPU); 1 2 (memory)
Latency: 90
lgroup 1 (leaf):
Children: none, Parent: 0
CPUs: 0-3
Memory: installed 8.0G, allocated 1.8G, free 6.2G
Lgroup resources: 1 (CPU); 1 (memory)
Load: 0.263
Latency: 54
lgroup 2 (leaf):
Children: none, Parent: 0
CPUs: 4-7
Memory: installed 8.0G, allocated 2.0G, free 6.0G
Lgroup resources: 2 (CPU); 2 (memory)
Load: 0
Latency: 54
It is possible to access this programmatically:
#include <sys/lgrp_user.h>
#include <stdio.h>
#include <stdlib.h>
void explore(lgrp_cookie_t cookie,lgrp_id_t node,int level)
{
printf("Lgroup level %i\n",level);
int ncpus=lgrp_cpus(cookie,node,0,0,LGRP_CONTENT_DIRECT);
processorid_t * cpus=(processorid_t*)calloc(ncpus,sizeof(processorid_t));
lgrp_cpus(cookie,node,cpus,ncpus,LGRP_CONTENT_DIRECT);
printf("CPUs: ");
for(int i=0; i<ncpus; i++)
{
printf("%i ",cpus[i]);
}
printf("\n");
int nchildren=lgrp_children(cookie, node, 0,0);
lgrp_id_t* children=(lgrp_id_t*)calloc(nchildren,sizeof(lgrp_id_t));
lgrp_children(cookie, node,children,nchildren);
for (int i=0; i<nchildren; i++)
{
explore(cookie,children[i],level+1);
}
free(children);
}
void main()
{
lgrp_cookie_t cookie =lgrp_init(LGRP_VIEW_CALLER);
lgrp_id_t node = lgrp_root(cookie);
explore(cookie,node,0);
lgrp_fini(cookie);
}
Which provides the following output:
% cc local.c -llgrp % ./a.out Lgroup level 0 CPUs: Lgroup level 1 CPUs: 0 1 2 3 Lgroup level 1 CPUs: 4 5 6 7
Posted at 12:09PM Sep 30, 2009 by Darryl Gove in Sun |
Updated compiler flags article
Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.
Posted at 12:55PM Sep 28, 2009 by Darryl Gove in Sun |
Haskell (GHC) on UltraSPARC T2
Ben Lippmeier gave an excellent presentation at the recent Haskell conference in Edinburgh on his work on porting the Glasgow Haskell Compiler (GHC) back to SPARC. A video of the talk is available.
Update:Link to slides
Posted at 09:37PM Sep 21, 2009 by Darryl Gove in Sun |
Profiling scripts
If you try to use the Sun Studio Performance Analyzer on something that is not an executable, you'll end up with an error message:
$ collect kstat Target `kstat' is not a valid ELF executable
The most reliable workaround for this that I've discovered is as follows. First of all make up shell script that executes the command passed into it:
$ more shell.sh #!/bin/sh $@
Then run the collect command as:
$ collect -F on /bin/sh shell.sh <script> <params>
The -F on is required so that collect follows forked processes, otherwise collect will just profile the top /bin/sh which will do minimal work before forking off the actual command.
When loading the resulting experiment into the Analyzer you have to load all the descendant processes. You can do this by going to the filter dialog box and selecting all the processes, or you can take the easier route of placing en_desc on into your .er.rc file in your home directory (this will tell the analyzer to always load the descendant processes, which might make loading experiments slower, but will guarantee that you actually load all the data, and not just the top-level code).
One other thing to note is that each new process can contribute wall and wait time, so the wall time shown in the analyzer can be a multiple of the actual wall time. To see this in action do:
$ collect -F on /bin/sh shell.sh shell.sh shell.sh shell.sh kstat
The wall time on this will be a multiple of the actual runtime because each shell script contributes wall time while it waits for the kstat command to complete.
Posted at 11:22AM Sep 21, 2009 by Darryl Gove in Sun |
Performance tuning webcast
I wrote one of the TechDays 2008-2009 sessions on application performance tuning. Unfortunately I never actually got to give it to alive audience, but I did get this version recorded. Thanks to the HPC Watercooler for pointing it out to me.
Posted at 12:33PM Sep 08, 2009 by Darryl Gove in Sun |
Shoelaces
I was chatting to one of the kids teachers this morning, apparently she ends up tying shoelaces for a bunch of kids in the class everyday. All of which reminded me of this alternative way of tying laces.
Posted at 09:10AM Sep 03, 2009 by Darryl Gove in Personal |
Profiling a rate
Sometimes it's the rate of doing something which is the target that needs to be improved through optimisation. ie increase the widgets per second of some application. I've just been looking at a code that estimated performance by counting the number of computations completed in a known constant length of time. The code was showing a performance regression, and I wanted to find out what changed. The analysis is kind of counter intuitive, so I thought I'd share an example with you.
Here's an example code that does a computation for a fixed length of time, in this case about 30 seconds:
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
double f1(int i)
{
double t=0;
while (i-->0) {t+=t;}
return t;
}
double f2(int i)
{
double t=0;
while (i-->0) {t+=t;}
return t;
}
void main(int argc,char**args)
{
struct timeval now;
long startsecs;
long count=0;
int vcount;
if (argc!=2){ printf("Needs a number to be passed in\n"); exit(0);}
vcount=atoi(args[1]);
gettimeofday(&now,0);
startsecs=now.tv_sec;
do
{
f1(100);
f2(vcount);
count++;
gettimeofday(&now,0);
} while (now.tv_sec<startsecs+30);
printf("Iterations %i duration %i rate %f\n",count, now.tv_sec-startsecs, 1.0*count/(now.tv_sec-startsecs));
}
The code takes a command line parameter to indicate the number of iterations to do in function f2, function f1 always does 100 iterations.
If I compile and run this code under the performance analyzer with 50 and 70 as the commandline parameters I get the following profile:
| Description | 50 Iterations | 70 Iterations |
| Total time | 26.6s | 25.87s |
| f1 | 11.89s | 10.66s |
| gettimeofday | 9.9s | 8.76s |
| f2 | 4.53s | 6.09s |
| Main | 0.28s | 0.37s |
| Total iterations | 942,684 | 841,921 |
We can make the following observation when we go from 70 down to 50 for parameter passed to f2, we see a 12% gain in the total rate. This is to be expected as we are reducing the total number of iterations of the pair of loops in f1 and f2 will reduce from 170 down to 150, which is the same ~12% gain.
Where it gets counter intuitive is that for the run which achieves the higher rate, the time spent in the routines f1 and gettimeofday increases - by the same 12%. This is counter intuitive because increased time in a routine normally indicates that the routine is the one to be investigated, but for a 'rate' situation the opposite is true. These routines are being well behaved. The way to think about it is that each unit of work needs a smidgeon of time in both of these routines, if the number of units of work increases, then the absolute amount of time in these two routines needs to increase linearly with the increase in rate.
However, the time in routine f2 decreases as the rate increases. This is the routine which has been "improved" to get the better rate. The other thing to note is that the time went from ~6s to ~4.5s, but the rate went from 841k to 941k, so the time per unit work dropped further than that - this makes comparing the profiles of the two runs more tricky.
Note that Amdahl's law would still tell us that the routines that need to be optimised are the ones where the time is spent - so in one sense nothing has changed. But my particular scenario today is figuring out what has changed in the executable when compiled in two different ways that leads to the performance gain. In this context, I now know the routine, and I can dig into the assembly code to figure out the why.
Posted at 04:37PM Sep 02, 2009 by Darryl Gove in Sun |
Doctor Who exhibition in Coventry
We've recently returned from a vacation in the UK. We had a fun time, the kids got dragged around all sorts of entertainments. One thing I did manage to squeeze in was a trip to the Doctor Who exhibition in Coventry. It took us about forty minutes to wander through, and look at the props from the show. It's a temporary show being held at the Coventry Transport Museum, so after lunch we wandered through that. The transport museum had some interesting vehicles - a De Lorean, Thrust II, and Thrust SSC. A grand day out.
Posted at 10:37PM Aug 29, 2009 by Darryl Gove in Personal |
Maps in the STL
I was looking at some code with a colleague and we observed a bunch of time in some code which used the std::map to set up mappings between strings. The source code looked rather like the following:
#include <map>
#include <string>
using namespace std;
int func(map<string,string>&mymap, string &s1, string &s2)
{
mymap.insert(pair<string,string>(s1,s2));
return 0;
}
When compiled with CC -O -c -library=stlport4 map.cc this expands to a horrendous set of calls, here's the first few:
$ er_src -dis func map.o|grep call
[?] 188d: call std::basic_string...::basic_string
[?] 189f: call std::basic_string...::basic_string
[?] 18b2: call std::basic_string...::basic_string
[?] 18c2: call std::basic_string...::basic_string
[?] 18d8: call std::_Rb_tree...::insert_unique
[?] 18f8: call std::__node_alloc...::_M_deallocate
[?] 190c: call std::_STLP_alloc_proxy...::~_STLP_alloc_proxy
...
What's happening is that the act of making a pair object is causing copies to be made of the two strings that are passed into the pair constructor. Then the pair object is passed into the insert method of std::map and this results in two more copies of the strings being made. There's a bunch of other stuff going on, and the resulting code is a mess.
There's an alternative way of assigning the mapping:
#include <map>
#include <string>
using namespace std;
int func(map<string,string>&mymap, string &s1, string &s2)
{
mymap[s1]=s2;
return 0;
}
When compiled the resulting code looks a lot neater:
$ er_src -dis func map.o|grep call
[?] 28e6: call std::map...::operator[]
[?] 2903: call std::basic_string...::_M_assign_dispatch
Of course a neater chunk of code is nice, but the question is whether the code for ::operator[] contains the same ugly mess. Rather than disassembling to find out, it's simpler to time the two versions and see which does better. A simple test harness looks like:
int main()
{
map<string,string>mymap;
string s1,s2;
long long i;
s1="123456789";
s2="987654321";
for (i=0; i<100000000; i++)
{
func(mymap,s1,s2);
}
}
It's a less than ideal harness since it uses constant strings, and one version of the code might end up bailing early because of this. The performance of the two codes is quite surprising:
real 6.79 user 6.77 sys 0.00 real 1:03.53 user 1:03.26 sys 0.01
So the version that creates the pair object is about 10x slower!
Posted at 09:57AM Aug 28, 2009 by Darryl Gove in Sun | Comments[4]
Second life talk audio available.
The slides, transcript, and audio from my Second Life talk about The Developer's Edge are now available.
Posted at 09:30AM Aug 28, 2009 by Darryl Gove in Sun |
Sun Studio 12 Update 1 blog entry live on AMD site
Just had a blog entry about Sun Studio 12 Update 1 posted to the AMD forums site.
Posted at 02:05PM Jul 17, 2009 by Darryl Gove in Sun | Comments[5]
Lesser known Solaris features
Joerg Moellenkamp has put together a very nice downloadable pdf book on lesser known Solaris features. Well worth skimming through if you're interested in some of the features that Solaris has that have not captured the headlines. The book walks through the features and has examples of how to use them.
Posted at 12:00PM Jul 13, 2009 by Darryl Gove in Sun |
BBC computer history slidecast
Interesting slidecast with audio from the BBC on the history of computers in the valley. Runs about 5 minutes, so there's not much detail, but some interesting images.
Posted at 01:29PM Jul 02, 2009 by Darryl Gove in Sun |
Introduction to parallel programming
My colleague, Ruud van der Pas, has recorded a series of seven webcasts on parallel programming which will be released on the HPC portal. Ruud is an expert on parallel programming, and one of the authors of the book "Using OpenMP".
Posted at 09:17AM Jul 01, 2009 by Darryl Gove in Sun | Comments[1]
Cost of pointer chasing in libmtmalloc
I thought I'd quickly write up some comments about how to improve the issue with mtmalloc.
The "obvious" solution is to allocate larger chunks of memory. I was using 512 byte objects, so 72KB can allocate 144 of these. That's quite a few, I'm not sure that in general I'd want to allocate say 1024 objects of 512 bytes just in case I needed them. So increasing the chunk size might be a useful workaround, but I think it's rather a sledgehammer solution.
Of course, if I end up allocating 60KB objects, then I hope the code does the right thing and reserves space for several of them. If I need one, I'm quite likely to need a second, and I don't want to be doing pointer chasing on a linked list of single object blocks - that would be very painful. So hopefully there is some kind of scaling of the requestsize for larger objects.
However, the fundamental problem is not actually the number of objects in each allocated chunk, that's what reveals the problem. No, the real problem is the pointer chasing loop to locate a chunk with free space in it. There are several upfront problems with this. Fortunately the two data structures that are used in the pointer chasing (mt-next and mt-nfree) are on the same cache line - although given their offsets, I'm not convinced that this was a design decision. However, the pointer chasing from block to block guarantees that the next pair of values that need to be inspected are not in the cache. Given the fact that we're jumping at least 72KB, there's a good chance that the mapping isn't even in the TLB.
We could argue at this point that there should be a list of free memory so a single pointer step could get us to the next free block, but this approach almost certainly opens up a lot of issues around ensuring thread safety (ie you probably need mutexes) and getting good performance (ie mutexes cost quite a few cycles). So I don't think that's the solution.
I suspect the easiest thing that could be done would be to take the two key fields and place them into an array. So you would have an array of pointers to the chunks interleaved with the counts of the number of free elements in each of the chunks. The advantage of this is that we could identify a chunk with free space without having to stride through memory to do it, we can just scan the array. We'd rarely need multiple TLB entries, and we might even be able to fit the details of multiple chunks on the same cacheline - reducing cache misses substantially (there is an issue of false sharing here, so it may not be entirely feasible), and the other gain would be that we'd be streaming through memory so the hardware might be able to prefetch the next cacheline, or we just just add prefetches if that were necessary.
The programming challenge with this approach would be in the situations where we need to increase the size of the array to allocate more chunks. This should happen rarely, but could be tricky to do safely. But not impossible.
Posted at 12:23PM Jun 26, 2009 by Darryl Gove in Sun | Comments[3]
mtmalloc vs umem
A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:
Excl. Incl. Name User CPU User CPU sec. sec. 266.446 266.446258.301 263.084 malloc_internal 1.661 1.951 free 1.401 1.401 mutex_lock_impl 0.961 0.961 mutex_unlock
We can dig into the disassembly of malloc_internal to find out what's going on:
73.201 73.201 [?] 1724: cmp %o5, 0
1.981 1.981 [?] 1728: bne 0x1740
0.320 0.320 [?] 172c: nop
0.490 0.490 [?] 1730: ld [%i2 + 44], %i2
1.191 1.191 [?] 1734: cmp %i2, 0
0.901 0.901 [?] 1738: bne,a 0x1724
## 176.443 176.443 [?] 173c: ld [%i2 + 32], %o5
It's not hard to visualise what the original C code would look like:
while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }
Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:
1032 while (thiscache != NULL && thiscache->mt_nfree == 0) 1033 thiscache = thiscache->mt_next;
So what's going on?
Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.
It's probably worth revealing the test code, at this point, so that you can see what it does:
#include <stdlib.h>
typedef struct s
{
struct s * next;
char padding[508];
} S;
void main()
{
struct s * head;
struct s * keep;
struct s * current;
head=0;
keep=0;
for (int j=0; j<100; j++)
{
for (int i=0; i<100000; i++)
{
current=(struct s*)malloc(sizeof(struct s));
if (i&1)
{
current->next=head;
head=current;
}
else
{
current->next=keep;
keep=current;
}
}
current = head;
while (current!=0)
{
struct s * tmp = current;
current = current -> next;
free(current);
}
head = 0;
}
}
The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:
% cc -O mtm.c -lmtmalloc % timex a.out real 4:44.18 user 4:33.80 sys 8.70
However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:
#include... mallocctl(MTCHUNKSIZE,20); ...
The performance reduces from nearly 5 minutes to about 1 minute:
% cc -O mtm.c -lmtmalloc % timex a.out real 1:09.10 user 1:01.09 sys 6.53
If we increase the requestsize to 30, performance improves still further:
% cc -O mtm.c -lmtmalloc % timex a.out real 38.36 user 31.41 sys 4.96
Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:
% cc -O mtm.c -lumem % timex a.out real 31.06 user 18.10 sys 10.95
So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.
Posted at 11:15AM Jun 26, 2009 by Darryl Gove in Sun |
Sun Studio 12 Update 1
Sun Studio 12 Update 1 went live yesterday. It's still a free download, and it's got a raft of new features. Many people will have been using the express releases, so they will already be familiar with the improvements.
It's been about two years since Sun Studio 12 came out, and the most obvious change in that time is the prevalence of multicore processors. I figured the easiest way to discern this would be to look at the submissions of SPEC CPU2006 results in that time period. The following chart shows the cummulative number of SPEC CPU2006 Integer speed results over that time broken down by the number of threads that the chip was capable of supporting.
Ok, the first surprising thing about the chart is that there's very few single threaded chips. There were a few results when the suite was launched back in 2006, but nothing much since. What is more apparent is the number of dual-thread chips, that was where the majority of the market was. There were also a number of quad-thread chips at that point. If we fast-forward to the situation today, we can see that the number of dual-thread chips has pretty much leveled off, the bulk of the chips are capable of supporting four threads. But you can see the start of a ramp of chips that are capable of supporting 6 or 8 simultaneous threads.
The relevance of this chart to Sun Studio is that Sun Studio has always been a tool that supports the development of multi-threaded applications. Every release of the product improves on the support in the previous release. Sun Studio 12 Update 1 includes improvements in the compiler's ability to automatically parallelise codes - afterall the easiest way to develop parallel applications is if the compiler can do it for you; improvements to the support of parallelisation specifications like OpenMP, this release includes support for the latest OpenMP 3.0 specification; and improvements in the tools and their ability to provide the developer meaningful feedback about parallel code, for example the ability of the Performance Analyzer to profile MPI code.
Footnote SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results posted on www.spec.org as of 15 June 2009.
Posted at 10:01AM Jun 23, 2009 by Darryl Gove in Sun |
Glasgow Haskell Compiler successfully ported to OpenSPARC
Ben Lippmeier has been working on the native port of the Glasgow Haskell Compiler (GHC) to SPARC. The port was completed a few months back, and since then he's been using an UltraSPARC T2 system to look at thread activity and scaling as the number of threads is increased. The full details of the project are on his GHC on SPARC blog. The latest SPARC/Solaris binary can be downloaded here, although the full set of changes probably won't be available for a couple of months.
Posted at 04:46PM Jun 15, 2009 by Darryl Gove in Sun | Comments[1]
Audio for JavaOne interview available

A couple of weeks back I recorded an interview where I discussed The Developer's Edge. I've just found the audio up at BlogTalkRadio, it's about 15 minutes in duration.
Posted at 11:13PM Jun 14, 2009 by Darryl Gove in Sun |
Stlport4 and multithreaded code
I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.
First off, it's probably good to take a look at a code that illustrates the problem:
#include <vector>
int main()
{
#pragma omp parallel for default (__auto)
for (int i=0; i<10000000; i++)
{
std::vector<int> v;
v.push_back(10);
}
return(0);
}
The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.
$ CC -O t1.cc $ timex a.out real 15.85 user 15.64 sys 0.01 $ CC -O -library=stlport4 t1.cc $ timex a.out real 7.87 user 7.78 sys 0.01
This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.
The next thing to investigate is scaling when there are multiple threads:
$ CC -O -xopenmp -library=stlport4 t1.cc $ timex a.out real 7.00 user 6.96 sys 0.01 $ export OMP_NUM_THREADS=2 $ timex a.out real 7.18 user 14.28 sys 0.01
So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:
Excl. Incl. Name User CPU User CPU sec. sec. 8.076 8.0761.571 2.272 mutex_lock_impl 1.501 1.971 mutex_unlock 1.051 4.573 std::vector >::_M_insert_overflow(int*,const int&,const std::__true_type&,unsigned,bool) 0.871 8.076 _$d1A5.main 0.871 3.272 std::__node_alloc<true,0>::_M_allocate(unsigned) 0.560 1.721 std::__node_alloc<true,0>::_M_deallocate(void*,unsigned) 0.480 0.480 sigon 0.440 0.440 mutex_trylock_adaptive 0.250 0.470 mutex_unlock_queue
So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.
This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.
The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:
$ gmake BUILDTYPE=12D CONFIG=sunpro.config
Once I had it built, I could install it with:
$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install
The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:
CC -m64 -O -xopenmp -library=no%Cstd \ -I ./stdcxx-4.2.1/install/include/ \ -L ./stdcxx-4.2.1/install/lib/ \ -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc
So we can build the test and look at the scaling between one and two threads:
$ export OMP_NUM_THREADS=1 $ timex a.out real 18.98 user 18.93 sys 0.01 $ export OMP_NUM_THREADS=2 $ timex a.out real 18.42 user 36.73 sys 0.01
Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:
Excl. Incl. Name User CPU User CPU sec. sec. 21.145 21.1452.572 16.411 std::vector<int,std::allocator<int> >::_C_insert_n(int*const&,unsigned long,const int&) 2.402 4.293 mutex_unlock 2.342 3.613 mutex_lock_impl 1.961 10.697 std::vector<int,std::allocator<int> >::_C_realloc(unsigned long) 1.681 5.634 free 1.341 1.891 mutex_unlock_queue 1.271 1.271 _free_unlocked 0.991 0.991 sigon
So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:
(er_print) csingle mutex_lock Attr. Excl. Incl. Name User CPU User CPU User CPU sec. sec. sec. 0.170 1.681 5.634 free 0.020 0.690 4.623 malloc 0.190 0.190 0.190 *mutex_lock
So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:
CC -m64 -O -xopenmp -library=no%Cstd \ -I ./stdcxx-4.2.1/install/include/ \ -L ./stdcxx-4.2.1/install/lib/ \ -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc -lmtmalloc
Then we can try the timing tests again:
$ export OMP_NUM_THREADS=1 $ timex a.out real 18.02 user 17.98 sys 0.01 $ export OMP_NUM_THREADS=2 real 13.76 user 27.05 sys 0.01 $ export OMP_NUM_THREADS=4 $ timex a.out real 6.92 user 26.97 sys 0.02 $ export OMP_NUM_THREADS=8 $ timex a.out real 3.51 user 26.99 sys 0.02
So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]
Posted at 04:05PM Jun 12, 2009 by Darryl Gove in Sun | Comments[5]
Code complete: burn this chapter
That's a somewhat inflammatory title for this post, but continuing from my previous post on the book Code Complete, I think that the chapters (25 & 26) on performance really do not contain good advice or good examples.
To make this more concrete, consider the example on pg 593 where Steve McConnell compares the performance of these two code fragments:
| Original | Unrolled |
|---|---|
for i = 1 to 10 a[ i ] = i end for |
a[ 1 ] = 1 a[ 2 ] = 2 a[ 3 ] = 3 a[ 4 ] = 4 a[ 5 ] = 5 a[ 6 ] = 6 a[ 7 ] = 7 a[ 8 ] = 8 a[ 9 ] = 9 a[ 10 ] = 10 |
Steve finds that Visual Basic and Java run the unrolled version of the loop faster.
There's a couple of examples that talk about incorrect access ordering for arrays. Here's some C code that illustrates the problem:
| Slow code | Fast code |
|---|---|
for (column=0; column < max_column; column++)
{
for (row=0; row < max_row; row++)
{
data[row][column]=stuff();
}
}
|
for (row=0; row < max_row; row++)
{
for (column=0; column < max_column; column++)
{
data[row][column]=stuff();
}
}
|
On page 599 it is suggested that the slow code is inefficient because it might cause paging to disk, on page 623 it is suggested that the higher trip count loop should be inside to amortise the initialisation overhead for each execution of the inner loop. Neither of these explanations is right. As I'm sure most of you recognise the code is slow because of cache misses incurred when accessing non-adjacent memory locations. There is a cost to initialisation of the inner loop, but nothing significant, and yes, you could get paging to disk - but only if you are running out of memory (and if you're running out of memory, you're hosed anyway!). You're more likely to get TLB misses (and perhaps that is what Mr McConnell intended to say.
I consider the above issues to be quite serious, but, unfortunately, I'm not terribly happy with the rest of the material. Hence my recommendation to ignore (or burn
these chapters. I'll go through my other reservations now.
Lack of details. The timing information is presented with no additional information (pg 623) "C++ Straight Time = 4.75 Code-Tuned Time = 3.19 Time Savings = 33%". What was the compiler? What compiler flags were given? What was the test harness?
The book presents it as somehow that "C++" runs this code slowly, but in reality it's more likely to be a test of the effectiveness of the compiler, and the ability of the user to use the compiler. I'd be surprised if any compiler with minimal optimisation enabled did not do the loop interchange operation necessary to get good performance. Which leads to my next observation:
Don't compilers do this? I think the book falls into one of the common "optimisation book" traps, where lots of ink is spent describing and naming the various optimisations. This gives the false impression that it is necessary for the expert programmer to be able to identify these optimisations and apply them to their program. Most compilers will apply all these optimisations - afterall that is what compilers are supposed to do - take the grudgery out of producing optimal code. It's great for page count to enumerate all the possible ways that code might be restructured for performance, but for most situations the restructuring will lead to code that has the same performance.
Profiling. It's not there! To me the most critical thing that a developer can do to optimise their program is to profile it. Understanding where the time is being spent is the necessary first step towards improving the performance of the application. This omission is alarming. The chapter already encourages users to do manual optimisations where there might be no gains (at the cost of time spent doing restructuring that could be better spent writing new code, and the risk that the resulting code is less maintainable), but without profiling the application, the users are basically encouraged to do this over the entire source code, not just the lines that actually matter.
Assembly language. Yes, I love assembly language, there's nothing I enjoy better than working with it (no comment), but I wouldn't encourage people to drop into it for performance reasons, unless they had utterly exhausted every other option. The book includes an example using Delphi where the assembly language version ran faster than the high-level version. My guess is that the compilers had some trouble with aliasing, and hence had more loads than were necessary - a check of the assembly code that the compilers generated would indicate that, and then it's pretty straight forward to write assembly-language-like high level code that the compiler can produce optimal code for. [Note, that I view reading and analysing the code at the assembly language level to be very useful, but I wouldn't recommend leaping into writing assembly language without a good reason.]
So what would I recommend:
- Profile. Always profile. This will indicate where the time is being spent, and what sort of gains you should expect from optimising parts of the application.
- Know the tools. Make sure that you know what compiler flags are available, and that you are requesting the right kind of things from the compiler. All too often there are stories about how A is faster than B, which are due to people not knowing how to use the tools.
- Identify those parts of the code where the time is spent, and examine them in detail to determine if it's a short coming of the compiler, the compiler flags, or an ambiguity in the source code, that causes time to be spent there. Many performance problems can be solved with by adding a new flag, or perhaps a minor tweak to the source code.
- Only when you have exhausted all other options, and you know that you can get a significant performance gain should you start wildly hacking at the source code, or recoding parts in assembly language.
The other thing to recommend is a read of Bart Smaalder's Performance Anti-patterns.
Posted at 08:00AM Jun 11, 2009 by Darryl Gove in Personal | Comments[6]


