Darryl Gove's blog
BBC computer history slidecast
Interesting slidecast with audio from the BBC on the history of computers in the valley. Runs about 5 minutes, so there's not much detail, but some interesting images.
Posted at 01:29PM Jul 02, 2009 by Darryl Gove in Sun | Comments[0]
Introduction to parallel programming
My colleague, Ruud van der Pas, has recorded a series of seven webcasts on parallel programming which will be released on the HPC portal. Ruud is an expert on parallel programming, and one of the authors of the book "Using OpenMP".
Posted at 09:17AM Jul 01, 2009 by Darryl Gove in Sun | Comments[0]
Cost of pointer chasing in libmtmalloc
I thought I'd quickly write up some comments about how to improve the issue with mtmalloc.
The "obvious" solution is to allocate larger chunks of memory. I was using 512 byte objects, so 72KB can allocate 144 of these. That's quite a few, I'm not sure that in general I'd want to allocate say 1024 objects of 512 bytes just in case I needed them. So increasing the chunk size might be a useful workaround, but I think it's rather a sledgehammer solution.
Of course, if I end up allocating 60KB objects, then I hope the code does the right thing and reserves space for several of them. If I need one, I'm quite likely to need a second, and I don't want to be doing pointer chasing on a linked list of single object blocks - that would be very painful. So hopefully there is some kind of scaling of the requestsize for larger objects.
However, the fundamental problem is not actually the number of objects in each allocated chunk, that's what reveals the problem. No, the real problem is the pointer chasing loop to locate a chunk with free space in it. There are several upfront problems with this. Fortunately the two data structures that are used in the pointer chasing (mt-next and mt-nfree) are on the same cache line - although given their offsets, I'm not convinced that this was a design decision. However, the pointer chasing from block to block guarantees that the next pair of values that need to be inspected are not in the cache. Given the fact that we're jumping at least 72KB, there's a good chance that the mapping isn't even in the TLB.
We could argue at this point that there should be a list of free memory so a single pointer step could get us to the next free block, but this approach almost certainly opens up a lot of issues around ensuring thread safety (ie you probably need mutexes) and getting good performance (ie mutexes cost quite a few cycles). So I don't think that's the solution.
I suspect the easiest thing that could be done would be to take the two key fields and place them into an array. So you would have an array of pointers to the chunks interleaved with the counts of the number of free elements in each of the chunks. The advantage of this is that we could identify a chunk with free space without having to stride through memory to do it, we can just scan the array. We'd rarely need multiple TLB entries, and we might even be able to fit the details of multiple chunks on the same cacheline - reducing cache misses substantially (there is an issue of false sharing here, so it may not be entirely feasible), and the other gain would be that we'd be streaming through memory so the hardware might be able to prefetch the next cacheline, or we just just add prefetches if that were necessary.
The programming challenge with this approach would be in the situations where we need to increase the size of the array to allocate more chunks. This should happen rarely, but could be tricky to do safely. But not impossible.
Posted at 12:23PM Jun 26, 2009 by Darryl Gove in Sun | Comments[3]
mtmalloc vs umem
A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:
Excl. Incl. Name User CPU User CPU sec. sec. 266.446 266.446258.301 263.084 malloc_internal 1.661 1.951 free 1.401 1.401 mutex_lock_impl 0.961 0.961 mutex_unlock
We can dig into the disassembly of malloc_internal to find out what's going on:
73.201 73.201 [?] 1724: cmp %o5, 0
1.981 1.981 [?] 1728: bne 0x1740
0.320 0.320 [?] 172c: nop
0.490 0.490 [?] 1730: ld [%i2 + 44], %i2
1.191 1.191 [?] 1734: cmp %i2, 0
0.901 0.901 [?] 1738: bne,a 0x1724
## 176.443 176.443 [?] 173c: ld [%i2 + 32], %o5
It's not hard to visualise what the original C code would look like:
while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }
Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:
1032 while (thiscache != NULL && thiscache->mt_nfree == 0) 1033 thiscache = thiscache->mt_next;
So what's going on?
Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.
It's probably worth revealing the test code, at this point, so that you can see what it does:
#include <stdlib.h>
typedef struct s
{
struct s * next;
char padding[508];
} S;
void main()
{
struct s * head;
struct s * keep;
struct s * current;
head=0;
keep=0;
for (int j=0; j<100; j++)
{
for (int i=0; i<100000; i++)
{
current=(struct s*)malloc(sizeof(struct s));
if (i&1)
{
current->next=head;
head=current;
}
else
{
current->next=keep;
keep=current;
}
}
current = head;
while (current!=0)
{
struct s * tmp = current;
current = current -> next;
free(current);
}
head = 0;
}
}
The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:
% cc -O mtm.c -lmtmalloc % timex a.out real 4:44.18 user 4:33.80 sys 8.70
However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:
#include... mallocctl(MTCHUNKSIZE,20); ...
The performance reduces from nearly 5 minutes to about 1 minute:
% cc -O mtm.c -lmtmalloc % timex a.out real 1:09.10 user 1:01.09 sys 6.53
If we increase the requestsize to 30, performance improves still further:
% cc -O mtm.c -lmtmalloc % timex a.out real 38.36 user 31.41 sys 4.96
Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:
% cc -O mtm.c -lumem % timex a.out real 31.06 user 18.10 sys 10.95
So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.
Posted at 11:15AM Jun 26, 2009 by Darryl Gove in Sun | Comments[0]
Sun Studio 12 Update 1
Sun Studio 12 Update 1 went live yesterday. It's still a free download, and it's got a raft of new features. Many people will have been using the express releases, so they will already be familiar with the improvements.
It's been about two years since Sun Studio 12 came out, and the most obvious change in that time is the prevalence of multicore processors. I figured the easiest way to discern this would be to look at the submissions of SPEC CPU2006 results in that time period. The following chart shows the cummulative number of SPEC CPU2006 Integer speed results over that time broken down by the number of threads that the chip was capable of supporting.
Ok, the first surprising thing about the chart is that there's very few single threaded chips. There were a few results when the suite was launched back in 2006, but nothing much since. What is more apparent is the number of dual-thread chips, that was where the majority of the market was. There were also a number of quad-thread chips at that point. If we fast-forward to the situation today, we can see that the number of dual-thread chips has pretty much leveled off, the bulk of the chips are capable of supporting four threads. But you can see the start of a ramp of chips that are capable of supporting 6 or 8 simultaneous threads.
The relevance of this chart to Sun Studio is that Sun Studio has always been a tool that supports the development of multi-threaded applications. Every release of the product improves on the support in the previous release. Sun Studio 12 Update 1 includes improvements in the compiler's ability to automatically parallelise codes - afterall the easiest way to develop parallel applications is if the compiler can do it for you; improvements to the support of parallelisation specifications like OpenMP, this release includes support for the latest OpenMP 3.0 specification; and improvements in the tools and their ability to provide the developer meaningful feedback about parallel code, for example the ability of the Performance Analyzer to profile MPI code.
Footnote SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results posted on www.spec.org as of 15 June 2009.
Posted at 10:01AM Jun 23, 2009 by Darryl Gove in Sun | Comments[0]
Glasgow Haskell Compiler successfully ported to OpenSPARC
Ben Lippmeier has been working on the native port of the Glasgow Haskell Compiler (GHC) to SPARC. The port was completed a few months back, and since then he's been using an UltraSPARC T2 system to look at thread activity and scaling as the number of threads is increased. The full details of the project are on his GHC on SPARC blog. The latest SPARC/Solaris binary can be downloaded here, although the full set of changes probably won't be available for a couple of months.
Posted at 04:46PM Jun 15, 2009 by Darryl Gove in Sun | Comments[1]
Audio for JavaOne interview available

A couple of weeks back I recorded an interview where I discussed The Developer's Edge. I've just found the audio up at BlogTalkRadio, it's about 15 minutes in duration.
Posted at 11:13PM Jun 14, 2009 by Darryl Gove in Sun | Comments[0]
Stlport4 and multithreaded code
I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.
First off, it's probably good to take a look at a code that illustrates the problem:
#include <vector>
int main()
{
#pragma omp parallel for default (__auto)
for (int i=0; i<10000000; i++)
{
std::vector<int> v;
v.push_back(10);
}
return(0);
}
The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.
$ CC -O t1.cc $ timex a.out real 15.85 user 15.64 sys 0.01 $ CC -O -library=stlport4 t1.cc $ timex a.out real 7.87 user 7.78 sys 0.01
This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.
The next thing to investigate is scaling when there are multiple threads:
$ CC -O -xopenmp -library=stlport4 t1.cc $ timex a.out real 7.00 user 6.96 sys 0.01 $ export OMP_NUM_THREADS=2 $ timex a.out real 7.18 user 14.28 sys 0.01
So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:
Excl. Incl. Name User CPU User CPU sec. sec. 8.076 8.0761.571 2.272 mutex_lock_impl 1.501 1.971 mutex_unlock 1.051 4.573 std::vector >::_M_insert_overflow(int*,const int&,const std::__true_type&,unsigned,bool) 0.871 8.076 _$d1A5.main 0.871 3.272 std::__node_alloc<true,0>::_M_allocate(unsigned) 0.560 1.721 std::__node_alloc<true,0>::_M_deallocate(void*,unsigned) 0.480 0.480 sigon 0.440 0.440 mutex_trylock_adaptive 0.250 0.470 mutex_unlock_queue
So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.
This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.
The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:
$ gmake BUILDTYPE=12D CONFIG=sunpro.config
Once I had it built, I could install it with:
$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install
The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:
CC -m64 -O -xopenmp -library=no%Cstd \ -I ./stdcxx-4.2.1/install/include/ \ -L ./stdcxx-4.2.1/install/lib/ \ -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc
So we can build the test and look at the scaling between one and two threads:
$ export OMP_NUM_THREADS=1 $ timex a.out real 18.98 user 18.93 sys 0.01 $ export OMP_NUM_THREADS=2 $ timex a.out real 18.42 user 36.73 sys 0.01
Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:
Excl. Incl. Name User CPU User CPU sec. sec. 21.145 21.1452.572 16.411 std::vector<int,std::allocator<int> >::_C_insert_n(int*const&,unsigned long,const int&) 2.402 4.293 mutex_unlock 2.342 3.613 mutex_lock_impl 1.961 10.697 std::vector<int,std::allocator<int> >::_C_realloc(unsigned long) 1.681 5.634 free 1.341 1.891 mutex_unlock_queue 1.271 1.271 _free_unlocked 0.991 0.991 sigon
So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:
(er_print) csingle mutex_lock Attr. Excl. Incl. Name User CPU User CPU User CPU sec. sec. sec. 0.170 1.681 5.634 free 0.020 0.690 4.623 malloc 0.190 0.190 0.190 *mutex_lock
So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:
CC -m64 -O -xopenmp -library=no%Cstd \ -I ./stdcxx-4.2.1/install/include/ \ -L ./stdcxx-4.2.1/install/lib/ \ -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc -lmtmalloc
Then we can try the timing tests again:
$ export OMP_NUM_THREADS=1 $ timex a.out real 18.02 user 17.98 sys 0.01 $ export OMP_NUM_THREADS=2 real 13.76 user 27.05 sys 0.01 $ export OMP_NUM_THREADS=4 $ timex a.out real 6.92 user 26.97 sys 0.02 $ export OMP_NUM_THREADS=8 $ timex a.out real 3.51 user 26.99 sys 0.02
So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]
Posted at 04:05PM Jun 12, 2009 by Darryl Gove in Sun | Comments[5]
Code complete: burn this chapter
That's a somewhat inflammatory title for this post, but continuing from my previous post on the book Code Complete, I think that the chapters (25 & 26) on performance really do not contain good advice or good examples.
To make this more concrete, consider the example on pg 593 where Steve McConnell compares the performance of these two code fragments:
| Original | Unrolled |
|---|---|
for i = 1 to 10 a[ i ] = i end for |
a[ 1 ] = 1 a[ 2 ] = 2 a[ 3 ] = 3 a[ 4 ] = 4 a[ 5 ] = 5 a[ 6 ] = 6 a[ 7 ] = 7 a[ 8 ] = 8 a[ 9 ] = 9 a[ 10 ] = 10 |
Steve finds that Visual Basic and Java run the unrolled version of the loop faster.
There's a couple of examples that talk about incorrect access ordering for arrays. Here's some C code that illustrates the problem:
| Slow code | Fast code |
|---|---|
for (column=0; column < max_column; column++)
{
for (row=0; row < max_row; row++)
{
data[row][column]=stuff();
}
}
|
for (row=0; row < max_row; row++)
{
for (column=0; column < max_column; column++)
{
data[row][column]=stuff();
}
}
|
On page 599 it is suggested that the slow code is inefficient because it might cause paging to disk, on page 623 it is suggested that the higher trip count loop should be inside to amortise the initialisation overhead for each execution of the inner loop. Neither of these explanations is right. As I'm sure most of you recognise the code is slow because of cache misses incurred when accessing non-adjacent memory locations. There is a cost to initialisation of the inner loop, but nothing significant, and yes, you could get paging to disk - but only if you are running out of memory (and if you're running out of memory, you're hosed anyway!). You're more likely to get TLB misses (and perhaps that is what Mr McConnell intended to say.
I consider the above issues to be quite serious, but, unfortunately, I'm not terribly happy with the rest of the material. Hence my recommendation to ignore (or burn
these chapters. I'll go through my other reservations now.
Lack of details. The timing information is presented with no additional information (pg 623) "C++ Straight Time = 4.75 Code-Tuned Time = 3.19 Time Savings = 33%". What was the compiler? What compiler flags were given? What was the test harness?
The book presents it as somehow that "C++" runs this code slowly, but in reality it's more likely to be a test of the effectiveness of the compiler, and the ability of the user to use the compiler. I'd be surprised if any compiler with minimal optimisation enabled did not do the loop interchange operation necessary to get good performance. Which leads to my next observation:
Don't compilers do this? I think the book falls into one of the common "optimisation book" traps, where lots of ink is spent describing and naming the various optimisations. This gives the false impression that it is necessary for the expert programmer to be able to identify these optimisations and apply them to their program. Most compilers will apply all these optimisations - afterall that is what compilers are supposed to do - take the grudgery out of producing optimal code. It's great for page count to enumerate all the possible ways that code might be restructured for performance, but for most situations the restructuring will lead to code that has the same performance.
Profiling. It's not there! To me the most critical thing that a developer can do to optimise their program is to profile it. Understanding where the time is being spent is the necessary first step towards improving the performance of the application. This omission is alarming. The chapter already encourages users to do manual optimisations where there might be no gains (at the cost of time spent doing restructuring that could be better spent writing new code, and the risk that the resulting code is less maintainable), but without profiling the application, the users are basically encouraged to do this over the entire source code, not just the lines that actually matter.
Assembly language. Yes, I love assembly language, there's nothing I enjoy better than working with it (no comment), but I wouldn't encourage people to drop into it for performance reasons, unless they had utterly exhausted every other option. The book includes an example using Delphi where the assembly language version ran faster than the high-level version. My guess is that the compilers had some trouble with aliasing, and hence had more loads than were necessary - a check of the assembly code that the compilers generated would indicate that, and then it's pretty straight forward to write assembly-language-like high level code that the compiler can produce optimal code for. [Note, that I view reading and analysing the code at the assembly language level to be very useful, but I wouldn't recommend leaping into writing assembly language without a good reason.]
So what would I recommend:
- Profile. Always profile. This will indicate where the time is being spent, and what sort of gains you should expect from optimising parts of the application.
- Know the tools. Make sure that you know what compiler flags are available, and that you are requesting the right kind of things from the compiler. All too often there are stories about how A is faster than B, which are due to people not knowing how to use the tools.
- Identify those parts of the code where the time is spent, and examine them in detail to determine if it's a short coming of the compiler, the compiler flags, or an ambiguity in the source code, that causes time to be spent there. Many performance problems can be solved with by adding a new flag, or perhaps a minor tweak to the source code.
- Only when you have exhausted all other options, and you know that you can get a significant performance gain should you start wildly hacking at the source code, or recoding parts in assembly language.
The other thing to recommend is a read of Bart Smaalder's Performance Anti-patterns.
Posted at 08:00AM Jun 11, 2009 by Darryl Gove in Personal | Comments[6]
Utilising a CMT system
I got asked about how to utilise a CMT system, it's probably not an uncommon question, so I'll post my (somewhat brief) answer here:
The CMT processor appears as a system with many CPUs. These virtual CPUs can be provisioned in the same way as you would with any multiprocessor system:
- The OS will naturally handle positioning multiple active threads so as to get the optimal performance.
- If you wish to manually tune this then you can use Solaris tools like processor binding (pbind, or processor_bind) to statically allocate a particular thread or process to a particular core. You can use processor sets (psrset) to restrict a set of processes to a particular set of processors (or to exclude particular processes from using these processors).
- The machine can be divided into multiple virtual machines either through Solaris Zones, where all zones run the same version of the Solaris operating system. Or through logical domains where multiple different operating systems can be installed onto the same machine.
The optimal configuration will depend on the problem to be solved.
I've actually heard someone argue that multicore processors require a redesign of applications. Um, no. Applications will work just fine. However, multicore processors do give you opportunities to throw more threads at a problem - which can be very useful.
Posted at 12:04AM Jun 11, 2009 by Darryl Gove in Sun |
Code complete: coding style
I read Code Complete a couple of years back. It's an interesting book to read, but there were two parts that annoyed me. I was giving a presentation the other week on "Coding for performance" and I happened to mention the book, and say that I had these two reservations. So I figure I should write them up more formally.
My first issue was, basically, me just been a stuck in the mud. Those of you who regularly read my blog will see that I favour the following style of indenting:
if (some condition)
{
do something;
}
If you've read Solaris Application programming, you'll see that I actually use quite a few styles. In writing that book, there were particular places where there was limited space on the page and I ended up juggling the style to make it fit the medium. So I have preferences, but I'm pragmatic.
Anyway, CC on page 746 says identifies my preferred style as "unindented begin-end pairs" and says the following "Although this approach looks fine, it violates the Fundamental Theorem of Formatting; it doesn't show the logical structure of the code.".
Bother.
So I wanted to read up more details on this Fundamental Theorem, perhaps I'm misreading the text, but this is how it appeared to me (pg 739) "A control construct in Visual Basic always has a beginning statement ... and it always has a corresponding End statement." (pg 740) "The controversy about formatting control blocks arises in part from the fact that some languages don't require block structures." (pg 740) "Uncoupling begin and end from the control structure - as languages like C++ and Java do - with { and } - leads to questions about where to put the begin and end. Consequently, many indentation problems are problems only because you have to compensate for poorly designed language structures." [Emphasis mine.] I read this as, basically, you need to make your untidy C/C++/Java code look more like VB. I guess that's why it's taken me a couple of years to calm down sufficiently to post this 
Actually, I'm not far from disagreeing. But let's return to this point in a moment. Let's start with the two approaches to indenting that are recommended in the book.
First of all, what is probably the most common style "pure block emulation":
if (something) {
do something;
}
The other recommended style is "begin and end as block boundaries":
if (something)
{
do something;
}
On page 745, a study by Hansen and Yim (1987) indicates that there's no difference in understandability between these two styles. Excellent - so it doesn't matter! I'm sure that if "unindented begin-end pairs" were also included in the survey then it too would provide indistinguishable understandability.
Anyway, the differences between the recommended "begin and end as block boundaries" and the shunned "unindented begin-end pairs" is basically four spaces, which I don't personally think is a lot.
Heading back to why I might actually agree with some of his comments. It is very easy to introduce a bug in a program where the begin and end braces have been omitted. For example:
| Before | After |
|---|---|
if ( a > max )
max = a;
|
if ( a > max )
printf("New max = %i\n",a);
max = a;
|
So, whilst I agree that the absence of brackets can be a problem, I don't necessarily think that rigid adherence to a particular style naturally follows as the only solution to that problem.
I do have some rules that I tend to obey:
- Indenting is a personal/project preference. There are tools out there that can render source code pretty much how you like it. The UI is a view of the source, and it doesn't really matter what the style of the source is. If I find the source hard to read, then I can process it to make it conform to what ever layout works best for me to solve the problem that I'm working on.
- Always use begin and end brackets. They add a single character and can avoid the problem demonstrated above.
- I tend to favour clarity over a rigid adherence to particular styles. I'm not above placing an entire statement on a single line when I feel that it is the best way to present the information. Taking the previous example:
Multi-line Single line if ( a > max ) { max = a; }if ( a > max ) { max = a; }
Posted at 10:19PM Jun 10, 2009 by Darryl Gove in Personal | Comments[4]
Secure programming paper
Excellent paper from Joep Vesseur on secure programming.
Posted at 11:45AM Jun 03, 2009 by Darryl Gove in Sun |
The Developer's Edge talk in Second Life
Just finished talking in Second Life. The slides from the talk are available from SLX. I've got into the habit of writing a transcript for my SL presentations - basically in case the audio fails for some reason.
The talk focuses a bit more on the way that people now get information (through blog posts, articles, indexed by search engines) and the Q&A after the talk was more about that than the technical content of the book. This is a domain that I've given a fair amount of thought to. When writing technical books there is a challenge to balance the information so that it includes the necessary details without writing material that will be out of date by the time that the book hits the press. Fortunately a large amount of the information that developers need is relatively long lived. The challenges come when describing a particular revision of the software, or a particular processor - details which can be very useful for people, but also details which may not age gracefully!
Posted at 10:06AM Jun 03, 2009 by Darryl Gove in music | Comments[3]
Graph of libraries used by firefox and thunderbird
Just gathered library usage charts for firefox and thunderbird. The full charts look like:
Firefox
Thunderbird
Neither of which is particularly telling. The reduced charts look much better:
Firefox
Thunderbird
Posted at 03:00AM May 22, 2009 by Darryl Gove in Sun |
Drawing libraries - neater eye-candy!
Chris Quenelle posted an interesting comment to my post which showed the dependencies for StarOffice. As you can see from the mass of lines below, adding more dependency information, using the latest version of ld_dot, into the StarOffice library map did not make the graphic any clearer!
It turns out that the reduction operation that Chris was alluding to is implemented by tred (the "transitive reduction filter", what great technobabble!). This filtering reduces the graph down to something which even looks ok when shrunk down to fit this page:
This clarifies the relationships between the libraries. More importantly it also looks pretty.
Posted at 09:48PM May 20, 2009 by Darryl Gove in Sun | Comments[1]
Libraries (5) - Runtime costs - TLBs
The next consideration when using libraries is that each library will get mapped in on a new virtual page of memory; as shown in this pmap output:
% pmap 60500 60500: a.out 00010000 8K r-x-- /libraries/a.out 00020000 8K rwx-- /libraries/a.out FEEC0000 24K rwx-- [ anon ] FEED0000 8K r-x-- /libraries/lib1_26.so FEEE0000 8K rwx-- /libraries/lib1_26.so FEEF0000 8K r-x-- /libraries/lib1_25.so FEF00000 8K rwx-- /libraries/lib1_25.so FEF10000 8K r-x-- /libraries/lib1_24.so FEF20000 8K rwx-- /libraries/lib1_24.so FEF30000 8K r-x-- /libraries/lib1_23.so FEF40000 8K rwx-- /libraries/lib1_23.so FEF50000 8K rwx-- [ anon ] FEF60000 8K r-x-- /libraries/lib1_22.so FEF70000 8K rwx-- /libraries/lib1_22.so FEF80000 8K r-x-- /libraries/lib1_21.so FEF90000 8K rwx-- /libraries/lib1_21.so FEFA0000 8K r-x-- /libraries/lib1_20.so FEFB0000 8K rwx-- /libraries/lib1_20.so FEFC0000 8K r-x-- /libraries/lib1_19.so ....
There are finite number of TLB entries on a chip. If each library takes an entry, and the code jumps around between libraries, then a single application can utilise quite a few TLB entries. Take a CMT system where there are multiple applications (or copies of the same application) running, and there becomes a lot of pressure on the TLB.
One of the enhancements in Solaris to support CMT processors is Shared Context. When multiple applications map the same library at the same address, then they can share a single context to map that library. This can lead to a significant reduction in the TLB pressure. Shared context only works for libraries that are loaded into the same memory locations in different contexts, so it can be defeated if the libraries are loaded in different orders or any other mechanisms that scramble the locations in memory.
If each library is mapped into a different TLB entry, then every call into a new library is a new ITLB entry, together with a jump through the PLT, together with the normal register spill/fill overhead. This can become quite a significant chunk of overhead.
To round this off, lets look at some figures from an artificial code run on an UltraSPARC T1 system that was hanging around here.
| Experiment | Runtime |
|---|---|
| Application that jumps between 26 different routines a->b->c...->z. All the routines are included in the same executable. | 3s |
| Application that jumps between 26 different routines a->...z. The routines are provided as a library, and calls are therefore routed through the PLT. | 6s |
Application that jumps between 26 different routines a->...z. The routines are
provided as a library, but all are declared static except for the
initial routine that is called by main. Therefore the calls within the library
avoid the PLT. |
3s |
| Application that jumps between 26 different routines a->...z. Each routine is defined in its own library, so calls to the routine have to go through the PLT, and also require a new ITLB entry to be used. | 60s |
Since the routines in this test code don't actually do anything, the overhead of calling through the PLT is clearly shown as a doubling of runtime. However, this is insignificant when compared with the costs of calling to separate libraries, which is about 10x slower than this.
Moving the experiment to look at the impact on CMT systems:
| Experiment | Runtime |
|---|---|
| One copy of this executable per core of an UltraSPARC T1 processor | 1 minute |
| Two copies of this executable per core | 5 minutes |
| Four copies of this executable per core (fully loaded system) | 8 minutes |
Running multiple copies of the application has a significant impact on performance. The performance counters show very few instructions being executed, and much time being lost to ITLB misses. Now this performance is from a system without the shared context changes - so I would expect much better scaling on a system with these improvements (if I find one I'll rerun the experiment).
The conclusion is that care needs to be taken when deciding to split application code into libraries.
Posted at 06:00PM May 20, 2009 by Darryl Gove in Sun |
Libraries (4) - Runtime costs - Procedure Lookup Table (PLT)
Most applications spend the majority of their time running - rather than starting up. So it's useful to look at the costs of using libraries at runtime.
The most apparent cost of using libraries is that calls to routines now go indirectly to the target routine through the procedure look up table (PLT). Unless the developer explicitly limits the scope of a function, it is exported from the library as a global function, which means that even calls within the library will go through the PLT. Consider the following code snippet:
void func2()
{
...
}
void func1()
{
func2();
}
If this is compiled into an executable the assembly code will look like:
func1()
11104: 82 10 00 0f mov %o7, %g1
11108: 7f ff ff f8 call func2 ! 0x110e8
1110c: 9e 10 00 01 mov %g1, %o7
However, if this is compiled as part of a library then the code looks like:
func2()
664: 82 10 00 0f mov %o7, %g1
668: 40 00 40 b9 call .plt+0x3c ! 0x1094c
66c: 9e 10 00 01 mov %g1, %o7
This is a doubling of the cost of the call.
In C it's possible to limit the scope of the function using the static keyword. Declaring func1 as static will cause the compiler to generate a direct call to that routine. The downside is that the routine will only be visible within the source file that defines it. It is also possible to use other methods to limit the visibility of symbols.
Posted at 03:00PM May 20, 2009 by Darryl Gove in Sun | Comments[2]
Libraries (3) - Application startup costs
As can be seen from the previous graphs, even a simple application (like ssh) can pull in a fair number of libraries. Whenever a library is pulled in, the linker has to request memory, load the image from disk, and then link in all the routines. This effort takes time - it's basically a large chunk of the start up time of an application. If you profile the start up of an application, you'll probably not see much because much of this time is basically the OS/disk activity of mapping the libraries into memory.
Of course applications also have start up costs associated with initialising data structures etc. However, the biggest risk is that applications will pull in libraries that they don't need, or perhaps do need, but don't need yet. The best work-around for this is to lazy load the libraries. Of course it's fairly easy to write code that either breaks under lazy loading or breaks lazy loading. It's not hard to work around these issues with care, and doing so can have a substantial impact on start up time.
Posted at 02:01PM May 20, 2009 by Darryl Gove in Sun |
Libraries (2)
Just updated the ld_dot script to include filter libraries. Added a profile for ssh logging into a system, rather than just showing the help message (click the image for the full size version).
Posted at 11:52AM May 20, 2009 by Darryl Gove in Sun |
Libraries
I was talking to Rod Evans about the diagnostic capabilities available in the runtime linker. These are available through the environment setting LD_DEBUG. The setting LD_DEBUG=files gives diagnostic information about which libraries were loaded by which other libraries. This is rather hard to interpret, and would look better as a graph. It's relatively easy to parse the output from LD_DEBUG into dot format. This script does the parsing. The full stesp to do this for the date command are:
$ LD_DEBUG=files date >ld_date 2>&1 $ ld_dot ld_date $ dot -Tpng -o date.png dot.dot
The lines in the graph represent which libraries use which other libraries. Solid lines indicate "needed" or hard links, the dotted lines represent lazy loading or dynamic loading (dlopen). The resulting graph looks like:
More complex commands like ssh pull in a larger set of libraries:
It is possible to use this on much larger applications. Unfortunately, the library dependencies tend to get very complex. This is the library map for staroffice.
Posted at 06:12AM May 20, 2009 by Darryl Gove in Sun | Comments[4]
Developer's Edge safari rerelease
We've just pushed a new version of The Developer's Edge to safari. The original version didn't show any of the text from each section of the book unless you logged into the safari site. The new version shows the snippet from each section even if you're not a subscriber.
I was pleased to see that the book is featured on the Sun Studio developer portal.
I'm also scheduled to give a second life presentation during JavaOne at 9am PST on the 3rd June.
Posted at 02:01PM May 19, 2009 by Darryl Gove in Sun |
The perils of strlen
Just been looking at an interesting bit of code. Here's a suitably benign version of it:
#include <string.h>
#include <stdio.h>
void main()
{
char string[50];
string[49]='\0';
int i;
int j=0;
for (i=0; i<strlen(string); i++)
{
if (string[i]=='1') {j=i;}
}
printf("%i\n",j);
}
Compiling this bit of code leads to a loop that looks like:
.L900000109:
/* 0x002c 12 */ cmp %i5,49
/* 0x0030 10 */ add %i3,1,%i3
/* 0x0034 12 */ move %icc,%i4,%i2
/* 0x0038 10 */ call strlen ! params = %o0 ! Result = %o0
/* 0x003c */ or %g0,%i1,%o0
/* 0x0040 */ add %i4,1,%i4
/* 0x0044 */ cmp %i4,%o0
/* 0x0048 */ bcs,a,pt %icc,.L900000109
/* 0x004c 12 */ ldsb [%i3],%i5
The problem being that for each character tested there's also a call to strlen! The reason for this is that the compiler cannot be sure what the call to strlen actually returns. The return value might depend on some external variable that could change as the loop progresses.
There's a lot of functions defined in the libraries that the compiler could optimise, if it was certain that it recognised them. The compiler flag that enables recognition of the "builtin" functions is -xbuiltin (which is included in -fast. This enables the compiler to do things like recognise calls to memcpy or memset and in some instances produce more optimal code. However, it doesn't recognise the call the strlen.
In terms of solving the problem, there are two approaches. The most portable approach is to hold the length of the string in a temporary variable:
int length=strlen(string); for (i=0; i<length; i++)
Another, less portable approach, is to use #pragma no_side_effect. This pragma means the return value of the function depends only on the parameters passed into the function. So the result of calling strlen only depends on the value of the constant string that is passed in. The modified code looks like:
#include <string.h>
#include <stdio.h>
#pragma no_side_effect(strlen)
void main()
{
char string[50];
string[49]='\0';
int i;
int j=0;
for (i=0; i<strlen(string); i++)
{
if (string[i]=='1') {j=i;}
}
printf("%i\n",j);
}
And more importantly, the resulting disassembly looks like:
.L900000109:
/* 0x0028 0 */ sub %i1,49,%o7
/* 0x002c 11 */ add %i3,1,%i3
/* 0x0030 0 */ sra %o7,0,%o5
/* 0x0034 13 */ movrz %o5,%i4,%i2
/* 0x0038 11 */ add %i4,1,%i4
/* 0x003c */ cmp %i4,%i5
/* 0x0040 */ bcs,a,pt %icc,.L900000109
/* 0x0044 13 */ ldsb [%i3],%i1
Posted at 02:15PM May 07, 2009 by Darryl Gove in Sun | Comments[2]
Computer organization and design
Just returned from Europe - customer visits and the OpenSPARC workshop in Brussels. Since I was doing a fair amount of air travel I took a number of books. With good timing, I'd just got a copy of Patterson & Hennessy's Computer Organisation and Design.
The book is certainly an interesting read. Although there are various ways you might read the book - or various lessons you might extract from it, I took it as basically a book that describes how to build a MIPS processor. Much of the work I do is at the hardware-software border, so I found it useful to read around the domain a bit more. The book is a relatively easy read, lots of detail (which I didn't need to memorise), and a nice pace that builds up from a simple core into a more complete implementation of a processor. The book has a chapter on multicore, but this was not treated to the same depth. There's also some material about GPUs, and that also didn't fit very well.
Posted at 01:28PM Apr 30, 2009 by Darryl Gove in Sun |
Compiling for Nehalem and other processors
First thing I'd suggest is to make sure that you're using a recent compiler. Practically that means Sun Studio 12 (which is quite old now), or the Sun Studio Express releases (aka Sun Studio 12 Update 1 Early Access). Obviously the latest features are only found in the latest compilers.
In terms of flags, the starting point should be -fast, you can later trim that if you need to remove floating point simplification, or are not happy with one or other of the flags that it includes.
The next flags to use are:
-xipo=2This flag enables crossfile optimisation and tracking of allocated memory. I find crossfile optimisation useful because it limits the impact of the source code structure on the final application.-xarch=sse4_2This flag is included in-fast(assuming that the build system supports it). However, if you later plan to fine tune the compiler flags, it's best to start off with it explicitly. This allows the compiler to use the SSE4 instruction set. Probably-xarch=sse2will be sufficient in most circumstances - it's a call depending on the system that the application will be deployed on.-xvector=simdThis flag tells the compiler to generate SIMD (single instruction multiple data) instructions - basically the combination of this and the architecture flag enables the compiler to generate applications that use SSE instructions. These instructions can lead to substantial performance gains in some floating point applications.-m64On x86 there's performance to be gained from using the 64-bit instruction set extensions. The code gets more registers and a better calling convention. These tend to outweigh the costs of the larger memory footprint of 64-bit applications.-xpagesize=2MThis tells the operating system to provide large pages to the application.- The other optimisation that I find very useful is profile feedback. This does complicate and lengthen the build process, but is often the most effective way of getting performance gains for codes dominated by branches and conditional code.
- The other flags to consider are the aliasing flags
-xalias_level=stdfor C,-xalias_level=compatiblefor C++, and-xrestrict. These flags do lead to performance gains, but require the developer to be comfortable that their code does conform to the requirements of the flags. (IMO, most code does.)
All this talk about flags should not be a replacement for what I consider to be the basic first step in optimising the performance of an application: to take a profile. Compiler flags tell the compiler how to do a good job of producing the code, but the compiler can't do much about the algorithms used. Profiling the application will often give a clue as to a way that the performance of the application can be improved by a change of algorithm - something that even the best compiler flags can't always do.
Posted at 04:41PM Apr 17, 2009 by Darryl Gove in Sun |
California special election
Today I got the "voter information pamphlet" for the California statewide special election - which is meant to solve the . The booklet is rather short on information. Well, very short on information. A bit of searching found this site that links to the actual texts of the bills. Even with this I'm confused as to how "Supplemental payments to local school districts ... to address budget cuts." can provide "state savings of up to several billion dollars in 2009-2010 and 2011-2012".
Note:The last time I wrote about voting, I was presenting at a customer site the next day, and one of the questions was "Is this really your blog?". Next week I'm off presenting again, and if you happen to come here as a result of one of those presentations, please scroll down for my usual material. 
Posted at 10:24PM Apr 16, 2009 by Darryl Gove in Personal | Comments[1]
The Developer's Edge - Hardcopy updated
The cover art for The Developer's Edge has been updated. You can see this on the Amazon picture, which now agrees with the one on the left of this blog.
Posted at 12:00AM Apr 03, 2009 by Darryl Gove in Sun |
Strong prefetch
Looked at an interesting problem yesterday. One of the hot routines in an application was showing a lot of system time. The overall impact on the application was minor, but any unexpected system time is worth investigating.
Profiling under the performance analyzer indicated a pair of instructions, a prefetch followed by a load. The load had all the time attributed to it, but it's usual for the performance analyzer to attribute time to the instruction following the one that is slow. Something like (this is synthetic data, measured data later):
User System 0.5 0.5 prefetch [%o0],23 10.0 20.0 ld [%i1],%i2
So there are three things that sprang to mind:
- Misaligned memory accesses. This would probably be the most common cause of system time on memory operations. However, it didn't look like a perfect file since the system time would usually be reported on the instruction afterwards.
- TSB misses. The TLB is the on-chip structure that holds virtual to physical mappings. If the mapping is not in the TLB, then the mapping gets fetched from a software managed structure called the TSB. This is a TLB miss. However, the TSB has a capacity limit too, and it is possible for there to be a TSB miss where the mapping is fetched from memory and placed into the TSB. TLB misses are fast traps and don't get recorded under system time. However, TSB misses do cause the switch. This is quite a strong candidate, however trapstat -t didn't show much activity.
- The third option was the prefetch instruction.
A prefetch instruction requests data from memory in advance of the data being used. For many codes this results in a large gain in performance. There's a number of variants of prefetch, and their exact operation depends on the processor. The SPARC architecture 2007 on pages 300-303 gives an outline of the variants. There's prefetch to read, prefetch to write, prefetch for single use, prefetch for multiple use. The processor has the option to do something different depending on which particular prefetch was used to get the data.
The system I was running on was an UltraSPARC IV+. This introduced the strong prefetch variant.
The idea of the strong prefetch variant is that it will cause a TLB miss to be taken if the mapping of the requested address is not present in the TLB. This is useful for situations where TLB misses are frequently encountered and the prefetch needs to be able to deal with them. To explain why this is important, consider this example. I have a loop which strides through a large range of memory, the loop optimally takes only a few cycles to execute - so long as the data is resident in cache. It takes a couple of hundred cycles to fetch data from memory, so I end up fetching for eight iterations ahead (eight is the maximum number of outstanding prefetches on that processor). If the data resides on a new page, and prefetches do not cause TLB misses, then eight iterations will complete before a load or store touches the new page of memory and brings the mapping into the TLB. The eight prefetches that were issued for those iterations will have been dropped and all the iterations will see the full memory latency (a total of 200*8=1600 cycles of cost). However, if the prefetches are strong prefetches, then the first strong prefetch will cause the mapping to be brought into the TLB, and all the prefetches will hit in the TLB.
However, there's a corner case. A strong prefetch of a page that is not mapped will still cause the TLB miss, and the corresponding system time while the kernel figures drops the access to a non-existant page.
This was my suspicion. The code had a loop, and each iteration of the loop would prefetch the next item in the linked list. If the code had reached the end of the list, then it would be issuing a prefetch for address 0, which (of course) is not mapped.
It's relatively easy to provide a test case for this, just loop around issuing prefetches for address zero and see how long the code takes to run. Prefetches are defined in the header file <sun_prefetch.h>. So the first code looks like:
#include <sun_prefetch.h>
void main()
{
for (int i=0; i<1000000; i++)
{
sparc_prefetch_write_many(0);
}
}
Compiling and running this gives:
$ cc pref.c $ timex a.out real 2.59 user 1.24 sys 1.34
So the code has substantial system time - which fits with the profile of the application. Next thing to look at is the location of that system time:
$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.1.er
...
Excl. Excl.
User CPU Sys. CPU
sec. sec.
...
0. 0. [?] 10b8c: bset 576, %o1 ! 0xf4240
0. 0. [?] 10b90: prefetch [0], #n_writes_strong
## 1.571 1.181 [?] 10b94: inc %i5
0. 0. [?] 10b98: cmp %i5, %o1
...
So the system time is recorded on the instruction following the prefetch.
The next step is to investigate ways to solve it. The <sun_prefetch.h> header file does not define any variants of weak prefetch. So it's necessary to write an inline template to provide the support:
.inline weak,4 prefetch [%o0],2 .end .inline strong,4 prefetch [%o0],22 .end
The identifier for many writes strong is #22, for many writes weak it is #2. The initial code uses strong since I want to validate that I get the same behaviour with my new code:
void weak(int*);
void strong(int*);
void main()
{
for (int i=0; i<1000000; i++)
{
strong(0);
}
}
Running this gives:
$ cc pref.c pref.il
$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.1.er
...
Excl. Excl.
User CPU Sys. CPU
sec. sec.
...
0. 0. [?] 10b90: clr %o0
0. 0. [?] 10b94: nop
0. 0. [?] 10b98: prefetch [%o0], #n_writes_strong
## 1.761 1.051 [?] 10b9c: inc %i5
...
So that looks good. Replacing the strong prefetch with a weak prefetch:
void weak(int*);
void strong(int*);
void main()
{
for (int i=0; i<1000000; i++)
{
weak(0);
}
}
And then repeat the profiling experiment:
Excl. Excl.
User CPU Sys. CPU
sec. sec.
...
0. 0. [?] 10b90: clr %o0
0. 0. [?] 10b94: nop
0. 0. [?] 10b98: prefetch [%o0], #n_writes
0. 0. [?] 10b9c: inc %i5
...
Running the code under timex:
$ timex a.out real 0.01 user 0.00 sys 0.00
So that got rid of the system time. Of course with this change the prefetches issued will now be dropped if there is a TLB miss, and it is possible that could cause a slowdown for the loads in the loop - which is a trade-off. However, examining the application using pmap -xs <pid> indicated that the data resided on 4MB pages, so the number of TLB misses should be pretty low (this was confirmed by the trapstat data I gathered earlier).
One final comment is that this behaviour is platform dependent. Running the same strong prefetch code on an UltraSPARC T2, for example, shows no system time.
Posted at 11:03AM Apr 02, 2009 by Darryl Gove in Sun |
NUMA, binding, and OpenMP
One of my colleagues did an excellent bit of analysis recently, it pulls together a fair number of related topics, so I hope you'll find it interesting.
We'll start with NUMA. Non-Uniform Memory Access. This is in contrast to UMA - Uniform Memory Access. This relates to memory latency - how long does it take to get data from memory to the processor. If you take a single CPU box, the memory latency is basically a measurement of the wires between the processor and the memory chips, it typically is about 90ns, can be as low as 60ns. For a 3GHz chip this is from around 200 to 300 cycles, which is a fair length of time.
Suppose we add a second chip into the system. The memory latency increases because there's now a bunch of communication that needs to happen between the two chips. The communication consists of things like checking that more recent data is not in the cache of the other chip, co-ordinating access to the same memory bank, accessing memory that is controlled by the other processor. The upshot of all this is that memory latency increases. However, that's not all.
If you have two chips together with a bunch of memory, you can have various configurations. The most likely one is that each chip gets half the memory. If one chip has to access memory that the other chip owns, this is going to take longer than if the memory is attached to that chip. Typically you might find that local memory takes 90ns to access, and remote memory 120ns.
One way of dealing with this disparity is to interleave the memory, so one cacheline will be local, the next remote. Doing this you'll see an average memory latency of 105ns. Although the memory latency is longer than the optimal, there's nothing a programmer (or an operating system) can do about it.
However, those of use who care about performance will jump through hoops of fire to get that lower memory latency. Plus as the disparity in memory latency grows larger, it makes less and less sense to average the cost. Imagine a situation on a large multi-board system where the on-board memory latency might be 150ns, but the cross-board latency would be closer to 300ns (I should point out that I'm using top-of-the-head numbers for all latencies, I'm not measuring them on any systems). The impact of doing this averaging could be a substantial slow-down in performance for any application that doesn't fit into cache (which is most apps). (There are other reasons for not doing this, such as limiting the amount of traffic that needs to go across the various busses.)
So most systems with more than one CPU will see some element of NUMA. Solaris has contained some memory placement optimisations MPO since Solaris 9. These optimisations attempt to allocate memory locally to the processor that is running the application. OpenSolaris has the lgrpinfo command that provides an interface to see the levels of memory locality in the system.
Solaris will attempt to schedule threads so that they remain in their locality group - taking advantage of the local memory. Another way of controlling performance is to use binding to keep processes, or threads on a particular processor. This can be done through the pbind command. Processor sets can performance a similar job (as can zones, or even logical domains), or directly through processor_bind.
Binding can be a tricky thing to get right. For example in a system where there are multiple active users, it is quite possible to end up in a situation where one virtual processor is oversubscribed with processes, whilst another is completely idle. However, in situations where this level of control enables better performance then binding can be hugely helpful.
One situation where binding is commonly used is for running OpenMP programs. In fact, it is so common that the OpenMP library has built in support for binding through the environment variable SUNW_MP_PROCBIND. This variable enables the user to specify which threads are bound to which logical processors.
It is worth pointing out that binding does not just help memory locality issues. Another situation where binding helps is thread migrations. This is the situation where an interrupt, or another thread requires attention and this causes the thread currently running on the processor to be descheduled. In some situations the descheduled thread will get scheduled onto another virtual processor. In some instances that may be the correct decision. In other instances it may result in lower than expected performance because the data that the thread needs is still in the cache on the old processor, and also because the migration of that thread may cause a cascade of migrations of other threads.
The particular situation we hit was that one code when bound showed bimodal distributions of runtimes. It had a 50% chance of running fast or slow. We were using OpenMP as well as the SUNW_MP_PROCBIND environment variable, so in theory we'd controlled for everything. However, the program didn't hit the parallel section until after a few minutes of running, and examining what was happening using both pbind and also the Performance Analyzer indicated what the problem was.
The environment variable SUNW_MP_PROCBIND currently binds threads once the code reaches the parallel region. Until that point the process is unbound. Since the process is unbound, Solaris can schedule it to any available virtual CPU. During the unbound time, the process allocated the memory that it needed, and the MPO feature of Solaris ensured that the memory was allocated locally. I'm sure you can see where this is heading.... Now, once the code hit the parallel region, the binding occurred, and the main thread was bound to a particular locality group, half the time this group was the same group where it had been running before, and half the time it was a different locality group. If the locality group was the same, then memory would be local, otherwise memory would be remote (and performance slower).
We put together a LD_PRELOAD library to prove it. The following code has a parallel section in it which gets called during initialisation. This ensures that binding has already taken place by the time the master thread starts.
#include <stdio.h>
#pragma init(s)
void s()
{
#pragma omp parallel sections
{
#pragma omp section
{
printf("Init");
}
}
}
The code is compiled and used with:
$ cc -O -xopenmp -G -Kpic -o par.so par.c $ LD_PRELOAD=./par.so ./a.out
Posted at 12:08AM Apr 01, 2009 by Darryl Gove in Sun |
The Developer's Edge available in hardcopy
The Developer's Edge is now available as hardcopy!
It is being made available as print-on-demand. You can either order through the publisher Vervante, or you can order through Amazon.
However, I suggest you wait until next week before ordering as the current cover art is not the final version (you can play spot the difference between the image on the left and the one on the Vervante website). I'll post when it gets fixed. Of course, you can order the "limited-edition" version if you want 
I introduced the book in a previous post. I'll reproduce a bit more of the details in this post. The brief summary is:
The Developer's Edge: Selected Blog Posts and Articles focuses on articles in the following areas:
- Native language issues
- Performance and improving performance
- Specific features of the x86 and SPARC processors
- Tools that are available in the Solaris OS
The articles provide a broad overview on a topic, or an in-depth discussion. The texts should provide insights into new tools or new methods, or perhaps the opportunity to review a known domain in a new light.
You can get more details than this from the Safari site, either reading the preface or skimming the table of contents
I would love to hear feedback on this book, feel free to e-mail me directly, leave comments on amazon, or leave comments on this blog, or on the blogs of the other contributors.
Posted at 08:00AM Mar 27, 2009 by Darryl Gove in Sun |
When to add a membar (an example)
I was recently having a discussion on one of the OpenSolaris lists on the topic of when to use the volatile keyword, and when it was necessary to use membars.
So volatile is a clue to the compiler to load the variable from memory and immediately store it back to memory. What it does not do is to tell the hardware anything. So the application can perform the store, but that store may not be immediately visible to the rest of the system. Most of the time this is not a problem - so long as the store is visible to the processor on which the thread is executing it's fine. Variability of when the store is visible to other processors may also be fine. There is one clear situation where the ordering of store operations could be a problem - and that's unlocking mutexes.
The problem here is best illustrated by the following scenario. I lock some data structure, then store new values into it, then unlock the structure. Immediately another thread comes along and uses the values in that structure. Not an uncommon situation. Unlocking a mutex is often just a case of storing a value (of zero) into the mutex structure. And here's the potential problem. In some weaker ordering architectures there is no guarantee that other processors see the stores in the same order that they are performed. So if you have Store A followed by Store B it may be possible for other processors to observe the change in the value of B before they see the change in the value of A. In the case of mutex unlock, the store of B would be the action that unlocked the mutex, enabling other threads to access the variable A... and there could be problems if they see the old value of A.
The solution to this is to put a membar in before the store that unlocks the mutex. You can see this happening in the OpenSolaris code:
41 /*
42 * lock_clear(lp)
43 * - clear lock.
44 */
45 ENTRY(_lock_clear)
46 membar #LoadStore|#StoreStore
47 retl
48 clrb [%o0]
49 SET_SIZE(_lock_clear)
The membar ensures that all the pending stores are visible to other processors before the store that releases the lock becomes visible to them.
Posted at 12:00AM Mar 27, 2009 by Darryl Gove in Sun | Comments[2]



