Darryl Gove's blog

Friday Jun 26, 2009

Cost of pointer chasing in libmtmalloc

I thought I'd quickly write up some comments about how to improve the issue with mtmalloc.

The "obvious" solution is to allocate larger chunks of memory. I was using 512 byte objects, so 72KB can allocate 144 of these. That's quite a few, I'm not sure that in general I'd want to allocate say 1024 objects of 512 bytes just in case I needed them. So increasing the chunk size might be a useful workaround, but I think it's rather a sledgehammer solution.

Of course, if I end up allocating 60KB objects, then I hope the code does the right thing and reserves space for several of them. If I need one, I'm quite likely to need a second, and I don't want to be doing pointer chasing on a linked list of single object blocks - that would be very painful. So hopefully there is some kind of scaling of the requestsize for larger objects.

However, the fundamental problem is not actually the number of objects in each allocated chunk, that's what reveals the problem. No, the real problem is the pointer chasing loop to locate a chunk with free space in it. There are several upfront problems with this. Fortunately the two data structures that are used in the pointer chasing (mt-next and mt-nfree) are on the same cache line - although given their offsets, I'm not convinced that this was a design decision. However, the pointer chasing from block to block guarantees that the next pair of values that need to be inspected are not in the cache. Given the fact that we're jumping at least 72KB, there's a good chance that the mapping isn't even in the TLB.

We could argue at this point that there should be a list of free memory so a single pointer step could get us to the next free block, but this approach almost certainly opens up a lot of issues around ensuring thread safety (ie you probably need mutexes) and getting good performance (ie mutexes cost quite a few cycles). So I don't think that's the solution.

I suspect the easiest thing that could be done would be to take the two key fields and place them into an array. So you would have an array of pointers to the chunks interleaved with the counts of the number of free elements in each of the chunks. The advantage of this is that we could identify a chunk with free space without having to stride through memory to do it, we can just scan the array. We'd rarely need multiple TLB entries, and we might even be able to fit the details of multiple chunks on the same cacheline - reducing cache misses substantially (there is an issue of false sharing here, so it may not be entirely feasible), and the other gain would be that we'd be streaming through memory so the hardware might be able to prefetch the next cacheline, or we just just add prefetches if that were necessary.

The programming challenge with this approach would be in the situations where we need to increase the size of the array to allocate more chunks. This should happen rarely, but could be tricky to do safely. But not impossible.

mtmalloc vs umem

A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:

Excl.     Incl.      Name
User CPU  User CPU
   sec.      sec.
266.446   266.446    
258.301   263.084    malloc_internal
  1.661     1.951    free
  1.401     1.401    mutex_lock_impl
  0.961     0.961    mutex_unlock

We can dig into the disassembly of malloc_internal to find out what's going on:

    73.201    73.201            [?]     1724:  cmp         %o5, 0
     1.981     1.981            [?]     1728:  bne         0x1740
     0.320     0.320            [?]     172c:  nop
     0.490     0.490            [?]     1730:  ld          [%i2 + 44], %i2
     1.191     1.191            [?]     1734:  cmp         %i2, 0
     0.901     0.901            [?]     1738:  bne,a       0x1724
## 176.443   176.443            [?]     173c:  ld          [%i2 + 32], %o5

It's not hard to visualise what the original C code would look like:

  while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }

Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:

   1032 	while (thiscache != NULL && thiscache->mt_nfree == 0)
   1033 		thiscache = thiscache->mt_next;

So what's going on?

Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.

It's probably worth revealing the test code, at this point, so that you can see what it does:

#include <stdlib.h>
typedef struct s
{
  struct s * next;
  char padding[508];
} S;

void main()
{
  struct s * head;
  struct s * keep;
  struct s * current;
  head=0;
  keep=0;
  for (int j=0; j<100; j++)
  {
    for (int i=0; i<100000; i++)
    {
      current=(struct s*)malloc(sizeof(struct s));
      if (i&1)
      {
        current->next=head;
        head=current;
      }
      else
      {
        current->next=keep;
        keep=current;
      }
    }
    current = head;
    while (current!=0)
    {
      struct s * tmp = current;
      current = current -> next;
      free(current);
    }
    head = 0;
  }
}

The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        4:44.18
user        4:33.80
sys            8.70

However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:

#include 
...
  mallocctl(MTCHUNKSIZE,20);
...

The performance reduces from nearly 5 minutes to about 1 minute:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        1:09.10
user        1:01.09
sys            6.53

If we increase the requestsize to 30, performance improves still further:

% cc -O mtm.c -lmtmalloc
% timex a.out
real          38.36
user          31.41
sys            4.96

Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:

% cc -O mtm.c -lumem
% timex a.out
real          31.06
user          18.10
sys           10.95

So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.

Tuesday Jun 23, 2009

Sun Studio 12 Update 1

Sun Studio 12 Update 1 went live yesterday. It's still a free download, and it's got a raft of new features. Many people will have been using the express releases, so they will already be familiar with the improvements.

It's been about two years since Sun Studio 12 came out, and the most obvious change in that time is the prevalence of multicore processors. I figured the easiest way to discern this would be to look at the submissions of SPEC CPU2006 results in that time period. The following chart shows the cummulative number of SPEC CPU2006 Integer speed results over that time broken down by the number of threads that the chip was capable of supporting.

Ok, the first surprising thing about the chart is that there's very few single threaded chips. There were a few results when the suite was launched back in 2006, but nothing much since. What is more apparent is the number of dual-thread chips, that was where the majority of the market was. There were also a number of quad-thread chips at that point. If we fast-forward to the situation today, we can see that the number of dual-thread chips has pretty much leveled off, the bulk of the chips are capable of supporting four threads. But you can see the start of a ramp of chips that are capable of supporting 6 or 8 simultaneous threads.

The relevance of this chart to Sun Studio is that Sun Studio has always been a tool that supports the development of multi-threaded applications. Every release of the product improves on the support in the previous release. Sun Studio 12 Update 1 includes improvements in the compiler's ability to automatically parallelise codes - afterall the easiest way to develop parallel applications is if the compiler can do it for you; improvements to the support of parallelisation specifications like OpenMP, this release includes support for the latest OpenMP 3.0 specification; and improvements in the tools and their ability to provide the developer meaningful feedback about parallel code, for example the ability of the Performance Analyzer to profile MPI code.

Footnote SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results posted on www.spec.org as of 15 June 2009.

Monday Jun 15, 2009

Glasgow Haskell Compiler successfully ported to OpenSPARC

Ben Lippmeier has been working on the native port of the Glasgow Haskell Compiler (GHC) to SPARC. The port was completed a few months back, and since then he's been using an UltraSPARC T2 system to look at thread activity and scaling as the number of threads is increased. The full details of the project are on his GHC on SPARC blog. The latest SPARC/Solaris binary can be downloaded here, although the full set of changes probably won't be available for a couple of months.

Sunday Jun 14, 2009

Audio for JavaOne interview available

A couple of weeks back I recorded an interview where I discussed The Developer's Edge. I've just found the audio up at BlogTalkRadio, it's about 15 minutes in duration.

Friday Jun 12, 2009

Stlport4 and multithreaded code

I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.

First off, it's probably good to take a look at a code that illustrates the problem:

#include <vector>

int main()
{
  #pragma omp parallel for default (__auto)
  for (int i=0; i<10000000; i++)
  {
    std::vector<int> v;
    v.push_back(10);
  }
  return(0);
}

The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.

$ CC -O t1.cc
$ timex a.out
real          15.85
user          15.64
sys            0.01
$ CC -O -library=stlport4 t1.cc
$ timex a.out
real           7.87
user           7.78
sys            0.01

This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.

The next thing to investigate is scaling when there are multiple threads:

$ CC -O -xopenmp -library=stlport4 t1.cc
$ timex a.out
real           7.00
user           6.96
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real           7.18
user          14.28
sys            0.01

So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
8.076     8.076      
1.571     2.272      mutex_lock_impl
1.501     1.971      mutex_unlock
1.051     4.573      std::vector >::_M_insert_overflow(int*,const int&,const std::__true_type&,unsigned,bool)
0.871     8.076      _$d1A5.main
0.871     3.272      std::__node_alloc<true,0>::_M_allocate(unsigned)
0.560     1.721      std::__node_alloc<true,0>::_M_deallocate(void*,unsigned)
0.480     0.480      sigon
0.440     0.440      mutex_trylock_adaptive
0.250     0.470      mutex_unlock_queue

So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.

This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.

The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config 

Once I had it built, I could install it with:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install

The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:

CC -m64  -O -xopenmp -library=no%Cstd \
   -I ./stdcxx-4.2.1/install/include/ \
   -L ./stdcxx-4.2.1/install/lib/     \
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc 

So we can build the test and look at the scaling between one and two threads:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.98
user          18.93
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real          18.42
user          36.73
sys            0.01

Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
21.145    21.145     
 2.572    16.411     std::vector<int,std::allocator<int> >::_C_insert_n(int*const&,unsigned long,const int&)
 2.402     4.293     mutex_unlock
 2.342     3.613     mutex_lock_impl
 1.961    10.697     std::vector<int,std::allocator<int> >::_C_realloc(unsigned long)
 1.681     5.634     free
 1.341     1.891     mutex_unlock_queue
 1.271     1.271     _free_unlocked
 0.991     0.991     sigon

So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:

(er_print) csingle mutex_lock
Attr.    Excl.     Incl.      Name  
User CPU  User CPU  User CPU         
 sec.      sec.      sec.       
0.170     1.681     5.634      free
0.020     0.690     4.623      malloc
0.190     0.190     0.190     *mutex_lock

So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:

CC -m64  -O -xopenmp -library=no%Cstd \
   -I ./stdcxx-4.2.1/install/include/ \
   -L ./stdcxx-4.2.1/install/lib/     \
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc -lmtmalloc

Then we can try the timing tests again:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.02
user          17.98
sys            0.01
$ export OMP_NUM_THREADS=2
real          13.76
user          27.05
sys            0.01
$ export OMP_NUM_THREADS=4
$ timex a.out
real           6.92
user          26.97
sys            0.02
$ export OMP_NUM_THREADS=8
$ timex a.out
real           3.51
user          26.99
sys            0.02

So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]

Thursday Jun 11, 2009

Code complete: burn this chapter

That's a somewhat inflammatory title for this post, but continuing from my previous post on the book Code Complete, I think that the chapters (25 & 26) on performance really do not contain good advice or good examples.

To make this more concrete, consider the example on pg 593 where Steve McConnell compares the performance of these two code fragments:

OriginalUnrolled
for i = 1 to 10
  a[ i ] = i
end for







a[ 1 ] = 1
a[ 2 ] = 2
a[ 3 ] = 3
a[ 4 ] = 4
a[ 5 ] = 5
a[ 6 ] = 6
a[ 7 ] = 7
a[ 8 ] = 8
a[ 9 ] = 9
a[ 10 ] = 10

Steve finds that Visual Basic and Java run the unrolled version of the loop faster.

There's a couple of examples that talk about incorrect access ordering for arrays. Here's some C code that illustrates the problem:

Slow codeFast code
for (column=0; column < max_column; column++) 
{
  for (row=0; row < max_row; row++) 
  {
    data[row][column]=stuff();
  }
}
for (row=0; row < max_row; row++) 
{
  for (column=0; column < max_column; column++)
  {
    data[row][column]=stuff();
  }
}

On page 599 it is suggested that the slow code is inefficient because it might cause paging to disk, on page 623 it is suggested that the higher trip count loop should be inside to amortise the initialisation overhead for each execution of the inner loop. Neither of these explanations is right. As I'm sure most of you recognise the code is slow because of cache misses incurred when accessing non-adjacent memory locations. There is a cost to initialisation of the inner loop, but nothing significant, and yes, you could get paging to disk - but only if you are running out of memory (and if you're running out of memory, you're hosed anyway!). You're more likely to get TLB misses (and perhaps that is what Mr McConnell intended to say.

I consider the above issues to be quite serious, but, unfortunately, I'm not terribly happy with the rest of the material. Hence my recommendation to ignore (or burn ;) these chapters. I'll go through my other reservations now.

Lack of details. The timing information is presented with no additional information (pg 623) "C++ Straight Time = 4.75 Code-Tuned Time = 3.19 Time Savings = 33%". What was the compiler? What compiler flags were given? What was the test harness?

The book presents it as somehow that "C++" runs this code slowly, but in reality it's more likely to be a test of the effectiveness of the compiler, and the ability of the user to use the compiler. I'd be surprised if any compiler with minimal optimisation enabled did not do the loop interchange operation necessary to get good performance. Which leads to my next observation:

Don't compilers do this? I think the book falls into one of the common "optimisation book" traps, where lots of ink is spent describing and naming the various optimisations. This gives the false impression that it is necessary for the expert programmer to be able to identify these optimisations and apply them to their program. Most compilers will apply all these optimisations - afterall that is what compilers are supposed to do - take the grudgery out of producing optimal code. It's great for page count to enumerate all the possible ways that code might be restructured for performance, but for most situations the restructuring will lead to code that has the same performance.

Profiling. It's not there! To me the most critical thing that a developer can do to optimise their program is to profile it. Understanding where the time is being spent is the necessary first step towards improving the performance of the application. This omission is alarming. The chapter already encourages users to do manual optimisations where there might be no gains (at the cost of time spent doing restructuring that could be better spent writing new code, and the risk that the resulting code is less maintainable), but without profiling the application, the users are basically encouraged to do this over the entire source code, not just the lines that actually matter.

Assembly language. Yes, I love assembly language, there's nothing I enjoy better than working with it (no comment), but I wouldn't encourage people to drop into it for performance reasons, unless they had utterly exhausted every other option. The book includes an example using Delphi where the assembly language version ran faster than the high-level version. My guess is that the compilers had some trouble with aliasing, and hence had more loads than were necessary - a check of the assembly code that the compilers generated would indicate that, and then it's pretty straight forward to write assembly-language-like high level code that the compiler can produce optimal code for. [Note, that I view reading and analysing the code at the assembly language level to be very useful, but I wouldn't recommend leaping into writing assembly language without a good reason.]

So what would I recommend:

  • Profile. Always profile. This will indicate where the time is being spent, and what sort of gains you should expect from optimising parts of the application.
  • Know the tools. Make sure that you know what compiler flags are available, and that you are requesting the right kind of things from the compiler. All too often there are stories about how A is faster than B, which are due to people not knowing how to use the tools.
  • Identify those parts of the code where the time is spent, and examine them in detail to determine if it's a short coming of the compiler, the compiler flags, or an ambiguity in the source code, that causes time to be spent there. Many performance problems can be solved with by adding a new flag, or perhaps a minor tweak to the source code.
  • Only when you have exhausted all other options, and you know that you can get a significant performance gain should you start wildly hacking at the source code, or recoding parts in assembly language.

The other thing to recommend is a read of Bart Smaalder's Performance Anti-patterns.

Utilising a CMT system

I got asked about how to utilise a CMT system, it's probably not an uncommon question, so I'll post my (somewhat brief) answer here:

The CMT processor appears as a system with many CPUs. These virtual CPUs can be provisioned in the same way as you would with any multiprocessor system:

  • The OS will naturally handle positioning multiple active threads so as to get the optimal performance.
  • If you wish to manually tune this then you can use Solaris tools like processor binding (pbind, or processor_bind) to statically allocate a particular thread or process to a particular core. You can use processor sets (psrset) to restrict a set of processes to a particular set of processors (or to exclude particular processes from using these processors).
  • The machine can be divided into multiple virtual machines either through Solaris Zones, where all zones run the same version of the Solaris operating system. Or through logical domains where multiple different operating systems can be installed onto the same machine.
    • The optimal configuration will depend on the problem to be solved.

      I've actually heard someone argue that multicore processors require a redesign of applications. Um, no. Applications will work just fine. However, multicore processors do give you opportunities to throw more threads at a problem - which can be very useful.

Wednesday Jun 10, 2009

Code complete: coding style

I read Code Complete a couple of years back. It's an interesting book to read, but there were two parts that annoyed me. I was giving a presentation the other week on "Coding for performance" and I happened to mention the book, and say that I had these two reservations. So I figure I should write them up more formally.

My first issue was, basically, me just been a stuck in the mud. Those of you who regularly read my blog will see that I favour the following style of indenting:

  if (some condition)
  {
     do something;
  }

If you've read Solaris Application programming, you'll see that I actually use quite a few styles. In writing that book, there were particular places where there was limited space on the page and I ended up juggling the style to make it fit the medium. So I have preferences, but I'm pragmatic.

Anyway, CC on page 746 says identifies my preferred style as "unindented begin-end pairs" and says the following "Although this approach looks fine, it violates the Fundamental Theorem of Formatting; it doesn't show the logical structure of the code.".

Bother.

So I wanted to read up more details on this Fundamental Theorem, perhaps I'm misreading the text, but this is how it appeared to me (pg 739) "A control construct in Visual Basic always has a beginning statement ... and it always has a corresponding End statement." (pg 740) "The controversy about formatting control blocks arises in part from the fact that some languages don't require block structures." (pg 740) "Uncoupling begin and end from the control structure - as languages like C++ and Java do - with { and } - leads to questions about where to put the begin and end. Consequently, many indentation problems are problems only because you have to compensate for poorly designed language structures." [Emphasis mine.] I read this as, basically, you need to make your untidy C/C++/Java code look more like VB. I guess that's why it's taken me a couple of years to calm down sufficiently to post this ;)

Actually, I'm not far from disagreeing. But let's return to this point in a moment. Let's start with the two approaches to indenting that are recommended in the book.

First of all, what is probably the most common style "pure block emulation":

  if (something) {
    do something;
  }

The other recommended style is "begin and end as block boundaries":

  if (something)
    {
    do something;
    }

On page 745, a study by Hansen and Yim (1987) indicates that there's no difference in understandability between these two styles. Excellent - so it doesn't matter! I'm sure that if "unindented begin-end pairs" were also included in the survey then it too would provide indistinguishable understandability.

Anyway, the differences between the recommended "begin and end as block boundaries" and the shunned "unindented begin-end pairs" is basically four spaces, which I don't personally think is a lot.

Heading back to why I might actually agree with some of his comments. It is very easy to introduce a bug in a program where the begin and end braces have been omitted. For example:

BeforeAfter
  if ( a > max )
    max = a;

  if ( a > max )
    printf("New max = %i\n",a);
    max = a;

So, whilst I agree that the absence of brackets can be a problem, I don't necessarily think that rigid adherence to a particular style naturally follows as the only solution to that problem.

I do have some rules that I tend to obey:

  • Indenting is a personal/project preference. There are tools out there that can render source code pretty much how you like it. The UI is a view of the source, and it doesn't really matter what the style of the source is. If I find the source hard to read, then I can process it to make it conform to what ever layout works best for me to solve the problem that I'm working on.
  • Always use begin and end brackets. They add a single character and can avoid the problem demonstrated above.
  • I tend to favour clarity over a rigid adherence to particular styles. I'm not above placing an entire statement on a single line when I feel that it is the best way to present the information. Taking the previous example:
    Multi-lineSingle line
      if ( a > max ) {
        max = a;
      }
    
    
      if ( a > max ) { max = a; }
    
    

Wednesday Jun 03, 2009

Secure programming paper

Excellent paper from Joep Vesseur on secure programming.

The Developer's Edge talk in Second Life

Just finished talking in Second Life. The slides from the talk are available from SLX. I've got into the habit of writing a transcript for my SL presentations - basically in case the audio fails for some reason.

The talk focuses a bit more on the way that people now get information (through blog posts, articles, indexed by search engines) and the Q&A after the talk was more about that than the technical content of the book. This is a domain that I've given a fair amount of thought to. When writing technical books there is a challenge to balance the information so that it includes the necessary details without writing material that will be out of date by the time that the book hits the press. Fortunately a large amount of the information that developers need is relatively long lived. The challenges come when describing a particular revision of the software, or a particular processor - details which can be very useful for people, but also details which may not age gracefully!

Calendar

Search this blog

About

Solaris Application Programming

Book resources

The Developer's Edge

Book resources

OpenSPARC Internals

Book resources

Recent entries

Custom search

Tag cloud

book cmt communityone compiler cooltools cpu2006 dtrace gcc libraries linker multithreading openmp opensolaris opensparc optimisation optimization parallelisation parallelization performance performanceanalyzer programming secondlife solaris solarisapplicationprogramming sparc spot sunstudio ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Interesting docs

Navigation

Referers

Feeds