Darryl Gove's blog

Monday Oct 19, 2009

Fishing with cputrack

I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.

I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:

  • Excessive system time
  • Process migration
  • Memory placement
  • Page size
  • etc.

Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.

Friday Oct 09, 2009

Webcast: Improving the performance of parallel codes using the Performance Analyzer

Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.

Monday Sep 28, 2009

Updated compiler flags article

Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.

Monday Sep 21, 2009

Haskell (GHC) on UltraSPARC T2

Ben Lippmeier gave an excellent presentation at the recent Haskell conference in Edinburgh on his work on porting the Glasgow Haskell Compiler (GHC) back to SPARC. A video of the talk is available.

Update:Link to slides

Wednesday Jul 01, 2009

Introduction to parallel programming

My colleague, Ruud van der Pas, has recorded a series of seven webcasts on parallel programming which will be released on the HPC portal. Ruud is an expert on parallel programming, and one of the authors of the book "Using OpenMP".

Friday Jun 26, 2009

mtmalloc vs umem

A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:

Excl.     Incl.      Name
User CPU  User CPU
   sec.      sec.
266.446   266.446    
258.301   263.084    malloc_internal
  1.661     1.951    free
  1.401     1.401    mutex_lock_impl
  0.961     0.961    mutex_unlock

We can dig into the disassembly of malloc_internal to find out what's going on:

    73.201    73.201            [?]     1724:  cmp         %o5, 0
     1.981     1.981            [?]     1728:  bne         0x1740
     0.320     0.320            [?]     172c:  nop
     0.490     0.490            [?]     1730:  ld          [%i2 + 44], %i2
     1.191     1.191            [?]     1734:  cmp         %i2, 0
     0.901     0.901            [?]     1738:  bne,a       0x1724
## 176.443   176.443            [?]     173c:  ld          [%i2 + 32], %o5

It's not hard to visualise what the original C code would look like:

  while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }

Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:

   1032 	while (thiscache != NULL && thiscache->mt_nfree == 0)
   1033 		thiscache = thiscache->mt_next;

So what's going on?

Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.

It's probably worth revealing the test code, at this point, so that you can see what it does:

#include <stdlib.h>
typedef struct s
{
  struct s * next;
  char padding[508];
} S;

void main()
{
  struct s * head;
  struct s * keep;
  struct s * current;
  head=0;
  keep=0;
  for (int j=0; j<100; j++)
  {
    for (int i=0; i<100000; i++)
    {
      current=(struct s*)malloc(sizeof(struct s));
      if (i&1)
      {
        current->next=head;
        head=current;
      }
      else
      {
        current->next=keep;
        keep=current;
      }
    }
    current = head;
    while (current!=0)
    {
      struct s * tmp = current;
      current = current -> next;
      free(current);
    }
    head = 0;
  }
}

The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        4:44.18
user        4:33.80
sys            8.70

However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:

#include 
...
  mallocctl(MTCHUNKSIZE,20);
...

The performance reduces from nearly 5 minutes to about 1 minute:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        1:09.10
user        1:01.09
sys            6.53

If we increase the requestsize to 30, performance improves still further:

% cc -O mtm.c -lmtmalloc
% timex a.out
real          38.36
user          31.41
sys            4.96

Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:

% cc -O mtm.c -lumem
% timex a.out
real          31.06
user          18.10
sys           10.95

So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.

Tuesday Jun 23, 2009

Sun Studio 12 Update 1

Sun Studio 12 Update 1 went live yesterday. It's still a free download, and it's got a raft of new features. Many people will have been using the express releases, so they will already be familiar with the improvements.

It's been about two years since Sun Studio 12 came out, and the most obvious change in that time is the prevalence of multicore processors. I figured the easiest way to discern this would be to look at the submissions of SPEC CPU2006 results in that time period. The following chart shows the cummulative number of SPEC CPU2006 Integer speed results over that time broken down by the number of threads that the chip was capable of supporting.

Ok, the first surprising thing about the chart is that there's very few single threaded chips. There were a few results when the suite was launched back in 2006, but nothing much since. What is more apparent is the number of dual-thread chips, that was where the majority of the market was. There were also a number of quad-thread chips at that point. If we fast-forward to the situation today, we can see that the number of dual-thread chips has pretty much leveled off, the bulk of the chips are capable of supporting four threads. But you can see the start of a ramp of chips that are capable of supporting 6 or 8 simultaneous threads.

The relevance of this chart to Sun Studio is that Sun Studio has always been a tool that supports the development of multi-threaded applications. Every release of the product improves on the support in the previous release. Sun Studio 12 Update 1 includes improvements in the compiler's ability to automatically parallelise codes - afterall the easiest way to develop parallel applications is if the compiler can do it for you; improvements to the support of parallelisation specifications like OpenMP, this release includes support for the latest OpenMP 3.0 specification; and improvements in the tools and their ability to provide the developer meaningful feedback about parallel code, for example the ability of the Performance Analyzer to profile MPI code.

Footnote SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results posted on www.spec.org as of 15 June 2009.

Monday Jun 15, 2009

Glasgow Haskell Compiler successfully ported to OpenSPARC

Ben Lippmeier has been working on the native port of the Glasgow Haskell Compiler (GHC) to SPARC. The port was completed a few months back, and since then he's been using an UltraSPARC T2 system to look at thread activity and scaling as the number of threads is increased. The full details of the project are on his GHC on SPARC blog. The latest SPARC/Solaris binary can be downloaded here, although the full set of changes probably won't be available for a couple of months.

Sunday Jun 14, 2009

Audio for JavaOne interview available

A couple of weeks back I recorded an interview where I discussed The Developer's Edge. I've just found the audio up at BlogTalkRadio, it's about 15 minutes in duration.

Friday Jun 12, 2009

Stlport4 and multithreaded code

I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.

First off, it's probably good to take a look at a code that illustrates the problem:

#include <vector>

int main()
{
  #pragma omp parallel for default (__auto)
  for (int i=0; i<10000000; i++)
  {
    std::vector<int> v;
    v.push_back(10);
  }
  return(0);
}

The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.

$ CC -O t1.cc
$ timex a.out
real          15.85
user          15.64
sys            0.01
$ CC -O -library=stlport4 t1.cc
$ timex a.out
real           7.87
user           7.78
sys            0.01

This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.

The next thing to investigate is scaling when there are multiple threads:

$ CC -O -xopenmp -library=stlport4 t1.cc
$ timex a.out
real           7.00
user           6.96
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real           7.18
user          14.28
sys            0.01

So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
8.076     8.076      
1.571     2.272      mutex_lock_impl
1.501     1.971      mutex_unlock
1.051     4.573      std::vector >::_M_insert_overflow(int*,const int&,const std::__true_type&,unsigned,bool)
0.871     8.076      _$d1A5.main
0.871     3.272      std::__node_alloc<true,0>::_M_allocate(unsigned)
0.560     1.721      std::__node_alloc<true,0>::_M_deallocate(void*,unsigned)
0.480     0.480      sigon
0.440     0.440      mutex_trylock_adaptive
0.250     0.470      mutex_unlock_queue

So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.

This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.

The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config 

Once I had it built, I could install it with:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install

The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:

CC -m64  -O -xopenmp -library=no%Cstd \
   -I ./stdcxx-4.2.1/install/include/ \
   -L ./stdcxx-4.2.1/install/lib/     \
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc 

So we can build the test and look at the scaling between one and two threads:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.98
user          18.93
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real          18.42
user          36.73
sys            0.01

Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
21.145    21.145     
 2.572    16.411     std::vector<int,std::allocator<int> >::_C_insert_n(int*const&,unsigned long,const int&)
 2.402     4.293     mutex_unlock
 2.342     3.613     mutex_lock_impl
 1.961    10.697     std::vector<int,std::allocator<int> >::_C_realloc(unsigned long)
 1.681     5.634     free
 1.341     1.891     mutex_unlock_queue
 1.271     1.271     _free_unlocked
 0.991     0.991     sigon

So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:

(er_print) csingle mutex_lock
Attr.    Excl.     Incl.      Name  
User CPU  User CPU  User CPU         
 sec.      sec.      sec.       
0.170     1.681     5.634      free
0.020     0.690     4.623      malloc
0.190     0.190     0.190     *mutex_lock

So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:

CC -m64  -O -xopenmp -library=no%Cstd \
   -I ./stdcxx-4.2.1/install/include/ \
   -L ./stdcxx-4.2.1/install/lib/     \
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc -lmtmalloc

Then we can try the timing tests again:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.02
user          17.98
sys            0.01
$ export OMP_NUM_THREADS=2
real          13.76
user          27.05
sys            0.01
$ export OMP_NUM_THREADS=4
$ timex a.out
real           6.92
user          26.97
sys            0.02
$ export OMP_NUM_THREADS=8
$ timex a.out
real           3.51
user          26.99
sys            0.02

So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]

Thursday Jun 11, 2009

Utilising a CMT system

I got asked about how to utilise a CMT system, it's probably not an uncommon question, so I'll post my (somewhat brief) answer here:

The CMT processor appears as a system with many CPUs. These virtual CPUs can be provisioned in the same way as you would with any multiprocessor system:

  • The OS will naturally handle positioning multiple active threads so as to get the optimal performance.
  • If you wish to manually tune this then you can use Solaris tools like processor binding (pbind, or processor_bind) to statically allocate a particular thread or process to a particular core. You can use processor sets (psrset) to restrict a set of processes to a particular set of processors (or to exclude particular processes from using these processors).
  • The machine can be divided into multiple virtual machines either through Solaris Zones, where all zones run the same version of the Solaris operating system. Or through logical domains where multiple different operating systems can be installed onto the same machine.
    • The optimal configuration will depend on the problem to be solved.

      I've actually heard someone argue that multicore processors require a redesign of applications. Um, no. Applications will work just fine. However, multicore processors do give you opportunities to throw more threads at a problem - which can be very useful.

Wednesday May 20, 2009

Libraries (4) - Runtime costs - Procedure Lookup Table (PLT)

Most applications spend the majority of their time running - rather than starting up. So it's useful to look at the costs of using libraries at runtime.

The most apparent cost of using libraries is that calls to routines now go indirectly to the target routine through the procedure look up table (PLT). Unless the developer explicitly limits the scope of a function, it is exported from the library as a global function, which means that even calls within the library will go through the PLT. Consider the following code snippet:

void func2()
{
 ...
}

void func1()
{
   func2();
}

If this is compiled into an executable the assembly code will look like:

func1()
        11104:  82 10 00 0f  mov        %o7, %g1
        11108:  7f ff ff f8  call       func2   ! 0x110e8
        1110c:  9e 10 00 01  mov        %g1, %o7

However, if this is compiled as part of a library then the code looks like:

func2()
         664:  82 10 00 0f  mov         %o7, %g1
         668:  40 00 40 b9  call        .plt+0x3c       ! 0x1094c
         66c:  9e 10 00 01  mov         %g1, %o7

This is a doubling of the cost of the call.

In C it's possible to limit the scope of the function using the static keyword. Declaring func1 as static will cause the compiler to generate a direct call to that routine. The downside is that the routine will only be visible within the source file that defines it. It is also possible to use other methods to limit the visibility of symbols.

Tuesday May 19, 2009

Developer's Edge safari rerelease

We've just pushed a new version of The Developer's Edge to safari. The original version didn't show any of the text from each section of the book unless you logged into the safari site. The new version shows the snippet from each section even if you're not a subscriber.

I was pleased to see that the book is featured on the Sun Studio developer portal.

I'm also scheduled to give a second life presentation during JavaOne at 9am PST on the 3rd June.

Thursday Apr 30, 2009

Computer organization and design

Just returned from Europe - customer visits and the OpenSPARC workshop in Brussels. Since I was doing a fair amount of air travel I took a number of books. With good timing, I'd just got a copy of Patterson & Hennessy's Computer Organisation and Design.

The book is certainly an interesting read. Although there are various ways you might read the book - or various lessons you might extract from it, I took it as basically a book that describes how to build a MIPS processor. Much of the work I do is at the hardware-software border, so I found it useful to read around the domain a bit more. The book is a relatively easy read, lots of detail (which I didn't need to memorise), and a nice pace that builds up from a simple core into a more complete implementation of a processor. The book has a chapter on multicore, but this was not treated to the same depth. There's also some material about GPUs, and that also didn't fit very well.

Friday Mar 27, 2009

The Developer's Edge available in hardcopy

The Developer's Edge is now available as hardcopy!

It is being made available as print-on-demand. You can either order through the publisher Vervante, or you can order through Amazon.

However, I suggest you wait until next week before ordering as the current cover art is not the final version (you can play spot the difference between the image on the left and the one on the Vervante website). I'll post when it gets fixed. Of course, you can order the "limited-edition" version if you want :)

I introduced the book in a previous post. I'll reproduce a bit more of the details in this post. The brief summary is:


The Developer's Edge: Selected Blog Posts and Articles focuses on articles in the following areas:

  • Native language issues
  • Performance and improving performance
  • Specific features of the x86 and SPARC processors
  • Tools that are available in the Solaris OS

The articles provide a broad overview on a topic, or an in-depth discussion. The texts should provide insights into new tools or new methods, or perhaps the opportunity to review a known domain in a new light.


You can get more details than this from the Safari site, either reading the preface or skimming the table of contents

I would love to hear feedback on this book, feel free to e-mail me directly, leave comments on amazon, or leave comments on this blog, or on the blogs of the other contributors.

Thursday Mar 26, 2009

OpenSPARC workshop - Brussels

April 24th and 25th I'm going to be in Brussels rerunning the OpenSPARC workshop. The workshop leverages the material in the OpenSPARC Internals book, together with the OpenSPARC presentations. There's a poster with all the details (for those with acute eyesight, the poster on the left is from the December workshop!).

Friday Mar 20, 2009

University of Washington Presentation

I was presenting at the University of Washington, Seattle, on Wednesday on Solaris and Sun Studio. The talk covers the tools that are available in Solaris and Sun Studio. This is my "Grand Unified" presentation, it covers tools, compilers, optimisation, parallelisation, and debug.

Thursday Mar 05, 2009

Peak rate of window spill and fill traps

I've been looking at the performance of a code recently. Written in C++ using many threads. One of the things with C++ is that as a language it encourages developers to have lots of small routines. Small routines lead to many calls and returns; and in particular they lead to register window spills and fills. Read more about register windows. Anyway, I wondered what the peak rate of issuing register window spills and fills was?

I'm going to use some old code that I used to examine the cost of library calls a while back. The code uses recursion to reach a deep call depth. First off, I define a couple of library routines:

extern int jump2(int count);

int jump1(int count)
{
  count--;
  if (count==0)
  {
    return 1;
  }
  else
  {
    return 1+jump2(count);
  }
}
and
int jump1(int count);

int jump2(int count)
{
  count--;
  if (count==0)
  {
    return 1;
  }
  else
  {
    return 1+jump1(count);
  }
}

I can then turn these into libraries:

 
$ cc -O -G -o libjump1.so jump1.c
$ cc -O -G -o libjump2.so jump2.c

Done in this way, both libraries have hanging dependencies:

$ ldd -d libjump1.so
        symbol not found: jump2         (./libjump1.so)

So the main executable will have to resolve these. The main executable looks like:

#include <stdio.h>

#define RPT 100
#define SIZE 600

extern int jump1(int count);

int main()
{
  int index,count,links,tmp;
  tmp=jump1(100);
  #pragma omp parallel for default(none) private(count,index,tmp,links)
  for(links=1; links<10000; links++)
  {
   for (count=0; count<RPT; count++)
   {
     for (index=0;index<SIZE;index++) 
     {
       tmp=jump1(links);
       if (tmp!=links) {printf("mismatch\n");}
     }
   }
  }
}

This needs to be compiled in such a way as to resolve the unresolved dependencies

$ cc -xopenmp=noopt -o par par.c -L. -R. -ljump1 -ljump2

Note that I'm breaking rules again by making the runtime linker look for the dependent libraries in the current directory rather than use $ORIGIN to locate them relative to the executable. Oops.

I'm also using OpenMP in the code. The directive tells the compiler to make the outer loop run in parallel over multiple processors. I picked 10,000 for the trip count so that the code would run for a bit of time, so I could look at the activity on the system. Also note that the outer loop defines the depth of the call stack, so this code will probably cause stack overflows at some point, if not before. Err...

I'm compiling with -xopenmp=noopt since I want the OpenMP directive to be recognised, but I don't want the compiler to use optimisation, since if the compiler saw the code it would probably eliminate most of it, and that would leave me with nothing much to test.

The first thing to test is whether this generates spill fill traps at all. So we run the application and use trapstat to look at trap activity:

vct name           |   cpu13  
----------------------------------------
...
 84 spill-user-32       |  3005899  
...
 c4 fill-user-32        |  3005779 
...

So on this 1.2GHz UltraSPARC T1 system, we're getting 3,000,000 traps/second. The generated code is pretty plain except for the save and restore instructions:

jump1()
         21c:  9d e3 bf a0  save        %sp, -96, %sp
         220:  90 86 3f ff  addcc       %i0, -1, %o0
         224:  12 40 00 04  bne,pn      %icc,jump1+0x18 ! 0x234
         228:  01 00 00 00  nop       

         22c:  81 c7 e0 08  ret       
         230:  91 e8 20 01  restore     %g0, 1, %o0

         234:  40 00 00 00  call        jump2
         238:  01 00 00 00  nop       
         23c:  81 c7 e0 08  ret       
         240:  91 ea 20 01  restore     %o0, 1, %o0

So you can come up with an estimate of 300 ns/trap.

The reason for using OpenMP is to enable us to scale the number of active threads. Rerunning with 32 threads, by setting the environment variable OMP_NUM_THREADS to be 32, we get the following output from trapstat:

vct name                |    cpu21    cpu22    cpu23    cpu24    cpu25    cpu26 
------------------------+-------------------------------------------------------
...
 84 spill-user-32       |  1024589  1028081  1027596  1174373  1029954  1028695
...
 c4 fill-user-32        |   996739   989598   955669  1169058  1020349  1021877

So we're getting 1M traps per thread, with 32 threads running. Let's take a look at system activity using vmstat.

 vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s1 s2 s3 s4   in   sy   cs us sy id
...
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3022  427  812 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3020  428  797 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 2945  457  760 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3147  429 1025 99  1  0
 0 0 0 64800040 504168 0 15  0  0  0  0  0  0  0  0  0 3049  666  820 99  1  0
 0 0 0 64800040 504168 0  1  0  0  0  0  0  0  0  0  0 3044  543  866 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3021  422  798 100 0  0

So there's no system time being recorded - the register spill and fill traps are fast traps, so that's not a surprise.

One final thing to look at is the instruction issue rate. We can use cpustat to do this:

  2.009   6  tick  63117611 
  2.009  12  tick  69622769 
  2.009   7  tick  62118451 
  2.009   5  tick  64784126 
  2.009   0  tick  67341237 
  2.019  17  tick  62836527 

As might be expected from the cost of each trap, and the sparse number of instructions between traps, the issue rate of the instructions is quite low. Each of the four threads on a core is issuing about 65M instructions per second. So the core is issuing about 260M instructions per second - that's about 20% of the peak issue rate for the core.

If this were a real application, what could be done? Well, obviously the trick would be to reduce the number of calls and returns. At a compiler level, that would using flags that enable inlining - so an optimisation level of at least -xO4; adding -xipo to get cross-file inlining; using -g0 in C++ rather than -g (which disables front-end inlining). At a more structural level, perhaps the way the application is broken into libraries might be changed so that routines that are frequently called could be inlined.

The other thing to bear in mind, is that this code was designed to max out the register window spill/fill traps. Most codes will get nowhere near this level of window spills and fills. Most codes will probably max out at about a tenth of this level, so the impact from register window spill fill traps at that point will be substantially reduced.

Friday Jan 23, 2009

OSUM presentation on multi-threaded coding

I'll be giving a presentation titled "Multi-threaded coding for CMT processors" to OSUM members next friday (8am PST). If you are an OSUM member you can read the details here. OSUM stands for Open Source University Meetup - the definition is

"OSUM (pronounced "awesome") is a global community of students that are passionate about Free and Open Source Software (FOSS) and how it is Changing (Y)Our World. We call it a "Meetup" to encourage collaboration between student groups to create an even stronger open source community.".

Tuesday Dec 16, 2008

OpenSPARC Internals available on Amazon

OpenSPARC Internals is now available from Amazon. As well as print-on-demand from lulu, and as a free (after registration) download.

Thursday Nov 13, 2008

How to learn SPARC assembly language

Got a question this morning about how to learn SPARC assembly language. It's a topic that I cover briefly in my book, however, the coverage in the book was never meant to be complete. The text in my book is meant as a quick guide to reading SPARC (and x86) assembly, so that the later examples make some kind of sense. The basics are the instruction format:

[instruction]  [source register 1], [source register 2], [destination register]

For example:

faddd    %f0, %f2, %f4

Means:

%f4 = %f0 + %f2

The other thing to learn that's different about SPARC is the branch delay slot. Where the instruction placed after the branch is actually executed as part of the branch. This is different from x86 where a branch instruction is the delimiter of the block of code.

With those basics out the way, the next thing to do would be to take a look at the SPARC Architecture manual. Which is a very detailed reference to all the software visible implementation details.

Finally, I'd suggest just writing some simple codes, and profiling them using the Sun Studio Performance Analyzer. Use the disassembly view tab and the architecture manual to see how the instructions are used in practice.

Monday Nov 10, 2008

Poster for London workshop

Just got the poster for the OpenSPARC Workshop in London in December.

Wednesday Nov 05, 2008

OpenSPARC workshop in London

I'm thrilled to have been asked to present at the OpenSPARC workshop to be run in London, on December 4 and 5th. I'll be covering the 'software topics'. There's no charge for attending the workshop.

OpenSPARC presentations

As part of the OpenSPARC book, we were asked to provide slideware and to present that slideware. The details of what's available are listed on the OpenSPARC site, and are available for free from the site on wikis.sun.com.

I contributed two sections. I produced the slides and did the voice over for the material on developing for CMT, the accompanying slides are also available. I also did a voice over for someone else's slides on Operating Systems for CMT (again slides available).

The recording sessions were ok, but a bit strange since it was just myself and the sound engineer working in a meeting room in Santa Clara. I get a lot of energy from live presentations, particularly the interactions with people, and I found the setup rather too quiet for my liking.

The Sun Studio presentation was relatively easy. It runs for nearly an hour, and there's a couple of places where I felt that additional slides would have helped the flow. The Operating Systems presentation was much harder as it was trying to weave a story around someone else's slide deck.

OpenSPARC Internals book

The OpenSPARC Internals book has been released. This is available as a free (after registration) pdf or as a print-on-demand book. The book contains a lot of very detailed information about the OpenSPARC processors, and my contribution was a chapter about Sun Studio, tools, and developing for CMT.

Friday Oct 31, 2008

The limits of parallelism

We all know Amdahl's law. The way I tend to think of it is if you reduce the time spent in the hot region of code, the most benefit you can get is the total time that you initially spent there. However, the original setting for the 'law' was parallelisation - the runtime improvement depends on the proportion of the code that can be made to run in parallel.

Aside: When I'm looking at profiles of applications, and I see a hot region of code, I typically consider what the improvement in runtime would be if I entirely eliminated the time spent there, or if I halved it. Then use this as a guide as to whether it's worth the effort of changing the code.

The issue with Amdahl is that it's completely unrealistic to consider parallelisation without considering issues of the synchronisation overhead introduced when you use multiple threads. So let's do that and see what happens. Assume that:

P = parallel runtime
S = Serial runtime
N = Number of threads
Z = Synchronisation cost

Amdahl would give you:

P = S / N

The flaw is that I can keep adding processors, and the parallel runtime keeps getting smaller and smaller - why would I ever stop? A more accurate equation would be something like:

P = S / N + Z*log(N)

This is probably a fair approximation to the cost of synchronisation, some kind of binary tree object that synchronises all the threads. So we can differentiate that:

dP/DN = -S / N^2 + Z / N 

And then solve:

0 = -S / N^2 + Z / N
S / N = Z
N = S / Z

Ok, it's a bit disappointing, you start off with some nasty looking equation, and end up with a ratio. But let's take a look at what the ratio actually means. Let's suppose I reduce the synchronisation cost (Z). If I keep the work constant, then I can scale to a greater number of threads on the system with the lower synchronisation cost. Or, if I keep the number of threads constant, I can make a smaller chunk of work run in parallel.

Let's take a practical example. If I do synchronisation between threads on traditional SMP system, then communication between cores occurs at memory latency. Let's say that's ~200ns. Now compare that with a CMT system, where the synchronisation between threads can occur through the second level cache, with a latency of ~20ns. That's a 10x reduction in latency, which means that I can either use 10x the threads on the same chunk of work, or I can run in parallel a chunk of work that is 10x smaller.

The logical conclusion is that CMT is a perfect enabler of Microparallelism. You have both a system with huge numbers of threads, and the synchronisation costs between threads are potentially very low.

Now, that's exciting!

Thursday Oct 30, 2008

The multi-core is complex meme (but it's not)

Hidden amoungst the interesting stories (Gaming museum opening, Tennant stepping down) on the BBC was this little gem from Andrew Herbert, the head of Microsoft research in the UK.

The article describes how multi-core computing is hard. Here's a snippet of it:

"For exciting, also read 'complicated'; this presents huge programming challenges as we have to address the immensely complex interplay between multiple processors (think of a juggler riding a unicycle on a high wire, and you're starting to get the idea)."

Now, I just happened to see this particular article, but there's plenty of other places where the same meme appears. And yes, writing a multi-threaded application can be very complex, but probably only if you do it badly :) I mean just how complex is:

#pragma omp parallel for

Ok, so it's not fair comparing using OpenMP to parallelise a loop with writing some convoluted juggler riding unicycle application, but let's take a look at the example he uses:

"Handwriting recognition systems, for example, work by either identifying pen movement, or by recognising the written image itself. Currently, each approach works, but each has certain drawbacks that hinder adoption as a serious interface.

Now, with a different processor focusing on each recognition approach, learning our handwriting style and combining results, multi-core PCs will dramatically increase the accuracy of handwriting recognition."

This sounds like a good example (better than the old examples of using a virus scanner on one core whilst you worked on the other), but to me it implies two independent tasks. Or two independent threads. Yes, having a multicore chip means that the two tasks can execute in parallel, but assuming a sufficiently fast single core processor we could use the same approach.

So yes, to get the best from multi-core, you need to use multi-threaded programming (or multi-process, or virtualisation, or consolidation, but that's not the current discussion). But multi-threaded programming, whilst it can be tricky, is pretty well understood, and more importantly, for quite a large range of codes easy to do using OpenMP.

So I'm going to put it the other way around. It's easy to find parallelism in today's environment, from the desktop, through gaming, or numeric codes. There's abundant examples of where many threads are simultaneously active. Where it gets really exciting (for exciting read "fun") is if you start looking at using CMT processors to parallelise things that previously were not practical to run with multiple threads.

Tuesday Oct 28, 2008

Second life - Utilising CMT slides and transcript

Just finished presenting in second life. This time the experience was not so good, my audio cut out unexpected during the presentation, so I ended up having to use chat to present the material. I was very glad that I'd gone to the effort of writing a script before the presentation, however, reading off the screen is not as effective as presenting the material.

Anyway, I found 'statistics' panel in the environment and looking at this indicated that I was down to <1 FPS, with massive lag. Interestingly, after the presentation and once everyone had left the presentation area, the FPS went up to 10-12. The SL program was still maxing out the CPU (as you might expect, I guess there's no reason to back off until the program hits the frame rate of the screen), but much more responsive - things actually happened when I clicked the on-screen controls.

So, I'm sorry for anyone who found the experience frustrating. I did too. And thank you to those people who turned up, and persevered, and particularly to the bunch of people who participated in the discussion at the end.

Anyway, for those who are interesting the slides and transcript for the presentation are available.

Sunday Oct 26, 2008

Second life presentation

I'll be presenting in Second Life on Tuesday 28th at 9am PST. The title of the talk is "Utilising CMT systems".

Thursday Jul 24, 2008

Haskell porting opportunity

There's an opportunity to work on improving the performance of Haskell on SPARC. The project is also going to explore using multiple threads to improve performance. Much more information is contained in the announcement.

Calendar

Search this blog

About

Solaris Application Programming

Book resources

The Developer's Edge

Book resources

OpenSPARC Internals

Book resources

Recent entries

Custom search

Tag cloud

book cmt communityone compiler cooltools cpu2006 dtrace gcc libraries linker openmp opensolaris opensparc optimisation optimization parallelisation parallelization performance performanceanalyzer programming solaris solarisapplicationprogramming sparc spec spot sunstudio t2 ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Interesting docs

Navigation

Referers

Feeds