Darryl Gove's blog

Tuesday Apr 29, 2008

Multicore expo available - Microparallelisation

My presentation "Strategies for improving the performance of single threaded codes on a CMT system" has been made available on the OpenSPARC site.

The presentation discusses "microparallelisation", in the the context of parallelising an example loop. Microparallelisation is the aim of obtaining parallelism through assigning small chunks of work to discrete processors. Taking a step back...

With traditional parallelisation the idea is to identify large chunks of work that can be split between multiple processors. The chunks of work need to be large to amortise the synchronisation costs. This usually means that the loops have a huge trip count.

The synchronisation costs are derived from the time it takes to signal that a core has completed its work. The lower the synchronisation costs, the smaller amount of work is needed to make parallelisation profitable.

Now, a CMT processor has two big advantages here. First of all it has many threads. Secondly these threads have low latency access to a shared level of cache. The result of this is that the cost of synchronisation between threads is greatly reduced, and therefore each thread is free to do a smaller chunk of work in a parallel region.

All that's great in theory, the presentation uses some example code to try this out, and discovers, rather fortunately, that the idea also works in practice!

The presentation also covers using atomic operations rather that microparallelisation.

In summary the presentation is more research than solid science, but I hoped that presenting it would get some people thinking about non-traditional ways to extract parallelism from applications. I'm not alone in this area of work, Lawrence Spracklen is also working on it. We're both at presenting CommunityOne next week.

Thursday Apr 24, 2008

Second life slides and script

Just completed the Second Life presentation. It appeared well attended, and I got a bundle of great questions at the end. If you were there, thank you! I've uploaded a screen shot that I managed to get before the presentation started. Unfortunately, I didn't get a picture of the stage setup with the life-size books, a very nice touch.

[Read More]

Thursday Mar 27, 2008

Performance visualization discussion group

Saw this on Neil Gunther's blog, a discussion group for performance visualization. I've no idea whether it will be an interesting group or not, but there's certainly many possibilities for making performance data more readily understandable. So far there look to be about 12 folks signed up to the group.

One visualization tool that I think shows promise is chime, which is built upon dtrace.

Thursday Mar 20, 2008

The much maligned -fast

The compiler flag -fast gets an unfair rap. Even the compiler reports:

cc: Warning: -xarch=native has been explicitly specified, or 
implicitly specified by a macro option, -xarch=native on this 
architecture implies -xarch=sparcvis2 which generates code that 
does not run on pre UltraSPARC III processors

which is hardly fair given the the UltraSPARC III line came out about 8 years ago! So I want to quickly discuss what's good about the option, and what reasons there are to be cautious.

The first thing to talk about is the warning message. -xtarget=native is a good option to use when the target platform is also the deployment platform. For me, this is the common case, but for people producing applications that are more generally deployed, it's not the common case. The best thing to do to avoid the warning and produce binaries that work with the widest range of hardware is to add the flag -xtarget=generic after -fast (compiler flags are parsed from left to right, so the rightmost flag is the one that gets obeyed). The generic target represents a mix of all the important processors, the mix produces code that should work well on all of them.

The next option which is in -fast for C that might cause some concern is -xalias_level=basic. This tells the compiler to assume that pointers of different basic types (e.g. integers, floats etc.) don't alias. Most people code to this, and the C standard actually has higher demands on the level of aliasing the compiler can assume. So code that conforms to the C standard will work correctly with this option. Of course, it's still worth being aware that the compiler is making the assumption.

The final area is floating point simplification. That's the flags -fsimple=2 which allows the compiler to reorder floating point expressions, -fns which allows the processor to flush subnormal numbers to zero, and some other flags that use faster floating point libraries or inline templates. I've previously written about my rather odd views on floating point maths. Basically it comes down to If these options make a difference to the performance of your code, then you should investigate why they make a difference..

Since -fast contains a number of flags which impact performance, it's probably a good plan to identify exactly those flags that do make a difference, and use only those. A tool like ats can really help here.

Performance tuning recipe

Dan Berger posted a comment about the compiler flags we'd used for Ruby. Basically, we've not done compiler flag tuning yet, so I'll write a quick outline of the first steps in tuning an application.

  • First of all profile the application with whatever flags it usually builds with. This is partly to get some idea of where the hot code is, but it's also useful to have some idea of what you're starting with. The other benefit is that it's tricky to debug build issues if you've already changed the defaults.
  • It should be pretty easy at this point to identify the build flags. Probably they will flash past on the screen, or in the worse case, they can be extracted (from non-stripped executables) using dumpstabs or dwarfdump. It can be necessary to check that the flags you want to use are actually the ones being passed into the compiler.
  • Of course, I'd certainly use spot to get all the profiles. One of the features spot has that is really useful is to archive the results, so that after multiple runs of the application with different builds it's still possible to look at the old code, and identify the compiler flags used.
  • I'd probably try -fast, which is a macro flag, meaning it enables a number of optimisations that typically improve application performance. I'll probably post later about this flag, since there's quite a bit to say about it. Performance under the flag should give an idea of what to expect with aggressive optimisations. If the flag does improve the applications performance, then you probably want to identify the exact options that provide the benefit and use those explicitly.
  • In the profile, I'd be looking for routines which seem to be taking too long, and I'd be trying to figure out what was causing the time to be spent. For SPARC, the execution count data from bit that's shown in the spot report is vital in being able to distinguish from code that runs slow, or code that runs fast but is executed many times. I'd probably try various other options based on what I saw. Memory stalls might make me try prefetch options, TLB misses would make me tweak the -xpagesize options.
  • I'd also want to look at using crossfile optimisation, and profile feedback, both of which tend to benefit all codes, but particularly codes which have a lot of branches. The compiler can make pretty good guesses on loops ;)

These directions are more a list of possible experiments than necessary an item-by-item checklist, but they form a good basis. And they are not an exhaustive list...

Monday Mar 17, 2008

auto_ilp32, unsafe for production?

A while back I was reading up on the intel compiler's -auto_ilp32 flag. This flag produces a binary that enjoys the benefits of the EMT64 instructions set extensions, but uses a 32-bit memory model. The idea of the flag is to get the performance from the instruction set while avoiding the increase in memory footprint that comes from 64-bit pointers. I'm totally in favour of the idea, after all, that's the idea behind the v8plus/sparcvis architecture.

However, I was a bit distressed at the details of the implementation. The flag tells the compiler to make two assumptions, firstly that longs can be constrained to 4-bytes (rather than 8-bytes), secondly that pointers can also be held in 4 bytes (rather than 8).

The assumption for longs can be argued that if the application works when compiled for IA32, then any longs in the program do fit into 32-bits, so only using 32-bits for longs is therefore ok.

The docs place this restriction on the use of the flag for pointers:

Specifies that the application cannot exceed a 32-bit address space, which allows the compiler to use 32-bit pointers whenever possible.

The idea being that if the application only uses a 32-bit memory range, then the upper bits of the 64-bit pointer are going to be zero anyway - so why store them.

The problem with both of these is that the code ends up being run as if it were a 64-bit application. So the OS thinks the app is 64-bit, so will quite happily pass 8 byte longs to the app, or pass memory that is not in the low 32-bit addressable memory.

To go into a couple of situations where this could be 'bad':

  • Imagine a program which relies on a zero return value from malloc to tell it to stop allocating memory (perhaps it caches data, or maybe the algorithm it choses depends on how much memory is available). Under -auto_ilp32, the OS keeps returning new memory, so the application thinks it has not run out of memory, but in fact it is getting memory that is no longer addressable using just 32-bits.
  • Consider an application which allocates a chunk of memory to handle a problem. The larger the problem, the more memory is required. At some point the size of the problem will require a chunk of memory that is not entirely 32-bit addressable. Because malloc does not return zero, the application has no way of knowing that the problem is too large.
  • The application takes the pointers to memory that is not 32-bit addressable, and drops the upper 32-bits. Now it has a pointer into 32-bit addressable memory that has already been used for other data, and it starts storing new data on top of the old data.
  • Imagine another situation where the application is profiled, and it's realised that too much time is spent in the memory management library. So the developer produces a new library and gets the application to use this in preference to the system provided library. The only side-effect of this library is that it starts allocating memory at a higher address, which doesn't matter if you have a 64-bit address, but it eats up a sizable amount of the 32-bit addressable space.

In many of these cases the memory corruption problem would just leap out, and the user would know to remove the flag, but there will be some cases where the flag could cause silent data corruption.

But hang on a second ... didn't I just say that SPARC has the same hybrid. Well, not quite. v8plus is, as it's name suggests, built on SPARC V8 - so the OS thinks that it is a 32-bit application. Hence longs are 32-bits in size, as are pointers. There's some code necessary to store the additional state of a V9 processor. But basically, the application is a 32-bit app which happens to use a few more instructions.

The thing that's frustrating is that it's quite possible to make the OS aware of the 32-bit/64-bit hybrid, or to engineer a layer to be able to safely run a 32-bit hybrid app with the OS believing it to be a 64-bit app, but that was not the approach taken. A bit of a missed opportunity, IMO.

Friday Mar 07, 2008

Ruby performance gains on SPARC

The programming language Ruby is run on a VM. So the VM is responsible for context switches as well as garbage collection. Consequently, the code contains calls to flush register windows. A colleague of mine, Miriam Blatt, has been examining the code and we think we've found some places where the calls to flush register windows are unnecessary. The code appears in versions 1.8/1.9 of Ruby, but I'll focus on 1.8.* in this discussion.

As outlined in my blog entry on register windows, the act of flushing them is both high cost and rarely needed. The key points at which it is necessary to flush the register windows to memory are on context switches and before garbage collection.

Ruby defines a macro called FLUSH_REGISTER_WINDOWS in defines.h. The macro only does something on IA64 and SPARC, so the changes I'll discuss here are defined so that they leave the behaviour on IA64 unchanged. My suspicion is that the changes are equally valid for IA64, but I lack an IA64 system to check them on.

The FLUSH_REGISTER_WINDOWS macro gets used in eval.c in the EXEC_TAG macro, THREAD_SAVE_CONTEXT macro, rb_thread_save_context routine, and rb_thread_restore_context routine. (There's also a call in gc.c for the garbage collection.)

The first thing to notice is that the THREAD_SAVE_CONTEXT macro calls rb_thread_save_context, so the FLUSH_REGISTER_WINDOWS call in the THREAD_SAVE_CONTEXT macro is unnecessary (the register windows have already been flushed). However, we've not seen this particular flush cause any performance issues in our tests (although it's possible that the tests didn't stress multithreading).

The more important call is the one in EXEC_TAG. This is executed very frequently in Ruby codes, but this flush does not appear to be at all necessary. It is neither a context switch or the start of garbage collection. Removing this call to flush register windows leads to significant performance gains (upwards of 10% when measured in an older v880 box. Some of the benchmarks nearly doubled in performance).

The source code modifications for 1.8.6 are as follows:

$ diff defines.h.orig defines.h.mod
228a229,230
> #  define EXEC_FLUSH_REGISTER_WINDOWS ((void)0)
> #  define SWITCH_FLUSH_REGISTER_WINDOWS ((void)0)
232a235,236
> #  define EXEC_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
> #  define SWITCH_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
234a239,240
> #  define EXEC_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
> #  define SWITCH_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS

The change to defines.h adds new variants of the FLUSH_REGISTER_WINDOWS macro to be used for the EXEC_TAG and THREAD_SAVE_CONTEXT macros. To preserve the current behaviour on IA64, they are left as defined as ((void)0) on all architectures but IA64 where they are defined as FLUSH_REGISTER_WINDOWS.

$ diff eval.c.orig eval.c.mod
1025c1025
< #define EXEC_TAG()    (FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
---
> #define EXEC_TAG()    (EXEC_FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
10290c10290
<     (rb_thread_switch((FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))
---
>     (rb_thread_switch((SWITCH_FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))

The changes to eval.c just use the new macros instead of the old FLUSH_REGISTER_WINDOWS call.

These code changes have worked on all the tests we've used (including `gmake test-all`). However, I can't be certain that there is not a workload which requires these flushes. This appears to be putback that added the flush call to EXEC_TAG, and the comment suggests that the change may not be necessary. I'd love to hear comments either agreeing with the analysis, or pointing out why the flushes are necessary.

Update: to add diff -u output
$ diff -u defines.h.orig defines.h.mod
--- defines.h.orig      Tue Mar  4 16:32:05 2008
+++ defines.h.mod       Wed Mar  5 14:22:06 2008
@@ -226,12 +226,18 @@
        ;
 }
 #  define FLUSH_REGISTER_WINDOWS flush_register_windows()
+#  define EXEC_FLUSH_REGISTER_WINDOWS ((void)0)
+#  define SWITCH_FLUSH_REGISTER_WINDOWS ((void)0)
 #elif defined(__ia64)
 void *rb_ia64_bsp(void);
 void rb_ia64_flushrs(void);
 #  define FLUSH_REGISTER_WINDOWS rb_ia64_flushrs()
+#  define EXEC_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
+#  define SWITCH_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
 #else
 #  define FLUSH_REGISTER_WINDOWS ((void)0)
+#  define EXEC_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
+#  define SWITCH_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
 #endif

 #if defined(DOSISH)
$ diff -u eval.c.orig eval.c.mod
--- eval.c.orig Tue Mar  4 16:32:00 2008
+++ eval.c.mod  Wed Mar  5 14:22:13 2008
@@ -1022,7 +1022,7 @@
 #define PROT_LAMBDA INT2FIX(2) /* 5 */
 #define PROT_YIELD  INT2FIX(3) /* 7 */

-#define EXEC_TAG()    (FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
+#define EXEC_TAG()    (EXEC_FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))

 #define JUMP_TAG(st) do {              \
     ruby_frame = prot_tag->frame;      \
@@ -10287,7 +10287,7 @@
 }

 #define THREAD_SAVE_CONTEXT(th) \
-    (rb_thread_switch((FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))
+    (rb_thread_switch((SWITCH_FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))

 NORETURN(static void rb_thread_restore_context _((rb_thread_t,int)));
 NORETURN(NOINLINE(static void rb_thread_restore_context_0(rb_thread_t,int,void*)));

Flush register windows

The SPARC architecture has an interesting feature called Register Windows. The idea is that the processor should contain multiple sets of registers on chip. When a new routine is called, the processor can give a fresh set of registers to the new routine, preserving the value of the old registers. When the new routine completes and control returns to the calling routine, the register values for the old routine are also restored. The idea is for the chip not to have to save and load the values held in registers whenever a routine is called; this reduces memory traffic and should improve performance.

The trouble with register windows, is that each chip can only hold a finite number of them. Once all the register windows are full, the processor has to spill a complete set of registers to memory. This is in contrast with the situation where the program is responsible for spilling and filling registers - the program only need spill a single register if that is all that the routine requires.

Most SPARC processors have about seven sets of register windows, so if the program remains in a call stack depth of about seven, there is no register spill/fill cost associated with calls of other routines. Beyond this stack depth, there is a cost for the spills and fills of the register windows.

The SPARC architecture book contains a more detailed description of register windows in section 5.2.2.

Most of the time software is completely unaware of this architectural decision, in fact user code should never have to be aware of it. There are two situations where software does need to know about register windows, these really only impact virtual machines or operating systems:

  • Context switches. In a context switch the processor changes to executing another software thread, so all the state from that thread needs to be saved for the thread to later resume execution. Note that setjmp and longjmp which are sometimes used as part of code to implement context switching already have the appropriate flushes in them.
  • Garbage collection. Garbage collection involves inspecting the state of the objects held in memory and determining whether each object is live or dead. Live objects are identified by having other live objects point to them. So all the registers need to be stored in memory so that they can be inspected to check whether they point to any objects that should be considered live.

The SPARC V9 instruction flushw will cause the processor to store all the register windows in a thread to memory. For SPARC V8, the same effect is attained through trap x03. Either way, the cost can be quite high since the processor needs to store up to about 7 sets of register windows to memory Each set is 16 8-byte registers, which results in potentially a lot of memory traffic and cycles.

Tuesday Feb 26, 2008

Presentation at communityone

Had my presentation on parallelisation accepted for CommunityOne on 5th May in San Francisco.

Thursday Feb 07, 2008

Page size and memory layout

Support for large pages has been available since Solaris 9, I've previously talked about the various ways that an application can be coaxed into using large pages. However, I wanted to quickly write up how the large pages are laid out in memory. Take the following code that allocates a large chunk of memory, and then iterates over it for enough time to run pmap -xs on it:

#include <stdlib.h>

void main()
{
  int x,y;
  char *c;
  c=(char*)malloc(sizeof(char)*300000000);
  for (y=0; y<; y++)
  for (x=0; x<300000000; x++) { c[x]=c[x]+y;}
}

Compiling this code to use 4MB pages and then running the resulting executable produces a pmap output like:

% cc -xpagesize=4M t.c
% a.out&
[1] 15501
% pmap -xs 15501
15501:  a.out
 Address  Kbytes     RSS    Anon  Locked Pgsz Mode   Mapped File
00010000       8       8       -       -   8K r-x--  a.out
00020000       8       8       8       -   8K rwx--  a.out
00022000    3960    3960    3960       -   8K rwx--    [ heap ]
00400000  290816  290816  290816       -   4M rwx--    [ heap ]
...

Notice that the heap starts on 8KB pages, and uses these up until the memory reaches a 4MB boundary and then starts using 4MB pages. In this case it means that nearly 4MB of the memory is not using 4MB pages - if this happens to be where the majority of the program's active data resides, then there will still be plenty of TLB misses.

Fortunately, it is possible to tell the linker where to start the heap. There are some mapfiles provided in /usr/lib/ld/ for various scenarios, the one that we need is map.bssalign. Recompiling with this produces the following memory layout:

% cc -M /usr/lib/ld/map.bssalign -xpagesize=4M t.c
% a.out&
[1] 19077
% pmap -xs 19077
19077:  a.out
 Address  Kbytes     RSS    Anon  Locked Pgsz Mode   Mapped File
00010000       8       8       -       -   8K r-x--  a.out
00020000       8       8       8       -   8K rwx--  a.out
00400000  294912  294912  294912       -   4M rwx--    [ heap ]

With this change the heap now starts on a 4MB boundary and is entirely mapped with 4MB pages.

Friday Feb 01, 2008

The meaning of -xmemalign

I made some comments on a thread on the forums about memory alignment on SPARC and the -xmemalign flag. I've talked about memory alignment before, but this time the discussion was more about how the flag works. In brief:

  • The flag has two parts -xmemalign=[1|2|4|8][i|s]
  • The number specifies the alignment that the compiler should assume when compiling an object file. So if the compiler is not certain that the current variable is correctly aligned (say it's accessed through a pointer) then the compiler will assume the alignment given by the flag. Take a single precision floating point value that takes four bytes. Under -xmemalign=1[i|s] the compiler will assume that it is unaligned, so will issue four single byte loads to load the value. If the alignenment is specified as -xmemalign=2[i|s] the compiler will assume two byte alignment, so will issue two loads to get the four byte value.
  • The suffix [i|s] tells the compiler how to behave if there is a misaligned access. For 32-bit codes the default is i which fixes the misaligned access and continues. For 64-bit codes the default is s which causes the app to die with a SIGBUS error. This is the part of the flag that has to be specified at link time because it causes different code to be linked into the binary depending on the desired behaviour. The C documentation captures this correctly, but the C++ and Fortran docs will be updated.
Thursday Jan 31, 2008

Win $20,000!

Sun has announced a Community Innovation Awards Programme - basically a $1M of prize money available for various Sun-sponsored open source projects. There is an OpenSPARC programme, and the one that catches my eye is $20k for:

vi. Best Adaptation of a single-thread application to a multi-thread CMT (Chip Multi Threaded) environment

My guess is that they will expect more than the use of -xautopar -xreduction or a few OpenMP directives :) If I were allowed to enter (unfortunately Sun Employees are not) I'd be looking to exploit the features of the T1 or T2:

  • The threads can synchronise at the L2 cache level - so synchronisation costs are low
  • Memory latency is low

The upshot of this should be that it is possible to parallelise applications which traditionally have not been parallelisable because of synchronisation costs.

Funnily enough this is an area that I'm currently working in, and I do hope to have a paper accepted for the MultiExpo.

Bart Smaalder's performance anti-patterns

Bart Smaalder's has an excellent article on performance anti-patterns or things not to do. It's a great (and quick) read.

Monday Jan 07, 2008

putc in a multithreaded context

Just answering a question from a colleague. The application was running significantly slower when compiled as a multithreaded app compared to the original serial app. The profile showed mutex_unlock as being hot, but going up the callstack the routine that called mutex_unlock was putc.

This is the OpenSolaris source for putc, which shows a call to FLOCKFILE, which is defined in this file for MT programs. So for MT programs, a lock needs to be acquired before the character can be output.

Fortunately it is possible to avoid the locking using putc_unlocked. This call should not be used as a drop-in replacement for putc, but used after the appropriate mutex has been acquired. The details are in the Solaris Multi-threaded programming guide.

A test program that demonstrates this problem is:

#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>

static double s_time;

void starttime()
{
  s_time=1.0*gethrtime();
}

void endtime(long its)
{
  double e_time=1.0*gethrtime();
  printf("Time per iteration %5.2f ns\n", (e_time-s_time)/(1.0*its));
  s_time=1.0*gethrtime();

}

void *dowork(void *params)
{
  starttime();
  FILE* s=fopen("/tmp/dldldldldld","w");
  for (int i=0; i<100000000; i++)
  {
    putc(65,s);
  }
  fclose(s);
  endtime(100000000);
}

void main()
{
  starttime();
  FILE* s=fopen("/tmp/dldldldldld","w");
  for (int i=0; i<100000000; i++)
  {
    putc(65,s);
  }
  fclose(s);
  
  endtime(100000000);
  pthread_t threads[1];
  pthread_create(&threads[0],NULL,dowork,NULL);
  pthread_join(threads[0],NULL);
}

Here's the results of running the code on Solaris 10:

$ cc -mt putc.c
Time per iteration 30.55 ns
Time per iteration 165.76 ns

The situation on Solaris 10 is better than Solaris 9, since on Solaris 9 the cost of the mutex was incurred by the -mt compiler flag rather than whether there were actually multiple threads active.

Monday Dec 17, 2007

Open source application tuning

My group has started a page on the Sun wiki detailing the steps necessary to compile and build a number of open source applications. The page also contains links to useful destinations in the compiler documentation. Feel free to suggest ideas for applications that we should cover there - I can't guaranty that we'll manage to look at them, but I'd love to know what's important to you!

Wednesday Nov 28, 2007

CMT Developer Tools webcast

I've just had another webcast posted. This time it's discussing the CMT Developer Tools. The tools are add-ons for both Sun Studio 11 and Sun Studio 12. Of particular excitement was getting two of the tools ported to x64. There's a bit more information on the individual tools on my post back in July when we released them.

The CMT Developer Tools webcast covers installing and using the tools. The presentation is part slideware and part demo.

Monday Nov 26, 2007

Multi-threading webcast

A long while back I was asked to contribute a video that talked about parallelising applications. The final format is a webcast (audio and slides) rather than the expected video. This choice ended up being made to provide the clearest visuals of the slides, plus the smallest download.

I did get the opportunity to do the entire presentation on video - which was an interesting experience. I found it surprisingly hard to present to just a camera - I think the contrast with presenting to an audience is that you can look around the room and get feedback as to the appropriate level of energy to project. A video camera gives you no such feedback, and worse, there's no other place to look. Still I was quite pleased with the final video. The change to a webcast was made after this, so the audio from the video was carried over, and you still get to see about 3 seconds of the original film, but the rest has gone. I also ended up reworking quite a few of the slides - adding animation to clarify some of the topics.

The topics covered at a break-neck pace are, parallelising using Pthreads and OpenMP. Autoparallelisation by the compiler. Profiling parallel applications. Finally, detecting data races using the thread analyzer.

Monday Oct 29, 2007

Solaris Application Programming - available as rough cut

My book, "Solaris Application Programming", is now available as a Safari Rough-Cut.

For those who are unfamiliar with the rough-cut programme, the idea is to get early access to drafts of new books. The draft of my book that is available is the one that I actually handed over about two months back. This is before the copyeditor went through smoothing out the grammar, and also before I did another review of the text. The layout of the book is also different.

From the link you can either get access to the full text (for example, if you have a subscription to safari), or you can view snippets from various sections of the book to get a feel for the content.

Part of the idea of rough-cuts is that they provide an opportunity to influence/improve the final product. So please use the mechanism they provide to comment.

Tuesday Oct 09, 2007

CMT Developer Tools on the UltraSPARC T2 systems

The CMT Developer Tools are included on the new UltraSPARC T2 based systems, together with Sun Studio 12, and GCC for SPARC Systems.

The CMT Developer Tools are installed in:

/opt/SUNWspro/extra/bin/<tool>
and are (unfortunately) not on the default path.

Friday Oct 05, 2007

Filebench - benchmarking file systems

Interesting benchmarking application for filesystems available from opensolaris.org.

Friday Sep 28, 2007

Perflib and parallel regions

If perflib is called from within a parallel region, you get the serial version rather than a parallel version. This is probably better than ending up with N^2 threads when you set OMP_NUM_THREADS to be N.

Book on OpenMP

Interesting new book on OpenMP available. I've worked with both Ruud and Gabriele. I regularly see Ruud when he makes his stateside trips, and Gabriele used to work in the same group as I do before she moved from Sun. I've recently had a number of entertaining conversations with Ruud comparing the writing and publishing processes that we've been working through.

Multi-threading resources on the developer portal

Found a page of links to useful resources about multi-threading on the developer portal.

Solaris Application Programming Table of Contents

A couple of folks requested that I post the table of contents for my book. This is the draft TOC, not the finished product. I assume that there will be a good correspondence, but the final version should definitely look neater.

Wednesday Sep 26, 2007

Interpreting the performance counters on the UltraSPARC T1 and UltraSPARC T2

I've previously written up a short entry on using the UltraSPARC T1 performance counters to determine what the processor is doing and where effort might be spent in improving performance. I've just completed a follow up article for the developer portal which discusses this concept in more depth, and covers both the UltraSPARC T1 and the UltraSPARC T2.

A quick refresher here is that it's simple to calculate the utilisation of the processor. They have a fixed maximum number of instructions per second and cpustat can easily determine what proportion of that instruction budget is being utilised. Where it gets interesting is looking at the bottlenecks on the system - such as the memory stalls. On a traditional system memory stall time is all potential performance gain; but on a CMT system one threads's stall is another thread's instruction issue opportunity. Basically, stall will increase the latency of a thread, but reducing stall may not necessarily improve throughput.

This comes down to a few interesting observations:

  • A processor can tolerate a lot of stall cycles before the stall cycles start reducing the throughput of the application.
  • Traditional optimisations, where the developer, as an example, eliminates memory stall time, are not necessarily going to be the most productive use of developer time for CMT systems.
  • The factor that limits processor throughput is often instruction count, not stalls. Fortunately we have tools like BIT for getting instruction count data.

Tuesday Sep 25, 2007

Solaris Application Programming book

I'm thrilled to see that my book is being listed for pre-order on Amazon in the US. It seems to take about a month for it to travel the Atlantic to Amazon UK.

Wednesday Sep 12, 2007

Sun Studio 12 Performance Analyzer docs available

The documentation for the Sun Studio 12 version of the Performance Analyzer has gone live. The Sun Studio 12 docs collection is also available.

Tuesday Sep 04, 2007

Recording Analyzer experiments over long latency links (-S off)

I was recently collecting a profile from an app running on a machine in Europe, but writing the data back to a machine here in CA. The application normally ran in 5 minutes, so I was surprised that it had made no progress after 3 hours when run under collect.

The Analyzer experiment looked like:

Dir : archives 	 	08/28/07  	23:47:10
File: dyntext 		08/28/07 	23:47:12
File: log.xml 	598 KB 	08/29/07 	03:02:52 
File: map.xml 	3 KB 	08/28/07 	23:47:22
File: overview 	4060 KB 08/29/07 	03:02:51 
File: profile 	256 KB 	08/28/07 	23:47:56

Two of the files (log.xml and overview) had accumulated data since the start of the application, the other files had not. truss output showed plenty of writes to these files:

 0.0001 open("/net/remotefs/.../test.1.er/log.xml", O_WRONLY|O_APPEND) = 4
 0.0000 write(4, " < e v e n t   k i n d =".., 74)      = 74
 0.0004 close(4)                                        = 0

In fact it looked rather like opening and closing these remote files were taking all the time away from running the application. One of the Analyzer team suggested passing -S off to collect to switch off periodic sampling. Periodic sampling is collecting application state at one second intervals. Using this flag, the application terminated in the usual 5 minutes and produced a valid profile.

Thursday Aug 16, 2007

Presenting at Stanford HPC conference

I'll be presenting at Stanford next week as part of their HPC conference (Register here). I plan to cover:

Wednesday Aug 15, 2007

Comparing analyzer experiments

When performance tuning an application it is really helpful to be able to compare the performance of the current version of the code with an older version. This was one of the motivators for the performance reporting tool spot. spot takes great care to capture information about the system that the code was run on, the flags used to build the application, together with the obvious things like the profile of the application. The tool spot_diff, which is included in the latest release of spot, pulls out the data from multiple experiments and produces a very detailed comparison between them - indicating if, for example, one version had more TLB misses than another version.

However, there are situations where it's necessary to compare two analyzer experiments, and er_xcmp is a tool which does just that.

er_xcmp extracts the time spent in each routine for the input data that is passed to it, and presents this as a list of functions together with the time spent in each function from each data set. er_xcmp handles an arbitrary number of input files, so it's just as happy comparing three profiles as it is two. It's also able to handle data from bit, so comparisons of instruction count as well as user time are possible.

The input formats can be Analyzer experiments, fsummary output from er_print, or output directories from er_html - all three formats get boiled down to the same thing and handled in the same way by the script.

Here's some example output:

% er_xcmp test.1.er test.2.er
    2.8     8.8 <Total>
    1.8     2.6 foo
    1.0     6.2 main
    N/A     N/A _start

Calendar

Search this blog

About

Solaris Application Programming

Book resources

Recent entries

Custom search

Tag cloud

ats bit book c++ cmt communityone compiler cooltools cpu2006 developers dtrace gccfss hpc multithreading openmp opensparc parallelisation parallelization performance performanceanalyzer secondlife solaris solarisapplicationprogramming sparc spot sunstudio t2 ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Navigation

Referers

Feeds