Darryl Gove's blog

Thursday May 22, 2008

Tonight's OpenSolaris User Group presentations

Slides for tonight's presentations are now available:

Wednesday May 21, 2008

Presenting at OpenSolaris Users group tomorrow

Tomorrow I'll be presenting at the Silicon Valley OpenSolaris Users group. Alan DuBoff has asked me to try and avoid the monolithic presentation, so I'll be aiming to have a couple of short presentations. The idea is to push the balance towards communications rather than presentations.

Tuesday May 13, 2008

OpenMP 3.0 specification released

The specification for OpenMP 3.0 has been put up on the OpenMP.org website. Using the previous OpenMP 2.5 standard, there's basically two supported modes of parallelisation:

  • Splitting a loop over multiple threads - each thread is responsible for a range of the iterations.
  • Splitting a serial code into sections - each thread executes a section of code.

The large change with OpenMP 3.0 is the introduction of tasks, where a thread can spawn a task to be completed by another thread at an unspecified point in the future. This should make OpenMP amenable to many more situations. An example of using tasks looks like:

  node * p = head;
  while (p)
  {
    #pragma omp task
    {
      process(p);
    }
    p = p->next;
  }

The master thread iterates the linked list generating tasks for processing each element in the list. The brackets around the call to process(p) are unnecessary, but hopefully clarify what's happening.

Monday May 12, 2008

Slides for CommunityOne

All the slides for last week's CommunityOne conference are available for download. I was presenting in the CMT stream, you can find my slides here. Note that to download the slides, you'll need to use the username and password shown on the page.

My talk was on parallelisation. What's supported by the compiler, the steps to do it, and the tools that support that. I ended with an overview of microparallelisation.

Wednesday May 07, 2008

When to use membars

membar instructions are SPARC assembly language instructions that enforce memory ordering. They tell the processor to ensure that memory operations are completed before it continues execution. However, the basic rule is that the instructions are usually only necessary in "unusual" circumstances - which fortunately will mean that most people don't encounter them.

The UltraSPARC Architecture manual documents the situation very well in section 9.5. It gives these rules which cover the default behaviour:

  • Each load instruction behaves as if it were followed by a MEMBAR #LoadLoad and #LoadStore.
  • Each store instruction behaves as if it were followed by a MEMBAR #StoreStore.
  • Each atomic load-store behaves as if it were followed by a MEMBAR #LoadLoad, #LoadStore, and #StoreStore.

There's a table in section 9.5.3 which covers when membars are necessary. Basically, membars are necessary for ordering of block loads and stores, and for ordering non-cacheable loads and stores. There is an interesting note where it indicates that a membar is necessary to order a store followed by a load to a different addresses; if the address is the same the load will get the correct data. This at first glance seems odd - why worry about whether the store is complete if the load is of independent data. However, I can imagine this being useful in situations where the same physical memory is mapped using different virtual address ranges - not something that happens often, but could happen in the kernel.

As a footnote, the equivalent x86 instruction is the mfence. There's a good discussion of memory ordering in section 7.2 of the Intel Systems Programming Guide.

There's some more discussion of this topic on Dave Dice's weblog.

Saturday May 03, 2008

Official reschedule notice for CommunityOne

Session ID: S297077
Session Title: Techniques for Utilizing CMT
Track: Chip Multithreading (CMT): OpenSPARCâ„¢
Room: Esplanade 302
Date: 2008-05-05
Start Time: 13:30 

The official timetable has also been updated

Friday May 02, 2008

Embedded Systems Conference Presentation

I got the opportunity to present at the embedded systems conference in San Jose a couple of weeks back. My presentation covered parallelising a serial application, a quick tour of what to do, together with an overview of the tools that Sun Studio provides to help out. The presentation is now available on the OpenSPARC website.

Thursday May 01, 2008

CommunityOne Panel and reschedule

I've heard that my session at CommunityOne is now scheduled from 1:30. The panel session that was scheduled for that time has been shifted to the 11:00am slot. However, the timetable currently up on the site does not reflect that change. I've been invited to appear on the panel - which I'm looking forward to. See you there!

Tuesday Apr 29, 2008

Multicore expo available - Microparallelisation

My presentation "Strategies for improving the performance of single threaded codes on a CMT system" has been made available on the OpenSPARC site.

The presentation discusses "microparallelisation", in the the context of parallelising an example loop. Microparallelisation is the aim of obtaining parallelism through assigning small chunks of work to discrete processors. Taking a step back...

With traditional parallelisation the idea is to identify large chunks of work that can be split between multiple processors. The chunks of work need to be large to amortise the synchronisation costs. This usually means that the loops have a huge trip count.

The synchronisation costs are derived from the time it takes to signal that a core has completed its work. The lower the synchronisation costs, the smaller amount of work is needed to make parallelisation profitable.

Now, a CMT processor has two big advantages here. First of all it has many threads. Secondly these threads have low latency access to a shared level of cache. The result of this is that the cost of synchronisation between threads is greatly reduced, and therefore each thread is free to do a smaller chunk of work in a parallel region.

All that's great in theory, the presentation uses some example code to try this out, and discovers, rather fortunately, that the idea also works in practice!

The presentation also covers using atomic operations rather that microparallelisation.

In summary the presentation is more research than solid science, but I hoped that presenting it would get some people thinking about non-traditional ways to extract parallelism from applications. I'm not alone in this area of work, Lawrence Spracklen is also working on it. We're both at presenting CommunityOne next week.

Tuesday Mar 25, 2008

Conference schedule

The next two months are likely to be a bit hectic for me. I'm presenting at three different conferences, as well as a chat session in Second Life. So I figured I'd put the information up in case anyone reading this is also going to one or other of the events. So in date order:

I'll be talking about parallelisation at the various conferences, the talks will be different. The multi-core expo talks focuses on microparallelisation. The ESC talk will probably be higher level, and the CommunityOne talk will probably be wider ranging, and I hope more interactive.

In the Second Life event I'll be talking about the book, although the whole idea of appearing is to do Q&A, so I hope that will be more of a discussion.

Tuesday Feb 26, 2008

Presentation at communityone

Had my presentation on parallelisation accepted for CommunityOne on 5th May in San Francisco.

Tuesday Feb 12, 2008

CMT related books

OpenSPARC.net has a side bar featuring books that are relevant to CMT. Mine is featured as one of the links. The other two books are Computer Architecture - a quantitative approach which features the UltraSPARC T1, and Chip-Multiprocessor Architecture which is by members of the team responsible for the UltraSPARC T1 processor.

Thursday Jan 31, 2008

Win $20,000!

Sun has announced a Community Innovation Awards Programme - basically a $1M of prize money available for various Sun-sponsored open source projects. There is an OpenSPARC programme, and the one that catches my eye is $20k for:

vi. Best Adaptation of a single-thread application to a multi-thread CMT (Chip Multi Threaded) environment

My guess is that they will expect more than the use of -xautopar -xreduction or a few OpenMP directives :) If I were allowed to enter (unfortunately Sun Employees are not) I'd be looking to exploit the features of the T1 or T2:

  • The threads can synchronise at the L2 cache level - so synchronisation costs are low
  • Memory latency is low

The upshot of this should be that it is possible to parallelise applications which traditionally have not been parallelisable because of synchronisation costs.

Funnily enough this is an area that I'm currently working in, and I do hope to have a paper accepted for the MultiExpo.

Monday Nov 26, 2007

Multi-threading webcast

A long while back I was asked to contribute a video that talked about parallelising applications. The final format is a webcast (audio and slides) rather than the expected video. This choice ended up being made to provide the clearest visuals of the slides, plus the smallest download.

I did get the opportunity to do the entire presentation on video - which was an interesting experience. I found it surprisingly hard to present to just a camera - I think the contrast with presenting to an audience is that you can look around the room and get feedback as to the appropriate level of energy to project. A video camera gives you no such feedback, and worse, there's no other place to look. Still I was quite pleased with the final video. The change to a webcast was made after this, so the audio from the video was carried over, and you still get to see about 3 seconds of the original film, but the rest has gone. I also ended up reworking quite a few of the slides - adding animation to clarify some of the topics.

The topics covered at a break-neck pace are, parallelising using Pthreads and OpenMP. Autoparallelisation by the compiler. Profiling parallel applications. Finally, detecting data races using the thread analyzer.

Tuesday Oct 09, 2007

CMT Developer Tools on the UltraSPARC T2 systems

The CMT Developer Tools are included on the new UltraSPARC T2 based systems, together with Sun Studio 12, and GCC for SPARC Systems.

The CMT Developer Tools are installed in:

/opt/SUNWspro/extra/bin/<tool>
and are (unfortunately) not on the default path.

Threads and cores

The UltraSPARC T2 has eight cores, each of these cores is capable of executing 8 threads; making a total of 64 virtual processors. The threads are grouped into two groups of four threads. Each core can execute at most 2 instructions per cycle, the load/store unit and the floating point unit are shared between the two groups of threads, but each thread has its own integer pipeline.

Usually, threads 0-7 are assigned to core 0, threads 8-15 to core 1 etc. The exact mapping is reported by the service processor (and prtdiag), but the mapping is only likely to be different if the system is configured with LDoMs. Taking core 0 as an example; threads 0-3 are assigned to one group and threads 4-7 are assigned to the other group.

To disable a core using psradm it is necessary to disable all the threads on that core. On the other hand if the objective is just to reduce the total number of active threads, keeping the core enabled, then best performance will be attained if the threads are disabled across all groups rather than just disabling threads within a single group. The reason for this is that each group gets to execute a single instruction per cycle, so disabling all the threads within a group will reduce the maximum number of instructions that can be executed in a cycle.

Compiling for the UltraSPARC T2

Today, Sun launched systems based on the UltraSPARC T2. A question that is bound to come up is what compiler flags should be used for the processor?

Sun Studio 12 has the flag -xtarget=ultraT2 to specifically target the UltraSPARC T2. But before jumping off and using this flag, let's take the flag apart and see what it actually means. There are three components that are set by the -xtarget flag :

  • -xcache flag. This flag tells the compiler to target a particular cache configuration. The flag will have an impact on floating point code where the loops can be tiled to fit into cache. Obviously not all codes are amenable to this optimisation, so the -xcache setting is usually unimportant.
  • -xchip flag. This sets the instruction latencies and instruction selection preferences. The UltraSPARC T2 (in common with the UltraSPARC T1) has a simple pipeline so there is nothing much to gain from accurately modelling the instruction latencies. There are also no real situations where it will do better with one instruction sequence in preference to another (unless one is longer than the other). So for the UltraSPARC T2 this flag has little impact on the generated code.
  • -xarch flag. The -xarch flag controls the target architecture. This is traditionally used principally to control whether 32-bit or 64-bit binaries are generated. However, Sun Studio 12 introduced the flags -m32 and -m64 to separate the address-size of the binary from the instruction set selection. There are no UltraSPARC T2 specific instructions which the compiler currently generates, so the default of the SPARC V9 ISA is fine.
  • To summarise, there is an UltraSPARC T2 specific compiler flag, but for most situations the best target to use would be -xtarget=generic which should give good performance over a wide range of processors.

Friday Sep 28, 2007

Multi-threading resources on the developer portal

Found a page of links to useful resources about multi-threading on the developer portal.

Solaris Application Programming Table of Contents

A couple of folks requested that I post the table of contents for my book. This is the draft TOC, not the finished product. I assume that there will be a good correspondence, but the final version should definitely look neater.

Wednesday Sep 26, 2007

Interpreting the performance counters on the UltraSPARC T1 and UltraSPARC T2

I've previously written up a short entry on using the UltraSPARC T1 performance counters to determine what the processor is doing and where effort might be spent in improving performance. I've just completed a follow up article for the developer portal which discusses this concept in more depth, and covers both the UltraSPARC T1 and the UltraSPARC T2.

A quick refresher here is that it's simple to calculate the utilisation of the processor. They have a fixed maximum number of instructions per second and cpustat can easily determine what proportion of that instruction budget is being utilised. Where it gets interesting is looking at the bottlenecks on the system - such as the memory stalls. On a traditional system memory stall time is all potential performance gain; but on a CMT system one threads's stall is another thread's instruction issue opportunity. Basically, stall will increase the latency of a thread, but reducing stall may not necessarily improve throughput.

This comes down to a few interesting observations:

  • A processor can tolerate a lot of stall cycles before the stall cycles start reducing the throughput of the application.
  • Traditional optimisations, where the developer, as an example, eliminates memory stall time, are not necessarily going to be the most productive use of developer time for CMT systems.
  • The factor that limits processor throughput is often instruction count, not stalls. Fortunately we have tools like BIT for getting instruction count data.

Friday Aug 31, 2007

Total Store Ordering (TSO)

Total Store Ordering is the name for the ordering implemented on multi-processor SPARC systems. The SPARC architecture also defines a number of weaker orderings. TSO basically guarantees that loads are visible between processors in the order in which they occurred, the same for stores, and that loads will see the values of earlier stores. There's a very helpful description of this in the SPARC architecture documentation on pages 418-427 (section 9.5). Particularly useful is table 9-2 on page 422, which gives a useful summary of where membar operations are required.

Tuesday Jul 24, 2007

CMT Developer Tools

We've just released the CMT Developer Tools for Sun Studio 12. The tools are available for both SPARC and x64 systems, although not all tools are on the x64 platform. The list of tools is as follows:

  • ATS - Automatic Tuning and Troubleshooting System. Automatically finds the best compiler flags or locates optimisation bugs in an application without access to the source code. (SPARC and x64)
  • SPOT - Simple Performance Optimisation Tool. Generates an html report detailing where the application is spending time, and information about why it might be spending time there.(SPARC and x64)
  • BIT - Binary Improvement Tool. Reports information about application coverage also produces instruction execution counts, branch taken data etc. (SPARC only)
  • Discover - Sun Memory Error Discovery Tool. Detects memory access issues such in an application. Examples are accesses to uninitialised memory, accesses beyond array bounds, memory leaks, etc. (SPARC only)

Monday Mar 19, 2007

Workshop on software tools for Multi-Core systems

I was very pleased to get a paper accepted to the Second Workshop on Software Tools for Multi-Core Systems. My co-author, Raj Prakash, did the presentation. The other presentations from the workshop are also available.

The paper covers the cool tools, but the novel part of the presentation is interpreting the UltraSPARC-T1 performance counters. This material is a more complete version of the coverage in my previous blog entry.

Calendar

Search this blog

About

Solaris Application Programming

Book resources

The Developer's Edge

Book resources

OpenSPARC Internals

Book resources

Recent entries

Custom search

Tag cloud

book cmt communityone compiler cooltools cpu2006 dtrace gcc libraries linker multithreading openmp opensolaris opensparc optimisation optimization parallelisation parallelization performance performanceanalyzer programming secondlife solaris solarisapplicationprogramming sparc spot sunstudio ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Interesting docs

Navigation

Referers

Feeds