Darryl Gove's blog

Wednesday Apr 01, 2009

NUMA, binding, and OpenMP

One of my colleagues did an excellent bit of analysis recently, it pulls together a fair number of related topics, so I hope you'll find it interesting.

We'll start with NUMA. Non-Uniform Memory Access. This is in contrast to UMA - Uniform Memory Access. This relates to memory latency - how long does it take to get data from memory to the processor. If you take a single CPU box, the memory latency is basically a measurement of the wires between the processor and the memory chips, it typically is about 90ns, can be as low as 60ns. For a 3GHz chip this is from around 200 to 300 cycles, which is a fair length of time.

Suppose we add a second chip into the system. The memory latency increases because there's now a bunch of communication that needs to happen between the two chips. The communication consists of things like checking that more recent data is not in the cache of the other chip, co-ordinating access to the same memory bank, accessing memory that is controlled by the other processor. The upshot of all this is that memory latency increases. However, that's not all.

If you have two chips together with a bunch of memory, you can have various configurations. The most likely one is that each chip gets half the memory. If one chip has to access memory that the other chip owns, this is going to take longer than if the memory is attached to that chip. Typically you might find that local memory takes 90ns to access, and remote memory 120ns.

One way of dealing with this disparity is to interleave the memory, so one cacheline will be local, the next remote. Doing this you'll see an average memory latency of 105ns. Although the memory latency is longer than the optimal, there's nothing a programmer (or an operating system) can do about it.

However, those of use who care about performance will jump through hoops of fire to get that lower memory latency. Plus as the disparity in memory latency grows larger, it makes less and less sense to average the cost. Imagine a situation on a large multi-board system where the on-board memory latency might be 150ns, but the cross-board latency would be closer to 300ns (I should point out that I'm using top-of-the-head numbers for all latencies, I'm not measuring them on any systems). The impact of doing this averaging could be a substantial slow-down in performance for any application that doesn't fit into cache (which is most apps). (There are other reasons for not doing this, such as limiting the amount of traffic that needs to go across the various busses.)

So most systems with more than one CPU will see some element of NUMA. Solaris has contained some memory placement optimisations MPO since Solaris 9. These optimisations attempt to allocate memory locally to the processor that is running the application. OpenSolaris has the lgrpinfo command that provides an interface to see the levels of memory locality in the system.

Solaris will attempt to schedule threads so that they remain in their locality group - taking advantage of the local memory. Another way of controlling performance is to use binding to keep processes, or threads on a particular processor. This can be done through the pbind command. Processor sets can performance a similar job (as can zones, or even logical domains), or directly through processor_bind.

Binding can be a tricky thing to get right. For example in a system where there are multiple active users, it is quite possible to end up in a situation where one virtual processor is oversubscribed with processes, whilst another is completely idle. However, in situations where this level of control enables better performance then binding can be hugely helpful.

One situation where binding is commonly used is for running OpenMP programs. In fact, it is so common that the OpenMP library has built in support for binding through the environment variable SUNW_MP_PROCBIND. This variable enables the user to specify which threads are bound to which logical processors.

It is worth pointing out that binding does not just help memory locality issues. Another situation where binding helps is thread migrations. This is the situation where an interrupt, or another thread requires attention and this causes the thread currently running on the processor to be descheduled. In some situations the descheduled thread will get scheduled onto another virtual processor. In some instances that may be the correct decision. In other instances it may result in lower than expected performance because the data that the thread needs is still in the cache on the old processor, and also because the migration of that thread may cause a cascade of migrations of other threads.

The particular situation we hit was that one code when bound showed bimodal distributions of runtimes. It had a 50% chance of running fast or slow. We were using OpenMP as well as the SUNW_MP_PROCBIND environment variable, so in theory we'd controlled for everything. However, the program didn't hit the parallel section until after a few minutes of running, and examining what was happening using both pbind and also the Performance Analyzer indicated what the problem was.

The environment variable SUNW_MP_PROCBIND currently binds threads once the code reaches the parallel region. Until that point the process is unbound. Since the process is unbound, Solaris can schedule it to any available virtual CPU. During the unbound time, the process allocated the memory that it needed, and the MPO feature of Solaris ensured that the memory was allocated locally. I'm sure you can see where this is heading.... Now, once the code hit the parallel region, the binding occurred, and the main thread was bound to a particular locality group, half the time this group was the same group where it had been running before, and half the time it was a different locality group. If the locality group was the same, then memory would be local, otherwise memory would be remote (and performance slower).

We put together a LD_PRELOAD library to prove it. The following code has a parallel section in it which gets called during initialisation. This ensures that binding has already taken place by the time the master thread starts.

#include <stdio.h>
#pragma init(s)
void s()
{
  #pragma omp parallel sections
  {
  #pragma omp section
  {
   printf("Init");
  }
  }
}

The code is compiled and used with:

$ cc -O -xopenmp -G -Kpic -o par.so par.c
$ LD_PRELOAD=./par.so ./a.out

Friday Mar 20, 2009

University of Washington Presentation

I was presenting at the University of Washington, Seattle, on Wednesday on Solaris and Sun Studio. The talk covers the tools that are available in Solaris and Sun Studio. This is my "Grand Unified" presentation, it covers tools, compilers, optimisation, parallelisation, and debug.

Sunday Oct 26, 2008

Second life presentation

I'll be presenting in Second Life on Tuesday 28th at 9am PST. The title of the talk is "Utilising CMT systems".

Thursday Sep 11, 2008

Innovation insider information

Had my briefing for Innovation Insider this morning. I'll be on the show tomorrow (12th September) from 1-2pm PST. I expect to be talking about the book, Sun Studio, and parallelisation.

The format of the show is Q&A, plus phone-ins. So the discussion could go anywhere. Basically you can phone-in to the show to listen and ask questions. It's also streamed live over the net - although that apparently cuts off at 2pm sharp. Then it gets archived for on-demand replay. Should be an interesting experience.

Monday Jul 28, 2008

OpenSolaris presentation in Japan

I'm back from the trip to Japan. I got to visit a number of customers and talk with them about the compilers and tools. However, the highpoint for me was the OpenSolaris event on the Friday evening. Jim Grisanzio has put up a set of photos from the event and the meal afterwards. (Yes, I was wearing a shirt and tie - the outside temperature and humidity was far to great to also wear a jacket.)

You can probably see in the pictures that the room was full - about 70 people turned up, listened, and asked some excellent questions. Keiichi Oono translated for me, and did a superb job, I think I managed to talk in short chunks, but there were a couple of occasions where I probably talked way too much. There's a couple of pictures of me using the whiteboard, and this turned out to be quite a burden to translate - I plan to do a proper write up in the next day or so.

Hisayoshi Kato did a nice talk (with live demos on a V490) of various performance tools, including some dtrace. I must admit that since Hisayoshi's talk was in Japanese I didn't actually attend all of it, and instead chatted to Jim and Takanobu Masuzuki.

Thursday May 22, 2008

Tonight's OpenSolaris User Group presentations

Slides for tonight's presentations are now available:

Wednesday May 21, 2008

OpenSolaris Users Group presentation topics

As I wrote earlier, I'm planning on a number of short presentations rather than a single long one. I don't know whether I'll manage all four of the sets of slides that I've prepared - I rather hope that there will be more discussion and I'll end up only doing one or two sets. Anyway the topics I've prepared are:

  • A deck of slides on my book.
  • A quick run through what I consider to be the important compiler flags, and the associated gotcha's.
  • Compiler support for parallelisation.
  • An overview of OpenSPARC.

Tuesday May 13, 2008

OpenMP 3.0 specification released

The specification for OpenMP 3.0 has been put up on the OpenMP.org website. Using the previous OpenMP 2.5 standard, there's basically two supported modes of parallelisation:

  • Splitting a loop over multiple threads - each thread is responsible for a range of the iterations.
  • Splitting a serial code into sections - each thread executes a section of code.

The large change with OpenMP 3.0 is the introduction of tasks, where a thread can spawn a task to be completed by another thread at an unspecified point in the future. This should make OpenMP amenable to many more situations. An example of using tasks looks like:

  node * p = head;
  while (p)
  {
    #pragma omp task
    {
      process(p);
    }
    p = p->next;
  }

The master thread iterates the linked list generating tasks for processing each element in the list. The brackets around the call to process(p) are unnecessary, but hopefully clarify what's happening.

Monday May 12, 2008

Slides for CommunityOne

All the slides for last week's CommunityOne conference are available for download. I was presenting in the CMT stream, you can find my slides here. Note that to download the slides, you'll need to use the username and password shown on the page.

My talk was on parallelisation. What's supported by the compiler, the steps to do it, and the tools that support that. I ended with an overview of microparallelisation.

Monday Feb 18, 2008

Multi-core Expo

My paper "Strategies for improving the performance of single threaded codes on a CMT system" has been accepted for the Multi-core Expo in Santa Clara. I'm not sure when I'll be presenting; the agenda should be available soon.

Thursday Jan 31, 2008

Win $20,000!

Sun has announced a Community Innovation Awards Programme - basically a $1M of prize money available for various Sun-sponsored open source projects. There is an OpenSPARC programme, and the one that catches my eye is $20k for:

vi. Best Adaptation of a single-thread application to a multi-thread CMT (Chip Multi Threaded) environment

My guess is that they will expect more than the use of -xautopar -xreduction or a few OpenMP directives :) If I were allowed to enter (unfortunately Sun Employees are not) I'd be looking to exploit the features of the T1 or T2:

  • The threads can synchronise at the L2 cache level - so synchronisation costs are low
  • Memory latency is low

The upshot of this should be that it is possible to parallelise applications which traditionally have not been parallelisable because of synchronisation costs.

Funnily enough this is an area that I'm currently working in, and I do hope to have a paper accepted for the MultiExpo.

Monday Nov 26, 2007

Multi-threading webcast

A long while back I was asked to contribute a video that talked about parallelising applications. The final format is a webcast (audio and slides) rather than the expected video. This choice ended up being made to provide the clearest visuals of the slides, plus the smallest download.

I did get the opportunity to do the entire presentation on video - which was an interesting experience. I found it surprisingly hard to present to just a camera - I think the contrast with presenting to an audience is that you can look around the room and get feedback as to the appropriate level of energy to project. A video camera gives you no such feedback, and worse, there's no other place to look. Still I was quite pleased with the final video. The change to a webcast was made after this, so the audio from the video was carried over, and you still get to see about 3 seconds of the original film, but the rest has gone. I also ended up reworking quite a few of the slides - adding animation to clarify some of the topics.

The topics covered at a break-neck pace are, parallelising using Pthreads and OpenMP. Autoparallelisation by the compiler. Profiling parallel applications. Finally, detecting data races using the thread analyzer.

Friday Sep 28, 2007

Book on OpenMP

Interesting new book on OpenMP available. I've worked with both Ruud and Gabriele. I regularly see Ruud when he makes his stateside trips, and Gabriele used to work in the same group as I do before she moved from Sun. I've recently had a number of entertaining conversations with Ruud comparing the writing and publishing processes that we've been working through.

Calendar

Search this blog

About

Solaris Application Programming

Book resources

The Developer's Edge

Book resources

OpenSPARC Internals

Book resources

Recent entries

Custom search

Tag cloud

book cmt communityone compiler cooltools cpu2006 dtrace gcc libraries linker openmp opensolaris opensparc optimisation optimization parallelisation parallelization performance performanceanalyzer programming solaris solarisapplicationprogramming sparc spec spot sunstudio t2 ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Interesting docs

Navigation

Referers

Feeds