Darryl Gove's blog

Tuesday May 13, 2008

OpenMP 3.0 specification released

The specification for OpenMP 3.0 has been put up on the OpenMP.org website. Using the previous OpenMP 2.5 standard, there's basically two supported modes of parallelisation:

  • Splitting a loop over multiple threads - each thread is responsible for a range of the iterations.
  • Splitting a serial code into sections - each thread executes a section of code.

The large change with OpenMP 3.0 is the introduction of tasks, where a thread can spawn a task to be completed by another thread at an unspecified point in the future. This should make OpenMP amenable to many more situations. An example of using tasks looks like:

  node * p = head;
  while (p)
  {
    #pragma omp task
    {
      process(p);
    }
    p = p->next;
  }

The master thread iterates the linked list generating tasks for processing each element in the list. The brackets around the call to process(p) are unnecessary, but hopefully clarify what's happening.

Monday May 12, 2008

Slides for CommunityOne

All the slides for last week's CommunityOne conference are available for download. I was presenting in the CMT stream, you can find my slides here. Note that to download the slides, you'll need to use the username and password shown on the page.

My talk was on parallelisation. What's supported by the compiler, the steps to do it, and the tools that support that. I ended with an overview of microparallelisation.

Wednesday May 07, 2008

When to use membars

membar instructions are SPARC assembly language instructions that enforce memory ordering. They tell the processor to ensure that memory operations are completed before it continues execution. However, the basic rule is that the instructions are usually only necessary in "unusual" circumstances - which fortunately will mean that most people don't encounter them.

The UltraSPARC Architecture manual documents the situation very well in section 9.5. It gives these rules which cover the default behaviour:

  • Each load instruction behaves as if it were followed by a MEMBAR #LoadLoad and #LoadStore.
  • Each store instruction behaves as if it were followed by a MEMBAR #StoreStore.
  • Each atomic load-store behaves as if it were followed by a MEMBAR #LoadLoad, #LoadStore, and #StoreStore.

There's a table in section 9.5.3 which covers when membars are necessary. Basically, membars are necessary for ordering of block loads and stores, and for ordering non-cacheable loads and stores. There is an interesting note where it indicates that a membar is necessary to order a store followed by a load to a different addresses; if the address is the same the load will get the correct data. This at first glance seems odd - why worry about whether the store is complete if the load is of independent data. However, I can imagine this being useful in situations where the same physical memory is mapped using different virtual address ranges - not something that happens often, but could happen in the kernel.

As a footnote, the equivalent x86 instruction is the mfence. There's a good discussion of memory ordering in section 7.2 of the Intel Systems Programming Guide.

There's some more discussion of this topic on Dave Dice's weblog.

Saturday May 03, 2008

Official reschedule notice for CommunityOne

Session ID: S297077
Session Title: Techniques for Utilizing CMT
Track: Chip Multithreading (CMT): OpenSPARCâ„¢
Room: Esplanade 302
Date: 2008-05-05
Start Time: 13:30 

The official timetable has also been updated

Friday May 02, 2008

OpenSolaris Summit

I'm stepping in to present at the OpenSolaris Summit on Sunday. The presentation is titled "Optimizing for OpenSolaris", and I believe someone else has already prepared the slideset - we'll see. Anyway, I'm looking forward to going, the attendees list contains many familiar names.

Embedded Systems Conference Presentation

I got the opportunity to present at the embedded systems conference in San Jose a couple of weeks back. My presentation covered parallelising a serial application, a quick tour of what to do, together with an overview of the tools that Sun Studio provides to help out. The presentation is now available on the OpenSPARC website.

Thursday May 01, 2008

Good overview of tools

I've just noticed this blog which is doing a very nice series of entries about the tools available on the tools cd and from the dtrace toolkit as well as the tools that ship with the system.

CommunityOne Panel and reschedule

I've heard that my session at CommunityOne is now scheduled from 1:30. The panel session that was scheduled for that time has been shifted to the 11:00am slot. However, the timetable currently up on the site does not reflect that change. I've been invited to appear on the panel - which I'm looking forward to. See you there!

Tuesday Apr 29, 2008

Multicore expo available - Microparallelisation

My presentation "Strategies for improving the performance of single threaded codes on a CMT system" has been made available on the OpenSPARC site.

The presentation discusses "microparallelisation", in the the context of parallelising an example loop. Microparallelisation is the aim of obtaining parallelism through assigning small chunks of work to discrete processors. Taking a step back...

With traditional parallelisation the idea is to identify large chunks of work that can be split between multiple processors. The chunks of work need to be large to amortise the synchronisation costs. This usually means that the loops have a huge trip count.

The synchronisation costs are derived from the time it takes to signal that a core has completed its work. The lower the synchronisation costs, the smaller amount of work is needed to make parallelisation profitable.

Now, a CMT processor has two big advantages here. First of all it has many threads. Secondly these threads have low latency access to a shared level of cache. The result of this is that the cost of synchronisation between threads is greatly reduced, and therefore each thread is free to do a smaller chunk of work in a parallel region.

All that's great in theory, the presentation uses some example code to try this out, and discovers, rather fortunately, that the idea also works in practice!

The presentation also covers using atomic operations rather that microparallelisation.

In summary the presentation is more research than solid science, but I hoped that presenting it would get some people thinking about non-traditional ways to extract parallelism from applications. I'm not alone in this area of work, Lawrence Spracklen is also working on it. We're both at presenting CommunityOne next week.

Monday Apr 28, 2008

static and inline functions

Hit a problem when compiling a library. The problem is with mixing static and inline functions, which is not allowed by the standard, but is allowed by gcc. Example code looks like:

char * c;

static void foo(char *);

inline void work()
{
  foo(c);
}

void foo(char* c)
{
} 

When this code is compiled it generates the following error:

% cc s.c
"s.c", line 7: reference to static identifier "foo" in extern inline function
cc: acomp failed for s.c

It turns out that there is a workaround for this problem, which is the flag -features=no%extinl. Douglas Walls describes the issue in much more detail.

Thursday Apr 24, 2008

Second life slides and script

Just completed the Second Life presentation. It appeared well attended, and I got a bundle of great questions at the end. If you were there, thank you! I've uploaded a screen shot that I managed to get before the presentation started. Unfortunately, I didn't get a picture of the stage setup with the life-size books, a very nice touch.

[Read More]

Wednesday Apr 23, 2008

Time and location for CommunityOne

I've just had a mail giving time and location for my presentation at CommunityOne.

Session ID: S297077
Session Title: Techniques for Utilizing CMT
Track: Chip Multithreading (CMT): OpenSPARCâ„¢
Room: Esplanade 302
Date: 2008-05-05
Start Time: 11:00

See you there!

Presentation and Q&A in second life tomorrow

I'll be in Second Life tomorrow talking about the book. The session is at 9am PST in the Andromeda Theatre. I've got a small set of slides describing the contents of the book (Here's the full ToC for the book). After the slides I'll be sticking around to answer questions.

Thursday Apr 10, 2008

Solaris Grid Compiler

I've only just heard that the the Solaris Grid Compiler has been released as a cooltool! The idea of this is to set up a batch of machines as a compiler grid - so large compilations can be dispatched to the grid to better utilise the machines, and reduce compilation time.

Monday Apr 07, 2008

Congratulations Will, and thank you!

Will Zhang just received an SDN award for leading the effort to translate Solaris Application Programming into Chinese. Thanks, Will!

Friday Apr 04, 2008

Atomic operations and memory ordering

Nice article on the developer portal that discusses memory ordering and atomic operations. It's worth noting that Solaris 10 has a number of atomic operations implemented in libc.

Sunday Mar 30, 2008

Search terms

A while back, AOL released a bundle of search data, this turned out to be rather controversial. The data has been put up as a searchable archive. The top keywords are interesting - coming in at number 9 is "http", number 23 is "m"! Most of the terms seem to be just looking for a company website (e.g. the top search term is "google").

Thursday Mar 27, 2008

SPEC CPU2006 discussions

I recently read a couple of posts about SPEC CPU2006. As you can tell from the papers linked on this blog, I was quite busy helping prepare the suite - which was considerable fun. The first post is by Tom Yager, where he praises the suite for raising awareness of the components of a system that actually contribute to performance: "I added a practical angle to my scientific understanding of compiler optimizations, processor scheduling, CPU cache utilization".

On the other hand Neil Gunther (second time I've mentioned him) condemns "bogus SPECxx_rate benchmarks which simply run multiple instances of a single-threaded benchmark". I hope he's joking, but taking his comments at face value...

Interestingly he suggests SPEC SDM as a good choice. I'd not heard of this suite, but reading up on it it looks like it tests the impact of multiple users typing and executing commands on the system at the same time, and it's not been touched in over 10 years. I guess SDM would be a good match for the SunRay that I use daily, but I'm certain that the suite doesn't include the 22 copies of firefox that I currently see running on the server I'm using. On only slightly less rocky ground he talks about TPC-C, which appeared in 1992!

CPU2006 represents the CPU intensive portion very well, but deliberately tries of avoid hitting the disk or network. Since disk and network do play significant roles in system performance, I probably wouldn't recommend getting a machine purely on its SPECcpu2006 or SPECcpu2006_rate scores. However, the mix of apps in the benchmark suite is representative of most of the codes that are out there (and I believe some are codes that appeared less than 10 years ago;). So what ever app is being run on a system, there is probably a code in CPU2006 which is not that dissimilar.

Tackling his core beef with the suite that the rate metric "simply" runs multiple copies of the the same ecode, this is actually harder work for the system than running a heterogeneous mix of applications. So I'd suggest that it is a better test of system capability than running some codes that stress memory bandwidth together with some other codes that are resident in the L1 cache. So IMO far from being "bogus", specrate is a very good indicator of potential system throughput.

Performance visualization discussion group

Saw this on Neil Gunther's blog, a discussion group for performance visualization. I've no idea whether it will be an interesting group or not, but there's certainly many possibilities for making performance data more readily understandable. So far there look to be about 12 folks signed up to the group.

One visualization tool that I think shows promise is chime, which is built upon dtrace.

Ruby information on developers.sun.com

There's now a centre specially for Ruby at developers.sun.com

Wednesday Mar 26, 2008

Debugging C++ with dbx

Here's the page on how to debug C++ exceptions using dbx. It omits my often repeated mantra about the -g flag. For C++ -g currently turns off 'front-end' inlining. This usually results in a significant loss in performance, but a benefit in the code being easier to read without the inlining. If you are interested in debugging the same code that you get without -g, or the faster code that does having inlining, then use the flag -g0 when building the application.

Balancing California's budget

A mail about Next 10 appeared on one of the aliases that I lurk on. The organisation aims to raise awareness of the state budget, and part of this is an on-line 'game' to balance the state's budget. The game presents various scenarios together with their projected costs. It's interesting to see some of the options that are available, and to learn a bit more about the history of the current budget and the discussion around the future options.

Tuesday Mar 25, 2008

Adding dtrace probes to user code (part 3)

I've previously discussed how to add dtrace USDT probes into user code. The critical step is to run the object files through dtrace, for dtrace to record the instrumentation points and to modify the object files prior to linking. The output of this step is an object file that also needs to be linked into the executable. Here's an example:

$ cc -O -c app.c
$ cc -O -c app1.c
$ dtrace -G -32 -s probes.d app.o app1.o
$ cc -O probes.o app.o app1.o

The results from running the example code under a suitable dtrace script are:

$ sudo dtrace -s script.d -c a.out
dtrace: script 'script.d' matched 10 probes
a=1, b=2
a=1, b=2
a=1, b=2
a=2, b=3
dtrace: pid 20655 has exited

                2                3                1
                1                2                3

One question that has come up is whether it's necessary to run a single call to dtrace which instruments all the object files, or whether it's possible to use multiple calls.

The object file that dtrace produces probes.o is going to be over written with each call to dtrace, so it's no surprise that the naive approach of multiple calls to dtrace each call generating the same object file does not work:

$ dtrace -G -32 -s probes.d app.o
$ dtrace -G -32 -s probes.d app1.o
$ cc -O app.o app1.o probes.o
$ sudo dtrace -s script.d -c a.out
dtrace: script 'script.d' matched 9 probes
a=1, b=2
a=1, b=2
a=1, b=2
a=2, b=3
dtrace: pid 20725 has exited

                2                3                1
                1                2                2

The next thing to try is whether changing the generated object file works:

$ dtrace -G -32 -s probes.d -o probe0.o app.o
$ dtrace -G -32 -s probes.d -o probe1.o app1.o
$ cc -O probes.o app.o app1.o probe1.o
$ sudo dtrace -s script.d -c a.out
dtrace: script 'script.d' matched 9 probes
a=1, b=2
a=1, b=2
a=1, b=2
a=2, b=3
dtrace: pid 20673 has exited

                2                3                1
                1                2                2

And if we wanted more proof, swapping the order of the object files generates the following:

$ cc -O app.o app1.o probe1.o probe0.o
$ sudo dtrace -s script.d -c a.out
dtrace: script 'script.d' matched 1 probe
a=1, b=2
a=1, b=2
a=1, b=2
a=2, b=3
dtrace: pid 20683 has exited

                1                2                1

So the conclusion is that the only way it will work is by putting all the object files onto the commandline of a single call to dtrace.

Conference schedule

The next two months are likely to be a bit hectic for me. I'm presenting at three different conferences, as well as a chat session in Second Life. So I figured I'd put the information up in case anyone reading this is also going to one or other of the events. So in date order:

I'll be talking about parallelisation at the various conferences, the talks will be different. The multi-core expo talks focuses on microparallelisation. The ESC talk will probably be higher level, and the CommunityOne talk will probably be wider ranging, and I hope more interactive.

In the Second Life event I'll be talking about the book, although the whole idea of appearing is to do Q&A, so I hope that will be more of a discussion.

Sunday Mar 23, 2008

Old computers

My father surprised me by mailing a couple of links to 'old home computer' sites. The Obsolete Technology site, which fails to mention the Dragon 32, which I had for many years. And World of Spectrum - although I never had one. I think it means that he's clearing the garage. So anyone in the UK want a ZX-81, Dragon 32, or, I think, a Commodore Pet? ;)

I guess he missed old computers, vintage computers, our local computer history museum (which I think also lacks a Dragon), and the computer exhibit at Bletchley Park (I last visited there about 10 years, ago so can't say whether they also neglect the Dragon! ;).

Saturday Mar 22, 2008

Multiplication

Learning the times tables is a pain, we found this software, Timez Attack which combines 3D video game with multiplication practice. Certainly the game appeals to the kids, although I'm not certain that they 'learn' the multiplications.

Friday Mar 21, 2008

Programming for kids

A while ago I started looking for ways to get my oldest coding. My first machine was a zx-81, with 1k of memory, and most of this was used by the screen, there was a big incentive to learn assembler. I'm not out to force him into assembler programming, but...

I evaluated a number of possibilities, one was the Kid's Programming Language (or Phrogram) which can do some impressive things in few lines of code. A sample 3D space 'game' takes about 30 lines most of which look like:

	If IsKeyDown( Up ) Then
		Ship.TiltUP( moveAmount )
	End If

I also looked at squeak, but it didn't grab me as being easy to use.

An interesting alternative to real coding is c-jump, which is a programming board game. I'm not quite convinced by the syntax, or the jumping around the board.

The first thing I tried with him was Java. Which was pretty successful, but I couldn't just leave him to get on with it. There's quite a bit of syntax to have to handle. So while it was a success, it relied on me finding the time to work with him.

We then tried scratch. This has been quite successful for the following reasons:

  • It's all drag-and-drop, and the programming constructs are coloured/shaped so it's easy to put them together correctly.
  • Its all graphical, and the interface is very intuitive. You can see the object that you're programming.
  • It has an integrated graphics editor so he can draw his own sprites. Changing the look of a sprite is a step towards looking at the programming of the sprite and from there modifying the programming.
  • The biggest thing has been that he can work on this autonomously, I just have to see the end results.

The downside of scratch is that it seems a bit limited in what it can do. He really wants to do 3D games - so perhaps Phrogram is the next stop.

Any other recommendations for kids programming?

Thursday Mar 20, 2008

The much maligned -fast

The compiler flag -fast gets an unfair rap. Even the compiler reports:

cc: Warning: -xarch=native has been explicitly specified, or 
implicitly specified by a macro option, -xarch=native on this 
architecture implies -xarch=sparcvis2 which generates code that 
does not run on pre UltraSPARC III processors

which is hardly fair given the the UltraSPARC III line came out about 8 years ago! So I want to quickly discuss what's good about the option, and what reasons there are to be cautious.

The first thing to talk about is the warning message. -xtarget=native is a good option to use when the target platform is also the deployment platform. For me, this is the common case, but for people producing applications that are more generally deployed, it's not the common case. The best thing to do to avoid the warning and produce binaries that work with the widest range of hardware is to add the flag -xtarget=generic after -fast (compiler flags are parsed from left to right, so the rightmost flag is the one that gets obeyed). The generic target represents a mix of all the important processors, the mix produces code that should work well on all of them.

The next option which is in -fast for C that might cause some concern is -xalias_level=basic. This tells the compiler to assume that pointers of different basic types (e.g. integers, floats etc.) don't alias. Most people code to this, and the C standard actually has higher demands on the level of aliasing the compiler can assume. So code that conforms to the C standard will work correctly with this option. Of course, it's still worth being aware that the compiler is making the assumption.

The final area is floating point simplification. That's the flags -fsimple=2 which allows the compiler to reorder floating point expressions, -fns which allows the processor to flush subnormal numbers to zero, and some other flags that use faster floating point libraries or inline templates. I've previously written about my rather odd views on floating point maths. Basically it comes down to If these options make a difference to the performance of your code, then you should investigate why they make a difference..

Since -fast contains a number of flags which impact performance, it's probably a good plan to identify exactly those flags that do make a difference, and use only those. A tool like ats can really help here.

Performance tuning recipe

Dan Berger posted a comment about the compiler flags we'd used for Ruby. Basically, we've not done compiler flag tuning yet, so I'll write a quick outline of the first steps in tuning an application.

  • First of all profile the application with whatever flags it usually builds with. This is partly to get some idea of where the hot code is, but it's also useful to have some idea of what you're starting with. The other benefit is that it's tricky to debug build issues if you've already changed the defaults.
  • It should be pretty easy at this point to identify the build flags. Probably they will flash past on the screen, or in the worse case, they can be extracted (from non-stripped executables) using dumpstabs or dwarfdump. It can be necessary to check that the flags you want to use are actually the ones being passed into the compiler.
  • Of course, I'd certainly use spot to get all the profiles. One of the features spot has that is really useful is to archive the results, so that after multiple runs of the application with different builds it's still possible to look at the old code, and identify the compiler flags used.
  • I'd probably try -fast, which is a macro flag, meaning it enables a number of optimisations that typically improve application performance. I'll probably post later about this flag, since there's quite a bit to say about it. Performance under the flag should give an idea of what to expect with aggressive optimisations. If the flag does improve the applications performance, then you probably want to identify the exact options that provide the benefit and use those explicitly.
  • In the profile, I'd be looking for routines which seem to be taking too long, and I'd be trying to figure out what was causing the time to be spent. For SPARC, the execution count data from bit that's shown in the spot report is vital in being able to distinguish from code that runs slow, or code that runs fast but is executed many times. I'd probably try various other options based on what I saw. Memory stalls might make me try prefetch options, TLB misses would make me tweak the -xpagesize options.
  • I'd also want to look at using crossfile optimisation, and profile feedback, both of which tend to benefit all codes, but particularly codes which have a lot of branches. The compiler can make pretty good guesses on loops ;)

These directions are more a list of possible experiments than necessary an item-by-item checklist, but they form a good basis. And they are not an exhaustive list...

Cross-linking support in Nevada

Ali Bahrami has just written an interesting post about cross-linking support going into Nevada. This is the facility to enable the linking of SPARC object files to produce SPARC executables on an x86 box (or the other way around).

Calendar

Search this blog

About

Solaris Application Programming

Book resources

Recent entries

Custom search

Tag cloud

ats bit book c++ cmt communityone compiler cooltools cpu2006 developers dtrace gccfss hpc multithreading openmp opensparc parallelisation parallelization performance performanceanalyzer secondlife solaris solarisapplicationprogramming sparc spot sunstudio t2 ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Navigation

Referers

Feeds