Darryl Gove's blog
OpenMP 3.0 specification released
The specification for OpenMP 3.0 has been put up on the OpenMP.org website. Using the previous OpenMP 2.5 standard, there's basically two supported modes of parallelisation:
- Splitting a loop over multiple threads - each thread is responsible for a range of the iterations.
- Splitting a serial code into sections - each thread executes a section of code.
The large change with OpenMP 3.0 is the introduction of tasks, where a thread can spawn a task to be completed by another thread at an unspecified point in the future. This should make OpenMP amenable to many more situations. An example of using tasks looks like:
node * p = head;
while (p)
{
#pragma omp task
{
process(p);
}
p = p->next;
}
The master thread iterates the linked list generating tasks for processing each element in the list. The brackets around the call to process(p) are unnecessary, but hopefully clarify what's happening.
Posted at 11:10AM May 13, 2008 by Darryl Gove in Sun | Comments[0]
Slides for CommunityOne
All the slides for last week's CommunityOne conference are available for download. I was presenting in the CMT stream, you can find my slides here. Note that to download the slides, you'll need to use the username and password shown on the page.
My talk was on parallelisation. What's supported by the compiler, the steps to do it, and the tools that support that. I ended with an overview of microparallelisation.
Posted at 09:56PM May 12, 2008 by Darryl Gove in Sun | Comments[2]
When to use membars
membar instructions are SPARC assembly language instructions that enforce memory ordering. They tell the processor to ensure that memory operations are completed before it continues execution. However, the basic rule is that the instructions are usually only necessary in "unusual" circumstances - which fortunately will mean that most people don't encounter them.
The UltraSPARC Architecture manual documents the situation very well in section 9.5. It gives these rules which cover the default behaviour:
- Each load instruction behaves as if it were followed by a
MEMBAR #LoadLoadand#LoadStore. - Each store instruction behaves as if it were followed by a
MEMBAR #StoreStore. - Each atomic load-store behaves as if it were followed by a
MEMBAR #LoadLoad,#LoadStore, and#StoreStore.
There's a table in section 9.5.3 which covers when membars are necessary. Basically, membars are necessary for ordering of block loads and stores, and for ordering non-cacheable loads and stores. There is an interesting note where it indicates that a membar is necessary to order a store followed by a load to a different addresses; if the address is the same the load will get the correct data. This at first glance seems odd - why worry about whether the store is complete if the load is of independent data. However, I can imagine this being useful in situations where the same physical memory is mapped using different virtual address ranges - not something that happens often, but could happen in the kernel.
As a footnote, the equivalent x86 instruction is the mfence. There's a good discussion of memory ordering in section 7.2 of the Intel Systems Programming Guide.
There's some more discussion of this topic on Dave Dice's weblog.
Posted at 11:08AM May 07, 2008 by Darryl Gove in Sun | Comments[0]
Official reschedule notice for CommunityOne
Session ID: S297077 Session Title: Techniques for Utilizing CMT Track: Chip Multithreading (CMT): OpenSPARCâ„¢ Room: Esplanade 302 Date: 2008-05-05 Start Time: 13:30
The official timetable has also been updated
Posted at 12:13AM May 03, 2008 by Darryl Gove in Sun | Comments[0]
OpenSolaris Summit
I'm stepping in to present at the OpenSolaris Summit on Sunday. The presentation is titled "Optimizing for OpenSolaris", and I believe someone else has already prepared the slideset - we'll see. Anyway, I'm looking forward to going, the attendees list contains many familiar names.
Posted at 11:57AM May 02, 2008 by Darryl Gove in Sun | Comments[0]
Embedded Systems Conference Presentation
I got the opportunity to present at the embedded systems conference in San Jose a couple of weeks back. My presentation covered parallelising a serial application, a quick tour of what to do, together with an overview of the tools that Sun Studio provides to help out. The presentation is now available on the OpenSPARC website.
Posted at 08:18AM May 02, 2008 by Darryl Gove in Sun | Comments[0]
Good overview of tools
I've just noticed this blog which is doing a very nice series of entries about the tools available on the tools cd and from the dtrace toolkit as well as the tools that ship with the system.
Posted at 09:35AM May 01, 2008 by Darryl Gove in Sun | Comments[0]
CommunityOne Panel and reschedule
I've heard that my session at CommunityOne is now scheduled from 1:30. The panel session that was scheduled for that time has been shifted to the 11:00am slot. However, the timetable currently up on the site does not reflect that change. I've been invited to appear on the panel - which I'm looking forward to. See you there!
Posted at 12:05AM May 01, 2008 by Darryl Gove in Sun | Comments[0]
Multicore expo available - Microparallelisation
My presentation "Strategies for improving the performance of single threaded codes on a CMT system" has been made available on the OpenSPARC site.
The presentation discusses "microparallelisation", in the the context of parallelising an example loop. Microparallelisation is the aim of obtaining parallelism through assigning small chunks of work to discrete processors. Taking a step back...
With traditional parallelisation the idea is to identify large chunks of work that can be split between multiple processors. The chunks of work need to be large to amortise the synchronisation costs. This usually means that the loops have a huge trip count.
The synchronisation costs are derived from the time it takes to signal that a core has completed its work. The lower the synchronisation costs, the smaller amount of work is needed to make parallelisation profitable.
Now, a CMT processor has two big advantages here. First of all it has many threads. Secondly these threads have low latency access to a shared level of cache. The result of this is that the cost of synchronisation between threads is greatly reduced, and therefore each thread is free to do a smaller chunk of work in a parallel region.
All that's great in theory, the presentation uses some example code to try this out, and discovers, rather fortunately, that the idea also works in practice!
The presentation also covers using atomic operations rather that microparallelisation.
In summary the presentation is more research than solid science, but I hoped that presenting it would get some people thinking about non-traditional ways to extract parallelism from applications. I'm not alone in this area of work, Lawrence Spracklen is also working on it. We're both at presenting CommunityOne next week.
Posted at 09:00PM Apr 29, 2008 by Darryl Gove in Sun | Comments[0]
static and inline functions
Hit a problem when compiling a library. The problem is with mixing static and inline functions, which is not allowed by the standard, but is allowed by gcc. Example code looks like:
char * c;
static void foo(char *);
inline void work()
{
foo(c);
}
void foo(char* c)
{
}
When this code is compiled it generates the following error:
% cc s.c "s.c", line 7: reference to static identifier "foo" in extern inline function cc: acomp failed for s.c
It turns out that there is a workaround for this problem, which is the flag -features=no%extinl. Douglas Walls describes the issue in much more detail.
Posted at 10:03AM Apr 28, 2008 by Darryl Gove in Sun | Comments[0]
Second life slides and script
Just completed the Second Life presentation. It appeared well attended, and I got a bundle of great questions at the end. If you were there, thank you! I've uploaded a screen shot that I managed to get before the presentation started. Unfortunately, I didn't get a picture of the stage setup with the life-size books, a very nice touch.
Posted at 10:59AM Apr 24, 2008 by Darryl Gove in Sun |
Time and location for CommunityOne
I've just had a mail giving time and location for my presentation at CommunityOne.
Session ID: S297077 Session Title: Techniques for Utilizing CMT Track: Chip Multithreading (CMT): OpenSPARCâ„¢ Room: Esplanade 302 Date: 2008-05-05 Start Time: 11:00
See you there!
Posted at 03:03PM Apr 23, 2008 by Darryl Gove in Sun |
Presentation and Q&A in second life tomorrow
I'll be in Second Life tomorrow talking about the book. The session is at 9am PST in the Andromeda Theatre. I've got a small set of slides describing the contents of the book (Here's the full ToC for the book). After the slides I'll be sticking around to answer questions.
Posted at 01:33PM Apr 23, 2008 by Darryl Gove in Sun |
Solaris Grid Compiler
I've only just heard that the the Solaris Grid Compiler has been released as a cooltool! The idea of this is to set up a batch of machines as a compiler grid - so large compilations can be dispatched to the grid to better utilise the machines, and reduce compilation time.
Posted at 08:58PM Apr 10, 2008 by Darryl Gove in Sun |
Congratulations Will, and thank you!
Will Zhang just received an SDN award for leading the effort to translate Solaris Application Programming into Chinese. Thanks, Will!
Posted at 12:04PM Apr 07, 2008 by Darryl Gove in Sun |
Atomic operations and memory ordering
Nice article on the developer portal that discusses memory ordering and atomic operations. It's worth noting that Solaris 10 has a number of atomic operations implemented in libc.
Posted at 10:10AM Apr 04, 2008 by Darryl Gove in Sun | Comments[4]
Search terms
A while back, AOL released a bundle of search data, this turned out to be rather controversial. The data has been put up as a searchable archive. The top keywords are interesting - coming in at number 9 is "http", number 23 is "m"! Most of the terms seem to be just looking for a company website (e.g. the top search term is "google").
Posted at 12:10AM Mar 30, 2008 by Darryl Gove in Personal |
SPEC CPU2006 discussions
I recently read a couple of posts about SPEC CPU2006. As you can tell from the papers linked on this blog, I was quite busy helping prepare the suite - which was considerable fun. The first post is by Tom Yager, where he praises the suite for raising awareness of the components of a system that actually contribute to performance: "I added a practical angle to my scientific understanding of compiler optimizations, processor scheduling, CPU cache utilization".
On the other hand Neil Gunther (second time I've mentioned him) condemns "bogus SPECxx_rate benchmarks which simply run multiple instances of a single-threaded benchmark". I hope he's joking, but taking his comments at face value...
Interestingly he suggests SPEC SDM as a good choice. I'd not heard of this suite, but reading up on it it looks like it tests the impact of multiple users typing and executing commands on the system at the same time, and it's not been touched in over 10 years. I guess SDM would be a good match for the SunRay that I use daily, but I'm certain that the suite doesn't include the 22 copies of firefox that I currently see running on the server I'm using. On only slightly less rocky ground he talks about TPC-C, which appeared in 1992!
CPU2006 represents the CPU intensive portion very well, but deliberately tries of avoid hitting the disk or network. Since disk and network do play significant roles in system performance, I probably wouldn't recommend getting a machine purely on its SPECcpu2006 or SPECcpu2006_rate scores. However, the mix of apps in the benchmark suite is representative of most of the codes that are out there (and I believe some are codes that appeared less than 10 years ago
. So what ever app is being run on a system, there is probably a code in CPU2006 which is not that dissimilar.
Tackling his core beef with the suite that the rate metric "simply" runs multiple copies of the the same ecode, this is actually harder work for the system than running a heterogeneous mix of applications. So I'd suggest that it is a better test of system capability than running some codes that stress memory bandwidth together with some other codes that are resident in the L1 cache. So IMO far from being "bogus", specrate is a very good indicator of potential system throughput.
Posted at 02:17PM Mar 27, 2008 by Darryl Gove in Sun |
Performance visualization discussion group
Saw this on Neil Gunther's blog, a discussion group for performance visualization. I've no idea whether it will be an interesting group or not, but there's certainly many possibilities for making performance data more readily understandable. So far there look to be about 12 folks signed up to the group.
One visualization tool that I think shows promise is chime, which is built upon dtrace.
Posted at 12:08PM Mar 27, 2008 by Darryl Gove in Sun |
Ruby information on developers.sun.com
There's now a centre specially for Ruby at developers.sun.com
Posted at 10:53AM Mar 27, 2008 by Darryl Gove in Sun |
Debugging C++ with dbx
Here's the page on how to debug C++ exceptions using dbx. It omits my often repeated mantra about the -g flag. For C++ -g currently turns off 'front-end' inlining. This usually results in a significant loss in performance, but a benefit in the code being easier to read without the inlining. If you are interested in debugging the same code that you get without -g, or the faster code that does having inlining, then use the flag -g0 when building the application.
Posted at 08:51PM Mar 26, 2008 by Darryl Gove in Sun |
Balancing California's budget
A mail about Next 10 appeared on one of the aliases that I lurk on. The organisation aims to raise awareness of the state budget, and part of this is an on-line 'game' to balance the state's budget. The game presents various scenarios together with their projected costs. It's interesting to see some of the options that are available, and to learn a bit more about the history of the current budget and the discussion around the future options.
Posted at 08:15AM Mar 26, 2008 by Darryl Gove in Personal |
Adding dtrace probes to user code (part 3)
I've previously discussed how to add dtrace USDT probes into user code. The critical step is to run the object files through dtrace, for dtrace to record the instrumentation points and to modify the object files prior to linking. The output of this step is an object file that also needs to be linked into the executable. Here's an example:
$ cc -O -c app.c $ cc -O -c app1.c $ dtrace -G -32 -s probes.d app.o app1.o $ cc -O probes.o app.o app1.o
The results from running the example code under a suitable dtrace script are:
$ sudo dtrace -s script.d -c a.out
dtrace: script 'script.d' matched 10 probes
a=1, b=2
a=1, b=2
a=1, b=2
a=2, b=3
dtrace: pid 20655 has exited
2 3 1
1 2 3
One question that has come up is whether it's necessary to run a single call to dtrace which instruments all the object files, or whether it's possible to use multiple calls.
The object file that dtrace produces probes.o is going to be over written with each call to dtrace, so it's no surprise that the naive approach of multiple calls to dtrace each call generating the same object file does not work:
$ dtrace -G -32 -s probes.d app.o
$ dtrace -G -32 -s probes.d app1.o
$ cc -O app.o app1.o probes.o
$ sudo dtrace -s script.d -c a.out
dtrace: script 'script.d' matched 9 probes
a=1, b=2
a=1, b=2
a=1, b=2
a=2, b=3
dtrace: pid 20725 has exited
2 3 1
1 2 2
The next thing to try is whether changing the generated object file works:
$ dtrace -G -32 -s probes.d -o probe0.o app.o
$ dtrace -G -32 -s probes.d -o probe1.o app1.o
$ cc -O probes.o app.o app1.o probe1.o
$ sudo dtrace -s script.d -c a.out
dtrace: script 'script.d' matched 9 probes
a=1, b=2
a=1, b=2
a=1, b=2
a=2, b=3
dtrace: pid 20673 has exited
2 3 1
1 2 2
And if we wanted more proof, swapping the order of the object files generates the following:
$ cc -O app.o app1.o probe1.o probe0.o
$ sudo dtrace -s script.d -c a.out
dtrace: script 'script.d' matched 1 probe
a=1, b=2
a=1, b=2
a=1, b=2
a=2, b=3
dtrace: pid 20683 has exited
1 2 1
So the conclusion is that the only way it will work is by putting all the object files onto the commandline of a single call to dtrace.
Posted at 04:49PM Mar 25, 2008 by Darryl Gove in Sun |
Conference schedule
The next two months are likely to be a bit hectic for me. I'm presenting at three different conferences, as well as a chat session in Second Life. So I figured I'd put the information up in case anyone reading this is also going to one or other of the events. So in date order:
- Multi-core expo. 2nd April 2:35pm
- Embedded Systems Conference; 16th April 8:30am.
- Second life. 24th April 9am PST.
- CommunityOne. 5th May Time:TBD
I'll be talking about parallelisation at the various conferences, the talks will be different. The multi-core expo talks focuses on microparallelisation. The ESC talk will probably be higher level, and the CommunityOne talk will probably be wider ranging, and I hope more interactive.
In the Second Life event I'll be talking about the book, although the whole idea of appearing is to do Q&A, so I hope that will be more of a discussion.
Posted at 02:13PM Mar 25, 2008 by Darryl Gove in Sun |
Old computers
My father surprised me by mailing a couple of links to 'old home computer' sites. The Obsolete Technology site, which fails to mention the Dragon 32, which I had for many years. And World of Spectrum - although I never had one. I think it means that he's clearing the garage. So anyone in the UK want a ZX-81, Dragon 32, or, I think, a Commodore Pet? 
I guess he missed old computers, vintage computers, our local computer history museum (which I think also lacks a Dragon), and the computer exhibit at Bletchley Park (I last visited there about 10 years, ago so can't say whether they also neglect the Dragon!
.
Posted at 02:13PM Mar 23, 2008 by Darryl Gove in Personal |
Multiplication
Learning the times tables is a pain, we found this software, Timez Attack which combines 3D video game with multiplication practice. Certainly the game appeals to the kids, although I'm not certain that they 'learn' the multiplications.
Posted at 10:08AM Mar 22, 2008 by Darryl Gove in Personal |
Programming for kids
A while ago I started looking for ways to get my oldest coding. My first machine was a zx-81, with 1k of memory, and most of this was used by the screen, there was a big incentive to learn assembler. I'm not out to force him into assembler programming, but...
I evaluated a number of possibilities, one was the Kid's Programming Language (or Phrogram) which can do some impressive things in few lines of code. A sample 3D space 'game' takes about 30 lines most of which look like:
If IsKeyDown( Up ) Then Ship.TiltUP( moveAmount ) End If
I also looked at squeak, but it didn't grab me as being easy to use.
An interesting alternative to real coding is c-jump, which is a programming board game. I'm not quite convinced by the syntax, or the jumping around the board.
The first thing I tried with him was Java. Which was pretty successful, but I couldn't just leave him to get on with it. There's quite a bit of syntax to have to handle. So while it was a success, it relied on me finding the time to work with him.
We then tried scratch. This has been quite successful for the following reasons:
- It's all drag-and-drop, and the programming constructs are coloured/shaped so it's easy to put them together correctly.
- Its all graphical, and the interface is very intuitive. You can see the object that you're programming.
- It has an integrated graphics editor so he can draw his own sprites. Changing the look of a sprite is a step towards looking at the programming of the sprite and from there modifying the programming.
- The biggest thing has been that he can work on this autonomously, I just have to see the end results.
The downside of scratch is that it seems a bit limited in what it can do. He really wants to do 3D games - so perhaps Phrogram is the next stop.
Any other recommendations for kids programming?
Posted at 12:11PM Mar 21, 2008 by Darryl Gove in Personal | Comments[2]
The much maligned -fast
The compiler flag -fast gets an unfair rap. Even the compiler reports:
cc: Warning: -xarch=native has been explicitly specified, or implicitly specified by a macro option, -xarch=native on this architecture implies -xarch=sparcvis2 which generates code that does not run on pre UltraSPARC III processors
which is hardly fair given the the UltraSPARC III line came out about 8 years ago! So I want to quickly discuss what's good about the option, and what reasons there are to be cautious.
The first thing to talk about is the warning message. -xtarget=native is a good option to use when the target platform is also the deployment platform. For me, this is the common case, but for people producing applications that are more generally deployed, it's not the common case. The best thing to do to avoid the warning and produce binaries that work with the widest range of hardware is to add the flag -xtarget=generic after -fast (compiler flags are parsed from left to right, so the rightmost flag is the one that gets obeyed). The generic target represents a mix of all the important processors, the mix produces code that should work well on all of them.
The next option which is in -fast for C that might cause some concern is -xalias_level=basic. This tells the compiler to assume that pointers of different basic types (e.g. integers, floats etc.) don't alias. Most people code to this, and the C standard actually has higher demands on the level of aliasing the compiler can assume. So code that conforms to the C standard will work correctly with this option. Of course, it's still worth being aware that the compiler is making the assumption.
The final area is floating point simplification. That's the flags -fsimple=2 which allows the compiler to reorder floating point expressions, -fns which allows the processor to flush subnormal numbers to zero, and some other flags that use faster floating point libraries or inline templates. I've previously written about my rather odd views on floating point maths. Basically it comes down to If these options make a difference to the performance of your code, then you should investigate why they make a difference..
Since -fast contains a number of flags which impact performance, it's probably a good plan to identify exactly those flags that do make a difference, and use only those. A tool like ats can really help here.
Posted at 03:00PM Mar 20, 2008 by Darryl Gove in Sun |
Performance tuning recipe
Dan Berger posted a comment about the compiler flags we'd used for Ruby. Basically, we've not done compiler flag tuning yet, so I'll write a quick outline of the first steps in tuning an application.
- First of all profile the application with whatever flags it usually builds with. This is partly to get some idea of where the hot code is, but it's also useful to have some idea of what you're starting with. The other benefit is that it's tricky to debug build issues if you've already changed the defaults.
- It should be pretty easy at this point to identify the build flags. Probably they will flash past on the screen, or in the worse case, they can be extracted (from non-stripped executables) using dumpstabs or dwarfdump. It can be necessary to check that the flags you want to use are actually the ones being passed into the compiler.
- Of course, I'd certainly use spot to get all the profiles. One of the features spot has that is really useful is to archive the results, so that after multiple runs of the application with different builds it's still possible to look at the old code, and identify the compiler flags used.
- I'd probably try -fast, which is a macro flag, meaning it enables a number of optimisations that typically improve application performance. I'll probably post later about this flag, since there's quite a bit to say about it. Performance under the flag should give an idea of what to expect with aggressive optimisations. If the flag does improve the applications performance, then you probably want to identify the exact options that provide the benefit and use those explicitly.
- In the profile, I'd be looking for routines which seem to be taking too long, and I'd be trying to figure out what was causing the time to be spent. For SPARC, the execution count data from bit that's shown in the spot report is vital in being able to distinguish from code that runs slow, or code that runs fast but is executed many times. I'd probably try various other options based on what I saw. Memory stalls might make me try prefetch options, TLB misses would make me tweak the -xpagesize options.
- I'd also want to look at using crossfile optimisation, and profile feedback, both of which tend to benefit all codes, but particularly codes which have a lot of branches. The compiler can make pretty good guesses on loops

These directions are more a list of possible experiments than necessary an item-by-item checklist, but they form a good basis. And they are not an exhaustive list...
Posted at 10:10AM Mar 20, 2008 by Darryl Gove in Sun |
Cross-linking support in Nevada
Ali Bahrami has just written an interesting post about cross-linking support going into Nevada. This is the facility to enable the linking of SPARC object files to produce SPARC executables on an x86 box (or the other way around).
Posted at 09:47AM Mar 20, 2008 by Darryl Gove in Sun |
