Thursday Apr 02, 2009

It is with great pleasure that I announce availability of 41 OpenMP example programs. All these examples are introduced and discussed in the book "Using OpenMP". 

 Cover page of the book "Using OpenMP"

The sources are available as a free download under the BSD license. Each source comes with a copy of the license. Please do not remove this.

The zip file that contains everything needed (sources, example make files for various compilers, and a brief user's guide) can be downloaded here

Although tempting at times to deviate, the core part of each example has not been changed compared to what is shown in the book.

You are encouraged to try out these examples and perhaps use them as a starting point to better understand and possibly further explore OpenMP. 

Simplicity was an important goal for this project. For example. each source file constitutes a full working program. Other than a compiler and run time environment that support OpenMP, nothing else is needed to compile and run the program(s).

With the exception of one example, there are no source code comments. Not only are these examples very straightforward, they are also discussed in the above mentioned book. 

As a courtesy, each source directory has a straightforward make file called "Makefile". This file can be used to build and run the examples in the specific directory. Before doing so, you need to activate the appropriate include line in file Makefile. There are include files for several compilers on Unix based Operating Systems (Linux, Solaris and Mac OS to be precise). These files have been put together on a best effort basis.

The User's Guide that is bundled with the examples explains the compiler specific include file for "make" in more detail.

I would like to encourage you to submit your feedback, questions, suggestions, etc to the forum "Using OpenMP - The Book and Examples" on http://www.openmp.org.

Enjoy!

Tuesday Mar 31, 2009

The RWTH Aachen University in Aachen, Germany, organized and hosted the first "Parallel Programming in Computational Science and Engineering" (PPCES) HPC tutorials series. It was held March 23-27, 2009. I participated as well as presented several times throughout the week.

This tutorial was a natural follow on to the "SunHPC" workshops held from 2001-2007 and the combined SunHPC 2008 and VI-HPS event in 2008. 

This first PPCES tutorial week was very well attended and the group also actively participated. Many of the talks have been recorded and will appear on line.  I'll add a link when they are available.

Below I include some pictures of Aachen, with its beautiful historic old city.

Parts of the outer wall are still present. This is the Pontwall, near the Pontstraße, where one enters the city from the A4 (Aachen-Laurensberg exit). The second picture was made on the backside, facing the city.

Pontwall, Aachen

Pontwall, Aachen

The market square is the most prominent place in the old city. There are several restaurants and shops, but literally the most visible building is the huge and beautiful town hall. Below a picture of this building as seen from the square, plus a shot taken at the back side.

Aachen Town Hall

Aachen Town Hall, view from the back side

The huge cathedral is a true landmark and on the UNESCO world heritage list. It is also the burial site of Charlemagne. On the first picture below it can be seen on the left. The second picture has a more up close view. The third picture was taken from the other side. There is a small square between the town hall and cathedral. That's where this picture was made.

Cathedral is seen on the left Aachen cathedral up close

Aachen cathedral as seen from the other side

The picture below was taken in one of the small streets near the big market square. It was taken on the only day the weather was relatively good while I was there. It was somewhat chilly and windy, but shielded from the wind one could sit outside, as shown by the people at the end this street.

Small street near the market square

This fountain is very funny to see and a great attraction for children in particular. It is very fascinating to them that you can turn the hands around. This fountain is in another fairly narrow street, connecting the market square and cathedral. On busy days it can be really crowded here.

Fountain near the cathedral

The picture below was made while I stood on the small balcony in front of the main entrance to the town hall. On the left you can see one of my favorite places there. It is a fixed stop each day I walk to the RWTH.

View on the market square, standing on the stairs of the town hall

One of the nice other things about Aachen is the choice of restaurants.  One of my favorites is the "Best Friends" restaurant in the Pontstraße. It offers a variety of Asian dishes and I really enjoyed the Bento Box. The picture below was made when I went there with a couple of friends. No comments necessary I think.

Best Friends restaurant in the Pontstrasse

Dieter an Mey and his team at the Computer Centre of the RWTH always do a great job in general, but they also select really good places for the social dinner.  We've been at the Kazan restaurant a couple of times before and have never been disappointed regarding the food and the service. Below a picture made of the restaurant, followed by a live in action picture, shot by Agnes Mendes from the RWTH.

The Kazan restaurant

The social dinner

Tuesday Mar 17, 2009

I was in Houston, Texas, the week of March 7-14, 2009. The purpose of this trip was twofold. I was going to visit Barbara Chapman's Computer Science group at the University of Houston, as well as give an OpenMP class at Texas Instruments. In this blog I would like to share my impressions and pictures made during my stay there.

I've been in Houston several times now, but I continue to be amazed to see the indoor ice skating rink in the middle of the huge Galleria shopping mall. Given how hot and humid it typically is outside, it is fascinating to see people skating inside. In good US tradition there is a food court around the rink. It actually makes for an entertaining view while eating.

Ice skating rink at the Galleria Houston

I stayed in a nice hotel in Sugar Land on the South West side of Houston. The hotel is along the Southwest Freeway. The name of this part of Houston is kind of charming. It dates back to the days of sugarcane plantations. Below two pictures of the hotel, as well as two from the town hall.

Lobby of my hotel

Lobby of my hotel

Sugar Land Town Hall

Sugar Land Town Hall

It is always a pleasure to visit the University of Houston campus. I like the way it has been organized, as well as the lawns, the trees and the walkways. Below some impressions. The second picture shows the Philip Guthrie Hoffman Hall, the home of the Computer Science department. The third picture has the library on it.

Main entrance UH campus

PGH Building

UH Library

Main entrance to the UH campus

Barbara and I met several times. I also had an interesting discussion session with her students and staff members. Also the social side of this visit was not overlooked. Below a picture of the joint dinners we had. The first one was made at a dinner we had in a local pub called "Cafe Adobe" in Sugar Land. The second one was made in the "Mo Mong" Vietnamese restaurant on Westheimer.

Dinner at a pub in Sugar Land Dinner at a Vietnamese restaurant

Wednesday evening we went to Houston Livestock and Rodeo Show.  This is a very big and lively event that was held March 3-22, 2009, in the Reliant Stadium. Prior to going there we went for dinner at a Texas Barbeque place on Kirby Street. Below a picture of the restaurant, as well as a group picture. 

Goode Company Texas Barbeque on Kirby Street

Group picture at Texas BBQ

Next to the BBQ restaurant was a place I recognized from a previous visit to Houston. It is called "Goode's Armadillo Palace". There is a huge armadillo in front of it and I could not resist to make a picture of it.

Goode's Armadillo Palace in Houston

This was my first rodeo experience. Quite entertaining and very efficiently organized. There were short sessions with various contests. With some of those you had to be really quick to watch. It could be over in a few seconds, but that is after all the whole purpose of these contests.

Below a picture of the stadium, plus an early evening view on Houston. Our seats were very high up. The upside was that the outside view from there was really nice. The boots shown in the third picture attracted a lot of attention from people that wanted to have their picture made together with these boots.

Reliant Stadium in Houston

View on Houston from the stadium

Texas boots

Below are four pictures made at the show. On the first one can clearly see how the bull is mastered by the cowboy. The second one is not so fortunate, as he is about to fall of his horse. Luckily he did not seem to have any serious injuries. The third picture shows the wagon races. Quite spectacular. It reminded me of what the definition of "horsepower" stands for. On the fourth picture the podium for the concert is wheeled in. 

Cowboy wins

Cowboy loses Wagon races

Preparations for the concert

The last picture shows the Texas Instruments building where the OpenMP training was held. The turnout was really impressive and I very much enjoyed the discussions, as well as conversations with the attendees. They had really good and detailed questions.

TI building 

 

Saturday Feb 28, 2009

Unfortunately, the September 5, 2008 blog titled "The OpenMP Concurrency Platform" written by Charles Leiserson from Cilk Arts repeats some of the persistent myths regarding OpenMP.

Certain comments made also may give rise to a somewhat distorted view on OpenMP for those readers that are less into the aspects of parallel programming. For example, the statement that OpenMP is most suitable for loops only. This has never been the case and certainly the introduction of the flexible and powerful tasking concept in OpenMP 3.0 (released May 2008) is a big step forward.

In this article I would like to respond to this blog and share my view on the claims made. The format chosen is to give a quote, followed by my comment(s).


"OpenMP does not attempt to determine whether there are dependencies between loop iterations." 

This is correct, but there are two additional and important comments to be made.

OpenMP is a high level, but explicit programming model. The developer specifies the parallelism. The compiler and run time system translate this into a corresponding parallel execution model. The task of the programmer is to correctly identify the parallelism. As far as I can tell, this is the case for all parallel programming models. In that sense, OpenMP is not any different than other models. It is therefore not clear what the intention of this comment is (note that the exception is automatic parallelization. In this case it is the responsibility of the compiler to identify those portions of the program that can be executed in parallel, as well as generate the correct underlying infrastructure).

Another aspect not mentioned is that one of the strengths of OpenMP is that the directive based model allows compilers to check for possible semantic errors made by the developer. For example, several compilers perform a static dependence analysis to warn against possible data races. Such errors are much harder to detect if function calls are used to specify the parallelism (e.g. POSIX threads).


"Unfortunately, some of the OpenMP directives for managing memory consistency and local copies of variables affect the semantics of the serial code, compromising this desirable property unless the code avoids these pragmas." 

I don't think OpenMP directives affect the semantics of the serial code, so how can this be the case? An OpenMP program can always be compiled in such a way that the directives are ignored, effectively compiling the serial code.

I suspect the author refers to the "#pragma omp flush" and "#pragma omp private" directives. These affect the semantics of the parallel version, not the serial code, but either or both could be required to ensure correct parallel execution. The need for this depends on the specific situation.

We can only further guess what is meant here, but it is worth doing so. 

Part of the design of a shared memory computer system is to define the memory consistency model. Several choices are possible and have indeed been implemented in commercially available parallel systems, as well as in more research oriented architectures.

As suggested by the name, memory consistency defines what memory state the various threads of execution observe. They need not have the same view at a specific point in time.

The problem with this is that at certain points in a shared memory parallel program, the developer may want to enforce a consistent view to ensure that modifications made to shared data are visible to all threads, not only the thread(s) that modified this data.

This has however nothing to do with OpenMP. It is something that comes with shared memory programming and needs to be dealt with.

Ensuring a consistent view on memory is exactly what the "#pragma omp flush" directive does. This is guaranteed by the OpenMP implementation,. Therefore, the developer has a powerful yet portable mechanism to achieve this. In other words, it is a strength, not a weakness. Also, for ease of development, many OpenMP constructs already have this construct implied. This dramatically reduces the need to explicitly use the flush directive, but if it is needed still, this construct is a nice feature to have.

Given what it achieves, this directive does not impact correct execution of the serial or single threaded version of an OpenMP program. Therefore this can also not explain the claim made in this blog.

The second item mentioned ("local copies of variables") is also not applicable to the serial version of the program, nor single threaded execution of the parallel version. The "#pragma omp private" directive allows the programmer to specify what variables are local to the thread. There are also default rules for this by the way. As a result of this directive, each thread has its unique instance of the variable(s) specified. This is not only a very natural feature to wish for, it also has no impact on the serial code.

Perhaps the author refers to the "firstprivate" and "lastprivate" clauses, but these can be used to preserve the sequential semantics in the parallel program, not the other way round. Their use is rare, but if needed, very convenient to have.

"Since work-sharing induces communication and synchronization overhead whenever a parallel loop is encountered, the loop must contain many iterations in order to be worth parallelizing."

Again some important details are left out. OpenMP provides several strategies to assign loop iterations to threads. If not specified explicitly, the compiler provides for one. 

There is a good reason to provide multiple choices to the developer. The most efficient strategy is the static distribution of iterations over the threads. In contrast with the claim made above, the overhead for this is close to zero. It is also the reason why many compilers use this as the default in absence of an explicit specification by the user.

This choice may however not be optimal if the workload is less balanced. For this, OpenMP provides several alternatives like the "dynamic" and "guided" workload distribution schemes. It is true that more synchronization is needed then, but this is understandable. The run time system needs to make choices how to distribute the work. This is not needed with the static scheduling.

Of course the developer can always implement a specific scheme manually, but these 2 choices come a long way to accommodate many real world situations. Moreover, an implementor will try very hard to provide the most efficient implementation of these constructs, relieving the developer from this task.

"Although OpenMP was designed primarily to support a single level of loop parallelization"

I'm not sure what this comment is based on, because nested parallelism has been supported since the very first 1.0 release of OpenMP that came out in 1997. The one thing is that it has taken some time for compilers and run time systems to support this, but it is a widely available feature these days.


"The work-sharing scheduling strategy is often not up to the task, however, because it is sometimes difficult to determine at the start of a nested loop how to allocate thread resources."

OpenMP has a fairly high level of abstraction. I fail to see what is meant with "allocate thread resources". Actually, there is no such concept available to the user, other than various types of data-sharing attributes like "private" or "shared". It is also not clear what is really meant here. Nested parallelism works, each thread becomes the master thread of a new pool of threads, and resources are available whenever needed.

The next line gives us somewhat more of a clue as to what is really meant here.


"In particular, when nested parallelism is turned on, it is common for OpenMP applications to "blow out" memory at runtime because of the inordinate space demands."

In my experience this has not been an issue, but of course one can not exclude that an (initial) implementation of nested parallelism for a specific platform suffered from certain deficiencies. Even if so, that is a Quality Of Implementation (QoI) issue and has nothing to do with the programming model. Shared data is obviously not copied, so there are no additional memory resources, and by design (and desire) each (additional) thread gets a copy of its private data.

The fact this is really a QoI issue seems to be confirmed by the next statement.


"The latest generation OpenMP compilers have started to address this problem, however."

In other words, if there was a problem at all, it is being addressed.


 "In summary, if your code looks like a sequence of parallelizable Fortran-style loops, OpenMP will likely give good speedups."

This is one of those persistent myths. OpenMP has always been more flexible than for "just" parallelizing loops. As for example shown in the book "Using OpenMP" (by Chapman, Jost and van der Pas), the sections concept can be used to overlap input, processing and output in a pipelined manner. 


"If your control structures are more involved, in particular, involving nested parallelism, you may find that OpenMP isn't quite up to the job."

This is not only a surprisingly bold and general claim, some more specific information would be helpful. As already mentioned above, it is not at all clear why nested parallelism should not be suitable and performant. It actually is and is successfully used for certain kinds of algorithms.

Regrettably the author of this blog also does not seem to be aware of the huge leap forward made with OpenMP 3.0. The specifications have been released in May 2008 and are supported by several major compilers already.

The main new functionality added is the concept of tasking. A task can be any block of code. The developer has the responsibility to ensure that different tasks can be executed concurrently. The run time system generates and executes the tasks, either at implicit synchronization points in the program, or under explicit control of the programmer.

This adds a tremendous flexibility to OpenMP. It also uplifts the level of abstraction. Although never true in the past either, a claim that OpenMP is only suitable for a loop level style of parallelization is certainly way too restrictive.

In addition to tasking, numerous other features have been added, including enhanced support for nested parallelism and C++.

 Last, but certainly not least, I can strongly recommend anybody interested in OpenMP to visit the main web site

Saturday Feb 21, 2009

February 14-21 I was in Canada to give two classes on parallel programming at HPCVL. The first session was held at the HPCVL facilities in Kingston, ON. February 18 I travelled to Toronto, ON. The second class was held February 19-20 at Ryerson University. I'd been at HPCVL (Kingston and Ottawa) mid august 2008. Back then it was  still summer. Quite warm and sunny with a lot of outdoor activities. This time it was really winter still, as demonstrated by the pictures below. In Kingston I stayed in the same hotel as before, the Holiday Inn Waterfront. I had a very nice corner room on the highest floor with a great view on Lake Ontario.

The 2 pictures below are taken from my balcony. The water is frozen, but there is a pathway for the ferry. On the first picture you can see the Royal Military College on the other side of the bay. The second picture has a close up of the ferry.

View from balcony 

Ferry as seen from my balcony

The next two pictures are taken from the little park facing the harbor on Ontario street. On the second picture my hotel can be seen. It is the yellow building on the left. My room was on the 5-th floor, in the top right hand corner.

Harbour view 

Downtown KingstonThe following two pictures are taken in the downtown part of Kingston. The first one was made with my back facing the harbor. On the second picture you can see a small ice skating rink. A very nice idea. When I made this picture the ice was just cleaned and there are no people back on the ice yet, but this changed a few minutes later. The rink was pretty full with people then.

Harbour and hotel view

Downtown Kingston The last picture of Kingston is a beautiful sunrise. I was up early and saw this view while working on my laptop in my room. I went out on the balcony to shoot this picture. It was pretty cold, but worth doing. Note that the moon is still visible.

Sunrise in Kingston

The 4 pictures below were taken in Toronto. The first picture shows the building where the class was held. The second picture shows some more detail of this street. The university is very close to the Toronto Eaton Center with a big mall and office buildings. This center is on the left on the third picture. I thought the fourth picture is funny. This squirrel was very curious and I could not resist making a picture of this little animal.

Ryerson University - Building where class was held 

Street impression

Yonge Street

A very curious squirrel

Saturday February 21 I travelled to the San Francisco Bay Area for a week with meetings. I had a connecting flight through Chicago and the weather got pretty bad. Our plane had to be de-iced twice. When we were on the runway to take off, we had to return to the gate for the second de-icing. Given the weather conditions it was pretty impressive we were able to take off after all. The 4 pictures below give an impression what it was like.

View from my seat De-icing the plane

Getting ready to take off In the air!

 

Tuesday Feb 17, 2009

This seminar was held February 1-6, 2009, at Schloß Dagstuhl in Wadern, Germany. The Dagstuhl seminars are small scale events. Attendance is by invitation only. The goal is to not only exchange information, but also to encourage discussions and to get to know the attendees better. This approach worked out really well. Below my impressions on the scientific aspects of this event. 

All the seminar information can be found at the workshop web site. This is also where you can find all the presentation material. I've posted my slides, but will also write an extended abstract for the proceedings. This will be more like a short paper.

The two major new things I learned at this event were in how many areas combinatorial analysis is used, and that many of the algorithms are characterized by random memory access on large data sets.

Regarding the former, I was for example surprised to hear that the analysis of social networks boils down to a combinatorial problem. When you think about it though, there is a natural link between these two. A new aspect is however that these networks, like LinkedIn or Hives, are so huge. Nobody really knows what they look like, and a deeper analysis of their structure can be revealing. 

The computational aspects are quite interesting and challenging. In particular, traditional cache based architectures do not perform very well at all, due to the irregular memory access patterns, combined with the ever growing size of the data set. For the same reason, it is also a challenge for a cc-NUMA architecture to perform well.

Instead, heavily threaded architectures using latency hiding techniques shine on these kind of applications. Even an old system like the Tera MTA performs relatively well, despite its low clock rate. Several presenters reported excellent results on Niagara 2 and Victoria Falls based systems. For more details I can highly recommend the talk given by Prof. David Bader from the Georgia Institute of Technology. The slides can be found here

 

Sunday Feb 15, 2009

I was very fortunate to be part of a seminar on "Combinatorial Science and Engineering" held February 1-6, 2009, at Schloß Dagstuhl in Wadern, Gemany. In this personal section of my blog I share my experiences on the location and the event. In the work part I'll comment on the scientific aspects of this trip.

It was definitely a very interesting experience to be there. Schloß Dagstuhl is not only beautiful, as the pictures below hopefully demonstrate, it is also relatively remote. Especially if you don't have your own transportation. 

Attendance is by invitation only and the groups are always small. The remoteness of the location is on purpose. The idea is to encourage attendees to get to know each other better and also to stimulate discussions. The agenda also allows for enough hallway conversations, or to just go to the coffee room and have a chat with whomever might be there. All meals are on site and everybody is supposed to attend. Table seating is arranged and rotates among the attendees.

This approach certainly worked for me. I've met quite a number of people and had numerous fruitful and pleasant conversations. 

These are two pictures of the main building. This is where some of the guest rooms are (mine was on the newer, modern side to the left on these pictures). The coffee room and restaurant are in the old part of the building. 

 

One late afternoon I walked up a nearby hill and took an overview picture. You can clearly see the old and new part on the right. The library as well as the conference room we used are in the yellow buildings. The guest rooms are in the square surrounding it.

The reason I walked up the hill was because I'd seen the ruins of a castle and wanted take a closer look. This place has an interesting history and goes back to the year 1270.  More information can be found on this web site.


Tuesday Oct 09, 2007

Some time ago I had the pleasure to have access to an early engineering system with the UltraSPARC T2 processor. I used this opportunity to run my private PEAS (Performance Evaluation Application Suite) test suite.

It turned out that my findings on throughput benchmarks conducted with this suite revealed some interesting aspects of the architecture. In this blog I'll write about that, but a word of warning is also in place. This is rather early work. I'd like to see it as a start and plan to gather and analyze more results in the near future.

PEAS currently consists of 20 technical-scientific applications written in Fortran and C. These are all single threaded user programs, or computational kernels derived from the real application. In this sense it very well reflects what our users are typically running.

PEAS has been used in the past to evaluate the performance of upcoming processors. Access to an early system provided a good opportunity to take a closer first look at the performance of the UltraSPARC T2 processor.

PEAS is about throughput. I typically run all 20 jobs simultaneously. I call this a "STREAM", which by the way has no relationship with the Streams benchmark. It is just a name I picked quite some time ago.

I can of course run multiple STREAMs simultaneously, cranking up the load. Due to the limitations in the system I had access to, I could only run one and two STREAMS, but this still means I was running 20 and 40 jobs simultaneously. The maximum memory usage on the system was around 6 GB per STREAM.

The question is how to interpret the performance results. To this end, I use the single core performances as a reference. Prior to running my throughput tests, I run all jobs sequentially. For each job this gives me a reference timing. With these timings I can then estimate what the elapsed time of a throughput experiment will be.

Let me give a simple example to illustrate this. Let's say I have two programs, A and B. The single core elapsed times for A and B are 100 and 200 seconds respectively. If I run A and B simultaneously on one core, the respective elapsed times are then 200 and 300 seconds. This is because for 200 seconds, both A and B execute on a single core simultaneously and therefore get 50% of the core on average. After these 200 seconds, B still has 100 more seconds to go. Because A has finished, B has 100% of the core to itself and therefore needs an additional 100 seconds to finish.

This idea can easily be extended to multiple cores.

In the past I've used this approach to evaluate the performance of more conventional processor designs. Given the estimates assume ideal hardware and software behavior, the measured values were typically higher than what I estimated. This is actually what one would expect.

When I applied the same methodology to my results on the UltraSPARC T2 processor however, a big difference showed up. The measured times were actually better than what I estimated! This is exactly the opposite of what one might expect.

The explanation is that the threading within one core increases the capacity of that core.

The question is how to attach a number to that. In other words, how much computational capacity does an UltraSPARC T2 processor represent?

To answer this question I introduce what I call a "Core Equivalent", or "CE" for short. A CE is a core that has the capacity of the core used, but without the additional threading.

On multicore designs without additional hardware threading, a CE then corresponds to the core; there is a one to one mapping between the two.

On threaded designs, like UltraSPARC T2, a CE might be more than one core. The question is how much more.

This leads me to introduce the following metric. Define the Average(CE) metric as follows:

Average(CE) := sum{j = 1 to 20}(measured(j)-estimated(j,CE))/measured(j)

It is easy to see that this function increases monotonically as a function of the CE. I then define the "best" CE as the CE for which Average(CE) has the smallest positive value.

The motivation for this metric is that I compute the average over the relative differences of the measured versus the estimated elapsed times. Of course there could be cancellations, but as long as this average is negative, I underestimate the total throughput capacity of the system.

I applied this metric to my PEAS suite and found one STREAM (20 jobs executed simultaneously) to deliver 15 CEs, with Average(15) = 0.56%. For two STREAMs (40 simultaneous jobs), the throughput corresponds to 25 CEs with Average(25) = 1.21%.

In a subsequent blog I'll give more details so you can see for yourself, but these percentages indicate there is a really good match between the measured and estimated timings.

These numbers clearly reflect the observation that UltraSPARC T2 delivers more performance than 8 non-threaded cores. It is interesting to note that the number goes up when increasing the load. This suggests a higher workload might even give a higher CE value.

As mentioned in the introduction, this is early work. In a future blog I'll go into much more detail regarding the above. I also plan to gather more results. In particular I would like to see how the number of CEs changes if I increase the load.

So, stay tuned!

Monday Sep 17, 2007

Unfortunately it has taken me an embarrassingly long time to pick this up again. I'll try to behave better in the future and certainly have lots of ideas and plans for topics to blog about.

One of the main reasons for this long period of silence is that we have been working on a comprehensive book on OpenMP. I don't want to use this blog for ads, so I'll just mention the title, publisher, authors and ISBN numbers.
There we go. The book is called "Using OpenMP". It is going to be published by MIT Press. The work was done together with Barbara Chapman and Gabriele Jost. The ISBN numbers are ISBN-10: 0-262-53302-2 and ISBN-13: 978-0-262-53302-7.
It can be ordered already and should come out in October (in as far as we know now).

Stay tuned!

Tuesday Dec 06, 2005

Sun just announced the UltraSPARC T1 processor based T1000 and T2000 servers with massive amounts of parallelism on the chip. Of course, UltraSPARC IV and IV+ also have parallelism at the chip level. This is generally referred to as "Chip Multi-Threading", or CMT for short. Another example of this kind of on-chip hardware parallelism is the AMD dual-core Opteron processor.

And we are only seeing the beginning of this. In the not too distant future, the majority of processors will have some implementation of CMT. The naming may be different, but what matters to the developer is that a single chip is like a mini parallel system.

This brings up the question "How can I exploit this hardware parallelism in my application?"

There are various ways to do this. One can use Posix Threads ("Pthreads" for short) for example, or in Java the native threading model. These are fairy low level interfaces though and the learning curve is fairly steep.

An alternative is the Message Passing Interface (MPI). The latter is really designed to run on a cluster of systems, but an application parallelized with MPI can run equally well, if not better, on a single parallel system. But, it is not always easy to use MPI when one is new to the concepts of parallelization. Plus that MPI may be fairly heavy to run on a single CMT processor, but to be honest, I have not done any performance measurements to back this up with data. I could be wrong here!

My favourite programming model is OpenMP. The language specifications and much more information can be found at the OpenMP website.

Today I want to give a quick overview of what OpenMP is about and in howfar it fits an underlying CMT architecture.

Just like MPI, OpenMP is a de-facto standard. The OpenMP Architecture Review Board (ARB) defines the specifications and an implementation is supposed to stick to that. So far, this has worked very well and there are many compilers available that support OpenMP.

Sun Studio has supported OpenMP in C, C++ and Fortran for a long time now. We continue to improve the performance of the implementation, plus that we have added some features to assist the developer. I strongly advise to use Sun Studio 11 when you plan on giving it a try with OpenMP.

Not in the least because our compilers also support automatic parallelization in C, C++ and Fortran. Through the -xautopar option this feature is activated and it is certainly worth trying on an application. We support mixing and matching this with OpenMP, so you can always augment what the compiler did. If you use this feature, please also use the -xloopinfo option. This causes the compiler to print diagnostic messages on the screen, informing you what was parallelized and what wasn't. In case of the latter, the compiler tells you why the parallelization did not succeed.

OpenMP uses a directive based approach to parallelize an application. The one limitation of OpenMP is that an application can only run within a single address space. In other words, you can not run an OpenMP application on a cluster. This is a difference with MPI.

On the other hand, OpenMP is more lightweight and may therefore be more suitable for a CMT processor.

One question I often get is in howfar OpenMP compares to Posix Threads and whether developers should rewrite their Pthreads application and use OpenMP instead. To start with the latter, my answer is always "why ?". If it works fine, don't modify it.

The first question takes a little more time to answer. Pthreads are rather low level, but also efficient. OpenMP is built on top of a native threading model and therefore adds overhead, but the additional cost is fairly low. Unless you use OpenMP in the "wrong" way. One golden rule is to create large portions of parallel work to amortize the cost of the so-called parallel region in OpenMP.

The clear advantage OpenMP has over Pthreads is ease of use. OpenMP provides a fairly extensive set of features, relieving the developer from having to roll their own functionality. You can still do this with OpenMP, but many basic needs are covered already.

Time to look at an example. The code fragment below uses OpenMP directives to parallelize this simple loop:


#pragma omp parallel for shared(n,a,b,c) private(i)
for (i=0; i < n; i++)
a[i] = b[i] + c[i];

An OpenMP compliant compiler translates this pragma into the appropriate parallel infrastructure. As a side note, this loop will be automatically parallelized by the Sun Studio C compiler.

At run-time, the application will go parallel at this point. The threads are created and the work is distributed over the threads. In this case "work" means the various loop iterations. Each thread will get assigned a chunk out of the total number of iterations that need to be executed.
At the end of the loop, the threads synchronize and one thread (the so-called "master thread") resumes execution.

One of the things one has to do is to specify the so-called data-sharing attributes of the variables. OpenMP has some defaults for that, but I prefer and recommend to not use these and think about it yourself. Initially this may take some more time, but the reward is substantial.

At run-time, the OpenMP system distributes the iterations over the threads. For example if n=10 and I use 2 threads, each thread gets 5 iterations to work on. This is why the loop iteration variable "i" needs to be made "private". This implies that each thread gets a local copy of "i" and can modify it independently of the other threads.
All threads need to be able to access (a portion of) "a", "b" and "c". This is why we need to declare those to be "shared'. The same is true for "n".

OpenMP supports any data type. Although many OpenMP applications use floating-point, OpenMP works equally well for other data types. Like integer codes. Take the loop above. The vectors can be of any type, including "int", or "char" for that matter.

Although the example above is correct, my explanation is a little simplified, but I hope you get the gist of how this works in OpenMP. For example, one thing I can control explicitly is how the various loop iterations are distributed over the threads. If I do not specify this, I get a compiler dependent default.

Another feature of OpenMP is that I can, and should, create bigger chunks of work than just one parallelized for-loop. Otherwise I may get killed in the overhead. OpenMP provides several features to address this.

Activating the OpenMP directives is very easy. For example, on the Sun Studio compilers one needs to add the -xopenmp option to the rest of the options and voila, the compiler will recognize the OpenMP constructs and generate the infrastructure for you. If you do so, please also use the -xloopinfo option. Just as with -xautopar, this will display diagnostic messages on your screen.

OpenMP comes with a fairly compact, but yet powerful, set of directives to implement and control the parallelization. On top of that, a run-time library is supplied to query and control the parallel environment (e.g. adjusting the number of threads). A set of environment variables is also provided. For example, OMP_SET_NUM_THREADS to specify the number of threads to be used by the application.

So, what are the main advantages of OpenMP? This is what I think makes it attractive:


  • Portable - Every major hardware vendor and several ISVs provide an OpenMP compiler

  • Modest programming effort - Implementing OpenMP is usually fairly easy compared to other programming models

  • Incremental parallelization - One can implement the parallelization step by step

  • Local impact - Often, the OpenMP specific part in an application is relatively small

  • Easier to test - By not compiling for OpenMP you "de-activate" it

  • Natural mapping onto CMT architectures - OpenMP maps elegantly onto a CMT processor

  • Assistance from tools - You get a compiler to help you

  • Preserves sequential program - If done correctly, you can have your sequential program still built in

In particular the latter is a powerful feature I think. You can always go back to your original application if you want.

Of course there also drawbacks of using OpenMP. One is restricted to a single address space and in that respect OpenMP is not "Grid Ready". With all chip level parallelism this may be less of an issue in the future though.
Plus that one can combine MPI and OpenMP. MPI can be used to distribute the work at the system level. Within one SMP system one can then use OpenMP for the finer grained parallelization. This kind of "hybrid" parallelization is gaining groud.

The developer also needs to think about data-sharing attributes. Without prior experience in parallelization this is a hurdle to be taken, but it is usually easier than it might seem at first sight.

You also need a compiler, but hey, we have Sun Studio for that!

Interested? One place to start could be to take a look at the OpenMP tutorial that I presented at the previous IWOMP workshop (International Workshop on OpenMP, June 2005). You can find it through the OpenMP User Group website (under "Trainings"). This is the direct link to the pdf file. As always, with such things, I would do it slightly differently today, but hopefully this will get you going.

By the way, the next IWOMP will take place in Reims, France (June 12-15, 2006). The first two days a conference will be held. The next two days are reserved for the OMPlab and a tutorial. People can bring their own application to the OMPlab. OpenMP experts will be available to assist you with the parallelization. The tutorial will most likely be held the third day, in parallel with the OMPlab. After that, attendees can join the OMPlab part and work on whatever they want. Preparations have just started. Through this web site you can sign up for the conference newsletter.

Okay, back to CMT. The beauty of OpenMP is that it provides an abstract model. You can develop your OpenMP program on any piece of hardware with an OpenMP compliant compiler and then run it on any parallel system. Possibly you need to recompile if you change architecture, but that is about it.

For example, I travel a lot and routinely work on OpenMP programs on my single processor Toshiba laptop, using the Sun Studio 11 compilers under Solaris. I can run multi-threaded, even using the nested parallelism feature that OpenMP supports. Of course I will not get any performance gain doing so, but it makes for a great development environment!

I hope this article got you excited about OpenMP. If so, I recommend to sign up for the OpenMP community mail alias. This is a very good forum to ask questions, discuss issues, etc.

I hope to see you soon on the OpenMP alias!


[ T:

]

Tuesday Nov 08, 2005

This is a big topic and I plan to cover it in bits and pieces as time goes by.

With Sun Studio on Solaris we ship compilers for Fortran, C and C++. These are Sun developed products and have come a long way. These compilers support SPARC processors, certain Intel processors and the AMD Opteron processor. In the remainder I will focus on SPARC and AMD.

If it comes to performance, the -fast option is key. This option is really a macro that expands to a series of options. What is being put in depends on the platform, language and compiler release. This option also typically changes from release to release, as we get a better grip on the heuristics behind some of the options implied with -fast for example. Or we may decide to pull an option in under -fast.

Decisions like this are based on the outcome of extensive performance tests covering a wide range of applications.

In general, -fast is meant to give good performance across a diverse range of applications, but there are always exceptions. Exploring some other or additional options is always recommended.

I also recommend users to link with -fast. This ensures the optimal libraries are linked in.

In the future I will write more about what goes on under the hood of -fast. Before I do that however, I want to bring two other options under the attention.

The -xarch option is used to select the instruction set. Various processors support various instruction sets.

On SPARC for example we have -xarch=v8plusb for 32-bit and -xarch=v9b for 64-bit. These are the most recent instruction sets available. They are supported on all UltraSPARC III based systems and follow-on processors. For example, UltraSPARC III Cu, UltraSPARC IIIi, UltraSPARC IV and the recently announced UltraSPARC IV+ processor all support these instruction sets.
But, several older SPARC instruction sets are also accessible under the -xarch option. Check the documentation for a list.

I often get asked about the performance of 64-bit. There are (too) many misconceptions floating around about this. Basically, 64-bit means the address range is expanded from 32 to 64 bits. This gives a dramatic increase in the number of addresses that can be covered and therefore one can use more memory. Nothing more, nothing less. The increase in address space is the main reason to go to 64-bit.

In some cases, 64-bit may give rise to a slight performance degradation. A pointer is an address. Instead of using 32 bits in the cache it now uses twice as much space. The resulting effective cache capacity is cut in half and pointer intensive applications will get affected by that.

On the other hand, on some processors, 64-bit means more than increasing the address space and AMD's Opteron processor is a good example of that.

Just like SPARC, the AMD Opteron is a 64-bit processor. This is why the Sun compilers provide the -xarch=amd64 option to use the 64-bit instructions and extensions from AMD. Several 32-bit instruction sets are supported as well. For example -xarch=sse2 for those processors that support the SSE2 instruction set.
In general however it is recommended to -xarch=amd64 on AMD Opteron. This will not only give a much bigger address range, but also exploits the architectural enhancements AMD has put in.

The choice for -xarch also controls compatibility. Typically, a more recent instruction set is not supported on older processors. For example, -xarch=v8plusb is not supported on UltraSPARC II, but the SPARC processors are backward compatible. If one compiles and links with -xarch=v8plusa for example, the resulting binary will run on both UltraSPARC II as well as on more recent SPARC processors.

Of course a more recent instruction set tends to have richer features and therefore a "wrong" choice may affect performance as well.

The last option covered is called -xchip. This instructs the compiler to optimize the instruction schedule for the specific processor. For example -xchip=ultra4plus to request the compiler to optimize for the new UltraSPARC IV+ processor, or -xchip=opteron for AMD Opteron. The choice for -xchip also affects some of the higher level optimizations performed by the compiler.
In contrast with -xarch, a "wrong" choice for -xchip does not affect compatibility, but may impact performance.

In absence of specific settings for -xarch and/or -xchip, the compiler will select a default.

In summary:

Use -fast to get good performance with one single option. The -xarch option is used to specify the instruction set (32 or 64-bit). The -xchip controls which processor the compiler should optimize for. In absence of -xarch and/or -xchip the compiler will select a default.

Tuesday Nov 01, 2005

My name is Ruud van der Pas. I'm in engineering and have been with Sun for a little over 7 years now. My main interests are in the area of application performance and interval arithmetic/analysis.

The reason for me to start this weblog is because I meet a lot of our customers. It is always very inspiring to talk with them. I'm interested to find out what they're doing, how they use our products and what sort of problems they're struggling with. In talking with them, I learn a lot and hopefully they sometimes also pick up something from me.

I realized a weblog could be a convenient and easy way to share some of that information with a larger group of people all around the world. We will see how this works out, but for now I plan to go for it.

Regarding application performance I focus on technical-scientific programs. I'm not only interested in cranking up single processor performance, but also to apply shared memory parallelization through either the Sun compiler that you can ask to automatically parallelize your application, and/or by using the OpenMP programming model (http://www.openmp.org).
I expect both to get increasingly popular, given the CMT technologies that are out there today and what looms on the horizon. Think about it. If you have a chip with multiple cores, isn't it great if you can take advantage of that and speed up a single application? For a while, most people will go for the additional throughput and run several applications (e.g. Mozilla and StarOffice) side by side. But how far can you push that? Eventually you will want one single application to go faster, especially as the number of cores on a chip is going to increase, and then OpenMP provides are very nice solution. I plan to write a lot about that in the future.

My second passion is about interval arithmetic and interval analysis. You can expect me to write a lot about that as well. So what is it? Conceptually it is easy. Instead of using a single variable to store some value, you use an interval [a,b] say to store a range of values.
In many cases, that is a very natural approach, as data is usually not known precisely and/or may fluctuate (think of the wind speeds around a building for example).
Once you do this, a whole new world opens up. And it is a fascinating world. A description of many problems in physics, chemistry and math is often more natural when using intervals, because the parameters in the model are for example not known with 100 percent accuracy. So, intervals are not only more natural, one can also solve problems that can not be solved otherwise.
I'm the first one to admit though that this is not easy. To me, it is the way to go though. The fact it is hard is a challenge that should encourage people to figure things out and make progress.

Well, that is it for the first time.

This blog copyright 2009 by ruud