星期五 一月 16, 2009

Back to the happy time of single-thread programming, at the end of the function, the only thing you need to worry about is that the local variables on stack will be lost. So advices like "never return pointer to local variables" almost appears on every C/C++ books. However, things get more complicated in multi-threading world. Take OpenMP as an example, the story of parallel construct end is more funny.

1. Let's look at a very simple example:

#pragma omp parallel private(a)
{
  TYPE b;
  // blah blah blah
}

a and b are both of type TYPE. Suppose TYPE is C++ non-POD type, is there any difference between a and b? The answer is yes. The lifespan of b is the parallel construct, so its destructor should be called in the construct. The lifespan of a is the parallel region, and the destructor of a is part of OpenMP implementation code for the parallel region. The difference of construct and region is subtle, you can refer to OpenMP spec 3.0 (Actually, the construct I mentioned here is not exactly the same in spec). To make it more clearly, see the example below:

#pragma omp parallel for private(a) num_threads(4)
for(i=0;i<10;i++)
{
  TYPE b;
}

Through this parallel loop, you will get 4 private copies of a, but 10 copies of b.

Besides destructor of private variables, there are some other things at the end of parallel region.It includes, but not limit to:

1. implicit barrier
2. task scheduling (in implicit barrier)
3. last private copy
4. reduction

The interaction of these operations may trigger some really interesting questions.

2. Let's see a little more complicated example, suppose k is C++ non-POD type.

#pragma omp parallel private(k)
{
#pragma omp task shared(k)
  {
    ... = k;
  }
}

OpenMP spec 3.0 does not say a word about the order of task scheduling and destructor of k (Even no "implentation defined"). What it has say is:  [14:23-26]:
"A private variable in a task region can be shared by an explicit task region generated during its execution. However, it is the programmer's responsibility to ensure through synchronization that the lifetime of the variable does not end before completion of the explicit task region sharing it." But through these words, user may still confused that whether he need a taskwait to make sure k is still alive during task scheduling.

I'm not trying to give you any assurance here. What I want to say is that you have to undertand very clearly about what you are doing. If you want to return the value to the encounting implicit task, then there must be some instructions that use k after the task. Under this circumstances, please add a taskwait after the explicit task. If you don't want to modify k in the task, then use private/firstprivate instead. Anyway, don't lead yourself into this unspecified hole until specification fullfill it.

3. The lifespan of private variable is not the worst thing of this issue, let's look at this example:

#pragma omp parallel sections lastprivate(k)
{
#pragma omp section
 {
    // blah blah blah
  }
#pragma omp section
  {
#pragma omp task shared(k)
    {
      k = ...
    }
  }
}

This example shows that the unspecified behavior is not only caused by the lifespan of private variable, but also caused by datarace. Since spec does not talk about the order of implicit barrier and lastprivate copy, the thread of lexically last section may copying its private value, meanwhile some other thread in the team may already reach the implicit barrier and steal the task, then the datarace happens. Again, you should clearly know what exactly you are doing when you write this kind of code.

There are some other similar problems, such as reduction/task, single/task, single/copyprivate, etc. In my opinion, the essence of these problem is that

a) OpenMP gives programmer a slim chance to let one thread access other threads local variable, and only make slimmer behavior specified. See [14: 21-27].
b) There are too many things at the end of parallel region, but we don't know the exactly magic there.

PS: digging the difference between very similar concept and whether the code is specified by the specification is pretty much like the life of a lawyer, and it's really mixed of funny and boring.

星期二 十二月 30, 2008

Recently I was doing some experiments on EPCC OpenMP microbenchmark, especially on parallel for with dynamic schedule type. During these experiments, I found some factors that really impact the performance.

 1. The atomic instruction

Say if you want to atomically add delta to a shared variable value, one probably way is by compare-and-swap(CAS) primitive. Many architectures support this kind of atomic primitive, such as cmpxchg on IA32/amd64 or cas on SPARC V9.

do {
  oldvalue = value;
  newvalue = oldvalue + delta;
} while (CAS(&value, oldvalue, newvalue) != oldvalue);

This scheme works well, but when the number of threads is high, the performance is poor because of the heavy contention, and the work of loop body is waste if CAS failed (which happens a lot).

On IA32/amd64, there's a more efficient way of implementing this atomic add:

lock xadd &value, delta

The lock prefix turns the following instruction into atomic, so the access of shared memory &value is exclusive.

 2. False sharing

Let's say this scenario: two adjacent fields are in the same cache line, which accessed by two threads separately and one of the access is write. Then the writer may invalidate the other's local cache line, so the reader will see a cache miss even the value it's going to load is not modified by the writer. This is called false sharing.

The impact of false sharing is huge if the load and store operation is in a loop of each threads. The solution is simple: add enough paddings for the fields which will be updated by threads so that it can stand alone in one cache line.

 3. Thread binding

There are 2 ways that this factor may hurt the performance:

1) If a server has 4 processors, each processor has 4 cores and each core has 2 hardware threads (SPARC VII). Binding threads on different processors may introduce false sharing problem described above. i.e.

export SUNW_MP_PROCBIND="1 9 17 25" 

Conversely, Binding threads on different cores of same processor can reduce the chance of false sharing, because the cores have shared cache on most multi-core architecture. i.e.

export SUNW_MP_PROCBIND="0 2 4 6"

2) The same chip of 1). If we bind the threads on different hardware threads of same core, i.e.

export SUNW_MP_PROCBIND="0 1" 

it may reduce the overhead of compare-and-swap synchronization (it is more clear on UltraSPARC T2, which has 8 threads per core). The EPCC dynamic-parallel-for is an example: the loop body is pretty small (100 cycles on certain processor) so the majority work comes from the cas synchronization. Assigning threads onto the hardware SMT of one core can reduce the chance of the failure of cas. But this is not helped on such kind of applications which perform a lot of computation in the loop body, because the SMT threads shares the same computing unit of the core.

The three factors mentioned above may affect each other, for example, on SPARC, sometimes we can not observe the impact of false sharing if we bind user threads on different processors because the majority of overhead came from the failure of cas. But on IA32/amd64, the use of lock prefix remove the overhead of cas, the false sharing overhead start to emerge.

 4. load unbalance

This is easy to understand. One probably solution of this problem is let thread WAIT for work other than SPIN to avoid over-subscribing.

This blog copyright 2009 by Bin Fan