Touch Dreams

Gulf of Execution

Tuesday Jan 08, 2008

Gulf of Execution is a term used to describe the the difference between the steps one actually needs to take to achieve a goal and the steps that one perceives.

After learning this term, the example that quickly jumps into my mind is setting up those wifi-enabled devices, like Wii, PSP, NDS, Wireless gateway, etc. In my experience, the one with the narrowest gap is iPhone. The worst one is, well, some operating system.

Michael G Schwern had a blog entry about this on Perl.

Who wide is the Gulf in your favorite parallel programming language/model/scheme/library?

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Think in Parallel or Not

Tuesday Nov 20, 2007

Dr. John Shalf posted his view on Prof. Wen-mei Hwu's IEEE MICRO-39 paper. Dr. Shalf states the importance and advance of parallel algorithms and his view on the programming model for parallel algorithms.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Non-concurrency Analysis Used in PTP

Saturday Nov 17, 2007

When visiting the IBM booth at SC07, I was a little surprised to learn that the non-concurrency analysis technology for OpenMP programs had also been adopted and implemented in the Parallel Tools Platform.

Beth Tibbitts from IBM has kindly sent me the reference details: STMCS'07 program, paper, and presentation.

The technology is used by Sun Studio Compilers to do static error check for OpenMP programs.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Maximum Automation for Mundane Tasks

Friday Nov 09, 2007

Adam Kolawa (Parasoft) said in his recent article on DDJ,

"Many people ... want tools to find these bugs automatically. After 20 years of examining how and why errors occur, I believe this is the wrong response. Only a small class of errors can be found automatically; most bugs are related to functionality and requirements, and cannot be identified with just the click of a button."

and

"Our current mission is to address this problem by inventing technologies and strategies to support the brain as it performs this evaluation. We are building automated infrastructures that provide maximum automation for mundane tasks (compiling code, building/running regression test suites, checking adherence to policies, supporting code reviews, and so on) in such a way that each day the brain is presented with the minimal information needed to determine if yesterday's code modifications negatively impacted the application."

There is probably no magic button one can push to turn a piece of legacy code that is not thread-safe into a thread-safe code. A tool should offload the mundane tasks from human brain which can be set free to finish the magic touch.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Some History and Background Information about Threading on Unix-like Systems

Saturday Sep 22, 2007

The C10K problem refers to the problem of serving ten thousand clients simultaneously on a web server. This article written by Daniel Kegel contains some history and background information (and lots links) about threading on Linux, Solaris, BSD, MAC OS X, etc.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Common Mistakes in Using OpenMP 5: Assuming Non-existing Synchronization Before Entering Worksharing Construct

Thursday Sep 20, 2007

There is no synchronization between the threads in a team when they enter a worksharing construct. Many people assume there is a barrier before the threads enter a worksharing construct, especially when there is a FIRSTPRIVATE used in the worksharing construct. This is a common mistake.

For example, in the following code, assume two threads - thread 1 and thread 2 are in the team, and Read1 is executed by thread 1 and Read2 is executed by thread 2.

  #pragma omp parallel
  {
     if (omp_get_thread_num()==0)
        z = 1;
     else
        z = 2;
     #pragma omp sections firstprivate(z)
     {
       #pragma omp section
       {
          ... = z;      // Read1
       }
       #pragma omp section
       {
          ... = z;      // Read2
       }
     }
  }

What are the values of z at Read1 and Read2? All the following three combinations are possible,

  1. Read1:1 Read2:1
  2. Read1:1 Read2:2
  3. Read1:2 Read2:2

If there were a synchronization before the worksharing construct, then the above (Read1:1, Read2:2) is not possible.

Now, look at the following example which has both FIRSTPRIVATE and LASTPRIVATE,

  #pragma omp parallel
  {
     z = 1;
     #pragma omp for firstprivate(z) lastprivate(z) nowait
     for (i=0; i<n; i++) {
          ... = z;      // Read1
          z = 2;        // Write1
     }
  }

What could be the value of z at Read1? Would it be 2? OpenMP 3.0 Draft has clarified this situation. It says

If a list item appears in both firstprivate and lastprivate clauses, the update required for lastprivate occurs after all initializations for firstprivate.

So, the value of z at Read1 cannot be 2.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

A Presentation of DRDT at SVOSUG 9/28/2006

Friday Oct 13, 2006

Here is a short presentation of DRDT at the Silicon Valley OpenSolaris Users Group (SVOSUG) meeting on Setp 28, 2006.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Understanding Data Races 3: Several currently available tools (2/3)

Friday Aug 04, 2006

Runtime Checking (simulation based) Tool

Tool 3: Helgrind from Valgrind

In this blog entry, I will describe my experiment of the test cases with Helgrind. Helgrind is a data race detection module of Valgrind, which is pretty successful framework and tool suite for debugging and profiling Linux programs.

Unlike other runtime checking tools I will describe later (e.g. Intel's Thread Checker and Sun's DRDT), Valgrind is simulation based. One advantage of simulation based approach is the two active entities - the target application and the detection module are in different processes. They have different address spaces and name spaces. Therefore this approach can avoid many conflicts between the two entities. For example, the detection module can call any library routines that it monitors without worrying about re-entry problems. [Update: Valgrind actually runs in the same namespace as the target application. And the target application and the detection module are part of the same process. The detection module (core and tools) are designed carefully to avoid dependence on glibc.so.] One challenge of simulation based approach is dealing with system calls. The simulation based approach simulates only the execution of the user process, and it is NOT simulating the OS. A even more bigger challenge is to deal with threading calls. Valgrind is not multi-threaded itself, and all threading executions are serialized. I have not got a chance to study how it works. It must be very interesting. Valgrind's manual claims it works with NPTL or LinuxThreads "well enough for significant threaded applications".

Helgrind is based on the famous Eraser method enhanced with detection of thread creation and thread join. The method is very similar to that used in Compaq/HP's Visual Threads (as described in Harrow's paper). Lockset based methods (such as Eraser) tend to have a lot of false positives.

Currently Valgrind is at release 3.2.0. But the latest version that Helgrind works is 2.2.0. When I ran Helgrind in 3.2.0, I got

Helgrind is currently not working, because:
 (a) it is not yet ready to handle the Vex IR and the use with 64-bit
     platforms introduced in Valgrind 3.0.0
 (b) we need to get thread operation tracking working again after
     the changes added in Valgrind 2.4.0
 If you want to use Helgrind, you'll have to use Valgrind 2.2.0, which is
 the most recent Valgrind release that contains a working Helgrind.

Sorry for the inconvenience.  Let us know if this is a problem for you.

Then I swithced to 2.2.0. First I tried with pthr_prime.c.

$ cc -g pthr_prime.c -lm -lpthread -o pthr_prime

$ valgrind --tool=helgrind ./pthr_prime

==32368== Helgrind, a data race detector for x86-linux.
==32368== Copyright (C) 2002-2004, and GNU GPL'd, by Nicholas Nethercote et al.
==32368== Using valgrind-2.2.0, a program supervision framework for x86-linux.
==32368== Copyright (C) 2000-2004, and GNU GPL'd, by Julian Seward et al.
==32368== For more details, rerun with: -v
==32368== 
==32368== Thread 2:
==32368== Possible data race writing variable at 0x80498B0 (total)
==32368==    at 0x80486BD: work (pthr_prime.c:51)
==32368==    by 0x1D4AFE79: thread_wrapper (vg_libpthread.c:867)
==32368==    by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2)
==32368==  Address 0x80498B0 is in data section of /home/yl140942/tmp/vg/a.out
==32368==  Previous state: shared RO, no locks
==32368== 
==32368== Possible data race writing variable at 0x57EFE95C 
==32368==    at 0x804877D: main (pthr_prime.c:75)
==32368==  Address 0x57EFE95C == &(i) at pthr_prime.c:75
==32368==  Previous state: shared RO, no locks
==32368== 
==32368== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 2 from 2)
==32368== 4 possible data races found; 0 lock order problems
Helgrind finds the race access of total at line 51 and race access of i at line 75. Note that a data race is caused by a pair of accesses. Helgrind reports only one access of a pair. For the first one, the report is ok because line 51 reads and updates total, therefore it is fairly easy to guess what are the racing access pairs. For the second one (i at line 75), I would imagine it would take a fair large of amount of time for one to figure out the other race access of the pair is in line 46. Helgrind also misses several data races (e.g. write-write race at line 50, write-read race between line 50 and 76) due to the heuristic it uses.

Next, I tried with pthr_prime_fixed.c.

$ cc -g pthr_prime_fixed.c -lm -lpthread -o pthr_prime_fixed
$ valgrind --tool=helgrind ./pthr_prime_fixed

==21596== Helgrind, a data race detector for x86-linux.
==21596== Copyright (C) 2002-2004, and GNU GPL'd, by Nicholas Nethercote et al.
==21596== Using valgrind-2.2.0, a program supervision framework for x86-linux.
==21596== Copyright (C) 2000-2004, and GNU GPL'd, by Julian Seward et al.
==21596== For more details, rerun with: -v
==21596==
==21596== Thread 2:
==21596== Possible data race writing variable at 0x804CA10 (pflag+16)
==21596==    at 0x80486DC: is_prime (pthr_prime_fixed.c:34)
==21596==    by 0x8048756: work (pthr_prime_fixed.c:50)
==21596==    by 0x1D4EAE79: thread_wrapper (vg_libpthread.c:867)
==21596==    by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2)
==21596==  Address 0x804CA10 is in BSS section of /home/yl140942/tmp/vg/pthr_prime_fixed
==21596==  Previous state: exclusively owned by thread 1
==21596==
==21596== Thread 2:
==21596== Possible data race writing variable at 0x804CA18 (pflag+24)
==21596==    at 0x80486DC: is_prime (pthr_prime_fixed.c:34)
==21596==    by 0x8048756: work (pthr_prime_fixed.c:50)
==21596==    by 0x1D4EAE79: thread_wrapper (vg_libpthread.c:867)
==21596==    by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2)
==21596==  Address 0x804CA18 is in BSS section of /home/yl140942/tmp/vg/pthr_prime_fixed
==21596==  Previous state: exclusively owned by thread 1

<similar messages repeated for various pflag+offset>

==21596== Thread 2:
==21596== Possible data race writing variable at 0x804CAC0 (pflag+192)
==21596==    at 0x80486DC: is_prime (pthr_prime_fixed.c:34)
==21596==    by 0x8048756: work (pthr_prime_fixed.c:50)
==21596==    by 0x1D4EAE79: thread_wrapper (vg_libpthread.c:867)
==21596==    by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2)
==21596==  Address 0x804CAC0 is in BSS section of /home/yl140942/tmp/vg/pthr_prime_fixed
==21596==  Previous state: exclusively owned by thread 1
==21596==
==21596==
==21596== Possible data race reading variable at 0x80499B0 (total)
==21596==    at 0x8048873: main (pthr_prime_fixed.c:80)
==21596==  Address 0x80499B0 is in data section of /home/yl140942/tmp/vg/pthr_prime_fixed
==21596==  Previous state: shared RW, locked by:0x80499B4(mutex)
==21596==
==21596== ERROR SUMMARY: 33 errors from 33 contexts (suppressed: 2 from 2)
==21596== 35 possible data races found; 0 lock order problems

This time Helgrind reports 32 races accesses of pflag[] at line 34. As explained in DRDT tutorial, these are benign data races. Helgrind also reports a false positive race that has an access of total at line 80.

Helgrind does a good job of reporting the name of the variable involved in the data races (e.g. total, pflag[] and i) and the lock variables (e.g. mutex). The Previous state gives a hint why Helgrind thinks an access might cause data race. For example, in the above experiment with pthr_prime_fixed.c, for the access of total at line 80, it says "Previous state: shared RW, locked by:0x80499B4(mutex)". The accesses of total at lines 52-53 are protected by mutex locks. When Helgrind finds the read access of total is not protected by the same lock (or any lock in this case), it reports a possbile data race. The detection of the thread_join sometime did not work to get rid of the false positive though.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Understanding Data Races 2: Several currently available tools (1/3)

Thursday Jul 06, 2006

Introduction

This blog entry begins to describe a couple of currently available tools that detect data races in multi-threaded C/C++/Fortran programs. These tools and the categories they can be roughly put into are

  1. Static Checking
    • LockLint from Sun
    • vpara compile time check for OpenMP programs from Sun
  2. Runtime Checking - simulation based
    • Helgrind from Valgrind
  3. Runtime Checking - execution based
    • Visual Threads from HP
    • Thread Checker from Intel
    • Data Race Detection Tool from Sun

What not covered here are the tools from some research work. Some of them use combined static and runtime methods, and some use post-mortem based approaches.

Code Examples

I will reuse the following four code examples from the Tutorial of Using Sun Data Race Detection Tool. If you have downloaded and installed the Sun Studio Express June 2006, you should be able to find the example codes under

{installed-directory}/opt/SUNWspro/examples/rdt/prime.

All four codes find the prime numbers between 2 and 3000 using 4 threads. An OpenMP version and a Pthread version are provided,

  1. omp_prime.c: OpenMP version, contains data races
  2. omp_prime_fixed.c: OpenMP version, bugs fixed
  3. pthr_prime.c: Pthread version, contains data races and bugs
  4. pthr_prime_fixed.c: Pthread version, bugs fixed

Read the Tutorial to find out what the data races are and how the bugs are fixed.

omp_prime.c
    ...
    12	#include <stdio.h>
    13	#include <math.h>
    14	#include <omp.h>
    15	
    16	#define THREADS 4
    17	#define N 3000
    18	
    19	int primes[N];
    20	int pflag[N];
    21	
    22	int is_prime(int v)
    23	{
    24	    int i;
    25	    int bound = floor(sqrt(v)) + 1;
    26	    
    27	    for (i = 2; i < bound; i++) {
    28	        /* no need to check against known composites */ 
    29	        if (!pflag[i]) 
    30	            continue;
    31	        if (v % i == 0) { 
    32	            pflag[v] = 0;
    33	            return 0;
    34	        }
    35	    }
    36	    return (v > 1); 
    37	}
    38	
    39	int main(int argn, char **argv)
    40	{
    41	    int i;
    42	    int total = 0;
    43	
    44	#ifdef _OPENMP
    45	    omp_set_num_threads(THREADS);
    46	    omp_set_dynamic(0);
    47	#endif
    48	
    49	    for (i = 0; i < N; i++) {
    50	        pflag[i] = 1; 
    51	    }
    52	    
    53	    #pragma omp parallel for
    54	    for (i = 2; i < N; i++) {
    55	        if ( is_prime(i) ) {
    56	            primes[total] = i;
    57	            total++;
    58	        }
    59	    }
    60	    printf("Number of prime numbers between 2 and %d: %d\n",
    61	           N, total);
    62	    for (i = 0; i < total; i++) {
    63	        printf("%d\n", primes[i]);
    64	    }
    65	}
pthr_prime.c
    ...
    12	#include <stdio.h>
    13	#include <math.h>
    14	#include <pthread.h>
    15	
    16	#define THREADS 4
    17	#define N 3000
    18	
    19	int primes[N];
    20	int pflag[N];
    21	int total = 0;
    22	
    23	int is_prime(int v)
    24	{
    25	    int i;
    26	    int bound = floor(sqrt(v)) + 1;
    27	
    28	    for (i = 2; i < bound; i++) {
    29	        /* no need to check against known composites */ 
    30	        if (!pflag[i])
    31	            continue;
    32	        if (v % i == 0) {
    33	            pflag[v] = 0;
    34	            return 0;
    35	        }
    36	    }
    37	    return (v > 1); 
    38	}
    39	
    40	void *work(void *arg)
    41	{
    42	    int start;
    43	    int end;
    44	    int i;
    45	
    46	    start = (N/THREADS) * (*(int *)arg) ;
    47	    end = start + N/THREADS;
    48	    for (i = start; i < end; i++) {
    49	        if ( is_prime(i) ) {
    50	            primes[total] = i;
    51	            total++;        
    52	        }
    53	    }
    54	    return NULL;
    55	}
    56	
    57	int main(int argn, char **argv)
    58	{
    59	    int i;
    60	    pthread_t tids[THREADS-1];
    61	
    62	    for (i = 0; i < N; i++) {
    63	        pflag[i] = 1; 
    64	    }
    65	
    66	    for (i = 0; i < THREADS-1; i++) {
    67	        pthread_create(&tids[i], NULL, work, (void *)&i);
    68	    }
    69	
    70	    i = THREADS-1;
    71	    work((void *)&i);
    72	    
    73	    printf("Number of prime numbers between 2 and %d: %d\n",
    74	           N, total);
    75	    for (i = 0; i < total; i++) {
    76	        printf("%d\n", primes[i]);
    77	    }
    78	}
omp_prime_fixed.c
    ...
    12	#include <ststdio.h>
    13	#include <math.h>
    14	#include <pthread.h>
    15	
    16	#define THREADS 4
    17	#define N 3000
    18	
    19	int primes[N];
    20	int pflag[N];
    21	int total = 0;
    22	pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
    23	
    24	int is_prime(int v)
    25	{
    26	    int i;
    27	    int bound = floor(sqrt(v)) + 1;
    28	
    29	    for (i = 2; i < bound; i++) {
    30	        /* no need to check against known composites */ 
    31	        if (!pflag[i])
    32	            continue;
    33	        if (v % i == 0) {
    34	            pflag[v] = 0;
    35	            return 0;
    36	        }
    37	    }
    38	    return (v > 1); 
    39	}
    40	
    41	void *work(void *arg)
    42	{
    43	    int start;
    44	    int end;
    45	    int i;
    46	    
    47	    start = (N/THREADS) * ((int)arg) ;
    48	    end = start + N/THREADS;
    49	    for (i = start; i < end; i++) {
    50	        if ( is_prime(i) ) {
    51	            pthread_mutex_lock(&mutex);
    52	            primes[total] = i;
    53	            total++;        
    54	            pthread_mutex_unlock(&mutex);
    55	        }
    56	    }
    57	    return NULL;
    58	}
    59	
    60	int main(int argn, char **argv)
    61	{
    62	    int i;
    63	    pthread_t tids[THREADS-1];
    64	
    65	    for (i = 0; i < N; i++) {
    66	        pflag[i] = 1; 
    67	    }
    68	
    69	    for (i = 0; i < THREADS-1; i++) {
    70	        pthread_create(&tids[i], NULL, work, (void *)i);
    71	    }
    72	
    73	    i = THREADS-1;
    74	    work((void *)i);
    75	    
    76	    for (i = 0; i < THREADS-1; i++) {
    77	        pthread_join(tids[i], NULL);
    78	    }
    79	
    80	    printf("Number of prime numbers between 2 and %d: %d\n",
    81	           N, total);
    82	    for (i = 0; i < total; i++) {
    83	        printf("%d\n", primes[i]);
    84	    }
    85	}
pthr_prime_fixed.c
    ...
    12	#include <stdio.h>
    13	#include <math.h>
    14	#include <pthread.h>
    15	
    16	#define THREADS 4
    17	#define N 3000
    18	
    19	int primes[N];
    20	int pflag[N];
    21	int total = 0;
    22	pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
    23	
    24	int is_prime(int v)
    25	{
    26	    int i;
    27	    int bound = floor(sqrt(v)) + 1;
    28	
    29	    for (i = 2; i < bound; i++) {
    30	        /* no need to check against known composites */ 
    31	        if (!pflag[i])
    32	            continue;
    33	        if (v % i == 0) {
    34	            pflag[v] = 0;
    35	            return 0;
    36	        }
    37	    }
    38	    return (v > 1); 
    39	}
    40	
    41	void *work(void *arg)
    42	{
    43	    int start;
    44	    int end;
    45	    int i;
    46	    
    47	    start = (N/THREADS) * ((int)arg) ;
    48	    end = start + N/THREADS;
    49	    for (i = start; i < end; i++) {
    50	        if ( is_prime(i) ) {
    51	            pthread_mutex_lock(&mutex);
    52	            primes[total] = i;
    53	            total++;        
    54	            pthread_mutex_unlock(&mutex);
    55	        }
    56	    }
    57	    return NULL;
    58	}
    59	
    60	int main(int argn, char **argv)
    61	{
    62	    int i;
    63	    pthread_t tids[THREADS-1];
    64	
    65	    for (i = 0; i < N; i++) {
    66	        pflag[i] = 1; 
    67	    }
    68	
    69	    for (i = 0; i < THREADS-1; i++) {
    70	        pthread_create(&tids[i], NULL, work, (void *)i);
    71	    }
    72	
    73	    i = THREADS-1;
    74	    work((void *)i);
    75	    
    76	    for (i = 0; i < THREADS-1; i++) {
    77	        pthread_join(tids[i], NULL);
    78	    }
    79	
    80	    printf("Number of prime numbers between 2 and %d: %d\n",
    81	           N, total);
    82	    for (i = 0; i < total; i++) {
    83	        printf("%d\n", primes[i]);
    84	    }
    85	}

Static Checking Tools

Static checking tools find data races in a program without actually executing the program.

The static checking approach has three advantages, as compared with runtime based approachs.

  1. It can be very fast and consume little memory.
  2. The analysis does not affect the behavior of program because it is performed offline.
  3. It can detect potential data races that do not happen in a particular run with a particular input data set.

Because of the above advantages, static checking can be used in situations where it is very difficult or impossible to get a runtime experiment or where it is very difficult or impossible to get a precise runtime experiment without altering the runtime result, such as OS kernels and device drivers.

The biggest disadvantage of static checking is the large amount of false positives it may generate. Static checking is always puzzled by imprecise information due pointer aliasing and vague execution paths.

Tool 1: LockLint from Sun

Sun Studio provides a utility called LockLint, which analyzes the use of mutex and reader/writer locks, and reports data races and deadlocks due to inconsistent use of locking techniques.

LockLint reports a data race when accesses to a variable are not consistently protected by at least one lock, or accesses violate assertions about which locks protect the variable.

LockLint originates from WARLOCK, which was designed to detect data races and deadlocks in Solaris kernels and device drivers. Search for warlock in opensolaris.org, and you can still find the use of it there.

The following shows the result of using LockLint on pthr_prime.c. Notice the false positive at line 63, and false negative with respect to variable i.

$ cc -mt -Zll pthr_prime.c
$ lock_lint start
$ lock_lint load pthr_prime.ll
$ lock_lint analyze -v

* Warning: A main function was loaded with no annotations to indicate the
      presence or absence of concurrency. Lock_lint will assume concurrency.
  Please annotate source with:
      NOTE(COMPETING_THREADS_NOW) or NOTE(NO_COMPETING_THREADS_NOW)

* Writable variable read while no locks held!
  variable = :pflag
     where = :is_prime [pthr_prime.c,30]

* Variable written while no locks held!
  variable = :pflag
     where = :is_prime [pthr_prime.c,33]

* Variable written while no locks held!
  variable = :pflag
     where = :main [pthr_prime.c,63]

* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime.c,74]

* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime.c,75]

* Writable variable read while no locks held!
  variable = :primes
     where = :main [pthr_prime.c,76]

* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime.c,77]

* Writable variable read while no locks held!
  variable = :total
     where = :work [pthr_prime.c,50]

* Variable written while no locks held!
  variable = :primes
     where = :work [pthr_prime.c,50]

* Variable written while no locks held!
  variable = :total
     where = :work [pthr_prime.c,51]

The following shows the result of using LockLint on pthr_prime_fixed.c. Notice that the data races in routine work() are now gone, but the false positives and the false negatives in the previous experiment with pthr_prime.c are still there.

$ cc -mt -Zll pthr_prime_fixed.c
$ lock_lint start
$ lock_lint load pthr_prime_fixed.ll
$ lock_lint analyze -v

* Warning: A main function was loaded with no annotations to indicate the
      presence or absence of concurrency. Lock_lint will assume concurrency.
  Please annotate source with:
      NOTE(COMPETING_THREADS_NOW) or NOTE(NO_COMPETING_THREADS_NOW)

* Writable variable read while no locks held!
  variable = :pflag
     where = :is_prime [pthr_prime_fixed.c,31]

* Variable written while no locks held!
  variable = :pflag
     where = :is_prime [pthr_prime_fixed.c,34]

* Variable written while no locks held!
  variable = :pflag
     where = :main [pthr_prime_fixed.c,66]

* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime_fixed.c,81]

* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime_fixed.c,82]

* Writable variable read while no locks held!
  variable = :primes
     where = :main [pthr_prime_fixed.c,83]

* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime_fixed.c,84]

LockLint provides a rich set of source code notations and interactive subcommands that can be used to provide more precise information to LockLint so to improve the analysis.

Tool 2: vpara option in Sun Studio Fortran/C compilers

Strickly, this is not a tool. It is a compile-time check option provided in Sun Studio Fortran and C compilers. The following is from the man page of the cc command.

     -xvpara
          Show parallelization warning messages

          Issues warnings about potential parallel programming
          related problems that may cause incorrect results when
          using OpenMP or Sun/Cray parallel directives and prag-
          mas.

          Use with -xopenmp and OpenMP API directives, or with
          -explictpar and MP parallelization directives.

          Warnings are issued when the compiler detects the fol-
          lowing situations:

          o Loops that are parallelized using MP directives when
          there are data dependencies between different loop
          iterations

          o Problematic use of OpenMP data sharing attributes
          clauses, such as declaring a variable "shared" whose
          accesses in an OpenMP parallel region may cause data
          race, or declaring a variable "private" whose value in
          a parallel region is used after the parallel region.

In short, when -xvpara is used as an option to compile an OpenMP program, the compiler is able to report problems in the source code caused by incorrect use of data sharing attribute clause. One typical problem is data race introduced by incorrectly declaring a variable "shared".

When using vpara checking on the omp_prime.c, the compiler finds the data race between the write accesses to variable total at line 57 by different threads, as illustrated below. The checking analyzes the code enclosed lexically inside an OpenMP parallel region only, therefore it does not find data races in routine is_prime(). The checking also misses the data race on array primes[] due to a technique to reduce false positives. Unfortunately, the technique introduces a false negative here.

$ cc -xopenmp -xO3 -xvpara omp_prime.c -lm
"omp_prime.c", line 53: Warning: inappropriate scoping
        variable 'total' may be scoped inappropriately as 'shared'
        . write at line 57 and write at line 57 may cause data race

$ cc -xopenmp -xO3 -xvpara omp_prime_fixed.c -lm
$

The vpara compile-time checking is based on the static non-concurrency analysis techniques for OpenMP programs, which is also used by the OpenMP autoscoping feature provided in Sun Studio compilers.

[5] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Understanding Data Races 1: the Role of Data Race Detection Tools

Friday Jun 30, 2006

This is the first of a series of blogs on understanding data races I am going to post.


With the release of Sun Studio Express (June 2006 Build), we are offering a run-time data race detection tool (DRDT) for developers on Sun's platforms for FREE. It compliments other data race detection tools Sun already offers now.

If you have been bugged by data race problems in the past, you should give it a try. Go here (scroll to 'How to get started') to download it. And here is the page dedicated to the DRDT project.


I would like to start the series with understanding the role data race detection tools first.

Many mt programs have race conditions, the existence of which makes debugging mt programs very hard. One class of race conditions is data race condition or data race. (The difference between general race condtion and data race condition will be explained in another blog.)

Data race is a condition that happens in a program. People often think a data race is always a bug. This is not true. A data race could be the root cause of a bug; it could be caused by a bug; or it could be there because the programmer wants it there.

If a data race is the root cause of a bug, we want to find it. If a data race is caused by a bug, showing where the data race is can help the programmer locate the real bug. If a data race is there by design, we want to make sure it is there and we also want to make sure there is no unexpected data race.

The role of a data race detection tool is to check whether a program contains data races and pin-point the locations of them if there is any.

There are many ways of using a data race detection tool. Some use it as debugging tool: run it when there is a bug in the program. Someone use it as a sanity checking tool: run it as part of regression tests. And some use it as a programming assistance tool in parallelizing sequential programs: find thread unsafe routines and global variables that should be private to threads.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

The idea behind environment variable "SUNW_MP_MAX_POOL_THREADS"

Thursday Jun 29, 2006

Sun's OpenMP implementation supports true nested parallel regions - when nested parallelism is enabled, the inner parallel region can be executed by multiple threads concurrently.

We provide an environment variable called SUNW_MP_MAX_POOL_THREADS for users to control the total number of OpenMP slave threads in a process.

For example, if you have want a maximum of 16 threads to be used for a nest of parallel regions in your program, you can set SUNW_MP_MAX_POOL_THREADS to 15. That's 15 slave threads (some of them may become masters in inner parallel regions) plus one user thread which is the master thread for the out-most parallel region.

Why did we design an environment variable like SUNW_MP_MAX_NUM_THREADS so that a user can set it to 16 in the above example? Intel's implementation has KMP_ALL_THREADS and KMP_MAX_THREADS which do that.

Well, we were trying to have a scheme that works on more general cases, not just pure OpenMP codes. In particular, we think our scheme works better than others for mixed pthread and OpenMP thread code. The pool defines a set of threads that can be used as OpenMP slave threads. If the program has two pthreads and both will create a team, then both will try to grab slave threads from the same pool. The env var SUNW_MP_MAX_POOL_THREADS was NOT designed for users to control the total number of threads in a process. We cannot control that because of the use of pthreads. The env var is designed for users to control the total number of OpenMP slave threads.

The env var SUNW_MP_MAX_NUM_THREADS is documented here. We also have a short article "How Many Threads Does It Take?" if you want to understand it better.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Common Mistakes in Using OpenMP 4: Orphaned Worksharing Constructs

Sunday Jun 11, 2006

More precisely, this mistake should be classified as a common mis-understanding of OpenMP.

When a worksharing construct, such omp for or omp sections, is encountered outside any explicit parallel region, the arising worksharing region is called orphaned worksharing region. A common mis-understanding is that in this case the worksharing construct is simply being ignored and the region is executed sequentially.

Orphaned worksharing constructs are not ignored. All the data sharing attribute clauses are honored. The worksharing regin is executed as if a team of only one thread is executing the region.

For example, in the following C++ code,

     main() 
     {
         class_type_1  a;
         #pragma omp for private(a) schedule(dynamic)
         for (i=1; i<100; i++) {
             printf("%d\n", i);
         } 
     } 

the default constructor for class_type_1 will be called, and a comforming implementation is not forced to execute the loop in the order of 1, 2, 3, ..., 99.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Concurrency vs Parallelism, Concurrent Programming vs Parallel Programming

Sunday Jun 11, 2006

In the danger of hairsplitting, ...

Concurrency and parallelism are NOT the same thing. Two tasks T1 and T2 are concurrent if the order in which the two tasks are executed in time is not predetermined,

  • T1 may be executed and finished before T2,
  • T2 may be executed and finished before T1,
  • T1 and T2 may be executed simultaneously at the same instance of time (parallelism),
  • T1 and T2 may be executed alternatively,
  • ...

If two concurrent threads are scheduled by the OS to run on one single-core non-SMT non-CMP processor, you may get concurrency but not parallelism. Parallelism is possible on multi-core, multi-processor or distributed systems.

Concurrency is often referred to as a property of a program, and is a concept more general than parallelism.

Interestingly, we cannot say the same thing for concurrent programming and parallel programming. They are overlapped, but neither is the superset of the other. The difference comes from the sets of topics the two areas cover. For example, concurrent programming includes topic like signal handling, while parallel programming includes topic like memory consistency model. The difference reflects the different orignal hardware and software background of the two programming practices.

[1] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Common Mistakes in Using OpenMP 3: Fifteen Cases from a IWOMP 2006 paper by Michael Süß and Claudia Leopold

Wednesday Jun 07, 2006

The coming International Workshop on OpenMP (IWOMP 2006) has a paper titled "Common Mistakes in OpenMP and How to Avoid Them" written by Michael Süß and Claudia Leopold (University of Kassel, Germany).

The result is based on a survey of two undergraduate courses. The authors of the paper kindly allow me to list the 15 common mistakes presented in their paper here,

  1. (Correctness) Access to shared variables not protected
  2. (Correctness) Use of locks without flush
  3. (Correctness) Read of shared variable without flush
  4. (Correctness) Forget to mark private variables as such
  5. (Correctness) Use of ordered clause without ordered construct
  6. (Correctness) Declare loop variable in #pragma omp parallel for as shared
  7. (Correctness) Forget to put down for in #pragma omp parallel for
  8. (Correctness) Try to change num. of thr. in parallel reg. after start of reg.
  9. (Correctness) omp_unset_lock() called from non-owner thread
  10. (Correctness) Attempt to change loop variable while in #pragma omp for
  11. (Performance) Use of critical when atomic would be sufficient
  12. (Performance) Put too much work inside critical region
  13. (Performance) Use of orphaned construct outside parallel region
  14. (Performance) Use of unnecessary flush
  15. (Performance) Use of unnecessary critical

For detail, please read the full paper.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Read: "The Rise and Fall of CORBA"

Sunday Jun 04, 2006

The June 2006 issue (Vol 4, No 5) of ACM Queue features an aritcle by Michi Henning of ZeroC on the rise and fall of CORBA.

Technical issues and procedural issues contribute to the fall of CORBA. And the procedural problems are the root cause of the procedural problems. Many of the issues the article points out are alarming familiar!

The following is a list of lessons learnt in how to have a better standards process,

  • Standards consortia need iron-clad rules to ensure that they standardize existing best practice.
  • No standard should be approved without a reference implementation.
  • No standard should be approved without having been used to implement a few projects of realistic complexity.
  • Open source innovation usually is subject to a Darwinian selection proecess.
  • To create quality software, the ability to say "no" is usually far more important than the ability to say "yes".

Read the whole article.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Static Code Analysis Tools

Sunday May 28, 2006

New.com recently has an article on companies making comercial static code analysis tools for checking security flaws.

Companies and products to watch: 

Most of them use context sensitive, interprocedural, cross module, and mixed language analysis. A major difference between the analysis used in static error detection and the one used in compiler optimization is that the former can be incomplete and unsound.


Here is a link to a site that lists a collection of static analysis tools for C code.


[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Common Mistakes in Using OpenMP 2: Atomic

Monday Feb 20, 2006

The following code finds good members in array member[] and stores the indices of the good members in array good_members[].

#define N 1000

struct data member[N];

int good_members[N];

int pos = 0;

void find_good_members()
{
for (i=0; i < N; i++) {
if (is_good(member[i])) {
good_members[pos] = i;
pos ++;
}
}
}

The following is a navie way of parallelizing the above code,


#define N 1000

struct data member[N];

int good_members[N];

int pos = 0;

void find_good_members()
{
#pragma omp parallel for
for (i=0; i < N; i++) {
if (is_good(member[i])) {
good_members[pos] = i; // line a
#pragma omp atomic
pos ++; // line b
}
}
}

In order to avoid data races between different updates of global variable pos, the code puts the increment (at line b) in a atomic construct. However, the code does not work, because there is a data race between the read of pos at line a and write of pos at line b.

Changing the body of the if statement to the following gives the correct result.

      int mypos;
#pragma omp critical
{
mypos = pos;
pos ++;
}
good_members[mypos] = i;

In OpenMP 2.5 (the latest Specification), inside a parallel region, the only place where you can safely get the value of a variable that is updated in an atomic region is another atomic region.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Common Mistakes in Using OpenMP 1: Incorrect Directive Format

Friday Dec 30, 2005

In C/C++, OpenMP directives are specified by using the #pragma mechanism; and in Fortran, they are specified by using special comments that are identified by unique sentinels.

This design allows users to write OpenMP programs that can be compiled with compilers that do not support OpenMP or compiled with OpenMP compiles with OpenMP support disabled.

However, if you do not follow the directive format, you might get a program that compiles and runs but gives unexpected results, because the compiler does not recognize your OpenMP directives and thinks they are non-OpenMP related pragmas (C/C++) or regular comments (Fortran).

Quiz:

How many "me"s does the following code print? Assume a team of 4 threads are executing the parallel region.

foo() 
{
    #pragma omp parallel
    {
        #pragma single
        {
            printf("me\n");
        }
    }
}

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg