Darryl Gove's blog

Tuesday Jul 29, 2008

The cost of mutexes

One of the questions that came up last week was about the performance of mutex locks compared to the performance of atomic operations. The basic cost is that to acquire and free a mutex takes two operations whereas an atomic operation is a single call. There's also a bit more code required when using mutexes. The following code demonstrates the cost of calling mutexes, atomic operations, and also the inline template.

#include <pthread.h>
#include <atomic.h>
#include "timing.h"

#define SIZE 10000000

pthread_mutex_t mutex;
pthread_t thread;
volatile unsigned int counter;

void atomic_add(volatile unsigned int *,int);

void * count(void* value)
{
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    pthread_mutex_lock(&mutex);
    counter++;
    pthread_mutex_unlock(&mutex);
  }
  endtime(SIZE);
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    atomic_add_int(&counter,1);
  }
  endtime(SIZE);
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    atomic_add(&counter,1);
  }
  endtime(SIZE);
}

void main()
{
  pthread_mutex_init(&mutex,0);
  counter=0;
  pthread_create(&thread,0,count,0);
  
  pthread_join(thread,0);
  pthread_mutex_destroy(&mutex);
}

Compiling and running an UltraSPARC T1 gives results like:

% cc test.c add.il
% a.out
Time per iteration 250.61 ns
Time per iteration 75.85 ns
Time per iteration 65.21 ns

So the mutex calls are about 3x slower than atomic operations. Calling libc is about 10ns slower than using an inline template (not a bad difference in return for not having to write the inline template code).

It's interesting to see where the time goes. So here's the profile of the application:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
3.973     3.973      
1.341     3.973      count
1.331     1.331      mutex_unlock
0.781     0.781      mutex_lock_impl
0.490     0.490      atomic_add_32
0.030     0.030      mutex_lock

The routine mutex_lock tail calls mutex_lock_impl, which does the work of locking the mutex. The heart of mutex_unlock looks like:

   0.        0.                 [?]    beff8:  mov         %o1, %o3
   0.020     0.020              [?]    beffc:  cas         [%o0] , %o2, %o3
## 0.560     0.560              [?]    bf000:  cmp         %o2, %o3
   0.        0.                 [?]    bf004:  bne,a       0xbeff8
   0.        0.                 [?]    bf008:  mov         %o3, %o2

The core of mutex_lock_impl is not too dissimilar, so basically the mutex lock code contains two atomic operation loops, plus a bundle of other instructions that make up the rest of the cost of the calls.

Looking at where the time is spent for the atomic_add_32 call:

                                
   0.010     0.010              [?]    2ecb8:  ld          [%o0], %o2
   0.040     0.040              [?]    2ecbc:  add         %o2, %o1, %o3
   0.010     0.010              [?]    2ecc0:  cas         [%o0] , %o2, %o3
## 0.370     0.370              [?]    2ecc4:  cmp         %o2, %o3
   0.        0.                 [?]    2ecc8:  bne,a,pn    %icc,0x2ecbc
   0.        0.                 [?]    2eccc:  mov         %o3, %o2
   0.050     0.050              [?]    2ecd0:  retl        
   0.010     0.010              [?]    2ecd4:  add         %o2, %o1, %o0

Which is again, a very similar loop, but with little overhead around it. And it pretty much matches the code that came from the inline template:

   0.040     0.040              [?]    110ec:  add         %o2, %o1, %o3
   0.010     0.010              [?]    110f0:  cas         [%o0] , %o2, %o3
## 0.360     0.360              [?]    110f4:  cmp         %o2, %o3
   0.        0.                 [?]    110f8:  bne         0x110ec
   0.040     0.040              [?]    110fc:  mov         %o3, %o2

Comments:

Post a Comment:
Comments are closed for this entry.

Calendar

Search this blog

About

Solaris Application Programming

Book resources

The Developer's Edge

Book resources

OpenSPARC Internals

Book resources

Recent entries

Custom search

Tag cloud

book cmt communityone compiler cooltools cpu2006 dtrace gcc libraries linker multithreading openmp opensolaris opensparc optimisation optimization parallelisation parallelization performance performanceanalyzer programming secondlife solaris solarisapplicationprogramming sparc spot sunstudio ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Interesting docs

Navigation

Referers

Feeds