pthread_{get,set}specific vs thread local variable
When you write a multithreaded code,
sometimes you want to have a per-thread data that can be easily accessed everywhere.
This need is especially important when you port an existing single-threaded application
to be multithreaded. Two possible solutions among many possibilities are 1) using pthread_{get,set}specific() API
and 2) using compiler's thread local variable support. On this blog, I'll compare those two approaches and show an example.
pthread_{get,set}specific() are part of POSIX thread API
that provides access to per-thread data by using a key. This contrasts to the thread local variable
which is a language extension provided by the Sun Studio compiler.
Both utilize the runtime linker's support of thread local storage (TLS)
but are quite different.
First of all, they look completely different in the source code.
Pthread APIs are just bunch of function calls,
but the thread local variable is through a type qualifier.
So the thread local variable is more seamlessly integrated into the source but is less portable
(well, gcc supports __thread on many platforms, and Microsoft compiler supports __declspec( thread )
to do the same, so portability isn't that big of a deal in practice. But still it is a non-standard extension
and once you use it your code won't be 100% standard C).
And Microsoft compiler also supports thread local storage
This also affects the compiler optimization because compiler recognizes thread local variable and
knows its semantic but it has no idea what pthread APIs are about (and even if we teach compilers to
recognize them, there's not much a compiler can do because those APIs are in libc.so).
pthread_{get,set}specific() can store only void *. This means that if you want a data
that's bigger than void *, you need to allocate it on the heap yourself and store a pointer.
This contrasts to __thread which can be used for data with any size. And for such large cases, __thread
doesn't need the extra indirection pthread_{get,set}specific() would if a pointer is stored instead of the actual variable.
Here's a single .c file that can be built to do equivalent stuff in pthread_{get,set}specific() and the compiler language extension.
# cat -n thread.c
1 #include
2 #include
3 #include
4 #include
5 #include
6
7 void *test(void *); /* thread routine */
8 void nop(int);
9
10 #ifdef TLS
11 __thread int i;
12 #else
13 pthread_key_t key_i;
14 #endif
15 pthread_t *tid;
16
17 int
18 main(int argc, char *argv[])
19 {
20 int i;
21 int iter;
22 int nthread;
23 if (argc <= 2) {
24 printf("Usage: %s #threads #iteration\n", argv[0]);
25 return (1);
26 }
27
28 #ifndef TLS
29 pthread_key_create(&key_i, NULL);
30 #endif
31
32 nthread = atoi(argv[1]);
33 iter = atoi(argv[2]);
34 tid = alloca( sizeof(pthread_t) * nthread );
35
36 printf("main() %d threads %d iterations\n", nthread, iter);
37
38 for ( i = 0; i < nthread; i++)
39 pthread_create(&tid[i], NULL, test,
40 (void *)iter);
41 for ( i = 0; i < nthread; i++)
42 pthread_join(tid[i], NULL);
43
44 printf("main() reporting that all %d threads have terminated\n", i);
45 return (0);
46 } /* main */
47
48 void *
49 test(void *arg)
50 {
51 int count = (int)arg;
52 int v;
53 #ifdef TLS
54 i = count;
55 #else
56 if ( pthread_setspecific( key_i, (void*)count ) != 0 ) {
57 printf("pthread_setspecific failed\n");
58 }
59 #endif
60 printf("thread %d test count %d\n", pthread_self(), count);
61 #ifdef TLS
62 while( i > 10 ) {
63 i = i - 1;
64 nop(i);
65 }
66 #else
67 while( (v = (int)(pthread_getspecific(key_i))) > 10 ) {
68 pthread_setspecific(key_i, (void*)(v -1));
69 nop(v);
70 };
71 #endif
72 printf("thread %d finished. count %d\n", pthread_self(), count);
73
74 return (NULL);
75 }
#
This is a contrived example but is good enough to illustrate difference between the compiler support and pthread API.
You might have noticed __thread at line 11, which is a type qualifier that specifies that the variable is per-thread
meaning that each thread will see its own copy of i. Without it, all threads will share the same i.
I built the above source code with:
# cc -fast -xarch=v8plus thread.c -DTLS -o tls.out nop.il -Wc,-Qinline-l # cc -fast -xarch=v8plus thread.c -o specific.out nop.il -Wc,-Qinline-lYou may wonder what those extra compiler arguments (nop.il and -Wc,-Qinline-l) are about. I'll explain later. Anyway, with this compilation, let's compare the output of the code for the while loop above (line 62 and 67).
# dis -F test specific.out
...
test+0x58: 40 00 40 73 call pthread_setspecific
test+0x5c: d0 06 a0 84 ld [%i2 + 0x84], %o0
test+0x60: 90 10 00 1c mov %i4, %o0
test+0x64: 01 00 00 00 nop
test+0x68: 40 00 40 75 call pthread_getspecific
test+0x6c: d0 06 a0 84 ld [%i2 + 0x84], %o0
test+0x70: b8 10 00 08 mov %o0, %i4
test+0x74: 80 a2 20 0a cmp %o0, 0xa
test+0x78: 14 4f ff f8 bg,pt %icc, test+0x58
test+0x7c: 92 22 20 01 sub %o0, 0x1, %o1
...
pthread API case is straightforward - there are two calls, with two loads that load the value of key_i.
But to really see what's going on, you need to look at pthread_getspecific() itself:
# dis -F pthread_getspecific /usr/lib/libc.so.1
**** DISASSEMBLER ****
disassembly for /usr/lib/libc.so.1
section .text
pthread_getspecific()
pthread_getspecific: 80 a2 20 09 cmp %o0, 0x9
pthread_getspecific+0x4: 1a 80 00 05 bgeu pthread_getspecific+0x18
pthread_getspecific+0x8: 9a 01 e1 00 add %g7, 0x100, %o5
pthread_getspecific+0xc: 97 2a 20 02 sll %o0, 0x2, %o3
pthread_getspecific+0x10: 81 c3 e0 08 retl
pthread_getspecific+0x14: d0 03 40 0b ld [%o5 + %o3], %o0
pthread_getspecific+0x18: c2 01 e0 fc ld [%g7 + 0xfc], %g1
pthread_getspecific+0x1c: 80 a0 60 00 cmp %g1, 0x0
pthread_getspecific+0x20: 02 80 00 08 be pthread_getspecific+0x40
pthread_getspecific+0x24: 01 00 00 00 nop
pthread_getspecific+0x28: d4 00 60 00 ld [%g1], %o2
pthread_getspecific+0x2c: 80 a2 00 0a cmp %o0, %o2
pthread_getspecific+0x30: 1a 80 00 04 bgeu pthread_getspecific+0x40
pthread_getspecific+0x34: 9b 2a 20 02 sll %o0, 0x2, %o5
pthread_getspecific+0x38: 81 c3 e0 08 retl
pthread_getspecific+0x3c: d0 00 40 0d ld [%g1 + %o5], %o0
pthread_getspecific+0x40: 81 c3 e0 08 retl
pthread_getspecific+0x44: 90 10 20 00 clr %o0
#
If you follow the dependency chain, you'll see that
pthread_getspecific+0x18: c2 01 e0 fc ld [%g7 + 0xfc], %g1
pthread_getspecific+0x34: 9b 2a 20 02 sll %o0, 0x2, %o5
pthread_getspecific+0x3c: d0 00 40 0d ld [%g1 + %o5], %o0
are what it takes to get the return value. So, for each access, pthread_getspecific() needs three loads
(one for loading key_i in test(),
one for loading a pointer to thread local storage area from thread pointer %g7, and the final load to get the actual value.
See _pthread_getspecific
for the libc code for pthread_getspecific).
Although I didn't show pthread_setspecific(), it's essentially similar in the common case - two loads to form the address and
one store to do an actual write.
Now, let's look at __thread case:
# dis -F test tls.out
...
test+0x54: d0 25 a0 00 st %o0, [%l6]
test+0x58: 01 00 00 00 nop
test+0x5c: ea 05 a0 00 ld [%l6], %l5
test+0x60: 80 a5 60 0a cmp %l5, 0xa
test+0x64: 14 4f ff fc bg,pt %icc, test+0x54
test+0x68: 90 25 60 01 sub %l5, 0x1, %o0
...
#
__thread code looks straightforward on the surface - one load reads the value, one store writes it back.
The question is, how %l6 is formed.
Here's the assembly snippet for the code just before the above:
# dis -F test tls.out
**** DISASSEMBLER ****
disassembly for tls.out
section .text
test()
test: 9d e3 bf a0 save %sp, -0x60, %sp
test+0x4: 40 00 00 02 call test+0xc
test+0x8: 9e 10 00 0f mov %o7, %o7
test+0xc: 1b 00 00 40 sethi %hi(0x10000), %o5
test+0x10: 3b 00 00 00 sethi %hi(0x0), %i5
test+0x14: 9a 03 61 5c add %o5, 0x15c, %o5
test+0x18: b8 1f 7f f8 xor %i5, -0x8, %i4
test+0x1c: b6 03 40 0f add %o5, %o7, %i3
test+0x20: 33 00 00 42 sethi %hi(0x10800), %i1
test+0x24: b4 10 00 1c mov %i4, %i2
test+0x28: ac 01 c0 1a add %g7, %i2, %l6
...
#
I'll explain what's going on in this code sequence in more detail in another blog entry,
but let me just say that this code shows the dance of the code and the runtime linker to
allow forming a pointer to the thread local storage. But this sequence doesn't need any load
and usually it happens only once within a routine (or at least outside loop).
So, __thread requires only one load (or store) to access the variable.
With all of the above, let's compare the performance. This run was on 750MHz UltraSPARC-III system:
# ptime ./specific.out 1 100000000 main() 1 threads 100000000 iterations thread 2 test count 100000000 thread 2 finished. count 100000000 main() reporting that all 1 threads have terminated real 5.974 user 5.903 sys 0.015 # ptime ./tls.out 1 100000000 main() 1 threads 100000000 iterations thread 2 test count 100000000 thread 2 finished. count 100000000 main() reporting that all 1 threads have terminated real 2.024 user 1.971 sys 0.011 # ptime ./specific.out 2 100000000 main() 2 threads 100000000 iterations thread 2 test count 100000000 thread 3 test count 100000000 thread 2 finished. count 100000000 thread 3 finished. count 100000000 main() reporting that all 2 threads have terminated real 6.813 user 11.802 sys 0.028 # ptime ./tls.out 2 100000000 main() 2 threads 100000000 iterations thread 2 test count 100000000 thread 3 test count 100000000 thread 3 finished. count 100000000 thread 2 finished. count 100000000 main() reporting that all 2 threads have terminated real 2.150 user 3.931 sys 0.010 #Well, as expected, three loads vs one load and the performance difference is close to 3x. I tried this on a Niagara box (T2000, 1Ghz, 8 core):
# ptime ./specific.out 1 100000000 main() 1 threads 100000000 iterations thread 2 test count 100000000 thread 2 finished. count 100000000 main() reporting that all 1 threads have terminated real 6.932 user 6.904 sys 0.009 # ptime ./tls.out 1 100000000 main() 1 threads 100000000 iterations thread 2 test count 100000000 thread 2 finished. count 100000000 main() reporting that all 1 threads have terminated real 2.637 user 2.602 sys 0.010 # ptime ./specific.out 8 100000000 main() 8 threads 100000000 iterations thread 2 test count 100000000 thread 3 test count 100000000 thread 4 test count 100000000 thread 5 test count 100000000 thread 6 test count 100000000 thread 7 test count 100000000 thread 8 test count 100000000 thread 9 test count 100000000 thread 2 finished. count 100000000 thread 3 finished. count 100000000 thread 4 finished. count 100000000 thread 6 finished. count 100000000 thread 8 finished. count 100000000 thread 9 finished. count 100000000 thread 7 finished. count 100000000 thread 5 finished. count 100000000 main() reporting that all 8 threads have terminated real 6.966 user 55.246 sys 0.013 # ptime ./tls.out 8 100000000 main() 8 threads 100000000 iterations thread 2 test count 100000000 thread 3 test count 100000000 thread 4 test count 100000000 thread 5 test count 100000000 thread 6 test count 100000000 thread 7 test count 100000000 thread 8 test count 100000000 thread 9 test count 100000000 thread 3 finished. count 100000000 thread 9 finished. count 100000000 thread 2 finished. count 100000000 thread 4 finished. count 100000000 thread 5 finished. count 100000000 thread 6 finished. count 100000000 thread 7 finished. count 100000000 thread 8 finished. count 100000000 main() reporting that all 8 threads have terminated real 2.734 user 21.604 sys 0.011 #Not so surprisingly, they show similar performance differences. Niagara performs quite well and it scales linearly up to 8 threads (Why no result for 32 threads ? That's for the next blog
)" title="
)" />.
BTW, I didn't explain what those extra argument (nop.il -Wc,-Qinline-l) to the compiler is about.
A variable with __thread is still a variable and hence treated the same way as any global variable
except for how its address is formed. So, the compiler just optimized away the loop without nop() call at line 64.
So I inserted nop() calls at line 64 and the compiler could no longer optimize it away
(since it doesn't know what nop() is), but it adds extra call overhead which could be quite big
since the loop itself is quite small. So I wrote a little inline template for nop() so that the compiler can inline away nop() call.
Well, once I did that, the compiler once again optimized away the loop since it saw that the call doesn't do anything.
Arg. So,to prevent the compiler from understanding the body of nop() call but to still allow inlining it, I added -Wc,-Qinline-l
which is a code generator internal option that prevents it from inlining inline templates early in the optimization
which prevents the loop optimization.
Anyway, in summary, the thread local variable and pthread_{get,set}specific() APIs do similar things
but have different interface, performance and portability. If you can afford to use __thread,
it is usually a better choice in terms of the amount of code changes and the performance.
( Dec 16 2005, 01:04:53 PM PST )
Permalink
Comments are closed for this entry.

