Darryl Gove's blog
Peak rate of window spill and fill traps
I've been looking at the performance of a code recently. Written in C++ using many threads. One of the things with C++ is that as a language it encourages developers to have lots of small routines. Small routines lead to many calls and returns; and in particular they lead to register window spills and fills. Read more about register windows. Anyway, I wondered what the peak rate of issuing register window spills and fills was?
I'm going to use some old code that I used to examine the cost of library calls a while back. The code uses recursion to reach a deep call depth. First off, I define a couple of library routines:
extern int jump2(int count);
int jump1(int count)
{
count--;
if (count==0)
{
return 1;
}
else
{
return 1+jump2(count);
}
}
and
int jump1(int count);
int jump2(int count)
{
count--;
if (count==0)
{
return 1;
}
else
{
return 1+jump1(count);
}
}
I can then turn these into libraries:
$ cc -O -G -o libjump1.so jump1.c $ cc -O -G -o libjump2.so jump2.c
Done in this way, both libraries have hanging dependencies:
$ ldd -d libjump1.so
symbol not found: jump2 (./libjump1.so)
So the main executable will have to resolve these. The main executable looks like:
#include <stdio.h>
#define RPT 100
#define SIZE 600
extern int jump1(int count);
int main()
{
int index,count,links,tmp;
tmp=jump1(100);
#pragma omp parallel for default(none) private(count,index,tmp,links)
for(links=1; links<10000; links++)
{
for (count=0; count<RPT; count++)
{
for (index=0;index<SIZE;index++)
{
tmp=jump1(links);
if (tmp!=links) {printf("mismatch\n");}
}
}
}
}
This needs to be compiled in such a way as to resolve the unresolved dependencies
$ cc -xopenmp=noopt -o par par.c -L. -R. -ljump1 -ljump2
Note that I'm breaking rules again by making the runtime linker look for the dependent libraries in the current directory rather than use $ORIGIN to locate them relative to the executable. Oops.
I'm also using OpenMP in the code. The directive tells the compiler to make the outer loop run in parallel over multiple processors. I picked 10,000 for the trip count so that the code would run for a bit of time, so I could look at the activity on the system. Also note that the outer loop defines the depth of the call stack, so this code will probably cause stack overflows at some point, if not before. Err...
I'm compiling with -xopenmp=noopt since I want the OpenMP directive to be recognised, but I don't want the compiler to use optimisation, since if the compiler saw the code it would probably eliminate most of it, and that would leave me with nothing much to test.
The first thing to test is whether this generates spill fill traps at all. So we run the application and use trapstat to look at trap activity:
vct name | cpu13 ---------------------------------------- ... 84 spill-user-32 | 3005899 ... c4 fill-user-32 | 3005779 ...
So on this 1.2GHz UltraSPARC T1 system, we're getting 3,000,000 traps/second. The generated code is pretty plain except for the save and restore instructions:
jump1()
21c: 9d e3 bf a0 save %sp, -96, %sp
220: 90 86 3f ff addcc %i0, -1, %o0
224: 12 40 00 04 bne,pn %icc,jump1+0x18 ! 0x234
228: 01 00 00 00 nop
22c: 81 c7 e0 08 ret
230: 91 e8 20 01 restore %g0, 1, %o0
234: 40 00 00 00 call jump2
238: 01 00 00 00 nop
23c: 81 c7 e0 08 ret
240: 91 ea 20 01 restore %o0, 1, %o0
So you can come up with an estimate of 300 ns/trap.
The reason for using OpenMP is to enable us to scale the number of active threads. Rerunning with 32 threads, by setting the environment variable OMP_NUM_THREADS to be 32, we get the following output from trapstat:
vct name | cpu21 cpu22 cpu23 cpu24 cpu25 cpu26 ------------------------+------------------------------------------------------- ... 84 spill-user-32 | 1024589 1028081 1027596 1174373 1029954 1028695 ... c4 fill-user-32 | 996739 989598 955669 1169058 1020349 1021877
So we're getting 1M traps per thread, with 32 threads running. Let's take a look at system activity using vmstat.
vmstat 1 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s1 s2 s3 s4 in sy cs us sy id ... 0 0 0 64800040 504168 0 0 0 0 0 0 0 0 0 0 0 3022 427 812 100 0 0 0 0 0 64800040 504168 0 0 0 0 0 0 0 0 0 0 0 3020 428 797 100 0 0 0 0 0 64800040 504168 0 0 0 0 0 0 0 0 0 0 0 2945 457 760 100 0 0 0 0 0 64800040 504168 0 0 0 0 0 0 0 0 0 0 0 3147 429 1025 99 1 0 0 0 0 64800040 504168 0 15 0 0 0 0 0 0 0 0 0 3049 666 820 99 1 0 0 0 0 64800040 504168 0 1 0 0 0 0 0 0 0 0 0 3044 543 866 100 0 0 0 0 0 64800040 504168 0 0 0 0 0 0 0 0 0 0 0 3021 422 798 100 0 0
So there's no system time being recorded - the register spill and fill traps are fast traps, so that's not a surprise.
One final thing to look at is the instruction issue rate. We can use cpustat to do this:
2.009 6 tick 63117611 2.009 12 tick 69622769 2.009 7 tick 62118451 2.009 5 tick 64784126 2.009 0 tick 67341237 2.019 17 tick 62836527
As might be expected from the cost of each trap, and the sparse number of instructions between traps, the issue rate of the instructions is quite low. Each of the four threads on a core is issuing about 65M instructions per second. So the core is issuing about 260M instructions per second - that's about 20% of the peak issue rate for the core.
If this were a real application, what could be done? Well, obviously the trick would be to reduce the number of calls and returns. At a compiler level, that would using flags that enable inlining - so an optimisation level of at least -xO4; adding -xipo to get cross-file inlining; using -g0 in C++ rather than -g (which disables front-end inlining). At a more structural level, perhaps the way the application is broken into libraries might be changed so that routines that are frequently called could be inlined.
The other thing to bear in mind, is that this code was designed to max out the register window spill/fill traps. Most codes will get nowhere near this level of window spills and fills. Most codes will probably max out at about a tenth of this level, so the impact from register window spill fill traps at that point will be substantially reduced.
Posted at 03:31PM Mar 05, 2009 by Darryl Gove in Sun | Comments[2]



Looks like you need to protect the < whatever stuff in the code when you turn it to html.
Posted by Marc on March 05, 2009 at 04:46 PM PST #
Yup. Got most of them first try, but not those in the source code. Thanks for pointing it out.
Darryl.
Posted by Darryl Gove on March 05, 2009 at 09:26 PM PST #