Wednesday Oct 21, 2009

The Sun Studio Performance Analyzer reference manual, updated for Sun Studio 12 update 1, is now available on docs.sun.com:

Developing high performance applications requires a combination of compiler features, libraries of optimized functions, and tools for performance analysis. The Performance Analyzer manual describes the tools that are available to help you assess the performance of your code, identify potential performance problems, and locate the part of the code where the problems occur.

http://docs.sun.com/app/docs/doc/821-0304

Monday Feb 16, 2009

Another useful optimization option available with Sun Studio compilers is profile feedback.

This option can be especially helpful with codes that contain a lot of branching. The compiler is unable to determine from the source code alone which branches in an IF or CASE statement are the most likely to be taken. Using the profile feedback feature, you can run an instrumented version of the code using typical data to collect statistics on code coverage and branching, and then recompile the code using this collected data.

Darryl Gove has a great description of profile feedback in his book Solaris Application Programming.

With profile feedback, the compiler is better able to do certain optimizations that it cannot do by just analyzing the source code:

  • Layout the compiled code so that branches are rarely taken. The most frequent branches "fall-through" to the next memory location, avoiding a fetch and branch to a distant location.
  • Inline routines called many times. This avoids costly function calls.
  • Move infrequently executed code out of the "hot" parts of the code. This improves utilization of the instruction cache.
  • Lots more optimizations based on how variables are and are not utilized, based on the mostly likely paths the program will take

Of course, all these optimizations will depend on the typicality of the test data collected in the profile. Some cases it might be useful to identify a set of "typical data", collect data for each set, and compile multiple versions using each profile. Of course, this all depends on the application.

To use profile feedback, the compilation is in three phases:

  1. Compile with -xprofile=collect to produce an instrumented executable.
  2. Run the instrumented executable with a typical data set to create a performance profile.
  3. Recompile with -xprofile=use and -xO5 to produce the optimized executable

 % cc -xO3 -xprofile=collect:/tmp/profile myapp.c
 % a.out
 % cc -xO5 -xprofile=use:/tmp/profile -o myapp myapp.c


Read about profile feedback in the compiler man pages: C++, C, Fortran

Wednesday Jan 28, 2009

If you've ever wondered what the compiler is doing when it optimizes your code, you can use the command-line tool, er_src, which is part of the Sun Studio Performance Analyzer, to view the "compiler commentary".

Just compile with some optimization level and -g and then pass the object code to er_src.

>f95 -O3 -g -c fall.f95 ; er_src fall.o
Source file: fall.f95
Object file: fall.o
Load Object: fall.o

     1.         parameter (n=100)
        <Function: MAIN>
     2.         real psi(n,n)
     3.         a = 1E6
     4.         tpi = 2*3.14159265
     5.         di = tpi/float(n)
     6.         dj = di

    Source loop below has tag L1
    Source loop below has tag L2
    L1 could not be pipelined because it contains calls
     7.     forall (j=1:n, i=1:n) psi(i,j)= a*sin((float(i)-.5) * di) * sin((float(j)-.5)*dj)
     8.         print*, psi(50,50)
     9.         end

This is a little test example using a Fortran 95 FORALL loop, compiled at optimization level O3.

Lets try it again, but this time with -fast for full optimization:

>f95 -fast -g -c fall.f95 ; er_src fall.o
Source file: fall.f95
Object file: fall.o
Load Object: fall.o

     1.         parameter (n=100)
        <Function: MAIN>
     2.         real psi(n,n)
     3.         a = 1E6
     4.         tpi = 2*3.14159265
     5.         di = tpi/float(n)
     6.         dj = di

    Source loop below has tag L1
    Source loop below has tag L2
    L1 fissioned into 2 loops, generating: L3, L4
    L1 transformed to use calls to vector intrinsics: __vsinf_
    L4 scheduled with steady-state cycle count = 2
    L4 unrolled 3 times
    L4 has 1 loads, 1 stores, 0 prefetches, 0 FPadds, 1 FPmuls, and 0 FPdivs per iteration
    L4 has 0 int-loads, 0 int-stores, 4 alu-ops, 0 muls, 0 int-divs and 0 shifts per iteration
    L3 scheduled with steady-state cycle count = 4
    L3 unrolled 2 times
    L3 has 0 loads, 1 stores, 0 prefetches, 3 FPadds, 1 FPmuls, and 0 FPdivs per iteration
    L3 has 0 int-loads, 0 int-stores, 3 alu-ops, 0 muls, 0 int-divs and 0 shifts per iteration
     7.     forall (j=1:n, i=1:n) psi(i,j)= a*sin((float(i)-.5) * di) * sin((float(j)-.5)*dj)
     8.         print*, psi(50,50)
     9.         end

A lot more going on here. Note that transforms the FORALL into two loops and then unrolls them. It also uses a vector version of the sin() function to process a bunch of arguments in a single call.

While the compiler commentary can get somewhat bit cryptic, you can get a feel for the kinds of optimizations the compiler is performing on your code.

It's also useful when using the auto parallelization options. We'll have more to say about that. But it's worth using er_src to get an idea about what the compiler can and cannot do. And don't forget to also compile with -g.

Saturday Jan 10, 2009

Let's think about this a little bit more.

If I add an optimization option, like -xO3 or -fast, to my compile command-line, what does that actually mean?

Well, it means that everything in that compilation unit (source files) will be compiled with a certain set of optimization stragegies. The compiler will try to produce the best code it can at that level. But ambiguities in the source code might inhibit some optimizations because the compiler has to make sure that the machine code it generates will always do the right thing .. that is, do what the programmer expects it to do.

Note that all the routines, functions, modules, procedures, classes, compiled in that compilation unit will be compiled with the same options. In some cases the extra time spent by the compiler might be wasted on some routines because they are rarely called and do not really participate in the compute-intensive parts of the program.

For short programs, this hardly matters .. compile time is short, and you might only compile infrequently.

But this can become an issue with "industrial-strength" codes consisting of thousands of lines, hundreds of program units (routines, functions, etc..). Compile time might become a major concern, so we probably would want to compile only those routines that factor into the overall performance of the complete program.

That means you really need to know where your program is spending most of it's CPU time, and focus your performance optimization efforts primarily on those program units. This goes for any kind of performance optimization .. you do need to know and understand the flow of the program -- its footprint.

The Sun Studio Performance Analyzer is the tool to do that. While it does provide extensive features for gathering every piece of information about your program's execution, it also has a simple command-line interface that you can use immediately to find out where the program is spending most of its time.

Compile your code with the -g option (to produce a symbol table) and run the executable under the collect command.

>f95 -g -fixed -o shal shalow.f90

>collect shal

Creating experiment database test.1.er ...

1NUMBER OF POINTS IN THE X DIRECTION     256

 NUMBER OF POINTS IN THE Y DIRECTION     256

....

Running under the collect command generates runtime execution data in test.1.er/ that can be used by the er_print command of the Performance Analyzer:

>er_print -functions test.1.er
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
18.113    18.113     <Total>
 6.805     6.805     calc1_
 6.384     6.384     calc2_
 4.893     4.893     calc3_
 0.020     0.020     inital_
 0.010     0.010     calc3z_
 0.        0.        cosf
 0.        0.        cputim_
 0.        0.        etime_
 0.        0.        getrusage
 0.       18.113     main
 0.       18.113     MAIN
 0.        0.        __rusagesys
 0.        0.        sinf
 0.       18.113     _start


The er_print -functions command gives us a quick way of seeing timings for all routines (this was a Fortran 95 program), including library routines. Right away I know that calc1, calc2, and calc3 do all the work, as expected. But we also see that calc3 is not as significant as calc1. ("Inclusive Time" includes time spent in the routines called by that routine, while "Exclusive Time" only counts time spent in the routine, exclusive of any calls to other routines.)

Well, this is a start. Note that no optimization was specified here. Lets see what happens with -fast.

>f95 -o shalfast -fast -fixed -g shalow.f90
>collect shalfast
Creating experiment database test.3.er ...
....
>er_print -functions test.3.er
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
7.695     7.695      <Total>
7.675     7.695      MAIN
0.020     0.020      __rusagesys
0.        0.020      etime_
0.        0.020      getrusage



Yikes! What happened?

Clearly, with -fast the compiler compressed the program as much as it could, replacing the calls to the calc routines by compiling them inline into one hunk of code. Note also the 2x improvement in performance.

Of course, this was a little toy test program. Things would look a lot more complicated with a large "industrial" program.

But you get the idea.

More information on er_print and collect.

This blog copyright 2009 by rchrd