Let's think about this a little bit more.
If I add an optimization option, like -xO3 or -fast, to my compile command-line, what does that actually mean?
Well, it means that everything in that compilation unit (source files) will be compiled with a certain set of optimization stragegies. The compiler will try to produce the best code it can at that level. But ambiguities in the source code might inhibit some optimizations because the compiler has to make sure that the machine code it generates will always do the right thing .. that is, do what the programmer expects it to do.
Note that all the routines, functions, modules, procedures, classes, compiled in that compilation unit will be compiled with the same options. In some cases the extra time spent by the compiler might be wasted on some routines because they are rarely called and do not really participate in the compute-intensive parts of the program.
For short programs, this hardly matters .. compile time is short, and you might only compile infrequently.
But this can become an issue with "industrial-strength" codes consisting of thousands of lines, hundreds of program units (routines, functions, etc..). Compile time might become a major concern, so we probably would want to compile only those routines that factor into the overall performance of the complete program.
That means you really need to know where your program is spending most of it's CPU time, and focus your performance optimization efforts primarily on those program units. This goes for any kind of performance optimization .. you do need to know and understand the flow of the program -- its footprint.
The Sun Studio Performance Analyzer is the tool to do that. While it does provide extensive features for gathering every piece of information about your program's execution, it also has a simple command-line interface that you can use immediately to find out where the program is spending most of its time.
Compile your code with the -g option (to produce a symbol table) and run the executable under the collect command.
|
>f95 -g -fixed -o shal shalow.f90
>collect shal
Creating experiment database test.1.er ...
1NUMBER OF POINTS IN THE X DIRECTION 256
NUMBER OF POINTS IN THE Y DIRECTION 256
....
|
Running under the collect command generates runtime execution data in test.1.er/ that can be used by the er_print command of the Performance Analyzer:
>er_print -functions test.1.er Functions sorted by metric: Exclusive User CPU Time
Excl. Incl. Name User CPU User CPU sec. sec. 18.113 18.113 <Total> 6.805 6.805 calc1_ 6.384 6.384 calc2_ 4.893 4.893 calc3_ 0.020 0.020 inital_ 0.010 0.010 calc3z_ 0. 0. cosf 0. 0. cputim_ 0. 0. etime_ 0. 0. getrusage 0. 18.113 main 0. 18.113 MAIN 0. 0. __rusagesys 0. 0. sinf 0. 18.113 _start
|
The er_print -functions command gives us a quick way of seeing timings for all routines (this was a Fortran 95 program), including library routines. Right away I know that calc1, calc2, and calc3 do all the work, as expected. But we also see that calc3 is not as significant as calc1. ("Inclusive Time" includes time spent in the routines called by that routine, while "Exclusive Time" only counts time spent in the routine, exclusive of any calls to other routines.)
Well, this is a start. Note that no optimization was specified here. Lets see what happens with -fast.
|
>f95 -o shalfast -fast -fixed -g shalow.f90 >collect shalfast Creating experiment database test.3.er ... .... >er_print -functions test.3.er Functions sorted by metric: Exclusive User CPU Time
Excl. Incl. Name User CPU User CPU sec. sec. 7.695 7.695 <Total> 7.675 7.695 MAIN 0.020 0.020 __rusagesys 0. 0.020 etime_ 0. 0.020 getrusage
|
Yikes! What happened?
Clearly, with -fast the compiler compressed the program as much as it could, replacing the calls to the calc routines by compiling them inline into one hunk of code. Note also the 2x improvement in performance.
Of course, this was a little toy test program. Things would look a lot more complicated with a large "industrial" program.
But you get the idea.
More information on er_print and collect.