Here's a list of the current SDN articles dealing with tuning and optimization of applications on Solaris using Sun Studio compilers:
Here are examples of using a compiler flag or inline assembly language with Sun
Studio compilers to increase the performance of C, C++, and Fortran programs.
(June 4, 2007)
This article describes how to profile an IBM WebSphere Application Server
(WAS) runtime environment with the Sun Studio Performance Analysis Tools,
Collector and Analyzer.
(January 30, 2007)
The SHADE library is an emulator for SPARC hardware. The particular
advantage of using SHADE is that it is possible to write an analysis tool which
gathers information from the application being emulated. The SHADE library comes
with some example analysis tools which track things like the number of
instructions executed or the frequency that each type of instruction is
executed. A more advanced analysis tool might look at cache misses that the
application encounters for a given cache structure.
(September 29, 2006)
Profile feedback
is an optimisation technique that uses a short training run of the
application to provide the compiler with more detailed information
about the runtime
behaviour of the program. This information enables the compiler to make
better
optimisation decisions.
For example, which routines are appropriate to inline, or which
branches are the frequently taken path. This paper presents two ways of
viewing the
correspondence between the behaviour of the training and reference
workloads. The methods presented here are necessary conditions for the
training workload to be representative of the reference workload.
(September 29, 2006)
How to use the Sun Studio Performance Analyzer to profile Java applications.
(August 25, 2006)
This article describes how use the Sun Studio Performance Tools to
profile servers being run under BEA's WebLogic system. A server running
under BEA's WebLogic is a Java application that you launch by running a
script to invoke the JVM. To profile a server, prepend the JVM command
with a collect command to invoke the Sun Studio Collector. The article details how this is done.
(August 25, 2006)
Performance is a factor of both hardware and software. To extract the
maximum performance from the new AMD-64 based systems on your critical
C/C++ and Fortran applications, choose the best compilers. Then use
compiler options to take advantage of the Opteron system features to
maximize performance. This article will show you how.
(May 23, 2006)
Large, CPU intensive applications may perform better when built
with profile feedback. Profile feedback optimization requires the
application to be built twice, once to collect the profile data, and
again to make use of the profile to generate optimal code. This
requirement may keep some software vendors from building their
applications with profile feedback. However it is possible to use old
profiles to minimize the overhead of profile feedback builds in a
development environment. This article introduces all the stages of
profile feedback with examples, and offers some tips for making profile
feedback builds.
(April 11, 2006)
An application's performance depends on a combination of hardware
and software factors. For example, what events must the hardware deal
with, and what degree of optimisation was applied when compiling the
application? There are a number of tools that can be used to extract
information or collect this kind of information, but knowing which
tools to pick for a given application can be tricky.
This paper introduces a new tool that aims to simplify the process of
performance analysis. We call it the Simple Performance Optimisation
Tool, or 'SPOT'. Spot is an add-on package to Sun Studio 11, and it is
only available for UltraSPARC based systems. Spot has been released as
part of the Cool Tools project.
(March 7, 2006)
The VIS instruction set includes a number of instructions that can
be used to handle several items of data at the same time. These are
called SIMD (Single Instruction Multiple Data) instructions. The VIS
instructions work on data held in floating point registers. The
advantage of using VIS instructions is that an operation can be applied
to different items of data in parallel; meaning that it takes the same
time to compute eight 1 byte results as it does to calculate one 8-byte
results. In theory this means that code that uses VIS instructions can
be many times faster than code without them.
(January 5, 2006)
The Binary Optimizer is a static SPARC optimizer that accepts as
input a binary and creates an optimized binary as the output. We define
a binary as either an executable or a shared object. The availability
of the original source code is not a pre-requisite for using this tool.
It can optimize binaries irrespective of the source language used (C,
C++ or FORTRAN). It can also optimize mixed source language binaries. (December 1, 2005)
The Sun Studio performance tools are designed to help answer
questions about application performance. This article discusses the
kinds of performance questions that users typically ask. It describes
the model for using the tools, and for building the target executable,
as well as the data collection process, and the data that can be
collected. The Analyzer and its displays are also described, along with
a number of examples of what it can do.
(November 10, 2005)
Users wanting the best performance from CPU-intensive codes may
wish to explore the use of additional libraries and advanced compiler
options that control individual compiler components.
(Revised March 23, 2006)
Profile feedback is a useful mechanism for providing the compiler
with information about how a code behaves at runtime. Having this
information can lead to significant improvements in the performance of
the application. As with all optimisations, it is only worth using
profile feedback if it does produce a gain in performance.
(September 7, 2005)
Large applications have a particular problem: they have a lot of
instructions, and the processor does not have the capacity to hold the
entire application on-chip at any one time. As a consequence, larger
applications spend some of their run time stalled with the processor
waiting to fetch new instructions from memory. This paper discusses
several techniques that help the processor to hold more useful
instructions on-chip, consequently reducing the time wasted fetching
data from memory.
(July 12, 2005)
How to get the best performance from an UltraSPARC or x86/AMD64
(x64) processor running on the latest Solaris systems by compiling with
the best set of compiler options and the latest compilers? Here are
suggestions of things you should try, but before you release the final
version of your program, you should understand exactly what you have
asked the compiler to do.
(June 24, 2005)
Inline templates are a mechanism for directly inserting assembly
code into an executable. Typically, this approach is used to obtain the
best performance for a given function, or to implement an algorithm in
a specific way.
(July 23, 2003)
This article introduces you to the UltraSPARC-IIICu performance
counters, and demonstrates how you might use the Sun ONE Studio
Performance Tools to identify where in your application these events
are happening and how you can use this information to improve the
performance of your application.
(July 23, 2003)
A discussion of dataflow parallelism with the Fast Application Scalability Tool.
(February 27, 2003)
A case study in program optimization.
(January 13, 2003)
Compiling for the UltraSPARC(R) IIICu Processor
Techniques for improving both compile-time and run-time performance of your C++ programs.
(Revised March 14, 2006)
Posted by Marc on June 25, 2007 at 06:59 AM PDT #
Yes, profile feedback and OpenMP can live together. And yes, the granularity of the parallelization and number of threads could have an affect on the results collected by -xprofile=collect.
But this requires a more definitive answer; like what kind of data -xprofile=collect actually collects when there multiple threads running over multiple processors.
Stay tuned. I'll get that info and have a more definitive answer for you.
Posted by rchrd on June 26, 2007 at 11:07 PM PDT #
-xprofile=collect doesn't collect timing data. It only counts the number of times each block of code is executed, and for each conditional or indirect branch instruction, the number of times each outcome of the branch was taken.
The information collected under -xprofile=collect should be the same whether the instrumented code is executed by a single thread or by multiple threads. If -xopenmp is specified with -xprofile=collect, the compiler instruments the code using a private array of execution counters for each thread. The counters are accumulated when the thread exits or the program is unloaded, whichever comes first.
So therefore a single run should be sufficient. Changes in the number of threads should have little or no affect.
Sounds like we need to make this clearer in the docs.
Posted by rchrd on June 27, 2007 at 09:41 AM PDT #
I keep forgetting that collect does not collect any timings, I guess better optimizations are only for programs run in a virtual machine. It must be hard to optimize openmp loops that use library functions (not available through ipo) without timing.
Anyway if a single mono-threaded run is good enough for collect, this is very easy to use.
Posted by Marc on June 29, 2007 at 08:19 AM PDT #