Monday September 08, 2008
Interesting Read: Parallel Programming Made Easy?
Michael Wolfe, Senior Compiler engineer/architects at
The Portland Group (PGI Compilers)
recently wrote an article featured in
HPCWire
questioning how Easy? one could claim to make Parallel programming.
Some interesting thoughts and views from one of the leading compiler architects from a compiler that has focussed on high performance computing for decades(?) now.
Whether you agree or not is your own personal view of how this important style of programming should be shaped, but he does represent a very important part of the discussion underway today.
Definitely worth a read!
Posted by tatkar
( Sep 08 2008, 09:11:54 AM PDT )
Permalink
Comments [2]
Sun Studio wins Infoworld award in Application Performance category
Sun Studio came out on top in the application performance category in the recently rated in a new
IDE survey by Infoworld (here).
NetBeans is rated lower than Sun Studio (on which the SunStudio IDE is based), which is a surprise. Also surprising is the absence of Eclipse in the list, which the article explains in a very unsatisfactory way, IMO.
I'm just glad that it is being noticed that we -Sun Studio compilers- produce good overall application performance. Someone saying ITS BEST really helps.
FYI, IBM's Rational IDE came up on top overall. I'm sure the survey is not without bias, as the comments it has attracted seem to indicate.
Posted by tatkar
( May 22 2008, 10:28:54 AM PDT )
Permalink
Comments [2]
Technical articles on performance tuning
I often get asked about which
compiler options work best for x86 or SPARC or even between Intel and
AMD or various SPARC architectures.
Here is a handy reference for
compiler optimization and Performance Tuning options that is nice to
have handy. Getting good performance out of applications is both
important and often a little tricky, so this should help.
Of course the generic advice is that
-xtarget=generic and baseline options are constantly being tuned to be
generally the best average case options. It should suffice in most
cases to get the most juice across the broadest set of machines. But
there will always be those who need to go the extra mile and need to
know how they can get the most.
Selecting
the Best Compiler Options
How to get the best performance from an UltraSPARC or
x86/AMD64
(x64) processor running on the latest Solaris systems by compiling with
the best set of compiler options and the latest compilers? Here are
suggestions of things you should try, but before you release the final
version of your program, you should understand exactly what you have
asked the compiler to do.
Advanced
Compiler Options for Performance
Users wanting the best performance from CPU-intensive codes
may
wish to explore the use of additional libraries and advanced compiler
options that control individual compiler components.
Getting
the Best AMD64 Performance With Sun Studio Compilers
Performance is a factor of both hardware and software. To extract the
maximum performance from the new AMD-64 based systems on your critical
C/C++ and Fortran applications, choose the best compilers. Then use
compiler options to take advantage of the Opteron system features to
maximize performance.
How
I Got 15x Improvement Without Really Trying
A case study in program optimization.
Using
Inline Templates to Improve Application Performance
Inline
templates are a mechanism for
directly inserting assembly
code into an executable. Typically, this approach is used to obtain the
best performance for a given function, or to implement an algorithm in
a specific way.
Performance
Tuning With Sun Studio Compilers and Inline Assembly Language
Here are examples of using a compiler
flag or inline assembly language
with Sun Studio compilers to increase the performance of C, C++, and
Fortran programs.
Prefetching Pragmas and Intrinsics
Explicit
data prefetching pragmas and
intrinsics for the x86
platform and additional pragmas and intrinscs for the SPARC platform
are now available in Sun Studio 12 compilers. Prefetch instructions can
increase the speed of an application
substantially by bringing data into cache so that it is available when
the processor needs it. This benefits performance because today's
processors are so fast that it is difficult to bring data into them
quickly enough to keep them busy, even with hardware prefetching and
multiple levels of data cache.
Using
F95 Interfaces to Customize Access to the Sun Performance Library
When porting Fortran source, the Fortran 95 generic interface can be
used to allow the source code to remain virtually unchanged and yet
facilitate the use of the ILP-32, LP-64, and ILP-64 programming models.
Using
VIS Instructions to Speed Up Key Routines
The VIS instruction set includes a number of instructions
that
can
be used to handle several items of data at the same time. These are
called SIMD (Single Instruction Multiple Data) instructions. The VIS
instructions work on data held in floating point registers. The
advantage of using VIS instructions is that an operation can be applied
to different items of data in parallel; meaning that it takes the same
time to compute eight 1 byte results as it does to calculate one 8-byte
results. In theory this means that code that uses VIS instructions can
be many times faster than code without them.
The
Sun
Studio Binary Optimizer
The Binary Optimizer is a static SPARC
optimizer that accepts as
input a binary and creates an optimized binary as the output. We define
a binary as either an executable or a shared object. The availability
of the original source code is not a pre-requisite for using this tool.
It can optimize binaries irrespective of the source language used (C,
C++ or FORTRAN). It can also optimize mixed source language binaries.
Posted by tatkar
( Jan 21 2008, 05:47:42 PM PST )
Permalink
Comments [2]
New OpenMP book published
Its nice to see a new book on OpenMP published by some of the experts in this area.
See here for details.
One of the authors, Ruud van der Paas, works very closely with my compiler teams and with customers, so this book is sure to have practical and down to earth suggestions. I havent read it yet, myself.
PS. For those not quite in the know,here's a simple description of OpenMP:
OpenMP is a set of APIs for SMP programming in C, C++ and Fortran. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.
Jointly defined by a group of major computer vendors, OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.
Posted by tatkar
( Oct 01 2007, 02:49:34 AM PDT )
Permalink
Comments [0]
Sun Studio 12 patch performance improvements quantified for Core2Duo
My
previous blog pointed at the first available Sun Studio patch. It
mentioned that there were performance improvements for the Core2Duo
architecture beyond what was in
the released product.
This one quantifies the improvements we're seeing on SPECfp2000 (52%)
and
SPECfp2006 (34%). There are small improvements (0-3%) on
SPECint2006 and 1-4% on SPECint2000 as well,
but they are not as noteworthy. The system used here is a whitebox, 2 CPU, 4core
2.66GHz x5355 based system with 4GB memory. This run was a purely a comparative run and not done for a SPEC submission, so these arent official SPEC numbers (they are not numbers, anyway) but SPEC estimates, in that regard.
Benchmark: SPECfp 2000
%Change (over Studio12
FCS)
wupwise
30.3
swim
83.23
mgrid
39.10
applu
80.92
mesa
3.52
galgel
172.86
art
130.83
equake
14.05
facerec
2.44
ammp
65.74
lucas
13.86
fma3d
36.98
sixtrack
96.94
apsi
50.50
Overall Geo Mean
51.99
Benchmark: SPECfp 2006
%Change over Studio 12
FCS
410.bwaves
10.05
416.gamess
50.26
433.milc
12.99
434.zeusmp
50.34
435.gromacs
21.24
436.cactusADM
57.82
437.leslie3d
32.10
444.namd
62.20
447.dealII
9.00
450.soplex
4.58
453.povray
13.38
454.calculix
166.38
459.GemsFDTD
24.79
465.tonto
41.75
470.lbm
12.27
481.wrf
3.00
482.sphinx3
18.75
Overall
Geo Mean
34.55
Now, you know, why I recommended that if you're using Sun
Studio 12 for Woodcrest, Clovertown (Core2Duo) systems, then you MUST get the new patch.
Posted by tatkar
( Aug 21 2007, 01:49:23 PM PDT )
Permalink
Comments [2]
SunStudio 12 Compiler establishes World Record on Woodcrest chip!
Imagine That!
However, here it is. The latest submitted results for our Constellation
Blade Server, now called Sun Blade Server 6000 system, makes
it official.
The
Dual-Core Intel Xeon 5160 Intel Blade Module of this Server
delivers World
Record Performance on SPECint2006 of 21.0, which is higher
than any announced benchmark for either the Core2Duo or Opteron
chips. These results beat even the Intel Compiler results for an
Intel Chip!
New Sun Systems with SunStudio compilers and Solaris 10 on Intel chips are leading the way with performance. If ever there was vindication needed what Intel chips are capable of bringing to Sun systems and what Solaris is bringing to Intel to help expand the x86 marketplace, this is it!
Sun also announced World
Record Performance on the SPEC OMPM2001 benchmark for the Dual-socket dual-core
AMD Opteron Model 2222SE based Opteron Blade Module of this server
with a score of 13847 for 4-threads
Required Disclosure:
SPEC, SPECint and SPEComp registered trademarks of Standard
Performance
Evaluation Corporation. Results from www.spec.org
as of 05/25/07. Sun's
results were submitted for review.
Sun Blade 6250 (2xDual-core , 4 cores, 2 chips, 2 cores/chip,
Solaris 10): SunStudio12 SPECint2006 - 21.0
Sun Blade 6200 (2xDual-Core AMD Opteron
Model 2222SE processors 4 cores, 2 chips, 2 cores/chip,
Solaris 10): SunStudio11. SPECompM2001 - 13847
Posted by tatkar
( Jun 11 2007, 02:58:27 PM PDT )
Permalink
Comments [1]
Sun announces new Blade Servers with World Record Performance
Sun today
announced the fastest and the industry's only 4 socket dual-core 2.8GHz AMD Opteron 8000 series processor based blade server (announcement is here).
And
it packs an enormous performance punch! Here are the latest World Record Benchmarks for SPEC
CPU2006 rate and SPEC OMP2001, using Sun Studio 11 and Solaris 10 with
these blades:
Required Disclosure Statements:
SPEC, SPEComp, SPECCfp and
SPECfp Rate are Registered Trademarks of Standard
Performance Evaluation Sun's results
were submitted for review. For SPEC comparisons, socket equates to
chip.
Competitive results from www.spec.org as of Jan 05, 2007. Sun's results
were submitted for review
Sun Studio powers new Opteron Workstation to record SPEC INTrate and SPEC FP Sun recently introduced the so-called AM2 variant (the next-generation of AMD processors)
of Sun Ultra40 Workstation.
For details, see here.
With this introduction, Sun also announced new World Records with this
machine. I am particularly happy with this particular one
(words from the product page, directly).
The Sun Ultra 40 M2 workstation, with two Dual-Core AMD Opteron model
2220SE processors, has reached a new milestone on the SPECint_rate2006
suite of the SPEC CPU2006 benchmark, by utilizing the most advanced
features of Sun Studio 11 software and Solaris 10 OS.
Leading the x86 segment and surpassing competing workstations, the
next-generation Sun Ultra 40 M2 workstation produced a SPECint_rate2006
result of 48.8.
For a while now Woodcrest had retaken the SPEC INT lead held
previously by AMD's Opteron chips. The performance of Woodcrest on SPEC
CPU2000 has been particularly spectacular.
So it is particularly pleasing to see that with CPU2006
INTrate, Sun has been able to reclaim the World Record here for the
dual-core AMD Opteron model 2220SE processors. The rate measure is
particularly important as we move into the dual- and quad-core world
for the x86/x64 architecture machines.
In addition, Sun Ultra 40 Workstation continues to claim World Record
performance for SPEC CPU 2000 FP with best numbers (Peak) of 3545 and a
4-core FPrate of 121. This beats the Woodcrest based numbers,
handily, by about 40+%
The following table shows these comparisons:
SPEC CPU2006
INTrate (ratios, higher is better)
System
Description
#Threads
INTrate
Sun
Ultra 40 M2
AMD
2220SE (dual-core 2.8GHz),
2CPU
4
48.4
SuperMicroWoodcrest,
Intel 5160
(dual-core
3.0GHz), 2 CPU
4
45.2
Dell
Precision 380
Intel
3.73 GHz, Pentium Exteme
Edition 965
4
23.1
SPEC CPU2000
FP(ratios, higher is better)
System
Description
#Threads
FPrate
Sun
Ultra 40 M2
AMD
2220SE (dual-core 2.8GHz),
2CPU
2
121
Dell
Precision 690
(Xeon
5160, 4cores, 2
chips, RHEL 4AS U3)
2
81.3
Required Disclosure Statements:
SPEC, SPECCfp and
SPECfp Rate are Registered Trademarks of Standard
Performance Evaluation Sun's results
were submitted for review. For SPEC comparisons, socket equates to
chip.
Competitive results from www.spec.org as of Nov 17, 2006.
Performance Comparison: Sun Studio vs GCC on STREAM Benchmark I
have previously described the STREAM Benchmark and the
results we were seeing with its OpenMP version and what we got by
turning on Automatic
Parallelization in
the compiler.
Here I'd like to put out comparative results with the GCC compiler
| Function |
Sun Studio 11(MB/s) |
GCC4.1 (MB/s) |
| Copy |
4658 |
2766 |
| Scale |
4614 |
2745 |
| Add |
4628 |
2970 |
| Triad |
4627 |
2969 |
| Function |
Sun Studio 11(MB/s) |
Sun Studio 11(MB/s) 4proc Autopar |
| Copy |
4658 |
18120 |
| Scale |
4614 |
18108 |
| Add |
4628 |
17758 |
| Triad |
4627 |
17626 |
Performance Comparison: SunStudio vs. GCC on BYTE benchmark (Nbench)
I am often asked about performance differences between Sun Studio Compilers and GCC. And whereas with performance, a single answer never works across the board, I am attempting to put out as much comparative information as I can, to show some of the differences (and hopefully advantages over GCC) as I can.
I have announced previous Sun Studio based SPEC World Record numbers in postings here (like this
World Record SPECfp number and
this mention of SPEC CPU2006 World Records ), and about STREAM (as in
here), but these were not comparative numbers (vs GCC), so this is an attempt to fill that gap.
The first attempt is to take BYTE magazine's BYTEmark benchmark
programs that are freely available
at this location.
The benchmarks are designed to expose the capabilities of a system's
CPU, FPU and memory system and were derived directly, without
algorithmic change, from the BYTE web site.
The tests used here were ported to Linux and are actually run on a Solaris 10
system . The HW used here was a
SunFire X4100 box with a dual-core 2.4GHz
Opteron chip in a standard configuration.
In the following tables, the numbers are all Ratios index against a baseline of AMD K6/233 with 512KB L2-cache, gcc2.7.2.3 and libc-5.4.38 system.
Being ratios, Higher number is Better and so also in the Ratio's column, a ratio > 1 means SunStudio is better than GCC
| Test |
GCC4.1 |
SunStudio11 |
Ratio(SS11/GCC) |
| Numeric Sort |
12.68 |
10.63 |
0.84 |
| String Sort |
16.07 |
20.93 |
1.30 |
| Bitfield |
15.15 |
16.15 |
1.06 |
| FP Emulator |
14.46 |
31.77 |
2.19 |
| Fourier |
11.69 |
25.55 |
2.19 |
| Assignment |
37.59 |
25.11 |
0.67 |
| Idea |
18.52 |
32.87 |
1.77 |
| Huffman |
15.10 |
17.02 |
1.13 |
| Neural Net |
24.87 |
33.98 |
1.37 |
| Lu Decomposition |
45.55 |
50.20 |
1.10 |
| Memory Index |
20.78 |
20.397 | 0.98 |
| Integer Index |
15.09 |
20.844 |
1.39 |
| FP Index |
23.772 |
35.190 |
1.48 |
SunStudio Portal Article: Performance Analyzer
Two new articles describe how to use the Sun Studio Performance Tools to profile Java applications, and WebLogic servers.
Profiling Java Applications with Sun Studio Performance Tools
describes the challenge of profile Java applications either pure Java or mixed Java/C/C++, which need to run as a process instantiating the Java Virtual Machine (JVM), which is itself a C++ program.
Profiling WebLogic Servers with Sun Studio Performance Tools describes how to profile servers being run under BEA's WebLogic® system. A server run under BEA's WebLogic is a Java application that you launch by running a script to invoke the JVM.
This is the only profiling tool I know of that does C, C++, Fortran, Java, Weblogic servers, OpenMP, MPI, Pthreads, Auto-parallelized code well both on Solaris as well as Linux(es). Both in GUI and command-line. A true SunStudio Gem!
Posted by tatkar
( Aug 30 2006, 10:05:55 AM PDT )
Permalink
Comments [0]
Thumper (SunFire X4500) sets World Record in SPECfp
Not just another pretty face, Thumper, aka Sun Fire X4500 has scored its first World Record SPEC performance win!
Thumper, or as its known by its marketing name: Sun Fire X4500, is a terrific combination of storage (24TB) along with high performance 2-socket dual-core (and quad-core ready, BTW) Opteron server. The kind that prompts you to say I want one of those the moment you set your eyes upon it.
World Record SPECfp_rate2000 for all 2-socket x86 systems.
The message is clear: (in the words of our Performance Lead)
The result puts Thumper clearly at the top of floating point CPU horsepower for all 2-socket x86 systems. Thumper is not just a lot of storage, but the fastest server in its class all at the same time.
Clearly its the rackmountable version of All This (beautiful) Storage and Brains Too!
You can find much more information about Thumper, aka
Sun Fire X4500 here.
Required Disclosure Statements:
Sun Fire X4500 103 SPECfp 2000 Rate (4 cores, 2 chip, Solaris 10)
SPEC, SPECfp Rate are Registered Trademarks of Standard Performance Evaluation Corporation. Results from www.spec.org as of August 24, 2006.
Posted by tatkar ( Aug 25 2006, 08:12:42 AM PDT ) Permalink Comments [0]
SunStudio on Linux Tip #1: CMT support
Multi-Threading is here! Multi-core is here! Its time to get ready!
There is
no more free lunch; its over!
OK, now that I'm done sloganizing :-), there is really a serious need
that I think SunStudio
Compilers and Tools on Linux are
well-positioned to satisfy: Developing
high
Performance Application for the emerging multi-core systems!
SunStudio on Linux offers the following distinct benefits:
OpenMP vs. Autopar for STREAM on SunStudio/Linux
The Sun Studio
Compilers on Linux (click here to
download the latest Technology Preview) is now showing good signs
of maturity, stability and equivalent performance characteristics to
the Solaris version. In
particular, we now have OpenMP
and Automatic Parallelization
implemented in the compiler.
OpenMP is based on parallelization directives manually insert as
defined by the de facto OpenMP standard.
You can get the
OpenMP API Users Guide here and
this portal page has a bunch of related HPTC articles.
Automatic Parallelization
is inferred directly by the compiler based on opportunities found
during the optimizer's analysis phase.
This article on the portal that introduces OpenMP and parallelization concepts is actually a nice way of learning about the issues encountered
in parallelizing programs.
It seemed logical that I should
give it a spin with the STREAM benchmark.
The STREAM
Benchmark is the de facto industry standard benchmark for the
measurement of computer memory bandwidth. To be fair, its not a measure
of compiler performance in the strict computational sense that SPEC CPU benchmarks are, but it is also
a popular enough measure of how well the compiler optimizes for memory.
The STREAM
benchmark results pages are maintained here and you can download the sources here.
So, I thought I would post results for a SunFire V40z
machine (4 CPU x 2.6GHz, DDR1-400 16GB (2GB DIMMs x 8) Memory):
I compiled it this way on a Linux box running SuSE 9:
For OpenMP:
cc -fast
-xarch=amd64a -xvector=simd -xprefetch -xprefetch_level=3 -xopenmp stream_d_omp.c
second.c
For Automatic
parallelization:
cc
-fast
-xarch=amd64a -xvector=simd -xprefetch -xprefetch_level=3 -xautopar stream_d.c
second.c
Heres what I got for various levels of OMP and Parallelization scaling:
setenv
OMP_NUM_THREADS 4 vs. setenv PARALLEL 4
| Function |
OpenMP Rate (MB/s) |
Parallel Rate (MB/s) |
| Copy |
17660.2274 |
18120.3922 |
| Scale |
17467.1692 | 18108.1662 |
| Add |
17750.5371 | 17758.3657 |
| Triad |
17731.7766 | 17626.2119 |
| Function |
OpenMP Rate (MB/s) |
Parallel Rate (MB/s) |
| Copy |
9029.7180 | 9211.2915 |
| Scale |
8789.0595 | 9169.1302 |
| Add |
9082.2661 | 9090.8784 |
| Triad |
9072.0346 | 9066.7234 |
| Function |
OpenMP Rate (MB/s) |
Parallel Rate (MB/s) |
| Copy |
4559.0261 | 4657.2653 |
| Scale |
4425.3925 | 4614.3545 |
| Add |
4621.6104 | 4628.7296 |
| Triad |
4617.5824 | 4627.3465 |
Highest SPECfp on the Planet!
Please note the added required disclosure statement at the end
With the introduction of SunFire
X4600 servers and new compiler patches, Sun Studio Compilers,
Solaris and SunFire Opterons (4P)combine for the highest SPEC CPU2000 FP(peak value =3538) on the planet. There is some
secret sauce in the newly introduced patch, plus the autopar scaling of
some benchmarks made possible by this new configuration that helps
overwhelm Intel's Woodcrest and even the respected IBM Power5+ number!
The interesting nugget for Intel's
Woodcrest here is that its own FP number(2783), announced with the
much-respected Intel Compilers, is trumped by PathScale Compilers(3056)
on what appears to be a very similar box. Begone, Intel FUD!
Heres the tally of all the leading architectures and their SPEC
CPU2000 FP (peak values only):
Required Disclosure Statements:
Sun Fire X4600 3538 SPECfp2000 (4 cores, 4 chips, Solaris 10),
IBM p5 575 3513 SPECfp2000 (2.2GHz POWER5+) (1 core, 1 chip, AIX),
HP BL480c 3.0GHz Woodcrest 5160 3049 SPECfp2000 (1 core, 1 chip,
RHEL4u3).
Bull Novascale 3045 3017 SPECfp2000 (1600MHz Itanium2) (8 cores,
4 chips, Bull Advanced Server 4 (linux kernel 2.6.12, 64K pages))
Intel(R) D975XBX 2236 SPECfp2000 ( 3.73 GHz Intel Pentium(R) processor
Extreme Edition 965 , 1066 MHz bus) ( 2 cores, 1 chip, Windows XP
Professional SP2)
Fujitsu Limited PRIMEPOWER650 2236 SPECfp2000 (2164MHz SPARC64 V)
(1 core, 1 chip, Solaris 10)
HP DL380 G5 2150 SPECfp2000 (3.73GHz, Intel Xeon Processor 5080) (1
core, 1 chip, Hyperthreading disabled, RHEL4)
HP AlphaServer GS1280 7/1300 1684 SPECfp2000 (1300MHz Alpha
21364) (1 core, 1 chip, Tru64 UNIX V5.1B-1 + PK4)
SPEC, SPECfp are Registered Trademarks of Standard Performance Evaluation Corporation. Results from www.spec.org as of Jul 11, 2006.
Posted by tatkar ( Jul 19 2006, 10:48:07 AM PDT ) Permalink Comments [4]