Vijay Tatkar's Blog

All | Benchmarks | Business | Cloud Computing | General | Hardware | Linux/Unix | Performance | Software | Solaris | Sun | Sun Studio
Main | Next page »
20091111 Wednesday November 11, 2009

Sun Studio OpenMP gets 12x improvement on Seismic benchmark on SLES10
This story is hard to pass up:  Sun's BestPerf blog (read the details here) recently reported how they got a 12x performance improvement over a single-threaded version on an important Seismic (Reverse Time Migration) benchmark using Sun Studio's OpenMP feature on SLES10. Its a great story of how Sun can deliver performance through a combination of Sun Studio and new Hardware (via Sun Storage F5100 Flash Array). Yes, this is the same Flash Array that has been the talk of the town and has notched up several World Record wins.
Several points come to mind:


Posted by tatkar ( Nov 11 2009, 09:37:38 AM PST ) Permalink Comments [0]
Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20080908 Monday September 08, 2008

Interesting Read: Parallel Programming Made Easy?
Michael Wolfe, Senior Compiler engineer/architects at The Portland Group (PGI Compilers) recently wrote an article featured in HPCWire questioning how Easy? one could claim to make Parallel programming.
Some interesting thoughts and views from one of the leading compiler architects from a compiler that has focussed on high performance computing for decades(?) now.
Whether you agree or not is your own personal view of how this important style of programming should be shaped, but he does represent a very important part of the discussion underway today.
Definitely worth a read!
Posted by tatkar ( Sep 08 2008, 09:11:54 AM PDT ) Permalink Comments [2]

Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20080522 Thursday May 22, 2008

Sun Studio wins Infoworld award in Application Performance category
Sun Studio came out on top in the application performance category in the recently rated in a new IDE survey by Infoworld (here).
NetBeans is rated lower than Sun Studio (on which the SunStudio IDE is based), which is a surprise. Also surprising is the absence of Eclipse in the list, which the article explains in a very unsatisfactory way, IMO.
I'm just glad that it is being noticed that we -Sun Studio compilers- produce good overall application performance. Someone saying ITS BEST really helps.
FYI, IBM's Rational IDE came up on top overall. I'm sure the survey is not without bias, as the comments it has attracted seem to indicate.
Posted by tatkar ( May 22 2008, 10:28:54 AM PDT ) Permalink Comments [2]

Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20080121 Monday January 21, 2008

Technical articles on performance tuning
I often get asked about which compiler options work best for x86 or SPARC or even between Intel and AMD or various SPARC architectures.
Here is a handy reference for compiler optimization and Performance Tuning options that is nice to have handy. Getting good performance out of applications is both important and often a little tricky, so this should help.
Of course the generic advice is that -xtarget=generic and baseline options are constantly being tuned to be generally the best average case options. It should suffice in most cases to get the most juice across the broadest set of machines. But there will always be those who need to go the extra mile and need to know how they can get the most.


Selecting the Best Compiler Options
How to get the best performance from an UltraSPARC or x86/AMD64 (x64) processor running on the latest Solaris systems by compiling with the best set of compiler options and the latest compilers? Here are suggestions of things you should try, but before you release the final version of your program, you should understand exactly what you have asked the compiler to do.
Advanced Compiler Options for Performance

Users wanting the best performance from CPU-intensive codes may wish to explore the use of additional libraries and advanced compiler options that control individual compiler components.
Getting the Best AMD64 Performance With Sun Studio Compilers
Performance is a factor of both hardware and software. To extract the maximum performance from the new AMD-64 based systems on your critical C/C++ and Fortran applications, choose the best compilers. Then use compiler options to take advantage of the Opteron system features to maximize performance.
How I Got 15x Improvement Without Really Trying
A case study in program optimization.
Using Inline Templates to Improve Application Performance
Inline templates are a mechanism for directly inserting assembly code into an executable. Typically, this approach is used to obtain the best performance for a given function, or to implement an algorithm in a specific way.
Performance Tuning With Sun Studio Compilers and Inline Assembly Language
Here are examples of using a compiler flag or inline assembly language with Sun Studio compilers to increase the performance of C, C++, and Fortran programs.
Prefetching Pragmas and Intrinsics

Explicit data prefetching pragmas and intrinsics for the x86 platform and additional pragmas and intrinscs for the SPARC platform are now available in Sun Studio 12 compilers. Prefetch instructions can increase the speed of an application substantially by bringing data into cache so that it is available when the processor needs it. This benefits performance because today's processors are so fast that it is difficult to bring data into them quickly enough to keep them busy, even with hardware prefetching and multiple levels of data cache.
Using F95 Interfaces to Customize Access to the Sun Performance Library
When porting Fortran source, the Fortran 95 generic interface can be used to allow the source code to remain virtually unchanged and yet facilitate the use of the ILP-32, LP-64, and ILP-64 programming models.
Using VIS Instructions to Speed Up Key Routines
The VIS instruction set includes a number of instructions that can be used to handle several items of data at the same time. These are called SIMD (Single Instruction Multiple Data) instructions. The VIS instructions work on data held in floating point registers. The advantage of using VIS instructions is that an operation can be applied to different items of data in parallel; meaning that it takes the same time to compute eight 1 byte results as it does to calculate one 8-byte results. In theory this means that code that uses VIS instructions can be many times faster than code without them.
The Sun Studio Binary Optimizer

The Binary Optimizer is a static SPARC optimizer that accepts as input a binary and creates an optimized binary as the output. We define a binary as either an executable or a shared object. The availability of the original source code is not a pre-requisite for using this tool. It can optimize binaries irrespective of the source language used (C, C++ or FORTRAN). It can also optimize mixed source language binaries.

Posted by tatkar ( Jan 21 2008, 05:47:42 PM PST ) Permalink Comments [2]

Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20071001 Monday October 01, 2007

New OpenMP book published
Its nice to see a new book on OpenMP published by some of the experts in this area.
See here for details.
One of the authors, Ruud van der Paas, works very closely with my compiler teams and with customers, so this book is sure to have practical and down to earth suggestions. I havent read it yet, myself.
PS. For those not quite in the know,here's a simple description of OpenMP:
OpenMP is a set of APIs for SMP programming in C, C++ and Fortran. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.
Jointly defined by a group of major computer vendors, OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.
Posted by tatkar ( Oct 01 2007, 02:49:34 AM PDT ) Permalink Comments [0]

Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20070821 Tuesday August 21, 2007

Sun Studio 12 patch performance improvements quantified for Core2Duo
My previous blog pointed at the first available Sun Studio patch. It mentioned that there were performance improvements for the Core2Duo architecture beyond what was in the released product.
This one quantifies the improvements we're seeing on SPECfp2000 (52%) and SPECfp2006 (34%). There are small improvements (0-3%) on SPECint2006  and 1-4% on SPECint2000 as well, but they are not as noteworthy. The system used here is a whitebox, 2 CPU, 4core 2.66GHz x5355 based system with 4GB memory. This run was a purely a comparative run and not done for a SPEC submission, so these arent official SPEC numbers (they are not numbers, anyway) but SPEC estimates, in that regard.

Benchmark: SPECfp 2000
%Change (over Studio12 FCS)
wupwise
30.3
swim
83.23
mgrid
39.10
applu
80.92
mesa
3.52
galgel
172.86
art
130.83
equake
14.05
facerec
2.44
ammp
65.74
lucas
13.86
fma3d
36.98
sixtrack
96.94
apsi
50.50
Overall Geo Mean
51.99

Benchmark: SPECfp 2006
%Change over Studio 12 FCS
410.bwaves
10.05
416.gamess
50.26
433.milc
12.99
434.zeusmp
50.34
435.gromacs
21.24
436.cactusADM
57.82
437.leslie3d
32.10
444.namd
62.20
447.dealII
9.00
450.soplex
4.58
453.povray
13.38
454.calculix
166.38
459.GemsFDTD
24.79
465.tonto
41.75
470.lbm
12.27
481.wrf
3.00
482.sphinx3
18.75
Overall Geo Mean
34.55

Now, you know, why I recommended that if you're using Sun Studio 12 for Woodcrest, Clovertown (Core2Duo) systems, then you MUST get the new patch.

Posted by tatkar ( Aug 21 2007, 01:49:23 PM PDT ) Permalink Comments [2]

Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20070611 Monday June 11, 2007

SunStudio 12 Compiler establishes World Record on Woodcrest chip!
Imagine That!
However, here it is. The latest submitted results for our Constellation Blade Server, now called Sun Blade Server 6000 system, makes it official.
The Dual-Core Intel Xeon 5160 Intel Blade Module of this Server  delivers World Record Performance on SPECint2006  of 21.0, which is higher than any announced benchmark for either the Core2Duo or Opteron chips.  These results beat even the Intel Compiler results for an Intel Chip!
New Sun Systems with SunStudio compilers and Solaris 10 on Intel chips are leading the way with performance. If ever there was vindication needed what Intel chips are capable of bringing to Sun systems and what Solaris is bringing to Intel to help expand the x86 marketplace, this is it!

Sun also announced World Record Performance  on the SPEC OMPM2001 benchmark for the Dual-socket dual-core AMD Opteron Model 2222SE based Opteron Blade Module of this server with a score of  13847 for 4-threads

Required Disclosure:
SPEC, SPECint and SPEComp registered trademarks of Standard Performance Evaluation Corporation. Results from www.spec.org as of 05/25/07. Sun's results were submitted for review.
Sun Blade 6250 (2xDual-core , 4 cores, 2 chips, 2 cores/chip, Solaris 10):  SunStudio12 SPECint2006 - 21.0
Sun Blade 6200 (2xDual-Core AMD Opteron Model 2222SE processors 4 cores, 2 chips, 2 cores/chip, Solaris 10):  SunStudio11. SPECompM2001 - 13847

Posted by tatkar ( Jun 11 2007, 02:58:27 PM PDT ) Permalink Comments [1]

Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20070110 Wednesday January 10, 2007

Sun announces new Blade Servers with World Record Performance
Sun today announced the fastest and the industry's only 4 socket dual-core 2.8GHz AMD Opteron 8000 series processor based blade server (announcement is here).
And it packs an enormous performance punch! Here are the latest World Record Benchmarks for SPEC CPU2006 rate and SPEC OMP2001, using Sun Studio 11 and Solaris 10 with these blades:

These numbers clearly set this blade server module apart from the competition at this stage, delivering compelling performance differentiation. Look at the disclosure below for competitive numbers. What makes these boxes even more interesting is the Sun Refresh Service (read here) which will keep your blade servers upto date in this fast changing marketplace (especially with the exciting upcoming speedbumps and quad-core processors from AMD and Intel).

Required Disclosure Statements:

SPEC, SPEComp, SPECCfp and SPECfp Rate are Registered Trademarks of Standard Performance Evaluation Sun's results were submitted for review. For SPEC comparisons, socket equates to chip.
Competitive results from
www.spec.org as of  Jan 05, 2007. Sun's results were submitted for review

Sun Blade X8420 (4xAMD Opteron model 8220, 8 cores, 4 chips, 2 cores/chip, 8 threads, Solaris10): SPECompM2001 –23224;
IBM System p5 550 (POWER5+, 4 cores, 2 chips, 2 cores/chip, 8 threads, AIX5L V5.3): SPECompM2001 – 19,983;
HP ProLiant DL585 (4xAMD Opteron model 880, 8 cores, 4 chips, 2 cores/chip, 8 threads, RedHat EL AS): SPECompM2001 – 17948

 

Sun Blade X8420 (4xAMD Opteron model 8220, 8 cores, 4 chips, 2 cores/chip, 8 threads, SLES 9): SPECint_rate2006 –93.1.
Sun Blade X8420 (4xAMD Opteron model 8220, 8 cores, 4 chips, 2 cores/chip, 8 threads, Solaris10): SPECfp_rate2006 – 87.3.


Posted by tatkar ( Jan 10 2007, 07:42:55 AM PST ) Permalink Comments [0]
Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20061117 Friday November 17, 2006

Sun Studio powers new Opteron Workstation to record SPEC INTrate and SPEC FP Sun recently introduced the so-called AM2 variant (the next-generation of AMD processors) of Sun Ultra40 Workstation. For details, see here. With this introduction, Sun also announced new World Records with this machine. I am particularly happy with this particular one (words from the product page, directly).
The Sun Ultra 40 M2 workstation, with two Dual-Core AMD Opteron model 2220SE processors, has reached a new milestone on the SPECint_rate2006 suite of the SPEC CPU2006 benchmark, by utilizing the most advanced features of Sun Studio 11 software and Solaris 10 OS.
Leading the x86 segment and surpassing competing workstations, the next-generation Sun Ultra 40 M2 workstation produced a SPECint_rate2006 result of 48.8.
For a while now Woodcrest had retaken the SPEC INT lead held previously by AMD's Opteron chips. The performance of Woodcrest on SPEC CPU2000 has been particularly spectacular. So it is particularly pleasing to see that with CPU2006 INTrate, Sun has been able to reclaim the World Record here for the dual-core AMD Opteron model 2220SE processors. The rate measure is particularly important as we move into the dual- and quad-core world for the x86/x64 architecture machines.
In addition, Sun Ultra 40 Workstation continues to claim World Record performance for SPEC CPU 2000 FP with best numbers (Peak) of 3545 and a 4-core FPrate of 121. This beats the Woodcrest based numbers, handily, by about 40+%
The following table shows these comparisons:
SPEC CPU2006 INTrate (ratios, higher is better)
System
Description
#Threads
INTrate
Sun Ultra 40 M2
AMD 2220SE (dual-core 2.8GHz), 2CPU
4
48.4

SuperMicroWoodcrest, Intel 5160 (dual-core 3.0GHz), 2 CPU
4
45.2
Dell Precision 380
Intel 3.73 GHz, Pentium Exteme Edition 965 4
23.1

SPEC CPU2000 FP(ratios, higher is better)
System
Description
#Threads
FPrate
Sun Ultra 40 M2
AMD 2220SE (dual-core 2.8GHz), 2CPU
2
121
Dell Precision 690
(Xeon 5160, 4cores, 2 chips, RHEL 4AS U3) 2
81.3

Required Disclosure Statements:

SPEC, SPECCfp and SPECfp Rate are Registered Trademarks of Standard Performance Evaluation Sun's results were submitted for review. For SPEC comparisons, socket equates to chip.
Competitive results from
www.spec.org as of Nov 17, 2006.

Sun Ultra 40 M2 (AMD Opteron model 2220SE, 4 cores, 2 chips): SPECINT_rate2006: 48.4
Dell Precision 690 (Xeon 5160, 4 cores, 2 chips, RHEL 4AS U3): SPECint_rate2006 - 45.2

Sun Ultra 40 M2 (AMD Opteron model 2220SE, 4 cores, 2 chips): SPECfp_rate2000: 121
Dell Precision 690 (Xeon 5160, 4 cores, 2 chips, RHEL 4AS U3): SPECfp_rate2000 - 81.3


Sun Ultra 40 M2 (AMDOpteron model 2220SE, 4 cores, 2 chips): SPECfp2000 – 3545
Dell Precision 690 (Xeon 5160, 4 cores, 2 chips, RHEL 4AS U3): SPECfp2000 - 2872




Posted by tatkar ( Nov 17 2006, 04:31:45 PM PST ) Permalink Comments [1]
Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20061018 Wednesday October 18, 2006

Performance Comparison: Sun Studio vs GCC on STREAM Benchmark I have previously described  the STREAM Benchmark and the results we were seeing with its OpenMP version and what we got by turning on Automatic Parallelization  in the compiler.
Here I'd like to put out comparative results with the GCC compiler

Function
Sun Studio 11(MB/s)
GCC4.1 (MB/s)
Copy
4658
2766
Scale
4614
2745
Add
4628
2970
Triad
4627
2969

This is roughly a 1.6x advantage with the Sun Studio compiler.
The comparisons were done on exactly the same box. The box was a SunFire V40z with 4 x 2.6GHz processors and PC3200 CL3 DDR SDRAM ECC Regd. memory.

The Optimization options used in these cases were:
Sun Studio: -fast -xarch=amd64a -xvector=simd -xprefetch -xprefetch_level=3
GCC 4.1: -O3 -funroll-all-loops -ffast-math -fpeephole -m64 -mtune=k8 -fprefetch-loop-arrays

Function
Sun Studio 11(MB/s)
Sun Studio 11(MB/s)
 4proc  Autopar
Copy
4658
18120
Scale
4614
18108
Add
4628
17758
Triad
4627
17626

For a 4CPU machine, this is roughly a 3.9x scalability, which is incredible!
Of course, the GCC compiler isnt able to exploit such scalability because it does neither Automatic Parallelization nor OpenMP at this time. (Its working on at least OpenMP support, so at least this discrepancy will be addressed in a future release.).
Posted by tatkar ( Oct 18 2006, 02:08:06 PM PDT ) Permalink Comments [2]

Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20060926 Tuesday September 26, 2006

Performance Comparison: SunStudio vs. GCC on BYTE benchmark (Nbench)
I am often asked about performance differences between Sun Studio Compilers and GCC. And whereas with performance, a single answer never works across the board, I am attempting to put out as much comparative information as I can, to show some of the differences (and hopefully advantages over GCC) as I can.
I have announced previous Sun Studio based SPEC World Record numbers in postings here (like this World Record SPECfp number and this mention of SPEC CPU2006 World Records ), and about STREAM (as in here), but these were not comparative numbers (vs GCC), so this is an attempt to fill that gap.
The first attempt is to take BYTE magazine's BYTEmark benchmark programs that are freely available at this location.
The benchmarks are designed to expose the capabilities of a system's CPU, FPU and memory system and were derived directly, without algorithmic change, from the BYTE web site.
The tests used here were ported to Linux and are actually run on a Solaris 10 system . The HW used here was a SunFire X4100 box with a dual-core 2.4GHz Opteron chip in a standard configuration.
In the following tables, the numbers are all Ratios index against a baseline of AMD K6/233 with 512KB L2-cache, gcc2.7.2.3 and libc-5.4.38 system.
Being ratios, Higher number is Better and so also in the Ratio's column, a ratio > 1 means SunStudio is better than GCC

Test
GCC4.1
SunStudio11
Ratio(SS11/GCC)
Numeric Sort
12.68
10.63
0.84
String Sort
16.07
20.93
1.30
Bitfield
15.15
16.15
1.06
FP Emulator
14.46
31.77
2.19
Fourier
11.69
25.55
2.19
Assignment
37.59
25.11
0.67
Idea
18.52
32.87
1.77
Huffman
15.10
17.02
1.13
Neural Net
24.87
33.98
1.37
Lu Decomposition
45.55
50.20
1.10
Memory Index
20.78
20.397 0.98
Integer Index
15.09
20.844
1.39
FP Index
23.772
35.190
1.48

Numeric Sort, FP Emulator, Idea and Huffman are part of Integer Index
String Sort, Bitfield and Assignment make up the Memory Index
The other tests are part of FP Index

Flags used for each were:
Sun Studio11: -fast -xarch=amd64
GCC-O3 -s -Wall -fomit-frame-pointer -funroll-loops


Posted by tatkar ( Sep 26 2006, 02:36:44 PM PDT ) Permalink Comments [6]

Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20060830 Wednesday August 30, 2006

SunStudio Portal Article: Performance Analyzer
Two new articles describe how to use the Sun Studio Performance Tools to profile Java applications, and WebLogic servers.
Profiling Java Applications with Sun Studio Performance Tools describes the challenge of profile Java applications either pure Java or mixed Java/C/C++, which need to run as a process instantiating the Java Virtual Machine (JVM), which is itself a C++ program.
Profiling WebLogic Servers with Sun Studio Performance Tools describes how to profile servers being run under BEA's WebLogic® system. A server run under BEA's WebLogic is a Java application that you launch by running a script to invoke the JVM.
This is the only profiling tool I know of that does C, C++, Fortran, Java, Weblogic servers, OpenMP, MPI, Pthreads, Auto-parallelized code well both on Solaris as well as Linux(es). Both in GUI and command-line. A true SunStudio Gem!
Posted by tatkar ( Aug 30 2006, 10:05:55 AM PDT ) Permalink Comments [0]

Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20060825 Friday August 25, 2006

Thumper (SunFire X4500) sets World Record in SPECfp
Not just another pretty face, Thumper, aka Sun Fire X4500 has scored its first World Record SPEC performance win!
Thumper, or as its known by its marketing name: Sun Fire X4500, is a terrific combination of storage (24TB) along with high performance 2-socket dual-core (and quad-core ready, BTW) Opteron server. The kind that prompts you to say I want one of those the moment you set your eyes upon it.
World Record SPECfp_rate2000 for all 2-socket x86 systems.
The message is clear: (in the words of our Performance Lead) The result puts Thumper clearly at the top of floating point CPU horsepower for all 2-socket x86 systems. Thumper is not just a lot of storage, but the fastest server in its class all at the same time.
Clearly its the rackmountable version of All This (beautiful) Storage and Brains Too!
You can find much more information about Thumper, aka Sun Fire X4500 here.

Required Disclosure Statements:

Sun Fire X4500 103 SPECfp 2000 Rate (4 cores, 2 chip, Solaris 10)

SPEC, SPECfp Rate are Registered Trademarks of Standard Performance Evaluation Corporation. Results from www.spec.org as of August 24, 2006.

Posted by tatkar ( Aug 25 2006, 08:12:42 AM PDT ) Permalink Comments [0]
Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20060804 Friday August 04, 2006

SunStudio on Linux Tip #1: CMT support
Multi-Threading is here! Multi-core is here! Its time to get ready! There is no more free lunch; its over!

OK, now that I'm done sloganizing :-), there is really a serious need that I think SunStudio Compilers  and Tools on Linux are well-positioned to satisfy: Developing high Performance Application for the emerging multi-core systems!
SunStudio on Linux offers the following distinct benefits:


And also, several articles  published on SunStudio Developer Portal   that talk about various Chip-Multithreading, Multicore chips and Parallel programming aspects. Here is an interesting sprinkling of papers that you could read to get started on the topic

The Challenge of Developing Applications for Parallel Computing
This article discusses the important parallel application development issues now emerging from the parallel computing technology trend. It introduces and explains some of the most important industry standards such as OpenMP, MPI, and Grid Computing, and describes the current state of parallel application software development.
The Challenge of Race Conditions in Parallel Programming
This article discusses general and data race condition that arise in parallel programming. While data race condition problems are common and easy to fix, harder to avoid general race problems can also occur. A race condition could be the symptom of deeper design problems. A simple parallel partitioning example illustrates these various race condition issues and how to avoid them. DataRace Detection Tool
This article talks about using DRDT to detect data-races that occur during the execution of a single, multi-threaded process.
Lock_Lint - Static Data Race and Deadlock Detection Tool for C
The command-line utility lock_lint analyzes the use of mutex and multiple readers/single writer locks, and reports on inconsistent use of these locking techniques that may lead to data races and deadlocks in multi-threaded applications.
What is Throughput Computing all about
This (PDF) paper discusses what Sun's Throughput Computing initiative and gives a background about Software, Hardware and System issues Sun is endeavoring to solve. It introduces the terms CoolThreads and explains what CMT means and how Sun is satisfying business needs through the new UltraSPARC design

I'd recommend that you bookmark this  SunStudio/CMT site which has a collection of interesting papers on these topics

And finally, all of this works just as well on Solaris as well. In the case of MultiThreaded application development, debugging and analysis, it works even better!
Posted by tatkar ( Aug 04 2006, 10:44:43 AM PDT ) Permalink Comments [0]
Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

20060725 Tuesday July 25, 2006

OpenMP vs. Autopar for STREAM on SunStudio/Linux
The Sun Studio Compilers on Linux (click here to download the latest Technology Preview) is now showing good signs of maturity, stability and equivalent performance characteristics to the Solaris version. In particular, we now have OpenMP and Automatic Parallelization implemented in the compiler.
OpenMP is based on parallelization directives manually insert as defined by the de facto OpenMP standard. You can get the OpenMP API Users Guide here and this portal page has a bunch of related HPTC articles. Automatic Parallelization is inferred directly by the compiler based on opportunities found during the optimizer's analysis phase.
This article on the portal that introduces OpenMP and parallelization concepts is actually a nice way of learning about the issues encountered in parallelizing programs.
It seemed logical that I should give it a spin with the STREAM benchmark.
The STREAM Benchmark is the de facto industry standard benchmark for the measurement of computer memory bandwidth. To be fair, its not a measure of compiler performance in the strict computational sense that SPEC CPU benchmarks are, but it is also a popular enough measure of how well the compiler optimizes for memory.
The STREAM benchmark results pages are maintained here and you can download the sources here.

So, I thought I would post results for a SunFire V40z machine (4 CPU x  2.6GHz, DDR1-400 16GB (2GB DIMMs x 8) Memory):
I compiled it this way on a Linux box running SuSE 9:
For OpenMP:
cc -fast -xarch=amd64a -xvector=simd -xprefetch -xprefetch_level=3 -xopenmp stream_d_omp.c second.c
For Automatic parallelization:
cc -fast -xarch=amd64a -xvector=simd -xprefetch -xprefetch_level=3 -xautopar stream_d.c second.c

Heres what I got for various levels of OMP and Parallelization scaling:

setenv OMP_NUM_THREADS 4  vs. setenv PARALLEL 4
Function
OpenMP Rate (MB/s)
Parallel Rate (MB/s)
Copy
17660.2274 
18120.3922
Scale
17467.1692 18108.1662
Add
17750.5371 17758.3657
Triad
17731.7766 17626.2119

setenv OMP_NUM_THREADS 2  vs. setenv PARALLEL 2
Function
OpenMP Rate (MB/s)
Parallel Rate (MB/s)
Copy
9029.7180  9211.2915
Scale
8789.0595 9169.1302
Add
9082.2661 9090.8784
Triad
9072.0346 9066.7234


setenv OMP_NUM_THREADS 1  vs. setenv PARALLEL 1
Function
OpenMP Rate (MB/s)
Parallel Rate (MB/s)
Copy
4559.0261 4657.2653
Scale
4425.3925 4614.3545
Add
4621.6104 4628.7296
Triad
4617.5824 4627.3465

These numbers are virtually identical to the numbers on Solaris.
The conclusion: the compiler is finding parallelization opportunities on-par with hand-inserted OpenMP directives, which is indeed good news on Linux. The same is true on Solaris, BTW
I am going to measure these with competitive offerings as well (Intel, GCC, PGI and Pathscale) and will post those results separately.
Posted by tatkar ( Jul 25 2006, 04:31:31 PM PDT ) Permalink Comments [1]

Like this post?  del.icio.us  bookmark it   |   submit to dig digg.com digg it   |   slashdot slashdot it   |   technorati Technorati it

Who Am I?

Calendar

RSS Feeds

Search

Links

Presentations

Latest TechDays Presos

Navigation

Referers