Paul Hinker's Weblog
Tuesday Aug 23, 2005
Recent versions of the Sun Performance Library
In response to a recent query regarding my initial blog (thanks Kristofer), I'd like give some details about the current Performance Library efforts. Most of the new development has gone into creating versions for 32-bit X86 (vanilla P3 or P4 chips without SSE2 support), 32-bit SSE2 (minimum requirements P4 w/SSE2 support and Solaris 9 update 6 or later OS), and finally 64-bit Opteron.Kristofer asks, "Why does the x86 version of the library stink so bad compared to the SPARC offerings?". If the only library you have tested is the non-SSE2, 32-bit version which was released with the Sun Studio 9 compiler and tools, I'm sure you're disappointed with the results. It was our first effort after returning to the X86 architecture and we had no OpenMP support in the compiler. The schedule was tight and it was about all we could do to field a working, serial version of the library. However, we also shipped an SSE2 enabled version of the library with the Sun Studio 9 compiler and tools which you might give a try. You do need an P4 or better SSE2 enabled chip and Solaris 9 update 6 or later OS. You'll find that our 32-bit performance is quite respectable. You can access that library using the -xarch=sse2 compiler option.
% f90 -fast -xarch=sse2 -o prog prog.f -xlic_lib=sunperf
% prog
3000 4831.38784
That's 4831 MFlops on a 2.8 Ghz machine doing a double precision matrix multiply. Not world record sorts of numbers but certainly respectable (86+% of peak). The Studio 9 Performance Library doesn't have multi-threaded support due to missing support in the compiler.
Now, let's move forward a release to Studio 10. We rolled out 64-bit support and all libraries are parallelized. Let's take our same 3000x3000 double precision matrix multiply and run on a 2-way 2.2Ghz Opteron machine running Solaris 10.
% f90 -fast -xarch=amd64 -o prog prog.f -xlic_lib=sunperf
% prog
3000 3788.6042649
86% of peak ... not great but not too bad considering it was our first 64-bit attempt and we released the compilers and tools concurrently with the Solaris OS for 64-bit.
Let's see what the parallel scaling looks like:
% setenv PARALLEL 2
% prog
3000 7237.128577
Not too bad considering this was on a two processor machine so we're fighting the last processor problem a bit. Let's see how she goes on a 4-way box ...
% setenv PARALLEL 4
% prog
3000 13987.367615136576
Wow, 14 GFlop DGEMM on a commodity box. If Neil Lincoln and the guys at ETA could see me now! Everyone knows that benchmarking and publishing of performance numbers is voodoo at best. I can always find a test case where my library beats Vendor X's library and vice versa. The direction we've been moving lately is application based benchmarking. As time permits, I'll post some discussion of a unique way we've started collecting application data and how we're using it to better our product.
Thanks for reading this far.
Posted at 06:21PM Aug 23, 2005 by hinkthink in General |

