Paul Hinker's Weblog

pageicon Tuesday Apr 28, 2009

Double complex matrix-matrix multiply (ZGEMM) on the cheap

Here's a trick that works surprisingly well if you've already got a good DGEMM (double precision matrix-matrix multiply) implementation but haven't had the time to tune the ZGEMM (double precision complex matrix-matrix multiply). Fortran 95 array notation makes this very clean and performance is surprisingly good.

      subroutine pp_zgemm_nn(m,n,k,alpha,a,lda,b,ldb,c,ldc)
      implicit none
      integer, intent(in)                              :: m,n,k
      integer, intent(in)                              :: lda,ldb,ldc
      real*8, intent(in), dimension(0:2*lda-1,0:k-1)    :: a
      real*8, intent(in), dimension(0:2*ldb-1,0:n-1)    :: b
      real*8, intent(inout), dimension(0:2*ldc-1,0:n-1) :: c
      real*8, intent(in) :: alpha(2)

      integer :: i, j
      real*8, dimension(0:m-1,0:k-1) :: Ar, Ai
      real*8, dimension(0:k-1,0:n-1) :: Br, Bi
      real*8, dimension(0:m-1,0:n-1) :: Tr, Ti

      Ar = A(0:2*m-1:2,0:k-1)*alpha(1)-A(1:2*m-1:2,0:k-1)*alpha(2)
      Ai = A(1:2*m-1:2,0:k-1)*alpha(1)+A(0:2*m-1:2,0:k-1)*alpha(2)
      Br = B(0:2*k-1:2,0:n-1)
      Bi = B(1:2*k-1:2,0:n-1)

      call pp_dgemm('n','n',m,n,k,1.0d0,ar,m,br,k,0.0d0,Tr,m)
      call pp_dgemm('n','n',m,n,k,1.0d0,ai,m,bi,k,0.0d0,Ti,m)

      C(0:2*m-1:2,0:n-1) = C(0:2*m-1:2,0:n-1) + (Tr - Ti)

      call pp_dgemm('n','n',m,n,k,1.0d0,ar,m,bi,k,0.0d0,Tr,m)
      call pp_dgemm('n','n',m,n,k,1.0d0,ai,m,br,k,0.0d0,Ti,m)

      C(1:2*m-1:2,0:n-1) = C(1:2*m-1:2,0:n-1) + (Tr + Ti)

      return
      end
Of course, there are a few more details that need to be handled. This only does the NN case (transa='N' and transb='N') and the scaling of C by beta is handled in an upper level driver. You need versions for all the variations (NT,TN,TT,CN, etc).

There are some pros and cons with this approach. On the upside, you get a fast ZGEMM almost for free; if your DGEMM version is parallelized, you get that for free as well. Performance is good for matrix sizes of 1000x1000 and larger. On the downside, there's lots of temporary space used. If you block this for cache, you'll need to be careful that you parallelize at this level and not expect your DGEMM to do it

pageicon Wednesday Feb 04, 2009

Careful when benchmarking various BLAS and LAPACK libraries

Something that I've noticed while benchmarking various BLAS and LAPACK libraries is the implementation decision by the various vendors as to the number of processors that are utilized by default. There are a number of different versions of these popular libraries but the most widely used ones are ACML (from AMD), ATLAS (from sourceforge), MKL (from Intel), and the Sun Performance Library (from Sun).

All these implementations save for MKL take the conservative approach of using a single processor by default. MKL check the number of available processors in the system and uses that number by default. Unfortunately, this causes some confusion to users since many are not familiar with the various libraries.

I've received a number of e-mails from users who were concerned that it appeared that the Sun Performance Library was being outperformed by MKL by a factor of 2 or 4 or 8. Investigation usually results in finding out that the user is running some simple benchmark and leaving all the default settings causing MKL to use 2 or 4 or 8 processors in the system while Perflib is using only one. I received a similar message just this morning from a user asking about the same behavior whilst using ACML vs MKL on an X4100M2 (a server based on AMD chips).

Conversely, I've also run into cases where this decision has been detrimental to MKL performance since the opportunistic nature of the MKL implementation caused over subscription of the machine(s) being benchmarked. One notable example is when running a user was running distributed application which called the ScaLAPACK library. The nodes in the cluster all had 4 processors and there were 4 nodes in the cluster. The (correctly in my opinion) specified a processing grid of 4 x 4 processors. The MKL linked application appeared to run at half the speed of the application linked to both ACML and Sun Performance Library. It even underperformed the application linked to the ATLAS library.

This issue has been a source of discussion from time to time over the years in staff meetings. Sun comes from a 'big iron' super computing background and being a good system citizen in a shared environment is an important feature. I don't know what conversations happen at the MKL staff meetings but since Intel comes from more of a single-user background, where a dedicated machine is the norm, perhaps this explains the difference in implementation.

At any rate, it's important to know and understand these subtle implementation details when benchmarking various machines and the variety of linear algebra offerings so that you get a clear understanding of the performance you can expect in a production environment.

Since all the libraries mentioned are OMP based, the number of processors used can all be controlled using the OMP_NUM_THREADS environment variable. If you don't want to keep track of the default behavior of each of these libraries, it's probably prudent to specifically control the number of threads used in all cases.

For example, in the c shell to use a single thread :

% setenv OMP_NUM_THREADS 1

In the korn shell or bash :

% export OMP_NUM_THREADS=1

pageicon Thursday Nov 20, 2008

More on Custom Libraries

I've been proposing the idea of custom libraries within the Sun Studio product group and the Sun Performance Library group for years. In a previous blog I wrote about some of the benefits this idea could bring to both developers and to people like me who need to make decisions concerning where to apply (ever scarcer) tuning resources. I've been using an in-house version of a tool that can create custom libraries from a variety of sources. I thought I would show a couple examples of its usage:

Custom Lib from a object file

One of the most straightforward examples is the case of an object file that has external references to Perflib

% cat main.f
      program driver
      implicit none
  
      integer, parameter    :: n = 100, incx = 1, incy = 1
      real(8), dimension(N) :: x, y, result
      real(8)               :: ddot, alpha = 1.0

      x = 2.0
      y = 1.2
 
      call daxpy(n, alpha, x, incx, y, incy)
      print *, "Sum(daxpy) = ", sum(y)
      end
Here's a very simple main that calls the double precision AXPY from the Sun Performance Library. Looking at the nm on the .o file we can see that there's a single undefined external reference to Perflib (of course there are additional references to the fortran run time).
% nm main.o | grep -i undef
[28]    |             0|           0|FUNC |GLOB |0    |UNDEF  |__f90_eslw
[19]    |             0|           0|FUNC |GLOB |0    |UNDEF  |__f90_init
[26]    |             0|           0|FUNC |GLOB |0    |UNDEF  |__f90_slw_ch
[27]    |             0|           0|FUNC |GLOB |0    |UNDEF  |__f90_slw_r8
[25]    |             0|           0|FUNC |GLOB |0    |UNDEF  |__f90_sslw
[24]    |             0|           0|FUNC |GLOB |0    |UNDEF  |daxpy_
[18]    |             0|           0|FUNC |GLOB |0    |UNDEF  |f90_init
The object file can now be used as a target to generate a custom library:
% createCustom main.o -lib /opt/SunStudioExpress/prod/lib/amd64/libsunperf.a
Created libCustom.a
Created libCustom.so
The tool creates a static and shared version of the library that contains only the routines either directly referenced by the .o file or routines that support those directly referenced routines. There are a number of benefits to this. One is that the resulting custom library is considerably smaller than the entire Performance Library :
% ls -l libCustom*
-rw-r--r--   1 hinker   staff      14812 Nov 20 13:15 libCustom.a
-rwxr-xr-x   1 hinker   staff      12288 Nov 20 13:15 libCustom.so
% ls -l /opt/SunStudioExpress/prod/lib/amd64/libsunperf.*
-rwxr-xr-x   1 root     sys      28289228 Oct 30 15:31 /opt/SunStudioExpress/prod/lib/libsunperf.a
-rwxr-xr-x   1 root     sys      20747884 Oct 30 15:31 /opt/SunStudioExpress/prod/lib/libsunperf.so.3
Of course, this is a simple example but it's not uncommon to see a custom library that's 10% the size of the full library.

Another nice feature is that a custom library can be generated using a variety of targets:

* Object files
* Dynamically linked executables
* Shared libraries
* Archive libraries

The custom shared library that's produced can be used as a drop-in replacement for the existing libsunperf.so without requiring the executable to be relinked.

We hope to rollout the tool with an upcoming release to see how it's received by our users.

pageicon Tuesday Nov 18, 2008

Building Cryptopp version 5.5.2 on openSolaris using Studio 11.08 Express

Now that I've gotten an openSolaris VM running and installed the latest Sun Studio compilers and tools I'm back on the project of compiling some of the more popular open source packages using my set up.

The Cryptopp library is a free C++ class library of cryptographic schemes. It contains a variety of algorithms. The source can be downloaded from www.cryptopp.com for a number of platforms.

According to the Platforms matrix on the Cryptopp page, the library has been built on Solaris using Sun Studio 11 and Sun Studio 12. The only notes is that you should use the command "gmake CXX=CC". Let's give that a try:

The first thing you notice is a raft of warnings where the compiler is complaining about the aligned attribute not being supported. I sent some e-mail to the C++ compiler guys and they informed me that there was an outstanding request for enhancement concerning the support of the aligned attribute. If you're like me and you don't like all the warnings, you can add the -w flag to the CXXFLAGS in the GNUmakefile. One other message that comes up during compilation is the report that, due to complexity, some of the modules are compiled with reduced optimization.

Another issue that might come up if you're running an openSolaris guest VM is that the C++ compiler is quite memory hungry. I configured my VM with 1.5 GB of memory but there's a lot of swapping going on during compiles.

After a fair amount of time, the compile finishes successfully. You can run the validation tests :
% cryptest v

This results in a successful test run with all the tests passing. You can run the benchmark suite:
% cryptest b 30 2 > cryptopp.html

It ran for a while and output an impressive array of data. Note that this is a straight build that was run in an openSolaris guest VM running on an older 2.0Ghz opteron cpu.

What this showed me was that the openSolaris folks' efforts to make porting and building open source apps to openSolaris is coming along nicely. This useful library builds and works right out of the box with no fuss.

Next up is Xerces-C. A C++ XML parsing library from the Apache project.

pageicon Thursday Sep 25, 2008

openSolaris and VirtualBox 2.0

The latest buzz is about xVM, virtualization, and VirtualBox so I thought I would walk through a simple installation and a couple performance runs to see how virtualization and HPC works together. Since I'm in the Performance Library Group and one of the libsunperf developers, that's where my interest lies .

You can check my previous post concerning getting openSolaris installed along with getting the latest Sun Studio Compilers and Tools.

I started out by grabbing the binary distribution of the 2.0.2 VirtualBox packages from here. I'm running openSolaris on a Sun W2100 which sports a dual core 2.0 Ghz Amd opteron processor. It's an older chip so I can't run 64-bit guest OSes but I'll take it around the block using a couple 32-bit guests.

Once downloaded, the packages install in a straightforward manner. The ReadMe.txt included give instructions checking for the vbi module and installing it if it's not there. The vbi module is included in the VB distribution. The sequence of commands I used is :

% pkgadd -G -d VirtualBoxKern-2.0.2-SunOS-r36488.pkg

% pkgadd -d VirtualBox-2.0.2-SunOS-amd64-r36488.pkg

The default installation puts the VirtualBox libraries in /opt/VirtualBox but links are created in /usr/bin. I started it up from user space with :

% VirtualBox &

The first time you run it you'll be given the user agreement and the opportunity to register. Since I'm already registered I skipped this step but you'll be asked for a name and e-mail address. You'll see the xVM control panel


Go ahead and select 'New' to create a new VM and you'll get the 'Create New Virtual Machine' wizard:


The first screen of the wizard let's you pick a name for your VM and an OS type. There are lots of different choices including DOS, Windows, OS/2, and all sorts of flavors of Linux (including some generic linux kernels not associated with a particular distribution). I'm going to install Ubuntu so that's what I named and selected.


The next screen lets you select a base memory size that the VM will use when running. Depending on which OS variant you selected, the wizard will suggest something but you're free to disregard the recommendation. My machine has 2 GB of RAM so I changed the recommended 256Mb to 512Mb since I envision running a number of VMs simultaneously.


Next, you need to set aside some space for the hard disk image. If you already have VM images on your hard drive, these will show up in the drop down menu under Boot Hard Disk (Primary Master). This is my first run so there are none listed. If you have images on other disks, you can select them by choosing the 'Existing' button and navigating to them. I selected 'New' to create a new image. This starts up another wizard to help you create the disk image for the VM.


I selected 'Dynamically expanding image' since I don't know how big this VM is likely to become but you can also choose to create a fixed-size image. The fixed size image takes longer to create but is supposed to improve performance in some cases. You're asked to specify an image size and again, depending on the OS type you picked earlier, a suggested image size will be the default. For the Ubuntu distribution, the suggested size is 8 Gb but I increased it to 12 Gb just for good measure.


You'll notice too that the image file name matches the name you chose for the VM earlier in the installation. It lets you verify the parameters you selected before it actually creates the image.


Now you're back to the original wizard and finally allowed to create the actual disk image. You see that the primary master is now set to the disk image we just worked to create.


Again we get a window with the parameters we're selected so far and a chance to go back and change things or 'Finish' the creation of the virtual machine.


Now the main window will change by having your new VM appear in the left panel. Also, you'll notice that the Settings, Delete, and Start buttons have become active. On the right, you'll see the various details of the currently selected VM (in the left panel).


I wanted to install my ubuntu VM from CD so I first selected 'Settings' from the main VirtualBox panel. You'll see the general settings for the selected VM including the Name, OS Type, Memory Size, and Video Memory Size.


Select the CD/DVD-ROM item from the left panel to get the control panel.


Here you can see that you can mount a CD/DVD Drive. You can select either the Host CD/DVD Drive (i.e. the one attached to the current machine), or you can select an ISO image. If you have a .iso of the distribution you want to install on a local hard disk, you can install into the VM from there. A minor detail when booting from a .iso file. You need to select the checkbox for the Mount CD/DVD Drive, select the ISO Image button. This will activate the folder to the right of the ISO Image File drop-down so that you can navigate to the .iso file. For some reason it took me a while to figure this out. From this screen you can add or remove .iso images.


The passthrough button has to do with whether or not you'll want to write to the host CD/DVD from the VM. The passthrough enables you to write ATAPI commands directly to the drive from the VM. I did not select that option.

Close that panel by selecting 'ok' and you'll see that the main display now indicates that the CD/DVD-ROM (in the right panel) indicates that the host drive has been selected.

We're ready to power up our VM so select the 'Start' button from the main panel. A window will come up and you'll see a quick Sun splash screen followed by a window that explains that the VM is capturing the keyboard and how you can get it to stop capturing the keyboard. The first time you mouse click in the VM window, you'll get a similar message concerning the capturing of the mouse.


From this point, the guest OS works much like it would if you had installed it as the primary OS. There are lots of configuration issues concerning things like running the guest in a window or as the whole screen, configuring USB, and networking. My next entry will talk about using the Performance Library in a guest OS and some of the details involved.

pageicon Thursday Aug 21, 2008

Building open source apps with Studio Express (July 08) on OpenSolaris

I recently installed the latest openSolaris distribution on my home machine so that I could chase an issue a user was having. The first thing I was struck by was the ease of installation and configuration when compared with traditional Solaris. I downloaded the openSolaris LiveCD, burned it to a DVD (although it fits on a CD) and fired it up. The mini-kernel that comes up recognized the external USB drive I had attached and I was able to install onto it. The first thing I tried was to install the OS on a USB thumb drive. While the mini-root recognized a pre-formatted USB thumb drive, the installation seemed to hang during the 'Initializing Drive for installation' step.

The installation from the LiveCD is fairly bare bones so the first thing you'll want to do is install some of the development packages. Since I'm interested in the Sun Studio tools, I downloaded the Sun Studio Development cluster.

myhost% pkg install ss-dev

This will install the packages in the /opt/SUNWspro tree so you'll need to adjust your paths to find the compilers. Alternatively, you can use the pfexec command to install the software in alternate places.

This takes a bit of time since there's a little north of 500 Mb in the package but it's all good stuff.

There's been some recent internal interest in the GLPK (GNU Linear Programming Kit) package so I grabbed that to see if it would compile out-of-the-box. Like so many of these open source packages, GLPK depends on something else. In this case, GMP (GNU MP Library) is required so I pulled that over as well.

The GMP source distribution includes the usual configure script which can be modified with the usual list of environment variables. Here are the ones I set before running configure :

myhost% CC="cc -m64" ; export CC
myhost% CXX="CC -m64" ; export CXX
myhost% F77="f90 -m64" ; export F77
myhost% CFLAGS="-xO3" ; export CFLAGS
myhost% CXXFLAGS="-fast" ; export CXXFLAGS
myhost% FFLAGS="-fast" ; export FFLAGS
myhost% LDFLAGS="-L/usr/lib -R/usr/lib"
myhost% configure --prefix /usr
myhost% make
myhost% make check
myhost% make install
If you follow the same steps you'll get warnings from gmp-impl.h about unknown attributes but the compile completes successfully and the tests run correctly. The above will build the GMP libraries for 64-bit addressing. I install the libraries in /usr/lib since other configure scripts usually look there for installed libraries.

Now on to the glpk directory. The same environment variables work fine for configuring GLPK so once the source is un-tarred :

myhost% configure --prefix /usr
myhost% make
myhost% make check
myhost% make install
During the configure you'll notice that the gmp library is discovered and paths are set up to link it in during the GLPK build.

I did find a C compiler issue during the building of GMP (which explains the -xO3 instead of more aggressive optimization like -fast) so this was a useful exercise. Later I'll go through the building of SuiteSparse (University of Florida's Tim Davis' sparse package), Xerces-C, Cryptopp, and Octave 3.0.1. Some of which need a little more hand holding during the configure and build.

pageicon Thursday Aug 14, 2008

Call compatibility of Linux Fortran Compilers

I was recently asked by some of the guys doing benchmark work for a Linux version of the Sun Performance Library. We've been producing this library for a couple releases now (since Sun Studio 10) and I pointed Michael Burke to our latest release (the most recent Express release that just went live and can be downloaded from here)

Michael started getting some very good performance improvement which was encouraging. The benchmarks in question have the capability of selecting a number of libraries from which to get some common routines. For example, an environment variable can be set to indicate which shared library the application will open from which to get BLAS and LAPACK routines. What I wasn't told to begin with was that the application was compiled with the Intel compilers (Ifort). Once I learned that we had to back out the announcement we were going to make concerning the nice performance improvements. Why? Well, it turns out that Sun Studio (and gfortran) follow the Amd64 ABI in terms of dealing with functions which return complex values. Intel's fortran compiler (Ifort) does not.

We can see how the two methods are different using a simple example:

      program driver
      implicit none

      real a
      complex foo, ca

      a = 1.0
      ca = foo(a)

      print *, ca
      end

myhost% g77 -S driver.f
myhost% cat driver.s
[output editted for brevity]
.LCFI2:
 1      movl    $0x3f800000, %eax
 2      movl    %eax, -4(%rbp)
 3      leaq    -4(%rbp), %rsi
 4      leaq    -20(%rbp), %rdi
 5      movl    $0, %eax
 6      call    foo_
 7      movl    -20(%rbp), %eax
 8      movl    %eax, -12(%rbp)
 9      movl    -16(%rbp), %eax
10      movl    %eax, -8(%rbp)
So, line 1-3 are straightforward but notice that the parameter 'a' is being put in %rsi which is the second parameter register. -20(%rbp) is being placed into %rdi which is the first parameter register. This turns out being the address of a complex 'structure' (or a 2 element real array if you will). Notice that after the function 'foo' returns, the structure's real and imaginary parts -20(%rbp) and -16(%rbp) respectively, are read into local stack space.

Let's take the same example and run it through the Sun Studio compiler to see what we get:

myhost% f95 -S driver.f
myhost% cat driver.s
[output editted for brevity]
.L9:

 1      movl    .L_cseg0, %eax
 2      movl    %eax, .XBsSBMHnZFpIm2p.MAIN.a

 3      movq    $.XBsSBMHnZFpIm2p.MAIN.a+0, %rdi
 4      movl    $0, %eax
 5      call    foo_
 6      movsd   %xmm0, -40(%rbp)
 7      movss   -40(%rbp), %xmm0
 8      movss   -36(%rbp), %xmm1
 9      movss   %xmm0, .XBsSBMHnZFpIm2p.MAIN.ca
10      movss   %xmm1, .XBsSBMHnZFpIm2p.MAIN.ca+4

Here the parameter 'a' is placed in the first parameter register (%rdi) (line 3). The routine is called and we can see that on return, the complex value comes back in the %xmm0 register (line 6). Finally, let's look at gfortran :
myhost% gfortran -S driver.f
myhost% cat driver.s
[output editted for brevity]
 1      movl    $0x3f800000, %eax
 2      movl    %eax, -12(%rbp)
 3      leaq    -12(%rbp), %rdi
 4      call    foo_
 5      movd    %xmm0, %rax
 6      movq    %rax, -424(%rbp)
 7      movl    -424(%rbp), %eax
 8      movl    -420(%rbp), %edx

Again, the parameter 'a' is passed in the first parameter register (%rdi)[line 2-3] and the function value is returned in the %xmm0 register [line 5]. So, there is call compatibility between the Sun Studio compiler and the gfortran compiler. This is very useful since it means that applications compiled with one compiler can use libraries created by the other. From my experiments, this is the only incompatibility between these compilers. I have successfully run the entire LAPACK 3.1.1 test suite compiled with the gfortran and linked to the Sun Studio compiled Sun Performance Library.

In terms of benchmarking, this compatibility is handy since I'm able to use the Sun Studio compiler to compile my timer drivers and link to various vendor's libraries (ACML, MKL, Sun Performance Library, ATLAS, GotoBLAS, etc.) without needing to install and maintain all the various compilers. What's unfortunate is that there is the incompatibility for functions which return complex values.

pageicon Friday Apr 20, 2007

Habitat Workgroup sets record

A group of Sun employees from the Broomfield campus took the day off on Wednesday (April 18th) to help out with construction on a couple houses being built by Habitat for Humanity in Longmont, Colorado. Jeff Cheeney was the one who coordinated the sign-up (and the all important passing out of t-shirts!). He and Brad Keiser were team supervisors and in charge of cracking the whip to keep everyone going. Sun showed up with 19 workers, American Family had a good sized group as well as some smaller groups from other local business. The Wednesday 'irregulars' were there as well for a total of 47 people onsite which we were told (by Carey McClure Project Director) was a Wednesday record for the St. Vrain Valley chapter.

The location was Carriage Drive which is just east of Highway 287 and just north of Pike Road in Longmont. There are currently 4 houses in various stages of completion (from a hole in the ground to touch up paint and trim being applied) so there were a wide variety of projects going on.

I saw some familiar faces including Lisa Week, Diana Wadding, and Brad Keiser so UBRM-05 was well represented. I got involved in setting a stairway, installing truss baffles, building a ramp, firring out window openings, running a pickaxe & wheelbarrow (something I'm licensed to do, btw), and wrapping one of the houses in Tyvek. I know some of the other folks were busy trimming doors, patching drywall, installing baseboard and various other odd jobs necessary when building a house.

I was impressed by Dan (the site foreman) and his ability to keep all the balls in the air and everyone busy. All in all it was a satisfying day and I plan on participating again because it gets me off my (ever widening) posterior and helps me appreciate my own blessings. I would encourage everyone to give it a try.

pageicon Monday Nov 13, 2006

Using F95 interfaces to customize access to the Sun Performance Library

Porting dusty deck Fortran source can be an exercise in patience and conditional compilation. An application which needs to run in the ILP-32, LP-64, and ILP-64 models faces the problem of interfacing with external libraries seamlessly.

Using the BLAS AXPY family of routines (caxpy, daxpy, saxpy, and zaxpy) as an example:

ILP-32 interface

SUBROUTINE saxpy(N, ALPHA, X, INCX, Y, INCY)
INTEGER*4 :: N, INCX, INCY
REAL*4    :: ALPHA, X(*), Y(*)

LP-64 interface

SUBROUTINE saxpy(N, ALPHA, X, INCX, Y, INCY)
INTEGER*4 :: N, INCX, INCY
REAL*4    :: ALPHA, X(*), Y(*)

ILP-64 interface 

SUBROUTINE saxpy(N, ALPHA, X, INCX, Y, INCY)
INTEGER*8 :: N, INCX, INCY
REAL*4    :: ALPHA, X(*), Y(*)

ILP-64 interface (strict Fortran type adherence)

SUBROUTINE saxpy(N, ALPHA, X, INCX, Y, INCY)
INTEGER*8 :: N, INCX, INCY
REAL*8    :: ALPHA, X(*), Y(*)

Strict Fortran adherence means that INTEGER and REAL data types have identical bit-width. This flies in the face of the strong implication in the BLAS documentation that single precision (i.e. the 's' prefixed routines) expect 32-bit floating point data.

The ILP-64 interfaces to the Sun Performance Library routines are suffixed by _64 to distinguish them from routines by the same name which expect 32-bit integers but run in an LP-64 model. Some older Cray source codes follow a strict adherence to the Fortran Language specification which requires INTEGER and REAL data types to have the same bit-width and so, expect the floating point data sent to the 's' prefixed routines to be 64-bit.

There are a variety of ways to handle this. First, a purely brute force approach of manually editing the source with conditional compilation code:


#ifdef ILP64
   call saxpy_64(n,alpha,incx,y,incy)
#else
   call saxpy(n,alpha,incx,y,incy)
#endif

Doing this to an application which consists of 1000's of files and millions of lines of source is a waste of engineering time. The same can be done with an awk, sed, or perl script but there are 1700+ routines in the Performance Library and even scripting the process will be time consuming and error prone.

Finally, the Fortran 95 generic interface functionality could be used to allow source to remain virtually unchanged and yet facilitate the use of each of the three programming models (ILP-32, LP-64, and ILP-64).

Let's take a very simple example which calls the BLAS saxpy routine:

% cat tst.f

      program tst saxpy
      implicit none

      integer, parameter :: N = 10, INCX = 1, INCY = 1
      real, parameter    :: ALPHA = 1.0
      real, dimension(N) :: X, Y

      X = 1.0 
      Y = 2.0
   
      call saxpy(N, ALPHA, X, INCX, Y, INCY)
      print *, SUM(Y)
      END

% f90 -o tst tst.f -xlic_lib=sunperf -xarch=[v8|v8plus|v8plusb|sse2]

When compiled with one of the ILP-32 architectures (v8, v8plus, v8plusb, sse2) the saxpy call resolves to the one expecting 32-bit integer and real parameters.

% f90 -o tst tst.f -xlic_lib=sunperf -xarch=[v9|v9b|amd64]

When compiled with one of the ILP-64 architectures (v9, v9b, amd64) the saxpy call resolves to the LP-64 interface which expects 32-bit integer and real parameters. This is for backward compatibility reasons. However, when compiled with one of the ILP-64 libraries, additional entry points are available. That is, our source could look like the following :

      program tst saxpy
      implicit none

      integer, parameter   :: N = 10, INCX = 1, INCY = 1
      real, parameter      :: ALPHA = 1.0
      real, dimension(N)   :: X, Y

      X = 1.0 
      Y = 2.0
   
      call saxpy_64(N, ALPHA, X, INCX, Y, INCY)
      print *, SUM(Y)
      END

% f90 -o tst tst.f -xlic_lib=sunperf -xarch=[v9|v9b|amd64] -xtypemap=integer:64

Note : The integer declarations have been changed to 8 byte integers using the -xtypemap compiler option.

Warning : The xtypemap option only applies to implicitly typed variables. INTEGER :: N gets promoted to INTEGER*8 :: N but INTEGER*4 :: N would not be promoted.

In theory, the xtypemap option is a great way to port to the ILP-64 model. In practice, it's not quite a silver bullet. As long as interfaces are clearly defined and strictly follow typing rules it works well.

A F95 interface can be used to describe the different programming models.


      module sunperf64
         interface saxpy
!
! ILP-32 and LP-64 interface
!
            subroutine saxpy(n,alpha,x,incx,y,incy)
               integer :: n, incx, incy
               real    :: alpha, x(*), y(*)
            end subroutine
!
! ILP-64 interface
!
            subroutine saxpy_64(n,alpha,x,incx,y,incy)
               integer * 8 :: n, incx, incy
               real        :: alpha, x(*), y(*)
            end subroutine
!
! ILP-64 interface w/strict Fortran typing
!
            subroutine daxpy_64(n,alpha,x,incx,y,incy)
               integer * 8 :: n, incx, incy
               real * 8    :: alpha, x(*), y(*)
            end subroutine
         end interface
      end module

This module file will allow a single call to the saxpy routine to be interpreted as any of the ILP-32, LP-64, ILP-64, ILP-64(strict) calling conventions.

      program tst saxpy
      implicit none
      use sunperf64

      integer, parameter   :: N = 10, INCX = 1, INCY = 1
      real, parameter      :: ALPHA = 1.0
      real, dimension(N)   :: X, Y

      X = 1.0 
      Y = 2.0
   
      call saxpy(N, ALPHA, X, INCX, Y, INCY)
      print *, SUM(Y)
      END

% f90 -o tst tst.f -xlic_lib=sunperf -xarch=[v8plus|v8plusb|sse2]
Would call the ILP-32 version 
% f90 -o tst tst.f -xlic_lib=sunperf -xarch=[v9|v9b|amd64] -xtypemap=integer:64
Would call ILP-64 version
% f90 -o tst tst.f -xlic_lib=sunperf -xarch=[v9|v9b|amd64] -xtypemap=integer:64,real:64
Would call the ILP-64(strict) version

pageicon Thursday Aug 31, 2006

More on making DTrace probes for Fortran

I expand a bit on writing DTrace probes for Fortran routines using F95 Interface blocks to clean up the calling sequence a little.[Read More]
pageicon Monday Jun 19, 2006

Back in the saddle

An initial look at Dtrace and Fortran[Read More]
pageicon Wednesday Jan 25, 2006

Create 2 Now available

While normally I'm a Fortran/Assembly hack, I thought I would devote a little 'ink' to the release of Sun Java Studio Creator 2. This is a high quality development package that's free for the asking (and registration). You can register and get the software at :

http://developers.sun.com/jscreator

The website has had spotty availability due to demand but it's worth the effort to get this software.

From time to time I need to write a Java application to interface with a database or some web service and Creator is a great way to quickly get up and running. In previous entries to this blog I've wrote about the Performance Library Interpose Library and Microtimers which we use to benchmark Perflib against some of the common competitors. Creator is a great way to prototype and then write the database applications used to store, compare, and display the results of these timers.

If you are looking for a full featured, intuitive development suite, give Creator 2 a try. I think you'll like it.

pageicon Thursday Jan 19, 2006

Parallel Dual-Core Amd Performance

Last time I presented some serial performance for a dual-core Amd box (285 cpus) . Those numbers aren't especially interesting so here are some parallel perform ance numbers.

Matrix Size1Cpu2Cpu4Cpu% of Peak 2Cpu Scaling4Cpu Scaling
10004605 9076.68 17614.13 88.56% 98.55% 95.63%
12504661.55 9162.21 17764.15 89.65% 98.27% 95.27%
15004632.85 9128.44 17840.91 89.09% 98.52% 96.27%
17504647.37 9132.16 17990.68 89.37% 98.25% 96.78%
20004624.1 9149.64 18010.68 88.93% 98.93% 97.37%
22504642.03 9185.79 18015.37 89.27% 98.94% 97.02%
25004630.41 9148.34 18015.57 89.05% 98.79% 97.27%
27504641.52 9191.52 18006.34 89.26% 99.01%96.99%
30004607.91 9120.19 17946.34 88.61% 98.96% 97.37%
32504646.25 9203.86 18112.48 89.35% 99.05% 97.46%
35004621.16 9169.04 17988.65 88.87% 99.21% 97.32%
37504628.75 9166.81 17995.61 89.01% 99.02% 97.19%
40004673.16 9277.18 18291.52 89.87% 99.26% 97.85%
42504628.35 9175.68 18077.82 89.01% 99.12% 97.65%
45004611.75 9135.38 18009.88 88.69% 99.04% 97.63%
47504634.89 9195.68 18090.37 89.13% 99.20% 97.58%
50004600.08 9103.6 18054.83 88.46% 98.95% 98.12%

Performance numbers are expressed in Mflops and scaling is calculated as (multi-core performance / (serial performance * #cpus used)

Nice performance numbers with 90% of peak for the serial run and as much as 98% scaling to 4 cpus. The above table concerns the double precision matrix multiply routine. As discussed previously in this blog, the DGEMM routine is probably one of the most heavily used routines in high performance computing. Especially when solving dense systems. The 3 other 'flavors' of matrix multiply (single,complex, double complex) demonstrate similar performance and scaling.

pageicon Tuesday Jan 17, 2006

Dual-Core Amd performance of Perflib

I recently got some quality time on a Galaxy2 machine with 285 CPUs. This machine has two CPUs each which has 2 cores. The machine is clocked at 2.6 Ghz and had 16 Gb of RAM. I ran the Linpack benchmark using the Sun Studio 11 compiler. The Sun Performance Library comes packaged with the compiler collection and was used when building the Linpack executable. The results are as follows:

SizeSun Studio11
10003343.33
20003815.24
30003942.67
40003983.08
50004024.29
60004029.99
70004060.43
80004065.5
90004052.7
100004027.53

These numbers are not great and after using the collect/analyzer (which incidentally also comes packaged with the Sun Studio Compiler Collection). Brad Lewis was able to make some changes to the algorithm to increase performance.

SizeSun Studio11Sun Studio 11u1
10003343.333714.81
2000 3815.244205.77
3000 3942.674320.86
4000 3983.084379.35
5000 4024.294423.52
6000 4029.994423.46
7000 4060.434467.19
8000 4065.54468.22
9000 4052.74463.48
10000 4027.534461.84

This shows a nearly 10% performance improvement and the changes will be incorporated into an upcoming performnace patch to the Sun Studio 11 compiler collection version of the Performance Library.

These are obviously scalar performance numbers which are interesting but do not address some of the questions one might have concerning how things scale on a machine that has multi-core processors. In upcoming entries, I'll present some of the multi-core numbers obtained on this machine and a 4 Cpu, dual-core Amd box.

pageicon Saturday Oct 29, 2005

The Sun Studio compiler collection and Linux

Sun has rolled out an early access version of the Sun Studio Compiler collection for Linux. Since the defacto standard for a Linux compiler is the GCC offerings, I thought I would run some quick comparisons to see how the beta version of the Studio C compiler stacks up against GCC 3.4.3 on Solaris and Linux. I didn't choose these benchmarks, I was asked about them by another Sun Engineer. I spent less than 5 minutes fiddling with compiler flags for the Studio compiler (using just -fast -xarch=amd64 whenever possible). The three benchmarks I looked at where SciMark2, uBench-0.32, and UnixBench-4.1.0. As I've blogged before, I'm not a fan of these 'do nothing' benchmarks.

The SciMark2 set of benchmarks provided by the folks at NIST. It's apparently a Java benchmark but there are C code versions of the tests which can be compiled and run natively. The Solaris/Sun Studio combination is the winner here by a small margin. The Linux/Sun Studio C compiler makes a good showing even though some of the runtime and include libraries aren't available yet (libmvec and libm.il).

The UnixBench application is also made up of a series of microbenchmarks which aim to test a variety of compiler and OS features. Here the beta Sun Studio C Compiler gives GCC a sound drubbing (~20% by composite score). I really know almost nothing about this benchmark but just by glancing through the output it appears to me that there's some scaling difference between the Linux and Solaris results so I didn't feel it an apt comparison to put all four variations on the same graph. Again, in both cases (Solaris and Linux), the Studio compiler produces the better results.

uBench (which for all I know is the pre-cursor to UnixBench) claims to spawn 'about 2 concurrent processes for each CPU available on the system'. I'm a little unsure as to how one goes about spawning 'about' two processes but I'll take the author (one Sergei Viznyuk according to the readme) at his word. The readme also states that the benchmark does floating-point and integer calculations at 'about' a 1:3 ratio. It also performs some 'rather senseless memory allocation and memory to memory copying operations'. The reader might now be starting to understand why I have such a low opinion of benchmarks of this ilk.

Going to the author's webpage I see that he says, "Please make sure you compile ubench using only -O2 or -O optimization flags. More aggressive optimizations tend to alter the semantics of the code and skew the results." Well, there you have it.