Paul Hinker's Weblog

pageicon Wednesday Feb 04, 2009

Careful when benchmarking various BLAS and LAPACK libraries

Something that I've noticed while benchmarking various BLAS and LAPACK libraries is the implementation decision by the various vendors as to the number of processors that are utilized by default. There are a number of different versions of these popular libraries but the most widely used ones are ACML (from AMD), ATLAS (from sourceforge), MKL (from Intel), and the Sun Performance Library (from Sun).

All these implementations save for MKL take the conservative approach of using a single processor by default. MKL check the number of available processors in the system and uses that number by default. Unfortunately, this causes some confusion to users since many are not familiar with the various libraries.

I've received a number of e-mails from users who were concerned that it appeared that the Sun Performance Library was being outperformed by MKL by a factor of 2 or 4 or 8. Investigation usually results in finding out that the user is running some simple benchmark and leaving all the default settings causing MKL to use 2 or 4 or 8 processors in the system while Perflib is using only one. I received a similar message just this morning from a user asking about the same behavior whilst using ACML vs MKL on an X4100M2 (a server based on AMD chips).

Conversely, I've also run into cases where this decision has been detrimental to MKL performance since the opportunistic nature of the MKL implementation caused over subscription of the machine(s) being benchmarked. One notable example is when running a user was running distributed application which called the ScaLAPACK library. The nodes in the cluster all had 4 processors and there were 4 nodes in the cluster. The (correctly in my opinion) specified a processing grid of 4 x 4 processors. The MKL linked application appeared to run at half the speed of the application linked to both ACML and Sun Performance Library. It even underperformed the application linked to the ATLAS library.

This issue has been a source of discussion from time to time over the years in staff meetings. Sun comes from a 'big iron' super computing background and being a good system citizen in a shared environment is an important feature. I don't know what conversations happen at the MKL staff meetings but since Intel comes from more of a single-user background, where a dedicated machine is the norm, perhaps this explains the difference in implementation.

At any rate, it's important to know and understand these subtle implementation details when benchmarking various machines and the variety of linear algebra offerings so that you get a clear understanding of the performance you can expect in a production environment.

Since all the libraries mentioned are OMP based, the number of processors used can all be controlled using the OMP_NUM_THREADS environment variable. If you don't want to keep track of the default behavior of each of these libraries, it's probably prudent to specifically control the number of threads used in all cases.

For example, in the c shell to use a single thread :

% setenv OMP_NUM_THREADS 1

In the korn shell or bash :

% export OMP_NUM_THREADS=1

Comments:

> Sun comes from a 'big iron' super computing background and being a good system citizen in a shared environment is an important feature.

That doesn't apply to all Sun products. For instance, java, since 1.5, looks at the size of the system and starts by default with adapted options, which is an issue on systems with other users.

Posted by Marc on February 19, 2009 at 11:09 PM MST #

I didn't mean to imply that my statements were to apply to all Sun products. I was simply speculating on what may be a reason for the different implementation decision as it pertains to BLAS and LAPACK. Either decision is valid and each is 'right' under specific conditions. The post was merely pointing out that there is a difference and users will do well to keep that in mind.

Thanks for your comments.

Posted by Paul on February 19, 2009 at 11:30 PM MST #

And I didn't mean to sound like I was contradicting you, merely adding a not-so-important comment.

Interesting post by the way, there is too much broken benchmarking done nowadays.

Posted by Marc on February 20, 2009 at 01:53 PM MST #

Post a Comment:
Comments are closed for this entry.