Friday Nov 20, 2009

Significance of Results

A Sun Blade 6048 chassis with 48 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.

  • The cluster of 32 Sun Blade X6275 server modules was 9.2x faster than the 512 processor configuration of the IBM BlueGene/L.

  • The cluster of 48 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 37.8x speedup for 48 blades relative to 1 blade.

  • For largest molecule considered, the cluster of 48 Sun Blade X6275 server modules achieved a throughput of 0.028 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of the Sun Blade X6275 cluster to several of the clusters for which performance is reported on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Throughput for 512 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.014 0.0073 0.0048
Cambridge Xeon/3.0 InfiniPath 0.016 0.0088 0.0056
NCSA Xeon/2.33 InfiniBand 0.019 0.010 0.008
AMD Opteron/2.2 InfiniPath 0.025 0.015 0.008
IBM HPCx PWR4/1.7 Federation 0.039 0.021 0.013
SDSC IBM BlueGene/L MPI 0.108 0.061 0.044

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
48 768 0.0277 37.8 79% 0.0075 35.2 73% 0.0039 22.2 46%
36 576 0.0324 32.3 90% 0.0096 27.4 76% 0.0045 19.3 54%
32 512 0.0368 28.4 89% 0.0104 25.3 79% 0.0048 18.1 57%
24 384 0.0481 21.8 91% 0.0136 19.3 80% 0.0066 13.2 55%
16 256 0.0715 14.6 91% 0.0204 12.9 81% 0.0073 11.9 74%
12 192 0.0875 12.0 100% 0.0271 9.7 81% 0.0096 9.1 76%
8 128 0.1292 8.1 101% 0.0337 7.8 98% 0.0139 6.3 79%
4 64 0.2726 3.8 95% 0.0666 4.0 100% 0.0224 3.9 98%
1 16 1.0466 1.0 100% 0.2631 1.0 100% 0.0872 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Satellite Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

    48 x Sun Blade X6275, each with
      2 x (2 x 2.93 GHz Intel QC Xeon X5570 (Nehalem) processors)
      2 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Satellite Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

Key Points and Best Practices

Models with large numbers of atoms scale better than models with small numbers of atoms.

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.33GHz. This feature was was enabled when generating the results reported here.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 11/17/2009.

Tuesday Oct 13, 2009

The Sun Storage F5100 Flash Array can substantially improve performance over internal hard disk drives as shown by the I/O intensive ABAQUS MCAE application Standard benchmark tests on a Sun Fire X4270 server.

The I/O intensive ABAQUS "Standard" benchmarks test cases were run on a single Sun Fire X4270 server. Data is presented for runs at both 8 and 16 thread counts.

The ABAQUS "Standard" module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal striped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "S4b" test case by 14%.

  • The Sun Fire X4270 server coupled with a Sun Storage F5100 Flash Array established the world record performance on a single node for the four test cases S2A, S4B, S4D and S6.

Performance Landscape

ABAQUS "Standard" Benchmark Test S4B: Advantage of Sun Storage F5100

Results are total elapsed run times in seconds

Threads 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
8 1504 1318 14%
16 1811 1649 10%

ABAQUS Standard Server Benchmark Subset: Single Node Record Performance

Results are total elapsed run times in seconds

Platform Cores S2a S4b S4d S6
X4270 w/F5100 8 302 1192 779 1237
HP BL460c G6 8 324 1309 843 1322
X4270 w/F5100 4 552 1970 1181 1706
HP BL460c G6 4 561 2062 1234 1812

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ABAQUS V6.9-1 Standard Module
    Benchmark: ABAQUS Standard Benchmark Test Suite

Benchmark Description

Abaqus/Standard Benchmark Problems

These problems provide an estimate of the performance that can be expected when running Abaqus/Standard or similar commercially available MCAE (FEA) codes like ANSYS and MSC/Nastran on different computers. The jobs are representative of those typically analyzed by Abaqus/Standard and other MCAE applications. These analyses include linear statics, nonlinear statics, and natural frequency extraction.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • The memory requirements for the test cases in the ABAQUS Standard benchmark test suite are rather substantial with some of the test cases requiring slightly over 20GB of memory. There are two memory limits one a minimum where out of core "memory" will be used when this limit is exceeded. This requires more time consuming cpu and another maximum memory limit that minimizes I/O operations. These memory limits are given in the ABAQUS output and can be established before making a full execution in a preliminary diagnostic mode run.
  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the ABAQUS job. This is done in the "abaqus_v6.env" file that either resides in the subdirectory from where the job was launched or in the abaqus "site" subdirectory under the home installation directory.
  • Sometimes when running multiple cores on a single node, it is preferable from a performance standpoint to run in "smp" shared memory mode This is specified using the "THREADS" option on the "mpi_mode" line in the abaqus_v6.env file as opposed to the "MPI" option on this line. The test case considered here illustrates this point.
  • The test cases for the ABAQUS standard module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. On Linux OS's advantage can be taken of excess memory that can be used to cache and accelerate I/O.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Abaqus, Inc. or its subsidiaries in the United States and/or o ther countries: Abaqus, Abaqus/Standard, Abaqus/Explicit. All information on the ABAQUS website is Copyrighted 2004-2009 by Dassault Systemes. Results from http://www.simulia.com/support/v69/v69_performance.php as of October 12, 2009.

Monday Oct 12, 2009

Significance of Results

The Sun Storage F5100 Flash Array can greatly improve performance over internal hard disk drives as shown by the I/O intensive ANSYS MCAE application BMD benchmark tests on a Sun Fire X4270 server.

Select ANSYS 12 BMD benchmarks were run on a single Sun Fire X4270 server. These I/O intensive test cases were run to compare the performance of conventional high performance disk to Sun FlashFire technology.

The ANSYS 12.0 module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal stripped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "BMD-4" test case by 67% in the 8-core/8-thread server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "BMD-7" test case by 18% in the 8-core/16-thread server configuration.

Performance Landscape

ANSYS 12 "BMD" Test Suite on Single X4270 (24GB mem.) - SMP Mode

Results are total elapsed run times in seconds

Test Case SMP 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
bmd-4 8 523 314 67%
bmd-7 16 357 303 18%

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ANSYS Multiphysics 12.0
    Benchmark: ANSYS 12 "BMD" Benchmark Test Suite

Benchmark Description

ANSYS is a general purpose engineering analysis MCAE application that is based on the Finite Element Method. It performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior or deformations are concerned. Ansys provides a number of benchmark tests which exercise the capabilities of the software.

Please go here for a more complete description of the tests.

Key Points and Best Practices

Performance Considerations

The performance of Ansys (IO-intensive MCAE application) can be increased by reducing the IO demands of the application by increasing server memory or by using SSDs to increase the bandwidth and reduce the latency. The most I/O intensive case in the ANSYS distributed "BMD" test suite is BMD-4 particularly at the (maximum) 8 core level for a single node.


  • Ansys now takes full advantage of inexpensive RAID0 disk arrays and delivers sustained I/O rates.

  • Large memory can cache file accesses but often the size of ANSYS files grows much larger than the available physical memory so that system file caching is not able to hide the I/O cost.
  • For fast ANSYS runs the recommended configuration is a RAID 0 setup using 4 or more disks and a fast RAID controller. These fast I/O configurations are inexpensive to put together for systems and can achieve I/O rates in excess of 200 MB/sec.
  • SSD drives have much lower seek times, use less power, and tend to be about 2X faster than the fastest rotating disks for sustained throughput. The observed speed of a RAID 0 configuration of SSD drives for ANSYS simulations has been nearly as fast as I/O that is cached by large memory systems. SSD drives then may be the most affordable way to extend the capacity of a system to jobs that are too large to run in-core without incurring the performance penalty usually associated with I/O demands.

More About The ANSYS BMD "Distributed" Benchmarks

ANSYS is a general purpose engineering analysis MCAE application that is based on the Finite Element Method. It performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior or deformations are concerned.

In the most recent release of the ANSYS benchmarks there are now two test suites: The SMP "BM" suite designed to run on a single node with multi processors and the DMP "BMD" suite intended to run on multi node clusters but which can also run on a single node in SMP mode as in this study.

  • The test cases from both ANSYS test suites all have a substantial I/O component where 15% to 20% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. When running with the SX64 build a ZFS system might be a good idea to employ.
  • The ANSYS test cases don't scale very well (BMD better than BM) ; at best on up 8 cores.
  • The memory requirements for the test cases in the ANSYS BMD are greater than for the standard benchmark test suite. The requirements for the standard suite are not great requiring less than 3GB.

See Also

MCAE, SSD, HPC, ANSYS, Linux, SuSE, Performance, X64, Intel

Disclosure Statement

The following are trademarks or registered trademarks of ANSYS, Inc., ANSYS Multiphysics TM. All information on the ANSYS website is Copyrighted by ANSYS, Inc. Results from http://www.ansys.com/services/ss-intel-bench120.htm as of October 12, 2009.

Monday Oct 12, 2009

Significance of Results

The Sun Storage F5100 Flash Array can double performance over internal hard disk drives as shown by the I/O intensive MSC/Nastran MCAE application MDR3 benchmark tests on a Sun Fire X4270 server.

The MD Nastran MDR3 benchmarks were run on a single Sun Fire X4270 server. The I/O intensive test cases were run at different core levels from one up to the maximum of 8 available cores in SMP mode.

The MSC/Nastran MD 2008 R3 module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal stripped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0cmd2" test case by 107% in the 8-core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xl0tdf1"test case by 85% in the 8-core server configuration.

The MD Nastran MDR3 test suite was designed to include some very I/O intensive test cases albeit some are not very scalable. These cases are the called "xx0wmd0" and "xx0xst0". Both were run and results are presented using a single core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0xst0"test case by 33% in the single-core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0wmd0"test case by 20% in the single-core server configuration.

Performance Landscape

MD Nastran MDR3 Benchmark Tests

Results in seconds

Test Case DMP 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
xx0cmd2 8 959 463 107%
xl0tdf1 8 1104 596 85%
xx0xst0 1 1307 980 33%
xx0wmd0 1 20250 16806 20%

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: MSC/NASTRAN MD 2008 R3
    Benchmark: MDR3 Benchmark Test Suite
    HP MPI: 02.03.00.00 [7585] Linux x86-64

Benchmark Description

The benchmark tests are representative of typical MSC/Nastran applications including both SMP and DMP runs involving linear statics, nonlinear statics, and natural frequency extraction.

The MD (Multi Discipline) Nastran 2008 application performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior and/or deformations are concerned. The new release includes the MARC module for general purpose nonlinear analyses and the Dytran module that employs an explicit solver to analyze crash and high velocity impact conditions.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the Nastran job. This is done on the command line with the mem= option. On Linux based systems where the platform has a large amount of memory and where the model does not have large scratch I/O requirements the memory can be allocated to a tmpfs scratch space file system. On Solaris X64 systems advantage can be taken of ZFS for higher I/O performance.

  • The MD Nastran MDR3 test cases don't scale very well, a few not at all and the rest on up to 8 cores at best.

  • The test cases for the MSC/Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system, further enhanced as indicated here by implementing the Lustre based I/O system. High performance interconnects such as InfiniBand for inter node cluster message passing as well as I/O transfer from the storage system can also enhance performance substantially.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MD Nastran MDR3 results from http://www.mscsoftware.com and this report as of October 12, 2009.

Friday Oct 09, 2009

Significance of Results

Results are presented for the Weather Research and Forecasting (WRF) code running on twelve Sun Blade X6275 server modules, housed in the Sun Blade 6048 chassis, using the 2.5 km CONUS benchmark dataset.

  • The Sun Blade X6275 cluster was able to achieve 373 GFLOP/s on the CONUS 2.5-KM Dataset.
  • The results demonstrate an 91% speedup efficiency, or 11x speedup, from 1 to 12 blades.
  • The current results results were run with turbo on.

Performance Landscape

Performance is expressed in terms "simulation speedup" which is the ratio of the simulated time step per iteration to the average wall clock time required to compute it. A larger number implies better performance.

The current results were run with turbo mode on.

WRF 3.0.1.1: Weather Research and Forecasting CONUS 2.5-KM Dataset
#
Blade
#
Node
#
Proc
#
Core
Performance
(Simulation Speedup)
Computation Rate
GFLOP/sec
Speedup/Efficiency
(vs. 1 blade)
Turbo On
Relative Perf
Turbo On Turbo Off Turbo On Turbo Off Turbo On Turbo Off
12 24 48 192 13.58 12.93 373.0 355.1 11.0 / 91% 10.4 / 87% +6%
 8  16  32  128  9.27
254.6
 7.5 / 93% 

 6 12 24  96  7.03  6.60 193.1 181.3  5.7 / 94%  5.3 / 89% +7%
 4  8  16  64  4.74
130.2
 3.8 / 96% 

 2  4  8  32  2.44
67.0
 2.0 / 98% 

 1  2  4  16  1.24  1.24 34.1 34.1 1.0 / 100% 1.0 / 100% +0%

Results and Configuration Summary

Hardware Configuration:

    Sun Blade 6048 Modular System
      12 x Sun Blade X6275 Server Modules, each with
        4 x 2.93 GHz Intel QC X5570 processors
        24 GB (6 x 4GB)
        QDR InfiniBand
        HT disabled in BIOS
        Turbo mode enabled in BIOS

Software Configuration:

    OS: SUSE Linux Enterprise Server 10 SP 2
    Compiler: PGI 7.2-5
    MPI Library: Scali MPI v5.6.4
    Benchmark: WRF 3.0.1.1
    Support Library: netCDF 3.6.3

Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

Dataset used:

    Single domain, large size 2.5KM Continental US (CONUS-2.5K)

    • 1501x1201x35 cell volume
    • 6hr, 2.5km resolution dataset from June 4, 2005
    • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
    • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP

Key Points and Best Practices

  • Processes were bound to processors in round-robin fashion.
  • Model simulation time is 15 seconds per iteration as defined in input job deck. An achieved speedup of 2.67X means that each model iteration of 15s of simulation time was executed in 5.6s of real wallclock time (on average).
  • Computational requirements are 412 GFLOP per simulation time step as (measured empirically and) documented on the UCAR web site for this data model.
  • Model was run as single MPI job.
  • Benchmark was built and run as a pure-MPI variant. With larger process counts building and running WRF as a hybrid OpenMP/MPI variant may be more efficient.
  • Input and output (netCDF format) datasets can be very large for some WRF data models and run times will generally benefit by using a scalable filesystem. Performance with very large datasets (>5GB) can benefit by enabling WRF quilting of I/O across designated processors/servers. The master thread (or rank-0) performs most of the I/O (unless quilting specifies otherwise), with all processes potentially generating some I/O.

See Also

Disclosure Statement

WRF, CONUS-2.5K, see http://www.mmm.ucar.edu/wrf/WG2/bench/, results as of 9/21/2009.

Monday Jul 06, 2009

Significance of Results

The Sun Blade X6275 cluster, equipped with 2.93 GHz Intel QC X5570 processors and QDR InfiniBand interconnect, delivered the best performance at 32, 64 and 128 cores for the RADIOSS Neon_1M and Taurus_Frontal benchmarks.

  • Using half the nodes (16), the Sun Blade X6275 cluster was 3% faster than the 32-node SGI cluster running the Neon_1M test case.
  • In the 128-core configuration, the Sun Blade X6275 cluster was 49% faster than the SGI cluster running the Neon_1M test case.
  • In the 128-core configuration, the Sun Blade X6275 cluster was 49% faster than the SGI cluster running the Neon_1M test case.
  • In the 128-core configuration, the Sun Blade X6275 cluster was 16% faster than the top SGI cluster running the Taurus_Frontal test case.
  • At both the 32- and 64-core levels the Sun Blade X6275 cluster was 60% faster running the Neon_1M test case.
  • At both the 32- and 64-core levels the Sun Blade X6275 cluster was 4% faster running the Taurus_Frontal test case.

Performance Landscape


RADIOSS Public Benchmark Test Suite
  Results are Total Elapsed Run Times (sec.)

System
cores Benchmark Test Case
TAURUS_FRONTAL
1.8M
NEON_1M
1.06M
NEON_300K
277K

SGI Altix ICE 8200 IP95 2.93GHz, 32 nodes, DDR 256 3559 1672 310

Sun Blade X6275 2.93GHz, 16 nodes, QDR 128 4397 1627 361
SGI Altix ICE 8200 IP95 2.93GHz, 16 nodes, DDR 128 5033 2422 360

Sun Blade X6275 2.93GHz, 8 nodes, QDR 64 5934 2526 587
SGI Altix ICE 8200 IP95 2.93GHz, 8 nodes, DDR 64 6181 4088 584

Sun Blade X6275 2.93GHz, 4 nodes, QDR 32 9764 4720 1035
SGI Altix ICE 8200 IP95 2.93GHz, 4 nodes, DDR 32 10120 7574 1017

Results and Configuration Summary

Hardware Configuration:
    8 x Sun Blade X6275
    2x2.93 GHz Intel QC X5570 processors, turbo enabled (per half blade)
    24 GB (6 x 4GB 1333 MHz DDR3 dimms)
    InfiniBand QDR interconnects

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Application: RADIOSS V9.0 SP 1
    Benchmark: RADIOSS Public Benchmark Test Suite

Benchmark Description

Altair has provided a suite of benchmarks to demonstrate the performance of RADIOSS. The initial set of benchmarks provides four automotive crash models. Future updates will add in marine and aerospace applications, as well as including automotive NVH applications. The benchmarks use real data, requiring double precision computations and the parith feature (Parallel arithmetic algorithm) to obtain exactly the same results whatever the number of processors used.

Please go here for a more complete description of the tests.

Key Points and Best Practices

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled when generating the results reported here.

Node to Node MPI ping-pong tests show a bandwidth of 3000 MB/sec on the Sun Blade X6275 cluster using QDR. The same tests performed on a Sun Fire X2270 cluster and equipped with DDR interconnect produced a bandwidth of 1500 MB/sec. On another recent Intel based Sun Fire X2250 cluster (3.4 GHz DC E5272 processors) also equipped with DDR interconnects, the bandwidth was 1250 MB/sec. This same Sun Fire X2250 cluster equipped with SDR IB interconnect produced an MPI ping-pong bandwidth of 975 MB/sec.

See Also

Current RADIOSS Benchmark Results:
http://www.altairhyperworks.com/Benchmark.aspx

Disclosure Statement

All information on the Fluent website is Copyright 2009 Altair Engineering, Inc. All Rights Reserved. Results from http://www.altairhyperworks.com/Benchmark.aspx

Tuesday Jun 30, 2009

Significance of Results

A Sun Blade 6048 chassis with 12 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD was developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.
  • The cluster of 12 Sun Blade X6275 server modules was 6.2x faster than 256 processor configuration of the IBM BlueGene/L.
  • The cluster of 12 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 10.4x speedup for 12 blades relative to 1 blade.
  • For largest molecule considered, the cluster of 12 Sun Blade X6275 server modules achieved a throughput of 0.094 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of NAMD version 2.6 when executed on the Sun Blade X6275 cluster to the performance of NAMD as reported for several of the clusters on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, however, not multiplied by the number of "processors". A smaller number implies better performance.
Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 192 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.013 0.010
Cambridge Xeon/3.0 InfiniPath 0.016
0.0088
NCSA Xeon/2.33 InfiniBand 0.019
0.010
AMD Opteron/2.2 InfiniPath 0.025
0.015
IBM HPCx PWR4/1.7 Federation 0.039
0.021
SDSC IBM BlueGene/L MPI 0.108
0.062

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
12 192 0.0941 10.6 88% 0.0270 9.1 76% 0.0102 8.1 68%
8 128 0.1322 7.5 94% 0.0317 7.7 97% 0.0131 6.3 79%
4 64 0.2656 3.7 94% 0.0610 4.0 101% 0.0204 4.1 102%
1 16 0.9952 1.0 100% 0.2454 1.0 100% 0.0829 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Synthetic Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

  • Sun Blade[tm] 6048 Modular System with one shelf configured with
    • 12 x Sun Blade X6275, each with
      • 2 x (2 x 2.93 GHz Intel QC Xeon X5570 processors)
      • 2 x (24 GB memory)
      • Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

  • SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
  • Scali MPI 5.6.6
  • gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Key Points and Best Practices

  • Models with large numbers of atoms scale better than models with small numbers of atoms.

About the Sun Blade X6275

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled when generating the results reported here.

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Synthetic Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 6/26/2009.

Tuesday Jun 16, 2009

Significance of Results

The I/O intensive MSC/Nastran Vendor_2008 benchmark test suite was used to compare the performance on a Sun Fire X2270 server when using SSDs internally instead of HDDs.

The effect on performance from increasing memory to augment I/O caching was also examined. The Sun Fire X2270 server was equipped with Intel QC Xeon X5570 processors (Nehalem). The positive effect of adding memory to increase I/O caching is offset to some degree by the reduction in memory frequency with additional DIMMs in the bays of each memory channel on each cpu socket for these Nehalem processors.

  • SSDs can significantly improve NASTRAN performance especially on runs with larger core counts.
  • Additional memory in the server can also increase performance, however in some systems additional memory can decrease memory GHz so this may offset the benefits of increased capacity.
  • If SSDs are not used striped disks will often improve performance of IO-bound MCAE applications.
  • To obtain the highest performance it is recommended that SSDs be used and servers be configured with the largest memory possible without decreasing memory GHz. One should always look at the workload characteristics and compare against this benchmark to correctly set expectations.

SSD vs. HDD Performance

The performance of two striped 30GB SSDs was compared to two striped 7200 rpm 500GB SATA drives on a Sun Fire X2270 server.

  • At the 8-core level (maximum cores for a single node) SSDs were 2.2x faster for the larger xxocmd2 and the smaller xlotdf1 cases.
  • For 1-core results SSDs are up to 3% faster.
  • On the smaller mdomdf1 test case there was no increase in performance on the 1-, 2-, and 4-cores configurations.

Performance Enhancement with I/O Memory Caching

Performance for Nastran can often be increased by additional memory to provide additional in-core space to cache I/O and thereby reduce the IO demands.

The main memory was doubled from 24GB to 48GB. At the 24GB level one 4GB DIMM was placed in the first bay of each of the 3 CPU memory channels on each of the two CPU sockets on the Sun Fire X2270 platform. This configuration allows a memory frequency of 1333MHz.

At the 48GB level a second 4GB DIMM was placed in the second bay of each of the 3 CPU memory channels on each socket. This reduces the memory frequency to 1066MHz.

Adding Memory With HDDs (SATA)

  • The additional server memory increased the performance when running with the slower SATA drives at the higher core levels (e.g. 4- & 8-cores on a single node)
  • The larger xxocmd2 case was 42% faster and the smaller xlotdf1 case was 32% faster at the maximum 8-core level on a single system.
  • The special I/O intensive getrag case was 8% faster at the 1-core level.

Adding Memory With SDDs

  • At the maximum 8-core level (for a single node) the larger xxocmd2 case was 47% faster in overall run time.
  • The effects were much smaller at lower core counts and in the tests at the 1-core level most test cases ran from 5% to 14% slower with the slower CPU memory frequency dominating over the added in-core space available for I/O caching vs. direct transfer to SSD.
  • Only the special I/O intensive getrag case was an exception running 6% faster at the 1-core level.

Increasing performance with Two Striped (SATA) Drives

The performance of multiple striped drives was also compared to single drive. The study compared two striped internal 7200 rpm 500GB SATA drives to a singe single internal SATA drive.

  • On a single node with 8 cores, the largest test xx0cmd2 was 40% faster, a smaller test case xl0tdf1 was 33% faster and even the smallest test case mdomdf1 case was 12% faster.

  • On 1-core the added boost in performance with striped disks was from 4% to 13% on the various test cases.

  • One 1-core the special I/O-intensive test case getrag was 29% faster.

Performance Landscape

Times in table are elapsed time (sec).


MSC/Nastran Vendor_2008 Benchmark Test Suite

Test Cores Sun Fire X2270
2 x X5570 QC 2.93 GHz
2 x 7200 RPM SATA HDDs
Sun Fire X2270
2 x X5570 QC 2.93 GHz
2 x SSDs
48 GB
1067MHz
24 GB
2 SATA
1333MHz
24 GB
1 SATA
1333MHz
Ratio (2xSATA):
48GB/
24GB
Ratio:
2xSATA/
1xSATA
48 GB
1067MHz
24 GB
1333MHz
Ratio:
48GB/
24GB
Ratio (24GB):
2xSATA/
2xSSD

vlosst1 1 133 127 134 1.05 0.95 133 126 1.05 1.01

xxocmd2 1
2
4
8
946
622
466
1049
895
614
631
1554
978
703
991
2590
1.06
1.01
0.74
0.68
0.87
0.87
0.64
0.60
947
600
426
381
884
583
404
711
1.07
1.03
1.05
0.53
1.01
1.05
1.56
2.18

xlotdf1 1
2
4
8
2226
1307
858
912
2000
1240
833
1562
2081
1308
1030
2336
1.11
1.05
1.03
0.58
0.96
0.95
0.81
0.67
2214
1315
744
674
1939
1189
751
712
1.14
1.10
0.99
0.95
1.03
1.04
1.11
2.19

xloimf1 1 1216 1151 1236 1.06 0.93 1228 1290 0.95 0.89

mdomdf1 1
2
4
987
524
270
913
485
237
983
520
269
1.08
1.08
1.14
0.93
0.93
0.88
987
524
270
911
484
250
1.08
1.08
1.08
1.00
1.00
0.95

Sol400_1
(xl1fn40_1)
1 2555 2479 2674 1.03 0.93 2549 2402 1.06 1.03

Sol400_S
(xl1fn40_S)
1 2450 2302 2481 1.06 0.93 2449 2262 1.08 1.02

getrag
(xx0xst0)
1 778 843 1178 0.92 0.71 771 817 0.94 1.03

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X2270
      1 2-socket rack mounted server
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      2 x internal striped SSDs
      2 x internal striped 7200 rpm 500GB SATA drives

Software Configuration:

    O/S: Linux 64-bit SUSE SLES 10 SP 2
    Application: MSC/NASTRAN MD 2008
    Benchmark: MSC/NASTRAN Vendor_2008 Benchmark Test Suite
    HP MPI: 02.03.00.00 [7585] Linux x86-64
    Voltaire OFED-5.1.3.1_5 GridStack for SLES 10

Benchmark Description

The benchmark tests are representative of typical MSC/Nastran applications including both SMP and DMP runs involving linear statics, nonlinear statics, and natural frequency extraction.

The MD (Multi Discipline) Nastran 2008 application performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior and/or deformations are concerned. The new release includes the MARC module for general purpose nonlinear analyses and the Dytran module that employs an explicit solver to analyze crash and high velocity impact conditions.

  • As of the Summer '08 there is now an official Solaris X64 version of the MD Nastran 2008 system that is certified and maintained.
  • The memory requirements for the test cases in the new MSC/Nastran Vendor 2008 benchmark test suite range from a few hundred megabytes to no more than 5 GB.

Please go here for a more complete description of the tests.

Key Points and Best Practices

For more on Best Practices of SSD on HPC applications also see the Sun Blueprint:
http://wikis.sun.com/display/BluePrints/Solid+State+Drives+in+HPC+-+Reducing+the+IO+Bottleneck

Additional information on the MSC/Nastran Vendor 2008 benchmark test suite.

  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the Nastran job. This is done on the command line with the mem= option. On Linux based systems where the platform has a large amount of memory and where the model does not have large scratch I/O requirements the memory can be allocated to a tmpfs scratch space file system. On Solaris X64 systems advantage can be taken of ZFS for higher I/O performance.

  • The MSC/Nastran Vendor 2008 test cases don't scale very well, a few not at all and the rest on up to 8 cores at best.

  • The test cases for the MSC/Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system, further enhanced as indicated here by implementing the Lustre based I/O system. High performance interconnects such as Infiniband for inter node cluster message passing as well as I/O transfer from the storage system can also enhance performance substantially.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MSC/Nastran Vendor 2008 results from http://www.mscsoftware.com and this report as of June 9, 2009.

Tuesday Jun 09, 2009

Not the Free Compiler That You Thought, No, This Other One.

Nehalem performance measured with several software configurations

Contributed by: John Henning and Karsten Guthridge

Introduction

race

The GNU C Compiler, GCC, is popular, widely available, and an exemplary collaborative effort.

But how does it do for performance -- for example, on Intel's latest hot "Nehalem" processor family? How does it compare to the freely available Sun Studio compiler?

Using the SPEC CPU benchmarks, we take a look at this question. These benchmarks depend primarily on performance of the chip, the memory hierarchy, and the compiler. By holding the first two of these constant, it is possible to focus in on compiler contributions to performance.

Current Record Holder

The current SPEC CPU2006 floating point speed record holder is the Sun Blade X6270 server module. Using 2x Intel Xeon X5570 processor chips and 24 GB of DDR3-1333 memory, it delivers a result of 45.0 SPECfp_base2006 and 50.4 SPECfp2006. [1]

We used this same blade system to compare GCC vs. Studio. On separate, but same-model disks, this software was installed:

  • SuSE Linux Enterprise Server 11.0 (x86_64) and GCC V4.4.0 built with gmp-4.3.1 and mpfr-2.4.1
  • OpenSolaris2008.11 and Sun Studio 12 Update 1

Tie One Hand Behind Studio's Back

In order to make the comparison more fair to GCC, we took several steps.

  1. We simplified the tuning for the OpenSolaris/Sun Studio configuration. This was done in order to counter the criticism that one sometimes hears that SPEC benchmarks have overly aggressive tuning. Benchmarks were optimized with a reasonably short tuning string:

    For all:  -fast -xipo=2 -m64 -xvector=simd -xautopar
    For C++, add:  -library=stlport4
  2. Recall that SPEC CPU2006 allows two kinds of tuning: "base", and "peak". The base metrics require that all benchmarks of a given language use the same tuning. The peak metrics allow individual benchmarks to have differing tuning, and more aggressive optimizations, such as compiler feedback. The simplified Studio configuration used only the less aggressive base tuning.

Both of the above changes limited the performance of Sun Studio.  Several measures were used to increase the performance of GCC:

  1. We tested the latest released version of GCC, 4.4.0, which was announced on 21 April 2009. In our testing, GCC 4.4.0 provides about 10% better overall floating point performance than V4.3.2. Note that GCC 4.4.0 is more recent than the compiler that is included with recent Linux distributions such as SuSE 11, which includes 4.3.2; or Ubuntu 8.10, which updates to 4.3.2 when one does "apt-get install gcc". It was installed with the math libraries mpfr 2.4.1 and gmp 4.3.1, which are labeled as the latest releases as of 1 June 2009.

  2. A tuning effort was undertaken with GCC, including testing of -O2 -O3 -fprefetch-loop-arrays -funroll-all-loops -ffast-math -fno-strict-aliasing -ftree-loop-distribution -fwhole-program -combine and -fipa-struct-reorg

  3. Eventually, we settled on this tuning string for GCC base:

    For all:  -O3 -m64 -mtune=core2 -msse4.2 -march=core2
    -fprefetch-loop-arrays -funroll-all-loops
    -Wl,-z common-page-size=2M
    For C++, add:  -ffast-math

    The reason that only the C++ benchmarks used the fast math library was that 435.gromacs, which uses C and Fortran, fails validation with this flag. (Note: we verified that the benchmarks successfully obtained 2MB pages.)

Studio wins by 2x, even with one hand tied behind its back

At this point, a fair base-to-base comparison can be made, and Sun Studio/OpenSolaris finishes the race while GCC/Linux is still looking for its glasses: 44.8 vs. 21.1 (see Table 1). Notice that Sun Studio provides more than 2x the performance of GCC.

Table 1: Initial Comparisons, SPECfp2006,
Sun Studio/Solaris vs. GCC/Linux
  Base Peak
Industry FP Record
    Sun Studio 12 Update 1
    OpenSolaris 2008.11
45.0 50.4
Studio/OpenSolaris: simplify above (less tuned) 44.8  
GCC V4.4 / SuSE Linux 11 21.1  
Notes: All results reported are from rule-compliant, "reportable" runs of the SPEC CPU2006 floating point suite, CFP2006. "Base" indicates the metric SPECfp_base2006. "Peak" indicates SPECfp2006. Peak uses the same benchmarks and workloads as base, but allows more aggressive tuning. A base result, may, optionally, be quoted as peak, but the converse is not allowed. For details, see SPEC's Readme1st.

Fair? Did you say "Fair"?

Wait, wait, the reader may protest - this is all very unfair to GCC, because the Studio result used all 8 cores on this 2-chip system, whereas GCC used only one core! You're using trickery!

To this plaintive whine, we respond that:

  • Compiler auto-parallelization technology is not a trick. Rather, it is an essential technology in order to get the best performance from today's multi-core systems. Nearly all contemporary CPU chips provide support for multiple cores. Compilers should do everything possible to make it easy to take advantage of these resources.

  • We tried to use more than one core for GCC, via the -ftree-parallelize-loops=n flag. GCC's autoparallelization appears to be in a much earlier development stage than Studio's, since we did not observe any improvements for all values of "n" that we tested. From the GCC wiki, it appears that a new autoparallelization effort is under development, which may improve its results at a later time.

  • But, all right, if you insist, we will make things even harder for Studio, and see how it does.

Tie Another Hand Behind Studio's Back

The earlier section mentioned various ways in which the performance comparison had been made easier for GCC. Continuing the paragraph numbering from above, we took these additional measures:

  1. Removed the autoparallelization from Studio, substituting instead a request for 2MB pagesizes (which the GCC tuning already had).

  2. Added "peak" tuning to GCC: for benchmarks that benefit, add -ffast-math, and compiler profile-driven feedback

At this point, Studio base beats GCC base by 38%, and Studio base beats GCC peak by more than 25% (see table 2).

Table 2: Additional Comparisons, SPECfp2006,
Sun Studio/Solaris vs. GCC/Linux
  Base Peak
Sun Studio/OpenSolaris: base only, noautopar 29.1  
GCC V4.4 / SuSE Linux 11 21.1 23.1
The notes from Table 1 apply here as well.

Bottom line

The freely available Sun Studio 12 Update 1 compiler on OpenSolaris provides more than double the performance of GCC V4.4 on SuSE Linux, as measured by SPECfp_base2006.

If compilation is restricted to avoid using autoparallelization, Sun Studio still wins by 38% (base to base), or by more than 25% (Studio base vs. GCC peak).

YMMV

Your mileage may vary. It is certain that both GCC and Studio could be improved with additional tuning efforts. Both provide dozens of compiler flags, which can keep the tester delightfully engaged for an unbounded number of days. We feel that the tuning presented here is reasonable, and that additional tuning effort, if applied to both compilers, would not radically alter the conclusions.

Additional Information

The results disclosed in this article are from "reportable" runs of the SPECfp2006 benchmarks, which have been submitted to SPEC.

[1] SPEC and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation. Competitive comparisons are based on data published at www.spec.org as of 1 June 2009. The X6270 result can be found at http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090413-07019.html.

This blog copyright 2009 by John Henning