Friday Nov 20, 2009

Significance of Results

A Sun Blade 6048 chassis with 48 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.

  • The cluster of 32 Sun Blade X6275 server modules was 9.2x faster than the 512 processor configuration of the IBM BlueGene/L.

  • The cluster of 48 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 37.8x speedup for 48 blades relative to 1 blade.

  • For largest molecule considered, the cluster of 48 Sun Blade X6275 server modules achieved a throughput of 0.028 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of the Sun Blade X6275 cluster to several of the clusters for which performance is reported on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Throughput for 512 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.014 0.0073 0.0048
Cambridge Xeon/3.0 InfiniPath 0.016 0.0088 0.0056
NCSA Xeon/2.33 InfiniBand 0.019 0.010 0.008
AMD Opteron/2.2 InfiniPath 0.025 0.015 0.008
IBM HPCx PWR4/1.7 Federation 0.039 0.021 0.013
SDSC IBM BlueGene/L MPI 0.108 0.061 0.044

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
48 768 0.0277 37.8 79% 0.0075 35.2 73% 0.0039 22.2 46%
36 576 0.0324 32.3 90% 0.0096 27.4 76% 0.0045 19.3 54%
32 512 0.0368 28.4 89% 0.0104 25.3 79% 0.0048 18.1 57%
24 384 0.0481 21.8 91% 0.0136 19.3 80% 0.0066 13.2 55%
16 256 0.0715 14.6 91% 0.0204 12.9 81% 0.0073 11.9 74%
12 192 0.0875 12.0 100% 0.0271 9.7 81% 0.0096 9.1 76%
8 128 0.1292 8.1 101% 0.0337 7.8 98% 0.0139 6.3 79%
4 64 0.2726 3.8 95% 0.0666 4.0 100% 0.0224 3.9 98%
1 16 1.0466 1.0 100% 0.2631 1.0 100% 0.0872 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Satellite Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

    48 x Sun Blade X6275, each with
      2 x (2 x 2.93 GHz Intel QC Xeon X5570 (Nehalem) processors)
      2 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Satellite Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

Key Points and Best Practices

Models with large numbers of atoms scale better than models with small numbers of atoms.

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.33GHz. This feature was was enabled when generating the results reported here.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 11/17/2009.

Friday Jul 10, 2009

Significance of Results

Sun and Microsoft combined to deliver World Record price performance for Windows based results on the TPC-H benchmark at the 300GB scale factor. Using Microsoft's SQL Server 2008 Enterprise database along with Microsoft Windows Server 2008 operating system on the Sun Fire X4600 M2 server, the result of 2.80 $/QphH@300GB (USD) was delivered.

  • The Sun Fire X4600 M2 provides World Record price-performance of 2.80 $/QphH@300GB (USD) among Windows based TPC-H results at the 300GB scale factor. This result is 14% better price performance than the HP DL785 result.
  • The Sun Fire X4600 M2 trails HP's World Record single system performance (HP: 57,684 QphH@300GB, Sun: 55,185 QphH@300GB) by less than 5%.
  • The Sun/SQL Server solution used fewer disks for the database (168) than the other top performance leaders @300GB.
  • IBM required 79% more disks (300 total) than Sun to get a result of 46,034 QphH@300GB which is 20% below Sun's QphH.
  • HP required 21% more disks (204 total) than Sun to achieve a result of 3.24 $/QphH@300GB (USD) which is 16% worse than Sun's price performance.

This is Sun's first published TPC-H SQL Server benchmark.

Performance Landscape

ch/co/th = chips, cores, threads
$/QphH = TPC-H Price/Performance metric (smaller is better)

System ch/co/th Processor Database QphH $/QphH Price Disks Available
Sun Fire X4600 M2 8/32/32 2.7 Opteron 8384 SQL Server 2008 55,158 2.80 $154,284 168 07/06/09
HP DL785 8/32/32 2.7 Opteron 8384 SQL Server 2008 57,684 3.24 $186,700 204 11/17/08
IBM x3950 M2 8/32/32 2.93 Intel X7350 SQL Server 2005 46,034 5.40 $248,635 300 03/07/08

Complete benchmark results may be found at the TPC benchmark website http://www.tpc.org.

Results and Configuration Summary

Server:

    Sun Fire X4600 M2 with:
      8 x AMD Opteron 8384, 2.7 GHz QC processors
      256 GB memory
      3 x 73GB (15K RPM) internal SAS disks

Storage:

    14 x Sun Storage J4200 each consisting of 12 x 146GB 15,000 RPM SAS disks

Software:

    Operating System: Microsoft Windows Server 2008 Enterprise x64 Edition SP1
    Database Manager: SQL Server 2008 Enterprise x64 Edition SP1

Audited Results:

    Database Size: 300GB (Scale Factor)
    TPC-H Composite: 55,157.5 QphH@300GB
    Price/performance: $2.80 / QphH@300GB (USD)
    Available: July 6, 2009
    Total 3 Year Cost: $154,284.19 (USD)
    TPC-H Power: 67,095.6
    TPC-H Throughput: 45,343.5
    Database Load Time: 17 hours 29 minutes
    Storage Ratio: 76.82

Benchmark Description

The TPC-H benchmark is a performance benchmark established by the Transaction Processing Council (TPC) to demonstrate Data Warehousing/Decision Support Systems (DSS). TPC-H measurements are produced for customers to evaluate the performance of various DSS systems. These queries and updates are executed against a standard database under controlled conditions. Performance projections and comparisons between different TPC-H Database sizes (100GB, 300GB, 1000GB, 3000GB and 10000GB) are not allowed by the TPC.

TPC-H is a data warehousing-oriented, non-industry-specific benchmark that consists of a large number of complex queries typical of decision support applications. It also includes some insert and delete activity that is intended to simulate loading and purging data from a warehouse. TPC-H measures the combined performance of a particular database manager on a specific computer system.

The main performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@SF, where SF is the number of GB of raw data, referred to as the scale factor). QphH@SF is intended to summarize the ability of the system to process queries in both single and multi user modes. The benchmark requires reporting of price/performance, which is the ratio of QphH to total HW/SW cost plus 3 years maintenance. A secondary metric is the storage efficiency, which is the ratio of total configured disk space in GB to the scale factor.

Key Points and Best Practices

SQL Server 2008 is able to take advantage of the lower latency local memory access provides on the Sun Fire 4600 M2 server. This was achieved by setting the NUMA initialization parameter to enable all NUMA optimizations.

Enabling the Windows large-page feature provided a significant performance improvement. Because SQL Server 2008 manages its own memory buffer, the use of large-pages resulted in significant performance increase. Note that to use large-pages, an application must be part of the large-page group of the OS (Windows).

The 64-bit Windows OS and 64-bit SQL Server software were able to utilize the 256GB of memory available on the Sun Fire 4600 M2 server.

See Also

Disclosure Statement

TPC-H@300GB: Sun Fire X4600 M2 55,158 QphH@300GB, $2.80/QphH@300GB, availability 7/6/09; HP DL785, 57,684 QphH@300GB, $3.24/QphH@300GB, availability 11/17/08; IBM x3950 M2, 46,034 QphH@300GB, $5.40/QphH@300GB, availability 03/07/08; TPC-H, QphH, $/QphH tm of Transaction Processing Performance Council (TPC). More info www.tpc.org.

Tuesday Jun 30, 2009

Significance of Results

A Sun Blade 6048 chassis with 12 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD was developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.
  • The cluster of 12 Sun Blade X6275 server modules was 6.2x faster than 256 processor configuration of the IBM BlueGene/L.
  • The cluster of 12 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 10.4x speedup for 12 blades relative to 1 blade.
  • For largest molecule considered, the cluster of 12 Sun Blade X6275 server modules achieved a throughput of 0.094 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of NAMD version 2.6 when executed on the Sun Blade X6275 cluster to the performance of NAMD as reported for several of the clusters on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, however, not multiplied by the number of "processors". A smaller number implies better performance.
Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 192 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.013 0.010
Cambridge Xeon/3.0 InfiniPath 0.016
0.0088
NCSA Xeon/2.33 InfiniBand 0.019
0.010
AMD Opteron/2.2 InfiniPath 0.025
0.015
IBM HPCx PWR4/1.7 Federation 0.039
0.021
SDSC IBM BlueGene/L MPI 0.108
0.062

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
12 192 0.0941 10.6 88% 0.0270 9.1 76% 0.0102 8.1 68%
8 128 0.1322 7.5 94% 0.0317 7.7 97% 0.0131 6.3 79%
4 64 0.2656 3.7 94% 0.0610 4.0 101% 0.0204 4.1 102%
1 16 0.9952 1.0 100% 0.2454 1.0 100% 0.0829 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Synthetic Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

  • Sun Blade[tm] 6048 Modular System with one shelf configured with
    • 12 x Sun Blade X6275, each with
      • 2 x (2 x 2.93 GHz Intel QC Xeon X5570 processors)
      • 2 x (24 GB memory)
      • Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

  • SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
  • Scali MPI 5.6.6
  • gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Key Points and Best Practices

  • Models with large numbers of atoms scale better than models with small numbers of atoms.

About the Sun Blade X6275

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled when generating the results reported here.

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Synthetic Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 6/26/2009.

Tuesday Jun 23, 2009

Significance of Results

A Sun Constellation system, composed of 48 Sun Blade X6440 server modules in a Sun Blade 6048 chassis, running OpenSolaris 2008.11 and using the Sun Studio 12 Update 1 compiler delivered World Record SPEC CPU2006 rate results.

On the SPECint_rate_base2006 benchmark, Sun delivered 4.7 times more performance than the IBM power 595 (5GHz POWER6); this IBM system requires a slightly larger cabinet than the Sun Blade 6048 chassis (details below). 

On the SPECfp_rate_base2006 benchmark Sun delivered 3.9 times more performance than the largest IBM power 595 (5GHz POWER6); this IBM system requires a slightly larger cabinet than the Sun Blade 6048 chassis (details below).

  • The Sun Constellation System equipped with AMD Opteron QC 8384 2.7 GHz processors, running OpenSolaris 2008.11 and using the Sun Studio 12 update 1 compiler, delivered the World Record SPECint_rate_base2006 score of 8840.
  • This SPECint_rate_base2006 score beat the previous record holding score by over three times.
  • The Sun Constellation System equipped with AMD Opteron QC 8384 2.7 GHz processors, running OpenSolaris 2008.11 and using the Sun Studio 12 update 1 compiler, delivered the fastest x86 SPECfp_rate_base2006 score of 6500.
  • This SPECfp_rate_base2006 score beat the previous x86 record holding score by nine times.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results.

SPECint_rate2006

System Processors Performance Results Notes (1)
Type GHz Chips Cores Peak Base
Sun Blade 6048 Opteron 8384 2.7 192 768
8840 New Record
SGI Altix 4700 Density System Itanium 9150M 1.66 128 256 3354 2893 Previous Best
SGI Altix 4700 Bandwidth System Itanium2 9040 1.6 128 256 2971 2715
Fujitsu/Sun SPARC Enterprise M9000 SPARC64 VII 2.52 64 256 2290 2090
IBM Power 595 POWER6 5.0 32 64 2160 1870 Best POWER6

(1) Results as of 23 June 2009 from www.spec.org.

SPECfp_rate2006

System Processors Performance Results Notes (2)
Type GHz Chips Cores Peak Base
SGI Altix 4700 Density System Itanium 9140M 1.66 512 1024
10580
Sun Blade 6048 Opteron 8384 2.7 192 768
6500 New x86 Record
SGI Altix 4700 Bandwidth System Itanium2 9040 1.6 128 256 3507 3419
IBM Power 595 POWER 6 5.0 64 32 2184 1681 Best POWER6
Fujitsu/Sun SPARC Enterprise M9000 SPARC64 VII 2.52 64 256 2005 1861
SGI Altix 4700 Bandwidth System Itanium 9150M 1.66 128 256 1947 1832
SGI Altix ICE 8200EX Intel X5570 2.93 8 32 742 723

(2) Results as of 23 June 2009 from www.spec.org.

(2) Results as of 23 June 2009 from www.spec.org.

Results and Configuration Summary

Hardware Configuration:
    1 x Sun Blade 6048
      48 x Sun Blade X6440, each with
        4 x 2.7 GHz QC AMD Opteron 8384 processors
        32 GB, (8 x 4GB)

Software Configuration:

    O/S: OpenSolaris 2008.11
    Compiler: Sun Studio 12 Update 1
    Other SW: MicroQuill SmartHeap Library 9.01 x64
    Benchmark: SPEC CPU2006 V1.1

Key Points and Best Practices

The Sun Blade 6048 chassis is able to contain a variety of server modules. In this case, the Sun Blade X6440 was used to provide this capacity solution. This single rack delivered results which have not been seen in this form factor.

To run this many jobs, the benchmark requires a reasonably good file server where the benchmark is run. The Sun Fire X4540 server was used to provide the disk space required being accessed by NFS by the blades.

Sun has shown 4.7x greater SPECint_rate_base2006 and 3.9x greater SPECfp_rate_base2006 in a slightly smaller cabinet. IBM specifications are at: http://www-03.ibm.com/systems/power/hardware/595/specs.html. One frame (slimline doors): 79.3"H x 30.5"W x 58.5"D weight: 3,376 lb. One frame (acoustic doors): 79.3"H x 30.5"W x 71.1"D weight: 3,422 lb. The Sun Blade 6048 specifications are at: http://www.sun.com/servers/blades/6048chassis/specs.xml One Sun Blade 6048: 81.6"H x 23.9"W x 40.3"D weight: 2,300 lb (fully configured). 

Disclosure Statement:

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 6/22/2009 and this report. Sun Blade 6048 chassis with Sun Blade X6440 server modules (48 nodes with 4 chips, 16 cores, 16 threads each, OpenSolaris 2008.11, Studio 12 update 1) - 8840 SPECint_rate_base2006, 6500 SPECfp_rate_base2006; IBM p595, 1870 SPECint_rate_base2006, 1681 SPECfp_rate_base2006.

See Also

Tuesday Jun 23, 2009

A new and exceptional TPC-H result submitted today has been obtained on a cluster of 43 Sun Fire X4540 servers, each equipped  with two AMD Opteron 2356 2.3 GHz processors, running ParAccel Analytic Database on Sun OpenSolaris 2009.06. The Sun/ParAccel Cluster achieved a result of 1,050,556.20 QphH @30000GB with a price performance of $2.86/QphH @30000GB.

This is an incredible World Record for both performance and price-performance at the largest TPC-H Scale Factor (30TB) to date.

As of today, the only other 30TB result posted is on a single HP Superdome, powered by 64 x 1.6 GHz Itanium2 Dual Core processors running Oracle 10gR2. The HP result is 150,960 QphH @30000GB with a price-performance of $46.69/QphH @30000GB.

This result establishes the overall leadership of the Sun/ParAccel/OpenSolaris cluster solution in Decision Support Systems (DSS) and Data Warehousing.

  • The Sun Fire X4540 / ParAccel Cluster was over seven time (7x) faster than the HP Superdome and had sixteen times (16x) better price-performance. In addition, the total cost of the Sun/ParAccel configuration (H/W + S/W + 3 years maintenance) is less than half of the total cost of the HP/Oracle configuration.

  • The Sun Fire X4540 cluster storage consisted entirely of fully mirrored internal drives. There were almost 1000 fewer disk spindles than the HP Superdome solution (2064 vs. 3072 disks), resulting in an enormous reduction of hardware logistics, at a fraction of the floor space (172 RU vs 1120 RU).

  • This solution is one of the TPC-H new-generation DSS DBMS (column based, shared nothing, data compression, etc.) results. It is noteworthy that all of the other new generation TPC-H submissions (at 100GB up to 3000GB) ran queries entirely from memory. This new result is disk based and thus establishes the leadership and viability of the Sun/ParAccel/OpenSolaris solution on shared nothing clusters for very large disk based databases -- much larger than memory sizes realistically available even in extremely large database installations.

  • There are a number of new generation DBMS designed for Decision Support such as ParAccel, either currently for sale or still under development, all implemented on Linux. This result is the first public proof point of a new generation data warehousing product running on Solaris, more specifically OpenSolaris.

  • The load time of the 30TB database on the Sun/ParAccel cluster was 4 times faster than the HP Superdome solution. For large DSS databases, load time is a very important factor.

Performance Landscape

ch/co/th = chips, cores, threads
$/QphH = TPC-H Price/Performance metric (smaller is better)
QphH = TPC-H Composite Metric (bigger is better)


System
ch/co/th Database QphH $/QphH Price # Disks Available
  43 x Sun Fire X4275 86/344/344 PADB 1,050,566 2.86 $3,006,861 2064
06/21/09
  1 x HP Superdome 64/128/128 Oracle 150,960 46.69 $7,048,342 3072
06/18/07

Complete benchmark results may be found at the TPC benchmark website http://www.tpc.org.

Results and Configuration Summary

Servers:

    43 X Sun Fire X4540 each with:
      2 x AMD Opteron 2356, 2.3 GHz QC processors
      64 GB memory
      48 x 500GB (7,200 RPM) internal SATA disks
    86 total processors
    344 total processor cores
    344 total processor threads

Storage:

    No external storage

Switches:

    3 x 48 port Cisco 3750 + 4 x Cisco 3750 24 port 1Gb Ethernet Switches

Software:

    Operating System: OpenSolaris 2009.06
    Database Manager: ParAccel PADB

Audited Results:

    Database Size: 30,000 GB (Scale Factor)
    TPC-H Composite: 1,050,566.20 QphH@30000GB
    Price/performance: $2.86 / QphH@30000GB
    Available: June 21, 2009
    Total 3 Year Cost: $3,006,861
    TPC-H Power: 1,326,910.40
    TPC-H Throughput: 831,758.00
    Database Load Time: ~3 Hours 29 minutes
    Storage Ratio: 32.04

Benchmark Description

The TPC-H benchmark is a performance benchmark established by the Transaction Processing Council (TPC) to demonstrate Data Warehousing/Decision Support Systems (DSS). TPC-H measurements are produced for customers to evaluate the performance of various DSS systems. These queries and updates are executed against a standard database under controlled conditions. Performance projections and comparisons between different TPC-H Database sizes (100GB, 300GB, 1000GB, 3000GB and 10000GB) are not allowed by the TPC.

TPC-H is a data warehousing-oriented, non-industry-specific benchmark that consists of a large number of complex queries typical of decision support applications. It also includes some insert and delete activity that is intended to simulate loading and purging data from a warehouse. TPC-H measures the combined performance of a particular database manager on a specific computer system.

The main performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@SF, where SF is the number of GB of raw data, referred to as the scale factor). QphH@SF is intended to summarize the ability of the system to process queries in both single and multi user modes. The benchmark requires reporting of price/performance, which is the ratio of QphH to total HW/SW cost plus 3 years maintenance. A secondary metric is the storage efficiency, which is the ratio of total configured disk space in GB to the scale factor.

Key Technical Points

ParAccel PADB is one of a new generation of DBMS designed specifically for Decision Support and Data Warehousing applications.The Sun Fire X4540 and OpenSolaris2009.06 are a perfect match for the PADB solution. The Sun Fire X4540 with its large amount of internal storage in a compact form factor and OpenSolaris with ISM shared memory management, network performance and powerful Dtrace performance analysis tools.
Below are the main architectural features of the ParAccel product:

Shared Nothing Architecture

Shared nothing is the most optimal hardware architecture for highly parallel database operations in DSS environments. The inherent divide and conquer approach of distributing data over many nodes proportionally reduces the amount of work each node must do and thus has the potential for near linear scalability.

Column Based Physical Storage

Relational tables can be physically stored on disk in a row oriented fashion, or in a column oriented fashion. In the row oriented option, all columns of each row are stored contiguously on disk. By contrast, the column oriented option stores all the values of each column contiguously on disk. The choice of row store vs. column store may at first glance seem arbitrary, but in fact has profound consequences on the amount of I/O bandwidth, memory bandwidth and CPU requirements necessary for processing various types of queries.

Aggressive Data Compression

There are dozens of known techniques for storing data in a manner requiring fewer bytes than the original plain form of the data. The techniques are referred to as data compression algorithms. ParAccel uses several very effective data compression techniques. Compression is beneficial for query processing in that it reduces the amount of data that needs to be read from disk, and the amount of main memory space needed for processing the data. Both of these characteristics lead to query processing efficiencies and cost efficiencies.

Low Cost Servers and Interconnects

The ParAccel software does not require expensive proprietary hardware. Shared nothing clusters of small and low cost systems can provide adequate memory for aggressively compressed database engines to achieve performance levels far above the levels achievable by conventional database engines. In addition, the software does not require expensive special networking infrastructure but instead provides excellent performance just running on standard GbE equipment.

See Also

Disclosure Statement

TPC-H@30000GB Sun Fire X4540 1,050,566 QphH@30000GB, $2.86/QphH@30000GB, availability 6/21/09. TPC-H, HP Integrity Superdome, 150,960 QphH @30000GB, $46.69 / QphH @30000GB, availability 06/18/07, QphH, $/QphH tm of Transaction Processing Performance Council (TPC). More info www.tpc.org.

Monday Jun 15, 2009

Significance of Results

  • World Record performance result with 8 processors on the two-tier SAP ERP 6.0 enhancement pack 4 (unicode) standard sales and distribution (SD) benchmark as of June 10, 2009.
  • The Sun Fire X4600 M2 server with 8 AMD Opteron 8384 SE processors (32 cores, 32 threads) achieved 6,050 SAP SD Benchmark users running SAP ERP application release 6.0 enhancement pack 4 benchmark with unicode software, using MaxDB 7.8 database and Solaris 10 OS.
  • This benchmark result highlights the optimal performance of SAP ERP on Sun Fire servers running the Solaris OS and the seamless multilingual support available for systems running SAP applications.
  • ZFS is used in this benchmark for its database and log files.
  • The Sun Fire X4600 M2 server beats both the HP ProLiant DL785 G5 and the NEC Express5800 running Windows by 10% and 35% respectively even though all three systems use the same number of processors.
  • In January 2009, a new version, the Two-tier SAP ERP 6.0 Enhancement Pack 4 (Unicode) Standard Sales and Distribution (SD) Benchmark, was released. This new release has higher cpu requirements and so yields from 25-50% fewer users compared to the previous Two-tier SAP ERP 6.0 (non-unicode) Standard Sales and Distribution (SD) Benchmark. 10-30% of this is due to the extra overhead from the processing of the larger character strings due to Unicode encoding.  Refer to SAP Note for more details (https://service.sap.com/sap/support/notes/1139642 Note: User and password for SAP Service Marketplace required).

  • Unicode is a computing standard that allows for the representation and manipulation of text expressed in most of the world's writing systems. Before the Unicode requirement, this benchmark used ASCII characters meaning each was just 1 byte. The new version of the benchmark requires Unicode characters and the Application layer (where ~90% of the cycles in this benchmark are spent) uses a new encoding, UTF-16, which uses 2 bytes to encode most characters (including all ASCII characters) and 4 bytes for some others. This requires computers to do more computation and use more bandwidth and storage for most character strings. Refer to the above SAP Note for more details.

Performance Landscape

SAP-SD 2-Tier Performance Table (in decreasing performance order).

SAP ERP 6.0 Enhancement Pack 4 (Unicode) Results
(New version of the benchmark as of January 2009)

System OS
Database
Users SAP
ERP/ECC
Release
SAPS SAPS/
Proc
Date
Sun Fire X4600 M2
8xAMD Opteron 8384 SE @2.7GHz
256 GB
Solaris 10
MaxDB 7.8
6,050 2009
6.0 EP4
(Unicode)
33,230 4,154 10-Jun-09
HP ProLiant DL785 G5
8xAMD Opteron 8393 SE @3.1GHz
128 GB
Windows Server 2008
Enterprise Edition
SQL Server 2008
5,518 2009
6.0 EP4
(Unicode)
30,180 3,772 24-Apr-09
NEC Express 5800
8xIntel Xeon X7460 @2.66GHz
256 GB
Windows Server 2008
Datacenter Edition
SQL Server 2008
4,485 2009
6.0 EP4
(Unicode)
25,280 12,640 09-Feb-09
Sun Fire X4270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g
3,700 2009
6.0 EP4
(Unicode)
20,300 10,150 30-Mar-09

SAP ERP 6.0 (non-unicode) Results
(Old version of the benchmark retired at the end of 2008)

System OS
Database
Users SAP
ERP/ECC
Release
SAPS SAPS/
Proc
Date
Sun Fire X4600 M2
8xAMD Opteron 8384 @2.7GHz
128 GB
Solaris 10
MaxDB 7.6
7,825 2005
6.0
39,270 4,909 09-Dec-08
IBM System x3650
2xIntel Xeon X5570 @2.93GHz
48 GB
Windows Server 2003 EE
DB2 9.5
5,100 2005
6.0
25,530 12,765 19-Dec-08
HP ProLiant DL380 G6
2xIntel Xeon X5570 @2.93GHz
48 GB
Windows Server 2003 EE
SQL Server 2005
4,995 2005
6.0
25,000 12,500 15-Dec-08

Complete benchmark results may be found at the SAP benchmark website http://www.sap.com/benchmark.

Results and Configuration Summary

Hardware Configuration:

    One, Sun Fire X4600 M2
      8 x 2.7 GHz AMD Opteron 8384 SE processors (8 processors / 32 cores / 32 threads)
      256 GB memory
      3 x STK2540, 3 x STK2501 each with 12 x 146GB/15KRPM disks

Software Configuration:

    Solaris 10
    SAP ECC Release: 6.0 Enhancement Pack 4 (Unicode)
    MaxDB 7.8

Certified Results

    Performance:
    6050 benchmark users
    SAP Certification:
    2009022

Key Points and Best Practices

  • This is the best 8 Processor SAP ERP 6.0 EP4 (Unicode) result as of June 10, 2009.
  • Two-tier SAP ERP 6.0 Enhancement Pack 4 (Unicode) Standard Sales and Distribution (SD) Benchmark on Sun Fire X4600 M2 (8 processors, 32 cores, 32 threads, 8x2.7 GHz AMD Opteron 8384 SE) was able to support 6,050 SAP SD Users on top of the Solaris 10 OS.
  • Since random writes are an important part of this benchmark, we used zfs to help coalesce those into sequential writes.

Benchmark Description

The SAP Standard Application SD (Sales and Distribution) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing, and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.

SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

Disclosure Statement

Two-tier SAP Sales and Distribution (SD) standard SAP ERP 6.0 2005/EP4 (Unicode) application benchmarks as of 06/10/09: Sun Fire X4600 M2 (8 processors, 32 cores, 32 threads) 6,050 SAP SD Users, 8x 2.7 GHz AMD Opteron 8384 SE, 256 GB memory, MaxDB 7.8, Solaris 10, Cert# 2009022. HP ProLiant DL785 G5 (8 processors, 32 cores, 32 threads) 5,518 SAP SD Users, 8x 3.1 GHz AMD Opteron 8393 SE, 128 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009009. NEC Express 5800 (8 processors, 48 cores, 48 threads) 4,485 SAP SD Users, 8x 2.66 GHz Intel Xeon X7460, 256 GB memory, SQL Server 2008, Windows Server 2008 Datacenter Edition, Cert# 2009001. Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,700 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, Oracle 10g, Solaris 10, Cert# 2009005. Sun Fire X4600M2 (8 processors, 32 cores, 32 threads) 7,825 SAP SD Users, 8x 2.7 GHz AMD Opteron 8384, 128 GB memory, MaxDB 7.6, Solaris 10, Cert# 2008070. IBM System x3650 M2 (2 Processors, 8 Cores, 16 Threads) 5,100 SAP SD users,2x 2.93 Ghz Intel Xeon X5570, DB2 9.5, Windows Server 2003 Enterprise Edition, Cert# 2008079. HP ProLiant DL380 G6 (2 processors, 8 cores, 16 threads) 4,995 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, SQL Server 2005, Windows Server 2003 Enterprise Edition, Cert# 2008071.

SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark

Thursday Jun 11, 2009

Sun has demonstrated the first large scale grid validation for the SAS Grid Computing 9.2 benchmark. This workload showed both the strength of Solaris 10 utilizing containers for ease of deployment, as well as the Sun Storage 7410 Unified Storage Systems and Fishworks Analytics in analyzing, tuning and delivering performance in complex SAS grid data-intensive multi-node environments.

In order to model the real world, the Grid Endurance Test uses large data sizes and complex data processing.  Results demonstrate real customer scenarios and results.  These benchmark results represent significant engineering effort, collaboration and coordination between SAS and Sun. The results also illustrate the commitment of the two companies to provide the best solutions for the most demanding data integration requirements.

  • A combination of 7 Sun Fire X2200 M2 servers utilizing Solaris 10 and a Sun Storage 7410 Unified Storage System showed continued performance improvement as the node count increased from 2 to 7 nodes for the Grid Endurance Test.
  • SAS environments are often complex. Ease of deployment, configuration, use, and ability to observe application IO characteristics (hotspots, trouble areas) are critical for production environments. The power of Fishworks Analytics combined with the reliability of ZFS is a perfect fit for these types of applications.
  • Sun Storage 7410 Unified Storage System (exporting via NFS) satisfied performance needs, throughput peaking at over  900MB/s (near 10GbE line speed) in this multi-node environment.
  • Solaris 10 Containers were used to create agile and flexible deployment environments. Container deployments were trivially migrated (within minutes) as HW resources became available (Grid expanded).
  • This result is the only large scale grid validation for the SAS Grid Computing 9.2, and the first and most timely qualification of OpenStorage for SAS.
  • The test show a delivered throughput through client 1Gb connection of over 100MB/s.

Configuration

The test grid consisted of 8x Sun Fire x2200 M2 servers, 1 configured as the grid manager, 7 as the actual grid nodes.  Each node had a 1GbE connection through a Brocade FastIron 1GbE/10GbE switch.  The 7410 had a 10GbE connection to the switch and sat as the back end storage providing a common shared file system to all nodes which SAS Grid Computing requires.  A storage appliance like the 7410 serves as an easy to setup and maintain solution, satisfying the bandwidth required by the grid.  Our particular 7410 consisted of 46 700GB 7200RPM SATA drives, 36GB of write optimized SSD's and 300GB of Read optimized SSD's.


About the Test

The workload is a batch mixture.  CPU bound workloads are numerically intensive tests, some using tables varying in row count from  9,000 to almost 200,000.  The tables have up to 297 variables, and are processed with both stepwise linear regression and stepwise logistic regression.   Other computational tests use GLM (General Linear Model).  IO intensive jobs vary as well.  One particular test reads raw data from multiple files, then generates 2 SAS data sets, one containing over 5 million records, the 2nd over 12 million.  Another IO intensive job creates a 50 million record SAS data set, then subsequently does lookups against it and finally sorts it into a dimension table.   Finally, other jobs are both compute and IO intensive.

 The SAS IO pattern for all these jobs is almost always sequential, for read, write, and mixed access, as can be viewed via Fishworks Analytics further below.  The typical block size for IO is 32KB. 

Governing the batch is the SAS Grid Manager Scheduler,  Platform LSF.  It determines when to add a job to a node based on number of open job slots (user defined), and a point in time sample of how busy the node actually is.  From run to run, jobs end up scheduled randomly making runs less predictable.  Inevitably, multiple IO intensive jobs will get scheduled on the same node, throttling the 1Gb connection, creating a bottleneck while other nodes do little to no IO.  Often this is unavoidable due to the great variety in behavior a SAS program can go through during its lifecycle.  For example, a program can start out as CPU intensive and be scheduled on a node processing an IO intensive job.  This is the desired behavior and the correct decision based on that point in time.  However, the intially CPU intensive job can then turn IO intensive as it proceeds through its lifecycle.


Results of scaling up node count

Below is a chart of results scaling from 2 to 7 nodes.  The metric is total run time from when the 1st job is scheduled, until the last job is completed.

Scaling of 400 Analytics Batch Workload
Number of Nodes Time to Completion
2 6hr 19min
3 4hr 35min
4 3hr 30min
5 3hr 12min
6 2hr 54min
7 2hr 44min

One may note that time to completion is not linear as node count scales upwards.   To a large extent this is due to the nature of the workload as explained above regarding 1Gb connections getting saturated.  If this were a highly tuned benchmark with jobs placed with high precision, we certainly could have improved run time.  However, we did not do this in order to keep the batch as realistic as possible.  On the positive side, we do continue to see improved run times up to the 7th node.

The Fishworks Analytics displays below show several performance statistics with varying numbers of nodes, with more nodes on the left and fewer on the right.  The first two graphs show file operations per second, and the third shows network bytes per second.   The 7410 provides over 900 MB/sec in the seven-node test.  More information about the interpretation of the Fishworks data for these test will be provided in a later white paper.

An impressive part is in the Fishworks Analytics shot above, throughput of 763MB/s was achieved during the sample period.  That wasn't the top end of the what 7410 could provide.  For the tests summarized in the table above, the 7 node run peaked at over 900MB/s through a single 10GbE connection.  Clearly the 7410 can sustain a fairly high level of IO.

It is also important to note that while we did try to emulate a real world scenario with varying types of jobs and well over 1TB of data being manipulated during the batch, this is a benchmark.  The workload tries to encompass a large variety of job behavior.  Your scenario may vary quite differently from what was run here.  Along with scheduling issues, we were certainly seeing signs of pushing this 7410 configuration near its limits (with the SAS IO pattern and data set sizes), which also affected the ability to achieve linear scaling .  But many grid environments are running workloads that aren't very IO intensive and tend to be more CPU bound with minimal IO requirements.  In that scenario one could expect to see excellent node scaling well beyond what was demonstrated by this batch.  To demonstrate this, the batch was run sans the IO intensive jobs.  These jobs do require some IO, but tend to be restricted to 25MB/s or less per process and only for the purpose of initially reading a data set, or writing results.

  • 3 nodes ran in 120 minutes
  • 7 nodes ran in 58 minutes
Very nice scaling - near linear, especially with the lag time that can occur with scheduling batch jobs.  The point of this exercise being, know your workload.  In this case, the 7410 solution on the back end was more than up to the demands these 350+ jobs put on it and there was still room to grow and scale out more nodes further reducing overall run time.


Tuning(?)

The question mark is actually appropriate.  For the achieved results, after configuring a RAID1 share on the 7410, only 1 parameter made a significant difference.  During the IO intensive periods, single 1Gb client throughput was observed at 120MB/s simplex, and 180MB/s duplex - producing well over 100,000 interrupts a second.  Jumbo frames were enabled on the 7410 and clients, reducing interrupts by almost 75% and reducing IO intensive job run time by an average of 12%.  Many other NFS, Solaris, tcp/ip tunings were tried, with no meaningful reduction in microbenchmarks, or the actual batch.  Nice relatively simple (for a grid) setup.

Not a direct tuning but an application change worth mentioning was due to the visibility that Analytics provides.  Early on during the load phase of the benchmark, the IO rate was less than spectacular.  What should have taken about 4.5 hours was going to take almost a day.  Drilling down through analytics showed us that 100,000's of file open/closes were occurring that the development team had been unaware of.  Quickly that was fixed and the data loader ran at expected rates.


Okay - Really no other tuning?  How about 10GbE!

Alright, so there was something else we tried which was outside the test results achieved above.  The x2200 we were using is an 8 core box.  Even when maxing out the 1Gb testing with multiple IO bound jobs, there was still CPU resources left over.  Considering that a higher core count with more memory is becoming more the standard when referencing a "client", it makes sense to utilize all those resources.  In the case where a node would be scheduled with multiple IO jobs, we wanted to see if 10GbE could potentially push up client throughput.  Through our testing, two things helped improve performance.

The first was to turn off interrupt blanking.  With blanking disabled, packets are processed when they arrive as opposed to being processed when an interrupt is issued.  Doing this resulted in a ~15% increase in duplex throughput.  Caveat - there is a reason interrupt blanking exists and it isn't to slow down your network throughput.  Tune this only if you have a decent amount of idle cpu as disabling interrupt blanking will consume it.  The other piece that resulted in a significant increase in throughput through the 10GbE NIC was to use multiple NFS client processes.  We achieved this through zones.  By adding a second zone, throughput through the single 10GbE interface increased ~30%.  The final duplex numbers were (These are also peak throughput).

  • 288MB/s no tuning
  • 337MB/s interrupt blanking disabled
  • 430MB/s 2 NFS client processes + interrupt blanking disabled


Conclusion - what does this show?

  • SAS Grid Computing which requires a shared file system between all nodes, can fit in very nicely on the 7410 storage appliance.  The workload continues to scale while adding nodes.
  • The 7410 can provide very solid throughput peaking at over 900MB/s (near 10GbE linespeed) with the configuration tested. 
  • The 7410 is easy to set up, gives an incredible depth of knowledge about the IO your application does which can lead to optimization. 
  • Know your workload, in many cases the 7410 storage appliance can be a great fit at a relatively inexpensive price while providing the benefits described (and others not described) above.
  • 10GbE client networking can be a help if your 1GbE IO pipeline is a bottleneck and there is a reasonable amount of free CPU overhead.


Additional Reading

Sun.com on Sas Grid Computing and the Sun Storage 7410 Unified Storage Array

Description of Sas Grid Computing


    Wednesday Jun 03, 2009

    A sample of the various Sun and partner technologies to be discussed:
    OpenSolaris, Solaris, Linux, Windows, vmware, gcc, Java, Glassfish, MySQL, Java, Sun-Studio, ZFS, dtrace, perflib, Oracle, DB2, Sybase, OpenStorage, CMT, SPARC64, X64, X86, Intel, AMD

    This blog copyright 2009 by John Henning