Tuesday Nov 24, 2009

The Sun SPARC Enterprise M9000 server (64 processors, 256 cores, 512 threads) set a World Record on the SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Standard Sales and Distribution (SD) Benchmark.
  • The Sun SPARC Enterprise M9000 server with 2.88 GHz SPARC64 VII processors achieved 32,000 users on the two-tier SAP Sales and Distribution (SD) standard SAP enhancement package 4 for SAP ERP 6.0 (Unicode) application benchmark.

  • The Sun SPARC Enterprise M9000 server result is 8.6x faster than the only IBM 5GHz POWER6 unicode result, which was published on the IBM p550 using the new SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Standard Sales and Distribution (SD) Benchmark.

  • IBM has not submitted any IBM 595 results on the current SAP enhancement package 4 for SAP ERP 6.0 (unicode) Standard Sales and Distribution (SD) Benchmark. This benchmark has been current for almost a year. IBM p595 systems only have 8x more cores than the system than IBM system 550.

  • HP has not submitted any Itanium2 results on the new SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Standard Sales and Distribution (SD) Benchmark.

  • This new result is 1.84x times greater than the previous record result delivered on the Sun SPARC Enterprise M9000 server which used 32 processors.

  • In January 2009, a new version, the Two-tier SAP ERP 6.0 Enhancement Pack 4 (Unicode) Standard Sales and Distribution (SD) Benchmark, was released. This new release has higher cpu requirements and so yields from 25-50% fewer users compared to the previous Two-tier SAP ERP 6.0 (non-unicode) Standard Sales and Distribution (SD) Benchmark. 10-30% of this is due to the extra overhead from the processing of the larger character strings due to Unicode encoding. See this SAP Note 1139642 for more details.

  • Unicode is a computing standard that allows for the representation and manipulation of text expressed in most of the world's writing systems. Before the Unicode requirement, this benchmark used ASCII characters meaning each was just 1 byte. The new version of the benchmark requires Unicode characters and the Application layer (where ~90% of the cycles in this benchmark are spent) uses a new encoding, UTF-16, which uses 2 bytes to encode most characters (including all ASCII characters) and 4 bytes for some others. This requires computers to do more computation and use more bandwidth and storage for most character strings. Refer to the above SAP Note for more details.

Performance Landscape SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Results (in decreasing performance)

(ERP 6.0 EP is the current version of the benchmark as of January 2009)

System OS
Database
Users SAP
ERP/ECC
Release
SAPS Date
Sun SPARC Enterprise M9000
64xSPARC 64 VII @2.88GHz
1152 GB
Solaris 10
Oracle10g
32,000 2009
6.0 EP4
(Unicode)
175,600 18-Nov-09
Sun SPARC Enterprise M9000
32xSPARC 64 VII @2.88GHz
1024 GB
Solaris 10
Oracle10g
17,430 2009
6.0 EP4
(Unicode)
95,480 12-Oct-09
IBM System 550
4xPower6@5GHz
64 GB
AIX 6.1
DB2 9.5
3,752 2009
6.0 EP4
(Unicode)
20,520 16-Jun-09

Complete benchmark results may be found at the SAP benchmark website http://www.sap.com/benchmark.

Benchmark Description

The SAP Standard Application SD (Sales and Distribution) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing, and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.

SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

Results and Configuration Summary

Certified Result:

    Number of SAP SD benchmark users:
    32,000
    Average dialog response time:
    0.93 seconds
    Throughput:

    Fully processed order line items/hour:
    3,512,000

    Dialog steps/hour:
    10,536,000

    SAPS:
    175,600
    SAP Certification:
    2009046

Hardware Configuration:

    Sun SPARC Enterprise M9000
      64 x 2.88GHz SPARC64 VII, 1152 GB memory

Software Configuration:

    Solaris 10
    SAP enhancement package 4 for SAP ERP 6.0 (unicode)
    Oracle10g

Disclosure Statement

Two-tier SAP Sales and Distribution (SD) standard SAP enhancement package 4 for SAP ERP 6.0 (Unicode) application benchmarks as of 11/18/09: Sun SPARC Enterprise M9000 (64 processors, 256 cores, 512 threads) 32,000 SAP SD Users, 64 x 2.88 GHz SPARC VII, 1152 GB memory, Oracle10g, Solaris10, Cert# 2009046. Sun SPARC Enterprise M9000 (32 processors, 128 cores, 256 threads) 17,430 SAP SD Users, 32 x 2.88 GHz SPARC VII, 1024 GB memory, Oracle10g, Solaris10, Cert# 2009038. IBM System 550 (4 processors, 8 cores, 16 threads) 3,752 SAP SD Users, 4x 5 GHz Power6, 64 GB memory, DB2 9.5, AIX 6.1, Cert# 2009023. Sun SPARC Enterprise M9000 (64 processors, 256 cores, 512 threads) 64 x 2.52 GHz SPARC64 VII, 1024GB memory, 39,100 SD benchmark users, 1.93 sec. avg. response time, Cert#2008042, Oracle 10g, Solaris 10, SAP ECC Release 6.0.

SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark

Friday Nov 20, 2009

A Sun Blade 6048 Modular System with 16 Sun Blade X6275 Server Modules configured with QDR InfiniBand cluster interconnect delivered outstanding performance running the FLUENT benchmark test suite truck_111m case.

  • A cluster of Sun Blade X6275 server modules with 2.93 GHz Intel X5570 processors achieved leading 32-node performance for the largest truck test case, truck_111m.
  • The Sun Blade X6275 cluster delivered the best performance for the 64-core/8-node, 128-core/16-node, and 256-core/32-node configurations, outperforming the SGI Altix result by as much as 8%.
  • NOTE: These results are will not be published on the Fluent website as Fluent has stopped accepting results for this version.

Performance Landscape


FLUENT 12 Benchmark Test Suite - truck_111m
  Results are "Ratings" (bigger is better)
  Rating = No. of sequential runs of test case possible in 1 day = 86,400 sec/(Total Elapsed Run Time in seconds)

System (1)
cores Benchmark Test Case
truck
111m

Sun Blade X6275, 32 nodes 256 240.0
SGI Altix ICE 8200 IP95, 32 nodes 256 238.9
Intel Whitebox, 32 nodes 256 219.8

Sun Blade X6275, 16 nodes 128 129.6
SGI Altix ICE 8200 IP95, 16 nodes 128 120.8
Intel Whitebox, 16 nodes 128 116.9

Sun Blade X6275, 8 nodes 64 64.6
SGI Altix ICE 8200 IP95, 8 nodes 64 59.8
Intel Whitebox, 8 nodes 64 57.4

(1) Sun Blade X6275, X5570 QC 2.93GHz, QDR
Intel Whitebox, X5560 QC 2.8GHz, DDR
SGI Altix ICE 8200, X5570 QC 2.93GHz, DDR

Results and Configuration Summary

Hardware Configuration:

    16 x Sun Blade X6275 Server Module ( Dual-Node Blade, 32 nodes ) each node with
      2 x 2.93GHz Intel X5570 QC processors
      24 GB (6 x 4GB, 1333 MHz DDR3 dimms)
      On-board QDR InfiniBand Host Channel Adapters, QNEM

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Interconnect Software: OFED ver 1.4.1
    Shared File System: Lustre ver 1.8.1
    Application: FLUENT V12.0.16
    Benchmark: FLUENT 12 Benchmark Test Suite

Benchmark Description

The benchmark test are representative of typical user large CFD models intended for execution in distributed memory processor (DMP) mode over a cluster of multi-processor platforms.

Key Points and Best Practices

Observations About the Results

The Sun Blade X6275 cluster delivered excellent performance on the largest Fluent benchmark problem, truck_111m.

The Intel X5570 processors include a turbo boost feature coupled with a speedstep option in the CPU section of the advanced BIOS settings. This, under specific circumstances, can provide a cpu upclocking, temporarily increasing the processor frequency from 2.93GHz to 3.2GHz.

Memory placement is a very significant factor with Nehalem processors. Current Nehalem platforms have two sockets. Each socket has three memory channels and each channel has 3 bays for DIMMs. For example if one DIMM is placed in the 1st bay of each of the 3 channels the DIMM speed will be 1333 MHz with the X5570's altering the DIMM arrangement to an off balance configuration by say adding just one more DIMM into the 2nd bay of one channel will cause the DIMM frequency to drop from 1333 MHz to 1067 MHz.

About the FLUENT 12 Benchmark Test Suite

The FLUENT application performs computational fluid dynamic analysis on a variety of different types of flow and allows for chemically reacting species. transient dynamic and can be linear or nonlinear as far

  • CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.
  • CFD models typically scale very well and are very suited for execution on clusters. The FLUENT 12 benchmark test cases scale well.
  • The memory requirements for the test cases in the FLUENT 12 benchmark test suite range from a few hundred megabytes to about 25 GB. As the job is distributed over multiple nodes the memory requirements per node correspondingly are reduced.
  • The benchmark test cases for the FLUENT module do not have a substantial I/O component. component. However performance will be enhanced very substantially by using high performance interconnects such as InfiniBand for inter node cluster message passing. This nodal message passing data can be stored locally on each node or on a shared file system.

See Also

Current FLUENT 12 Benchmark:
http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/

Disclosure Statement

All information on the Fluent website is Copyrighted 1995-2009 by Fluent Inc. Results from http://www.fluent.com/software/fluent/fl6bench/ as of November 12, 2009 and this presentation.

Friday Nov 20, 2009

Significance of Results

A Sun Blade 6048 chassis with 48 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.

  • The cluster of 32 Sun Blade X6275 server modules was 9.2x faster than the 512 processor configuration of the IBM BlueGene/L.

  • The cluster of 48 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 37.8x speedup for 48 blades relative to 1 blade.

  • For largest molecule considered, the cluster of 48 Sun Blade X6275 server modules achieved a throughput of 0.028 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of the Sun Blade X6275 cluster to several of the clusters for which performance is reported on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Throughput for 512 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.014 0.0073 0.0048
Cambridge Xeon/3.0 InfiniPath 0.016 0.0088 0.0056
NCSA Xeon/2.33 InfiniBand 0.019 0.010 0.008
AMD Opteron/2.2 InfiniPath 0.025 0.015 0.008
IBM HPCx PWR4/1.7 Federation 0.039 0.021 0.013
SDSC IBM BlueGene/L MPI 0.108 0.061 0.044

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
48 768 0.0277 37.8 79% 0.0075 35.2 73% 0.0039 22.2 46%
36 576 0.0324 32.3 90% 0.0096 27.4 76% 0.0045 19.3 54%
32 512 0.0368 28.4 89% 0.0104 25.3 79% 0.0048 18.1 57%
24 384 0.0481 21.8 91% 0.0136 19.3 80% 0.0066 13.2 55%
16 256 0.0715 14.6 91% 0.0204 12.9 81% 0.0073 11.9 74%
12 192 0.0875 12.0 100% 0.0271 9.7 81% 0.0096 9.1 76%
8 128 0.1292 8.1 101% 0.0337 7.8 98% 0.0139 6.3 79%
4 64 0.2726 3.8 95% 0.0666 4.0 100% 0.0224 3.9 98%
1 16 1.0466 1.0 100% 0.2631 1.0 100% 0.0872 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Satellite Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

    48 x Sun Blade X6275, each with
      2 x (2 x 2.93 GHz Intel QC Xeon X5570 (Nehalem) processors)
      2 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Satellite Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

Key Points and Best Practices

Models with large numbers of atoms scale better than models with small numbers of atoms.

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.33GHz. This feature was was enabled when generating the results reported here.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 11/17/2009.

Thursday Nov 19, 2009

The Sun SPARC Enterprise T5240 server running the Sun Java Messaging server 7.2 achieved a World Record SPECmail2009 result using Sun Storage 7310 Unified Storage System and ZFS file system.  Sun's OpenStorage platforms enable another world record.

  • World record SPECmail2009 benchmark using the Sun SPARC Enterprise T5240 server (two 1.6GHz UltraSPARC T2 Plus), Sun Communications Suite 7, Solaris 10, and the Sun Storage 7310 Unified Storage System achieved 14,500 SPECmail_Ent2009 users at 69,857 Sessions/Hour.

  • This SPECmail2009 benchmark result clearly demonstrates that the Sun Messaging Server 7.2, Solaris 10 and ZFS solution can support a large, enterprise level IMAP mail server environment as a low cost 'Sun on Sun' solution, delivering the best performance and maximizing data integrity and availability of Sun Open Storage and ZFS.

  • The Sun SPARC Enterprise T5240 server supported 2.4 times more users with 2.4 times better sessions/hour rate than AppleXserv3 solution on the SPECmail2009 benchmark.

  • There are no IBM Power6 results on this benchmark.

  • The configuration using Sun OpenStorage outperformed all previous results with traditional direct attached storage and significantly higher number of disk devices.

SPECmail2009 Performance Landscape (ordered by performance)

System Performance Disks OS Messaging
Server
Users Sessions/
hour
Sun SPARC Enterprise T5240
2 x 1.6GHz UltraSPARC T2 Plus
14,500 69,857 58
NAS
Solaris 10 CommSuite 7.2
Sun JMS 7.2
Sun SPARC Enterprise T5240
2 x 1.6GHz UltraSPARC T2 Plus
12,000 57,758 80
DAS
Solaris 10 CommSuite 5
Sun JMS 6.3
Sun Fire X4275
2 x 2.93GHz Xeon X5570
8,000 38,348 44
NAS
Solaris 10 Sun JMS 6.2
Apple Xserv3,1
2 x 2.93GHz Xeon X5570
6,000 28,887 82
DAS
MacOS 10.6 Dovecot 1.1.14
apple 0.5
Sun SPARC Enterprise T5220
1 x 1.4GHz UltraSPARC T2
3,600 17,316 52
DAS
Solaris 10 Sun JMS 6.2

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org

Users - SPECmail_Ent2009 Users
Sessions/hour - SPECmail2009 Sessions/hour
NAS - Network Attached Storage
DAS - Direct Attached Storage

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5240
      2 x 1.6 GHz UltraSPARC T2 Plus processors
      128 GB memory
      2 x 146GB, 10K RPM SAS disks, 4 x 32GB SSDs

External Storage:

    2 x Sun Storage 7310 Unified Storage System, each with
      32 GB of memory
      24 x 1 TB 7200 RPM SATA Drives

Software Configuration:

    Solaris 10
    ZFS
    Sun Java Communications Suite 7 Update 2
      Sun Java System Messaging Server 7.2
      Directory Server 6.3

Benchmark Description

The SPECmail2009 benchmark measures the ability of corporate e-mail systems to meet today's demanding e-mail users over fast corporate local area networks (LAN). The SPECmail2009 benchmark simulates corporate mail server workloads that range from 250 to 10,000 or more users, using industry standard SMTP and IMAP4 protocols. This e-mail server benchmark creates client workloads based on a 40,000 user corporation, and uses folder and message MIME structures that include both traditional office documents and a variety of rich media content. The benchmark also adds support for encrypted network connections using industry standard SSL v3.0 and TLS 1.0 technology. SPECmail2009 replaces all versions of SPECmail2008, first released in August 2008. The results from the two benchmarks are not comparable.

Software on one or more client machines generates a benchmark load for a System Under Test (SUT) and measures the SUT response times. A SUT can be a mail server running on a single system or a cluster of systems.

A SPECmail2009 'run' simulates a 100% load level associated with the specific number of users, as defined in the configuration file. The mail server must maintain a specific Quality of Service (QoS) at the 100% load level to produce a valid benchmark result. If the mail server does maintain the specified QoS at the 100% load level, the performance of the mail server is reported as SPECmail_Ent2009 SMTP and IMAP Users at SPECmail2009 Sessions per hour. The SPECmail_Ent2009 users at SPECmail2009 Sessions per Hour metric reflects the unique workload combination for a SPEC IMAP4 user.

Key Points and Best Practices

  • Each Sun Storage 7310 Unified Storage System was configured with one J4400 JBOD array with 22x1TB SATA drives to a mirrored device and 4 shared volumes are built under the mirrored device. Total 8 mirrored volumes from 2 x Sun Storage 7310 are mounted on the system under test (SUT) messaging mail indexes and mail messages file system using NFSV4 protocol. Four SSDs were used as the SUT internal disks. Each SSD is configured as a ZFS file system. Four such ZFS directories are used for the messaging server queue, store metadata, LDAP and queue. SSDs substantially reduced the store metadata and queue latencies.

  • Each Sun Storage 7310 Unified Storage System was connected to the SUT via a dual 10-Gigabit Ethernet Fiber XFP card.

  • The Sun Storage 7310 Unified Storage System software version is 2009.08.11,1-0.

  • The clients used these Java options: java -d64 -Xms4096m -Xmx4096m -XX:+AggressiveHeap

  • Substantial performance improvement and scalability was observed with Sun Communications Suite7 update2, Java Messaging Server 7.2 and Directory Server 6.2

  • See the SPEC Report for all OS, network and messaging server tunings.

See Also

Disclosure Statement

SPEC, SPECmail reg tm of Standard Performance Evaluation Corporation. Results as of 10/22/09 on www.spec.org. SPECmail2009: Sun SPARC Enterprise T5240, SPECmail_Ent2009 14,500 users at 69,857 SPECmail2009 Sessions/hour. Apple Xserv3,1, SPECmail_Ent2009 6,000 users at 28,887 SPECmail2009 Sessions/hour.

Wednesday Nov 18, 2009

Part of the Sun FlashFire family, the Sun Flash Accelerator F20 PCIe Card is a low-profile x8 PCIe card with 4 Solid State Disks-on-Modules (DOMs) delivering over 101K IOPS (4K IO) and 1.1 GB/sec throughput (1M reads).

The Sun F20 card is designed to accelerate IO-intensive applications, such as databases, at a fraction of the power, space, and cost of traditional hard disk drives. It is based on enterprise-class SLC flash technology, with advanced wear-leveling, integrated backup protection, solid state robustness, and 3M hours MTBF reliability.

  • The Sun Flash Accelerator F20 PCIe Card demonstrates breakthrough performance of 101K IOPS for 4K random read
  • The Sun Flash Accelerator F20 PCIe Card can also perform 88K IOPS for 4K random write
  • The Sun Flash Accelerator F20 PCIe Card has unprecedented throughput of 1.1 GB/sec.
  • The Sun Flash Accelerator F20 PCIe Card (low-profile x8 size) has the IOPS performance of over 550 SAS drives or 1,100 SATA drives.

Performance Landscape

Bandwidth and IOPS Measurements

Test DOMs
4 2 1
Random 4K Read 101K IOPS 68K IOPS 35K IOPS
Maximum Delivered Random 4K Write 88K IOPS 44K IOPS 22K IOPS
Maximum Delivered 50-50 4K Read/Write 54K IOPS 27K IOPS 13K IOPS
Sequential Read (1M) 1.1 GB/sec 547 MB/sec 273 MB/sec
Maximum Delivered Sequential Write (1M) 567 MB/sec 243 MB/sec 125 MB/sec

Sustained Random 4K Write* 37K IOPS 18K IOPS 10K IOPS
Sustained 50/50 4K Read/Write* 34K IOPS 17K IOPS 8.6K IOPS

(*) Maximum Delivered values measured over a 1 minute period. Sustained write performance differs from maximum delivered performance. Over time, wear-leveling and erase operations are required and impact write performance levels.

Latency Measurements

The Sun Flash Accelerator F20 PCIe Card is tuned for 4 KB or larger IO sizes, the write service for IOs smaller than 4 KB can be 10 times more than shown in the table below. It should also be noted that the service times shown below are both the latency and the time to transfer the data. This becomes the dominant portion the the service time for IOs over 64 KB in size.

Transfer Size Service Time (ms)
Read Write
4 KB 0.32 0.22
8 KB 0.34 0.24
16 KB 0.37 0.27
32 KB 0.43 0.33
64 KB 0.54 0.46
128 KB 0.49 1.30
256 KB 1.31 2.15
512 KB 2.25 2.25

- Latencies are measured application latencies via vdbench tool.
- Please note that the FlashFire F20 card is a 4KB sector device. Doing IOs of less than 4KB in size, or not aligned on 4KB boundaries, can result in a significant performance degradations on write operations.

Results and Configuration Summary

Storage:

    Sun Flash Accelerator F20 PCIe Card
      4 x 24-GB Solid State Disks-on-Modules (DOMs)

Servers:

    1 x Sun Fire X4170

Software:

    OpenSolaris 2009.06 or Solaris 10 10/09 (MPT driver enhancements)
    Vdbench 5.0
    Required Flash Array Patches SPARC, ses/sgen patch 138128-01 or later & mpt patch 141736-05
    Required Flash Array Patches x86, ses/sgen patch 138129-01 or later & mpt patch 141737-05

Benchmark Description

Sun measured a wide variety of IO performance metrics on the Sun Flash Accelerator F20 PCIe Card using Vdbench 5.0 measuring 100% Random Read, 100% Random Write, 100% Sequential Read, 100% Sequential Write, and 50-50 read/write. This demonstrates the maximum performance and throughput of the storage system.

Vdbench profile f20-parmfile.txt is here for bandwidth and IOPs. And here is the vdbench profile f20-latency.txt file for latency.

Vdbench is publicly available for download at: http://vdbench.org

Key Points and Best Practices

  • Drive each Flash Modules with 32 outstanding IO as shown in the benchmark profile above.
  • SPARC platforms will align with the 4K boundary size set by the Flash Array. x86/windows platforms don't necessarily have this alignment built in and can show lower performance

See Also

Disclosure Statement

Sun Flash Accelerator F20 PCIe Card delivered 100K 4K read IOPS and 1.1 GB/sec sequential read. Vdbench 5.0 (http://vdbench.org) was used for the test. Results as of September 14, 2009.

Thursday Nov 05, 2009

TPC-C Sun SPARC Enterprise T5440 with Oracle RAC World Record Database Result

Sun and Oracle demonstrate the World's fastest database performance. Sun Microsystems using 12 Sun SPARC Enterprise T5440 servers, 60 Sun Storage F5100 Flash arrays and Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning delivered a world-record TPC-C benchmark result.

  • The 12-node Sun SPARC Enterprise T5440 server cluster result delivered a world record TPC-C benchmark result of 7,646,486.7 tpmC and $2.36 $/tpmC (USD) using Oracle 11g R1 on a configuration available 12/14/09.

  • The 12-node Sun SPARC Enterprise T5440 server cluster beats the performance of the IBM Power 595 (5GHz) with IBM DB2 9.5 database by 26% and has 16% better price/performance on the TPC-C benchmark.

  • The complete Oracle/Sun solution used 10.7x better computational density than the IBM configuration (computational density = performance/rack).

  • The complete Oracle/Sun solution used 8 times fewer racks than the IBM configuration.

  • The complete Oracle/Sun solution has 5.9x better power/performance than the IBM configuration.

  • The 12-node Sun SPARC Enterprise T5440 server cluster beats the performance of the HP Superdome (1.6GHz Itanium2) by 87% and has 19% better price/performance on the TPC-C benchmark.

  • The Oracle/Sun solution utilized Sun FlashFire technology to deliver this result. The Sun Storage F5100 flash array was used for database storage.

  • Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning scales and effectively uses all of the nodes in this configuration to produce the world record performance.

  • This result showed Sun and Oracle's integrated hardware and software stacks provide industry-leading performance.

More information on this benchmark will be posted in the next several days.

Performance Landscape

TPC-C results (sorted by tpmC, bigger is better)


System
tpmC Price/tpmC Avail Database Cluster Racks w/KtpmC
12 x Sun SPARC Enterprise T5440 7,646,487 2.36 USD 12/14/09 Oracle 11g RAC Y 9 9.6
IBM Power 595 6,085,166 2.81 USD 12/10/08 IBM DB2 9.5 N 76 56.4
HP Integrity Superdome 4,092,799 2.93 USD 08/06/07 Oracle 10g R2 N 46 to be added

Avail - Availability date
w/KtmpC - Watts per 1000 tpmC
Racks - clients, servers, storage, infrastructure

Sun and IBM TPC-C Response times


System
tpmC

Response Time

New Order 90th%

Response Time

New Order Average

12 x Sun SPARC Enterprise T5440 7,646,487 0.170 0.168
IBM Power 595 6,085,166 1.69
1.22
Response Time Ratio - Sun Better

9.9x 7.3x

Sun uses 7x comparison to highlight the differences in response times between Sun's solution and IBM.  Although notice that Sun is 10x faster on New Order transactions that finish in the 90% percentile.

It is also interesting to note that none of Sun's response times, avg or 90th percentile, for any transaction is over 0.25 seconds. While IBM does not have even one interactive transaction, not even the menu, below 0.50 seconds. Graphs of Sun's and IBM's response times for New-Order can be found in the full disclosure reports on TPC's website TPC-C Official Result Page.

Results and Configuration Summary

Hardware Configuration:

    9 racks used to hold

    Servers:
      12 x Sun SPARC Enterprise T5440
      4 x 1.6 GHz UltraSPARC T2 Plus
      512 GB memory
      10 GbE network for cluster
    Storage:
      60 x Sun Storage F5100 Flash Array
      61 x Sun Fire X4275, Comstar SAS target emulation
      24 x Sun StorageTek 6140 (16 x 300 GB SAS 15K RPM)
      6 x Sun Storage J4400
      3 x 80-port Brocade FC switches
    Clients:
      24 x Sun Fire X4170, each with
      2 x 2.53 GHz X5540
      48 GB memory

Software Configuration:

    Solaris 10 10/09
    OpenSolaris 6/09 (COMSTAR) for Sun Fire X4275
    Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning
    Tuxedo CFS-R Tier 1
    Sun Web Server 7.0 Update 5

Benchmark Description

TPC-C is an OLTP system benchmark. It simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses.

See Also

Disclosure Statement

TPC Benchmark C, tpmC, and TPC-C are trademarks of the Transaction Performance Processing Council (TPC). 12-node Sun SPARC Enterprise T5440 Cluster (1.6GHz UltraSPARC T2 Plus, 4 processor) with Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning, 7,646,486.7 tpmC, $2.36/tpmC. Available 12/14/09. IBM Power 595 (5GHz Power6, 32 chips, 64 cores, 128 threads) with IBM DB2 9.5, 6,085,166 tpmC, $2.81/tpmC, available 12/10/08. HP Integrity Superdome(1.6GHz Itanium2, 64 processors, 128 cores, 256 threads) with Oracle 10g Enterprise Edition, 4,092,799 tpmC, $2.93/tpmC. Available 8/06/07. Source: www.tpc.org, results as of 11/5/09.

Monday Nov 02, 2009

A Sun Blade 6048 Modular System with 8 Sun Blade X6275 Server Modules configured with QDR InfiniBand cluster interconnect delivered outstanding performance running the FLUENT 12 benchmark test suite. Sun consistently delivered the best or near best results per node for the 6 benchmark tests considered up to the available nodes considered for these runs.

  • The Sun Blade X6275 cluster delivered the best results for the truck_poly_14M tests for all Rank counts tested.
  • For this large truck_poly_14m test case, the Sun Blade X6275 cluster beat the best results by SGI by as much as 19%.

  • Of the 54 test cases presented here, the Sun Blade X6275 cluster delivered the best results in 87% of the tests, 47 of the 54 cases.

Performance Landscape


FLUENT 12 Benchmark Test Suite
  Results are "Ratings" (bigger is better)
  Rating = No. of sequential runs of test case possible in 1 day 86,400/(Total Elapsed Run Time in Seconds)

System
Nodes Ranks Benchmark Test Case
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 16 128 6496.2 19307.3 8408.8 6341.3 1060.1 984.1
Best Intel 16 128 5236.4 (3) 15638.0 (7) 7981.5 (1) 6582.9 (1) 1005.8 (1) 933.0 (1)
Best SGI 16 128 7578.9 (5) 14706.4 (6) 6789.8 (4) 6249.5 (5) 1044.7 (4) 926.0 (4)

Sun Blade X6275 8 64 5308.8 26790.7 5574.2 5074.9 547.2 525.2
Best Intel 8 64 5016.0 (1) 25226.3 (1) 5220.5 (1) 4614.2 (1) 513.4 (1) 490.9 (1)
Best SGI 8 64 5142.9 (4) 23834.5 (4) 4614.2 (4) 4352.6 (4) 529.4 (4) 479.2 (4)

Sun Blade X6275 4 32 3066.5 13768.9 3066.5 2602.4 289.0 270.3
Best Intel 4 32 2856.2 (1) 13041.5 (1) 2837.4 (1) 2465.0 (1) 266.4 (1) 251.2 (1)
Best SGI 4 32 3083.0 (4) 13190.8 (4) 2588.8 (5) 2445.9 (5) 266.6 (4) 246.5 (4)

Sun Blade X6275 2 16 1714.3 7545.9 1519.1 1345.8 144.4 141.8
Best Intel 2 16 1585.3 (1) 7125.8 (1) 1428.1 (1) 1278.6 (1) 134.7 (1) 132.5 (1)
Best SGI 2 16 1708.4 (4) 7384.6 (4) 1507.9 (4) 1264.1 (5) 128.8 (4) 133.5 (4)

Sun Blade X6275 1 8 931.8 4061.1 827.2 681.5 73.0 73.8
Best Intel 1 8 920.1 (2) 3900.7 (2) 784.9 (2) 644.9 (1) 70.2 (2)) 70.9 (2)
Best SGI 1 8 953.1 (4) 4032.7 (4) 843.3 (4) 651.0 (4) 71.4 (4) 72.0 (4)

Sun Blade X6275 1 4 550.4 2425.3 533.6 423.0 41.6 41.6
Best Intel 1 4 515.7 (1) 2244.2 (1) 490.8 (1) 392.2 (1) 37.8 (1) 38.4 (1)
Best SGI 1 4 561.6 (4) 2416.8 (4) 526.9 (4) 412.6 (4) 40.9 (4) 40.8 (4)

Sun Blade X6275 1 2 299.6 1328.2 293.9 232.1 21.3 21.6
Best Intel 1 2 274.3 (1) 1201.7 (1) 266.1 (1) 214.2 (1) 18.9 (1) 19.6 (1)
Best SGI 1 2 294.2 (4) 1302.7 (4) 289.0 (4) 226.4 (4) 20.5 (4) 21.2 (4)

Sun Blade X6275 1 1 154.7 682.6 149.1 114.8 9.7 10.1
Best Intel 1 1 143.5 (1) 631.1 (1) 137.4 (1) 106.2 (1) 8.8 (1) 9.0 (1)
Best SGI 1 1 153.3 (4) 677.5 (4) 147.3 (4) 111.2 (4) 10.3 (4) 9.5 (4)

Sun Blade X6275 1 serial 155.6 676.6 156.9 110.0 9.4 10.3
Best Intel 1 serial 146.6 (2) 650.0 (2) 150.2 (2) 105.6 (2) 8.8 (2) 9.7 (2)

    Sun Blade X6275, X5570 QC 2.93 GHz, QDR SMT on / Turbo mode on

    (1) Intel Whitebox (X5560 QC 2.80 GHz, RHEL5, IB)
    (2) Intel Whitebox (X5570 QC 2.93 GHz, RHEL5)
    (3) Intel Whitebox (X5482 QC 3.20 GHz, RHEL5, IB)
    (4) SGI Altix ICE_8200IP95 (X5570 2.93 GHz +turbo, SLES10, IB)
    (5) SGI Altix_ICE_8200IP95 (X5570 2.93 GHz, SLES10, IB)
    (6) SGI Altix_ICE_8200EX (Intel64 QC 3.00 GHz, Linux, IB)
    (7) Qlogic Cluster (X5472 QC 3.00 GHz, RHEL5.2, IB Truescale)

Results and Configuration Summary

Hardware Configuration:

    8 x Sun Blade X6275 Server Module ( Dual-Node Blade, 16 nodes ) each node with
      2 x 2.93GHz Intel X5570 QC processors
      24 GB (6 x 4GB, 1333 MHz DDR3 dimms)
      On-board QDR InfiniBand Host Channel Adapters, QNEM

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Interconnect Software: OFED ver 1.4.1
    Shared File System: Lustre ver 1.8.0.1
    Application: FLUENT V12.0.16
    Benchmark: FLUENT 12 Benchmark Test Suite

Benchmark Description

The benchmark tests are representative of typical user large CFD models intended for execution in distributed memory processor (DMP) mode over a cluster of multi-processor platforms.

Key Points and Best Practices

Observations About the Results

The Sun Blade X6275 cluster delivered excellent performance, especially shining with the larger models

These processors include a turbo boost feature coupled with a speedstep option in the CPU section of the Advanced BIOS settings. This, under specific circumstances, can provide a cpu up clocking, temporarily increasing the processor frequency from 2.93GHz to 3.2GHz.

Memory placement is a very significant factor with Nehalem processors. Current Nehalem platforms have two sockets. Each socket has three memory channels and each channel has 3 bays for DIMMs. For example if one DIMM is placed in the 1st bay of each of the 3 channels the DIMM speed will be 1333 MHz with the X5570's altering the DIMM arrangement to an off balance configuration by say adding just one more DIMM into the 2nd bay of one channel will cause the DIMM frequency to drop from 1333 MHz to 1067 MHz.

About the FLUENT 12 Benchmark Test Suite

The FLUENT application performs computational fluid dynamic analysis on a variety of different types of flow and allows for chemically reacting species. transient dynamic and can be linear or nonlinear as far

  • CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.
  • CFD models typically scale very well and are very suited for execution on clusters. The FLUENT 12 benchmark test cases scale well.
  • The memory requirements for the test cases in the FLUENT 12 benchmark test suite range from a few hundred megabytes to about 25 GB. As the job is distributed over multiple nodes the memory requirements per node correspondingly are reduced.
  • The benchmark test cases for the FLUENT module do not have a substantial I/O component. component. However performance will be enhanced very substantially by using high performance interconnects such as InfiniBand for inter node cluster message passing. This nodal message passing data can be stored locally on each node or on a shared file system.
  • As a result of the large amount of inter node message passing performance can be further enhanced by more than a 3x factor as indicated here by implementing the Lustre based shared file I/O system.

See Also

FLUENT 12.0 Benchmark:
http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/

Disclosure Statement

All information on the Fluent website is Copyrighted 1995-2009 by Fluent Inc. Results from http://www.fluent.com/software/fluent/fl6bench/ as of October 20, 2009 and this presentation.

Monday Nov 02, 2009

This is an occasionally-generated index of previous entries in the BestPerf blog. Skip to next entry

Colors used:

Benchmark
Best Practices
Other

Nov 02, 2009 Sun Ultra 27 Delivers Leading Single Frame Buffer SPECviewperf 10 Results
Oct 28, 2009 SPC-2 Sun Storage 6780 Array RAID 5 & RAID 6 51% better $/performance than IBM DS5300
Oct 25, 2009 Sun C48 & Lustre fast for Seismic Reverse Time Migration using Sun X6275
Oct 25, 2009 Sun F5100 and Seismic Reverse Time Migration with faster Optimal Checkpointing
Oct 23, 2009 Wiki on performance best practices
Oct 20, 2009 Exadata V2 Information
Oct 15, 2009 Oracle Flash Cache - SGA Caching on Sun Storage F5100
Oct 13, 2009 Oracle Hyperion Sun M5000 and Sun Storage 7410
Oct 13, 2009 Sun T5440 Oracle BI EE Sun SPARC Enterprise T5440 World Record
Oct 13, 2009 SPECweb2005 on Sun SPARC Enterprise T5440 World Record using Solaris Containers and Sun Storage F5100 Flash
Oct 13, 2009 Oracle PeopleSoft Payroll (NA) Sun SPARC Enterprise M4000 and Sun Storage F5100 World Record Performance
Oct 13, 2009 SAP 2-tier SD Benchmark on Sun SPARC Enterprise M9000/32 SPARC64 VII
Oct 13, 2009 CP2K Life Sciences, Ab-initio Dynamics - Sun Blade 6048 Chassis with Sun Blade X6275 - Scalability and Throughput with Quad Data Rate InfiniBand
Oct 13, 2009 SAP 2-tier SD-Parallel on Sun Blade X6270 1-node, 2-node and 4-node
Oct 13, 2009 Halliburton ProMAX Oil & Gas Application Fast on Sun 6048/X6275 Cluster
Oct 13, 2009 SPECcpu2006 Results On MSeries Servers With Updated SPARC64 VII Processors
Oct 13, 2009 MCAE ABAQUS faster on Sun F5100 and Sun X4270 - Single Node World Record
Oct 12, 2009 MCAE ANSYS faster on Sun F5100 and Sun X4270
Oct 12, 2009 MCAE MCS/NASTRAN faster on Sun F5100 and Fire X4270
Oct 12, 2009 SPC-2 Sun Storage 6180 Array RAID 5 & RAID 6 Over 70% Better Price Performance than IBM
Oct 12, 2009 SPC-1 Sun Storage 6180 Array Over 70% Better Price Performance than IBM
Oct 12, 2009 Why Sun Storage F5100 is a good option for Peoplesoft NA Payroll Application
Oct 12, 2009 1.6 Million 4K IOPS in 1RU on Sun Storage F5100 Flash Array
Oct 11, 2009 TPC-C World Record Sun - Oracle
Oct 09, 2009 X6275 Cluster Demonstrates Performance and Scalability on WRF 2.5km CONUS Dataset
Oct 02, 2009 Sun X4270 VMware VMmark benchmark achieves excellent result
Sep 22, 2009 Sun X4270 Virtualized for Two-tier SAP ERP 6.0 Enhancement Pack 4 (Unicode) Standard Sales and Distribution (SD) Benchmark
Sep 01, 2009 String Searching - Sun T5240 & T5440 Outperform IBM Cell Broadband Engine
Aug 28, 2009 Sun X4270 World Record SAP-SD 2-Processor Two-tier SAP ERP 6.0 EP 4 (Unicode)
Aug 27, 2009 Sun SPARC Enterprise T5240 with 1.6GHz UltraSPARC T2 Plus Beats 4-Chip IBM Power 570 POWER6 System on SPECjbb2005
Aug 26, 2009 Sun SPARC Enterprise T5220 with 1.6GHz UltraSPARC T2 Sets Single Chip World Record on SPECjbb2005
Aug 12, 2009 SPECmail2009 on Sun SPARC Enterprise T5240 and Sun Java System Messaging Server 6.3
Jul 23, 2009 World Record Performance of Sun CMT Servers
Jul 22, 2009 Why does 1.6 beat 4.7?
Jul 21, 2009 Zeus ZXTM Traffic Manager World Record on Sun T5240
Jul 21, 2009 Sun T5440 Oracle BI EE World Record Performance
Jul 21, 2009 Sun T5440 World Record SAP-SD 4-Processor Two-tier SAP ERP 6.0 EP 4 (Unicode)
Jul 21, 2009 1.6 GHz SPEC CPU2006 - Rate Benchmarks
Jul 21, 2009 Sun Blade T6320 World Record SPECjbb2005 performance
Jul 21, 2009 New SPECjAppServer2004 Performance on the Sun SPARC Enterprise T5440
Jul 21, 2009 Sun T5440 SPECjbb2005 Beats IBM POWER6 Chip-to-Chip
Jul 21, 2009 New CMT results coming soon....
Jul 14, 2009 Vdbench: Sun StorageTek Vdbench, a storage I/O workload generator.
Jul 14, 2009 Storage performance and workload analysis using Swat.
Jul 10, 2009 World Record TPC-H@300GB Price-Performance for Windows on Sun Fire X4600 M2
Jul 06, 2009 Sun Blade 6048 Chassis with Sun Blade X6275: RADIOSS Benchmark Results
Jul 03, 2009 SPECmail2009 on Sun Fire X4275+Sun Storage 7110: Mail Server System Solution
Jun 30, 2009 Sun Blade 6048 and Sun Blade X6275 NAMD Molecular Dynamics Benchmark beats IBM BlueGene/L
Jun 26, 2009 Sun Fire X2270 Cluster Fluent Benchmark Results
Jun 25, 2009 Sun SSD Server Platform Bandwidth and IOPS (Speeds & Feeds)
Jun 24, 2009 I/O analysis using DTrace
Jun 23, 2009 New CPU2006 Records: 3x better integer throughput, 9x better fp throughput
Jun 23, 2009 Sun Blade X6275 results capture Top Places in CPU2006 SPEED Metrics
Jun 23, 2009 One Million Queries per Hour TPC-H at 30 Terabytes by Sun and ParAccel with OpenSolaris
Jun 19, 2009 Pointers to Java Performance Tuning resources
Jun 19, 2009 SSDs in HPC: Reducing the I/O Bottleneck BluePrint Best Practices
Jun 17, 2009 The Performance Technology group wiki is alive!
Jun 17, 2009 Performance of Sun 7410 and 7310 Unified Storage Array Line
Jun 16, 2009 Sun Fire X2270 MSC/Nastran Vendor_2008 Benchmarks
Jun 15, 2009 Sun Fire X4600 M2 Server Two-tier SAP ERP 6.0 (Unicode) Standard Sales and Distribution (SD) Benchmark
Jun 12, 2009 Correctly comparing SAP-SD Benchmark results
Jun 12, 2009 OpenSolaris Beats Linux on memcached Sun Fire X2270
Jun 11, 2009 SAS Grid Computing 9.2 utilizing the Sun Storage 7410 Unified Storage System
Jun 10, 2009 Using Solaris Resource Management Utilities to Improve Application Performance
Jun 09, 2009 Free Compiler Wins Nehalem Race by 2x
Jun 08, 2009 Variety of benchmark results to be posted on BestPerf
Jun 05, 2009 Interpreting Sun's SPECpower_ssj2008 Publications
Jun 03, 2009 Wide Variety of Topics to be discussed on BestPerf
Jun 03, 2009 Welcome to BestPerf group blog!

Monday Nov 02, 2009

A Sun Ultra 27 workstation configured with an nVidia FX5800 graphics card delivered outstanding performance running the SPECviewperf® 10 benchmark.

  • When compared with other workstations running a single graphics card (i.e. not running two or more cards in SLI mode), the Sun Ultra 27 workstation places first in 6 of 8 subtests and second in the remaining two subtests.

  • The calculated geometric mean shows that Sun Ultra 27 workstation is 11% faster than competitor's workstations.

  • The optimum point for price/performance is the nVidia FX1800 graphics card.

Results have been published on the SPEC web site at http://www.spec.org/gwpg/gpc.data/vp10/summary.html.

Performance Landscape

Performance of the Sun Ultra 27 versus the competition. Bigger is better for each of the eight tests. The comparison is based upon the performance of the Sun Ultra 27 workstation. Performance is measured in frames per second.


3DSMAX CATIA ENSIGHT MAYA
Perf % Perf % Perf % Perf %
Sun Ultra 27 FX5800 59.34
68.81
58.07
246.09
HP xw4600 ATI FireGL V7700 49.71 19 48.05 43 57.11 2
268.62 -8
HP xw4600 FX4800 52.26 14 63.26 12 53.79 8
226.82 7
Fujtsu Celsius M470 FX3800 53.67 11 65.25 7 52.19 10 227.37 7

PROENGINEER SOLIDWORKS TEAMCENTER UGS
Perf % Perf % Perf % Perf %
Sun Ultra 27 FX5800 68.96
152.01
42.02
36.04
HP xw4600 ATI FireGL V7700 47.25 32 109.71 28 40.18 4 56.65 -57
HP xw4600 FX4800 61.15 11 131.31 14 28.42 32 33.43 7
Fujtsu Celsius M470 FX3800 64.39 7
139.2 8 29.02 31 33.27 8

Comparison of various frame buffers on the Sun Ultra 27 running SPECviewperf 10. Performance is reported for each test along with the difference in performance as compared to the FX5800 frame buffer. The runs in the table below were made with 3.2GHz W3570 processors.


3DSMAX CATIA ENSIGHT MAYA PROENGR SOLIDWRKS TEAMCNTR UGS
Perf % Perf % Perf % Perf % Perf % Perf % Perf % Perf %
FX5800 57.07
67.84
58.63
219.4
68.05
152.3
40.85
34.73
FX3800 57.17 0 66.57 2
54.91 7
206.4 6 66.48 2 146.3 4 38.48 6 33.12 5
FX1800 56.73 1
64.33 6
52.05 13 189.3 16 64.67 5 135.2 13 34.18 20
30.46 14
FX380 45.90 24 55.81 22 34.93 68 120.3 82 46.09 48 64.11 138 17.00 140 13.88 150

Results and Configuration Summary

Hardware Configuration:

    Sun Ultra 27 Workstation
    1 x 3.33 GHz Intel Xeon (tm) W3580
    2GB (1 x 2GB PC10600 1333MHz)
    1 x 500GB SATA
    nVidia Quadro FX380, FX1800, FX3800 & FX5800
    $7,529.00 (includes Microsoft Windows and monitor)

Software Configuration:

    OS: Microsoft Windows Vista Ultimate, 32-bit
    Benchmark: SPECviewperf 10

Benchmark Description

SPECviewperf measures 3D graphics rendering performance of systems running under OpenGL. SPECviewperf is a synthetic benchmark designed to be a predictor of application performance and a measure of graphics subsystem performance. It is a measure of graphics subsystem performance (primarily graphics bus, driver and graphics hardware) and its impact on the system without the full overhead of an application. SPECviewperf reports performance in frames per second.

Please go here for a more complete description of the tests.

Key Points and Best Practices

SPECviewperf measures the 3D rendering performance of systems running under OpenGL.

The SPECopcSM project group's SPECviewperf 10 is totally new performance evaluation software. In addition to features found in previous versions, it now provides the ability to compare performance of systems running in higher-quality graphics modes that use full-scene anti-aliasing, and measures how effectively graphics subsystems scale when running multithreaded graphics content. Since the SPECviewperf source and binaries have been upgraded to support changes, no comparisons should be made between past results and current results for viewsets running under SPECviewperf 10.

SPECviewperf 10 requires OpenGL 1.5 and a minimum of 1GB of system memory. It currently supports Windows 32/64.

See Also

Disclosure Statement

SPEC® and the benchmark name SPECviewperf® are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of Oct 18, 2009. For the latest SPECviewperf benchmark results, visit www.spec.org/gwpg.

Wednesday Oct 28, 2009

Significance of Results

Results on the Sun Storage 6780 Array with 8Gb connectivity are presented for the SPC-2 benchmark using RAID 5 and RAID 6.
  • The Sun Storage 6780 array outperforms the IBM DS5300 by 51% in price performance for SPC-2 benchmark using RAID 5 data protection.

  • The Sun Storage 6780 array outperforms the IBM DS5300 by 51% in price performance for SPC-2 benchmark using RAID 6 data protection.

  • The Sun Storage 6780 Array has 62% better performance than the Fujitsu 800/1100 and delivers a price performance advantage of 5.6x as measured by the SPC-2 benchmark.

  • The Sun Storage 6800 array with 8Gb connectivity improved performance by 36% over the 4GB connected solution as measured by the SPC-2 benchmark.

Performance Landscape

SPC-2 Performance Chart (in increasing price-performance order)

Sponsor System SPC-2
MBPS
$/SPC-2
MBPS
ASU
Capacity
(GB)
TSC Price Data
Protection
Level
Date Results
Identifier
Sun SS6780 (8Gb) 5,634.17 $44.88 16,383.186 $252,873 RAID 5 10/27/09 B00047
IBM DS5300 (8Gb) 5,634.17 $67.75 16,383.186 $381,720 RAID 5 10/21/09 B00045
Sun SS6780 (8Gb) 5,543.88 $45.61 14,042.731 $252,873 RAID 6 10/27/09 B00048
IBM DS5300 (8Gb) 5,543.88 $68.85 14,042.731 $381,720 RAID 6 10/21/09 B00046
Sun SS6780 (4Gb) 4,818.43 $53.61 16,383.186 $258,329 RAID 5 02/03/09 B00039
IBM DS5300 (4Gb) 4,818.43 $93.80 16,383.186 $451,986 RAID 5 09/25/08 B00037
Sun SS6780 (4Gb) 4,675.50 $55.25 14,042.731 $258,329 RAID 6 02/03/09 B00040
IBM DS5300 (4Gb) 4,675.50 $96.67 14,042.731 $451,986 RAID 6 09/25/08 B00038
Fujitsu 800/1100 3,480.68 $238.93 4,569.845 $831,649 Mirroring 03/08/07 B00019

SPC-2 MBPS = the Performance Metric
$/SPC-2 MBPS = the Price/Performance Metric
ASU Capacity = the Capacity Metric
Data Protection = Data Protection Metric
TSC Price = Total Cost of Ownership Metric
Results Identifier = A unique identification of the result Metric

Complete SPC-2 benchmark results may be found at http://www.storageperformance.org.

Results and Configuration Summary

Storage Configuration:

    8 x CM200 trays, each with 16 x 146GB 15K RPM drives
    8 x Qlogic 8Gb HBA

Server Configuration:

    4 x IBM x3650
      2 x 2.93 GHz Intel X5570
      5 GB memory

Software Configuration:

    Microsoft Windows Server 2003 Enterprise Edition (32-bit) with SP2
    SPC-2 benchmark kit

Benchmark Description

The SPC Benchmark-2™ (SPC-2) is a series of related benchmark performance tests that simulate the sequential component of demands placed upon on-line, non-volatile storage in server class computer systems. SPC-2 provides measurements in support of real world environments characterized by:
  • Large numbers of concurrent sequential transfers.
  • Demanding data rate requirements, including requirements for real time processing.
  • Diverse application techniques for sequential processing.
  • Substantial storage capacity requirements.
  • Data persistence requirements to ensure preservation of data without corruption or loss.

Key Points and Best Practices

  • This benchmark was performed using RAID 5 and RAID 6 protection.
  • The controller stripe size was set to 512k.
  • No volume manager was used.

See Also

Benchmark Tags

$/Perf, performance, bandwidth, OpenStorage, Storage

Disclosure Statement

SPC-2, SPC-2 MBPS, $/SPC-2 MBPS are regular trademarks of Storage Performance Council (SPC). More info www.storageperformance.org. Sun Storage 6780 Array 5,634.17 SPC-2 MBPS, $/SPC-2 MBPS $44.88, ASU Capacity 16,838.186GB, Protect RAID 5, Cost $252,873.00, Ident. B00047. Sun Storage 6780 Array 5,543.88 SPC-2 MBPS, $/SPC-2 MBPS $45.61, ASU Capacity 14,042.731 GB, Protect RAID 6, Cost $252,873.00, Ident. B00048.

Publication Rules

See here for publication rules.

Sunday Oct 25, 2009

Significance of Results

A Sun Blade 6048 Modular System with 12 Sun Blade X6275 server modules were clustered together with QDR InfiniBand and using a Lustre File System with QDR InfiniBand to show performance improvements over an NFS file system for reading in Velocity, Epsilon, and Delta Slices and imaging 800 samples of various various grid sizes using the Reverse Time Migration.

  • The Initialization Time for populating the processing grids demonstrates significant advantages of Lustre over NFS:
    • 2486x1151x1231 : 20x improvement
    • 1243x1151x1231 : 20x improvement
    • 125x1151x1231 : 11x improvement
  • The Total Application Performance shows the Interconnect and I/O advantages of using QDR InfiniBand Lustre for the large grid sizes:
    • 2486x1151x1231 : 2x improvement - processed in less than 19 minutes
    • 1243x1151x1231 : 2x improvement - processed in less than 10 minutes

  • The Computational Kernel Scalability Efficiency for the 3 grid sizes:
    • 125x1151x1231 : 97% (1-8 nodes)
    • 1243x1151x1231 : 102% (8-24 nodes)
    • 2486x1151x1231 : 100% (12-24 nodes)

  • The Total Application Scalability Efficiency for the large grid sizes:
    • 1243x1151x1231 : 72% (8-24 nodes)
    • 2485x1151x1231 : 71% (12-24 nodes)

  • On the X5570 Intel processor with HyperThreading enabled and running 16 OpenMP threads per node gives approximately a 10% performance improvement over running 8 threads per node.

Performance Landscape

This first table presents the initialization time, comparing different number processors along with different problem sizes. The results are presented in seconds and shows the advantage the Lustre file system running over QDR InfiniBand provided when compared to a simple NFS file system.


Initialization Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
24 48 1.59 18.90 8.90 181.78 15.63 362.48
20 40 1.60 18.90 8.93 181.49 16.91 358.81
16 32 1.58 18.59 8.97 181.58 17.39 353.72
12 24 1.54 18.61 9.35 182.31 22.50 364.25
8 16 1.40 18.60 10.02 183.79

4 8 1.57 18.80



2 4 2.54 19.31



1 2 4.54 20.34



This next table presents the total application run time, comparing different number processors along with different problem sizes. It shows that for larger problems, using the Lustre file system running over QDR InfiniBand provided a big performance advantage when compared to a simple NFS file system.


Total Application Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
24 48 251.48 273.79 553.75 1125.02 1107.66 2310.25
20 40 232.00 253.63 658.54 971.65 1143.47 2062.80
16 32 227.91 209.66 826.37 1003.81 1309.32 2348.60
12 24 217.77 234.61 884.27 1027.23 1579.95 3877.88
8 16 223.38 203.14 1200.71 1362.42

4 8 341.14 272.68



2 4 605.62 625.25



1 2 892.40 841.94



The following table presents the run time and speedup of just the computational kernel for different processor counts for the three different problem sizes considered. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.


Computational Kernel Performance & Scalability
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
X6275 Time
(sec)
Speedup:
1-node
X6275 Time
(sec)
Speedup:
1-node
X6275 Time
(sec)
Speedup:
1-node
24 48 35.38 13.7 210.82 24.5 427.40 24.0
20 40 35.02 13.8 255.27 20.2 517.03 19.8
16 32 41.76 11.6 317.96 16.2 646.22 15.8
12 24 49.53 9.8 422.17 12.2 853.37 12.0*
8 16 62.34 7.8 645.27 8.0*

4 8 124.66 3.9



2 4 238.80 2.0



1 2 484.89 1.0



The last table presents the speedup of the total application for different processor counts for the three different problem sizes presented. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.


Total Application Scalability Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
1243 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
2486 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
24 48 3.6 17.3 17.1
20 40 3.8 14.6 16.6
16 32 4.0 11.6 14.5
12 24 4.1 10.9 12.0*
8 16 4.0 8.0*
4 8 2.6

2 4 1.5

1 2 1.0

Note: HyperThreading is enabled and running 16 threads per Node.

Results and Configuration Summary

Hardware Configuration:
    Sun Blade 6048 Modular Modular System with
      12 x Sun Blade x6275 Server Modules, each with
        4 x 2.93 GHz Intel Xeon QC X5570 processors
        12 x 4 GB memory at 1333 MHz
        2 x 24 GB Internal Flash
    QDR InfiniBand Lustre 1.8.0.1 File System
    GBit NFS file system

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    MPI: Scali MPI Connect 5.6.6-59413
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of its ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

This Reverse Time Migration code reads in processing parameters that define the grid dimensions, number of threads, number of processors, imaging condition, and various other parameters. The master node calculates the memory requirements to determine if there is sufficient memory to process the migration "in-core". The domain decomposition across all the nodes is determined by dividing the first grid dimension by the number of nodes. Each node then reads in it's section of the Velocity Slices, Delta Slices, and Epsilon Slices using MPI IO reads. The three source and receiver wavefield state vectors are created: previous, current, and next state. The processing steps through the input trace data reading both the receiver and source data for each of the 800 time steps. It uses forward propagation for the source wave field and backward propagation in time to cross correlate the receiver wavefield. The computational kernel consists of a 13 point stencil to process a subgrid within the memory of each node using OpenMP parallelism. Afterwards, conditioning and absorption are applied and boundary data is communicated to neighboring nodes as each time step is processed. The final image is written out using MPI IO.

Total memory requirements for each grid size:

    125x1151x1231: 7.5GB
    1243x1151x1231: 78GB
    2486x1151x1231: 156GB

For this phase of benchmarking, the focus was to optimize the data initialization. In the next phase of benchmarking, the trace data reading will be optimized so that each node reads in only it's section of interest. In this benchmark the trace data reading skews the Total Application Performance as the number of nodes increase. This will be optimized in the next phase of benchmarking, as well as, further node optimization with OpenMP. The IO description for this benchmark phase on each grid size:

    125x1151x1231:
      Initialization MPI Read: 3 x 709MB = 2.1GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 576KB = 920MB * number of nodes
      Final Output Image MPI Write: 709MB / number of nodes
    1243x1151x1231: 78GB
      Initialization MPI Read: 3 x 7.1GB = 21.3GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 5.7MB = 9.2GB * number of nodes
      Final Output Image MPI Write: 7.1GB / number of nodes
    2486x1151x1231: 156GB
      Initialization MPI Read: 3 x 14.2GB = 42.6GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 11.4MB = 18.4GB * number of nodes
      Final Output Image MPI Write: 42.6GB / number of nodes

Key Points and Best Practices

  • Additional evaluations were performed to compare GBit NFS, Infiniband NFS, and Infiniband Lustre for the Reverse Time Migration Initialization. Infiniband NFS was 6x faster than GBit NFS and Infiniband Lustre was 3x faster than Infiniband NFS using the same disk configurations. On 12 nodes for grid size 2486x1151x1231 the initialization time was 22.50 seconds for IB Lustre, 61.03 seconds for IB NFS, and 364.25 seconds for GBit NFS.
  • The Reverse Time Migration computational performance scales nicely as a function of the grid size being processed. This is consistent with the IBM published results for this application.
  • The Total Application performance results are not typically reported in benchmark studies for this application. The IBM report specifically states that the execution times do not include I/O times and non-recurring allocation or initialization delays. Examining the total application performance reveals that the workload is no longer dominated by the the partial differential equation (PDE) solver, as IBM suggests, but is constrained by the I/O for grid initialization, reading in the traces, saving/restoring wave state data, and writing out the final image. Aggressive optimization of the PDE solver has little effect on the overall throughput of this application. It is clearly more important to optimize the I/O. The trend in seismic processing, as stated at the 2008 Society of Exploration Geophysicists (SEG) conference, is to run the reverse time migration iteratively on wide azimuth data. Thus, optimizing the I/O and application throughput is imperative to meet this trend. SSD and Flash technologies in conjunction with Sun's Lustre file system can reduce this I/O bottleneck and pave the path for the future in seismic processing.
  • Minimal tuning effort was applied to achieve the results presented. Sun's HPC software stack, which includes the Sun Studio compiler, was used to build the 70000 lines of C++ and Fortran source into the application executable. The only compiler option used was "-fast". No assembly level optimizations, like those performed by IBM to use SIMD registers (SSE registers), where performed in this benchmark. Similarly, no explicit cache blocking, loop unrolling, or memory bandwidth optimizations were conducted. The idea was to demonstrate the performance that a customer can expect from their existing applications without extensive, platform specific optimizations.

See Also

Disclosure Statement

Reverse Time Migration, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf

Sunday Oct 25, 2009

A prominent Seismic Processing algorithm, Reverse Time Migration with Optimal Checkpointing, in SMP "THREADS" Mode, was testing using a Sun Fire X4270 server configured with four high performance 15K SAS hard disk drives (HDDs) and a Sun Storage F5100 Flash Array. This benchmark compares I/O devices for checkpointing wave state information while processing a production seismic migration.

  • Sun Storage F5100 Flash Array is 2.2x faster than high-performance 15K RPM disks.

  • Multithreading the checkpointing using the Sun Studio C++ Compiler OpenMP implementation gives a 12.8x performance improvement over the original single threaded version.

These results show the new trend in seismic processing to run iterative Reverse Time Migrations and migration playback is a reality. This is made possible through the use of Sun FlashFire technology to provide good checkpointing speeds without additional disk cache memory. The application can take advantage of all the memory within a node without regard to checkpoint cache buffers required for performance to HDDs. Similarly, larger problem sizes can be solved without increasing the memory footprint of each computational node.

Performance Landscape


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -800 x 1151 x 1231 with 800 Samples - 60GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 660.8 25.8 686.6 277.4 40.2 317.6 2.2x
400 1615.6 382.3 1997.9 989.5 269.7 1259.2 1.6x


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 10.2 0.2 10.4 8.0 0.2 8.2 1.3x
400 52.3 0.4 52.7 45.2 0.3 45.5 1.2x
800 102.6 0.7 103.3 91.8 0.6 92.4 1.1x


Reverse Time Migration Optimal Checkpointing
Single Thread vs Multithreaded I/O Performance
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
Single Thread F5100
Total Time (secs)
Multithreaded F5100
Total Time (secs)
Multithread
Speedup
80 105.3 8.2 12.8x
400 482.9 45.5 10.6x
800 963.5 92.4 10.4x

Note: Hyperthreading and Turbo Mode enabled while running 16 threads per node.

Results and Configuration Summary

Hardware Configuration:

    Sun Fire 4270 Server
      2 x 2.93 GHz Quad-core Intel Xeon X5570 processors
      72 GB memory
      4 x 73 GB 15K SAS drives
        File system striped across 4 15K RPM high-performance SAS HD RAID0
      Sun Storage F5100 Flash Array with local/internal r/w buff 4096
        20 x 24 GB flash modules

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of it's ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

The Reverse Time Migration with Optimal Checkpointing was introduced so large migrations could be performed within minimal memory configurations of x86 cluster nodes. The idea is to only have three wavestate vectors in memory for each of the source and receiver wavefields instead of holding the entire wavefields in memory for the duration of processing. With the Sun Flash F5100, this can be done with little performance penalty to the full migration time. Another advantage of checkpointing is to provide the ability to playback migrations and facilitate iterative migrations.

  • The stored snapshot data can be reprocessed with different filtering, image conditioning, or a variety of other parameters.
  • Fine grain snapshoting can help the processing of more complex subsurface data.
  • A Geoscientist can "playback" a migration from the saved snapshots to visually validate migration accuracy or pick areas of interest for additional processing.

The Reverse Time Migration with Optimal Checkpointing is an algorithm designed by Griewank (Griewank, 1992; Blanch et al., 1998; Griewank, 2000; Griewank and Walther, 2000; Akcelik et al., 2003).

  • The application takes snapshots of wavefield state data for some interval of the total number of samples.
  • This adjoint state method performs crosscorrelation of the source and receiver wavefields at the each level.
  • Forward recursion is used for the source wavefield and backward recursion for the receiver wavefield.
  • For relatively small seismic migrations, all of the forward processed state information can be saved and restored with minimal impact on the total processing time.
  • Effectively, the computational complexity increases while the memory requirements decrease by a logarithmic factor of the number of snapshots.
  • Griewank's algorithm helps define the most optimal tradeoff between computational performance and the number of memory buffers (memory requirements) to support this cross correlation.

For the purposes of this benchmark, this implementation of the Reverse Time Migration with Optimal Checkpointing does not fully implement the optimal memory buffer scheme proposed by Griewank. The intent is to compare various I/O alternatives for saving wave state data for each node in a compute cluster.

This benchmark measures the time to perform the wave state saves and restores while simultaneously processing the wave state data.

Key Points and Best Practices

  • Mulithreading the checkpointing using Sun Studio OpenMP and running 16 I/O threads with hyperthreading enabled gives a performance advantage over single threaded I/O to the Sun Storage F5100 flash array. The Sun Storage F5100 flash array can process concurrent I/O requests from multiple threads very efficiently.
  • Allocating the majority of a node's available memory to the Reverse Time Migration algorithm and leaving little memory for I/O caching favors the Sun Storage F5100 flash array over direct attached high performance disk drives. This performance advantage decreases as the number of snapshots increase. The reason for this is that increasing the number of snapshots decreases the memory requirement for the application.

See Also

Disclosure Statement

Reverse Time Migration with Optimal Checkpointing, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf

Friday Oct 23, 2009

A fantastic source of technical Best Practices is at
http://wikis.sun.com/display/Performance/Home

This wiki hosts the combined wisdom of many performance engineers from across Sun. It has information about Hardware, Software, ZFS, Oracle and other various performance topics.  This wiki attempts to categorize and present information so it is easy to find and use. It is getting started, but please let us know if there are any topics which would be useful.

Tuesday Oct 20, 2009

An engineer in our group wrote this blog posting:
"Exadata V2... Oracle grid consolidation in a box"

Link:
http://blogs.sun.com/glennf/entry/exadata_v2_oracle_grid_consolidation

Thursday Oct 15, 2009

Overview and Significance of Results

Oracle and Sun's Flash Cache technology combines New features in Oracle with the Sun Storage F5100 to improve database performance. In Oracle databases, the System Global Area (SGA) is a group of shared memory areas that are dedicated to an Oracle “instance” (Oracle processes in execution sharing a database) . All Oracle processes use the SGA to hold information. The SGA is used to store incoming data (data and index buffers) and internal control information that is needed by the database. The size of the SGA is limited by the size of the available physical memory.

This benchmark tested and measured the performance of a new Oracle Database 11g (Release2) feature, which allows to extend the SGA size and caching beyond physical memory, to a large flash memory storage device as the Sun Storage F5100 flash array.

One particular benchmark test demonstrated a dramatic performance improvement (almost 5x) using the Oracle Extended SGA feature on flash storage by reaching SGA sizes in the hundreds of GB range, at a more reasonable cost than equivalently sized RAM and with much faster access times than disk I/O.

The workload consisted in a high volume of SQL select transactions accessing a very large table in a typical business oriented OLTP database. To obtain a baseline, throughput and response times were measured applying the workload against a traditional storage configuration and constrained by disk I/O demand (DB working set of about 3x the size of the data cache in the SGA). The workload was then executed with an added Sun Storage F5100 Flash Array configured to contain an Extended SGA of incremental size.

The tests have shown scaling throughput along with increasing Flash Cache size.

Table of Results

F5100 Extended SGA Size (GB) Query Txns / Min Avg Response Time (Secs) Speedup Ratio
No 76338 0.118 N/A
25 169396 0.053 2.2
50 224318 0.037 2.9
75 300568 0.031 3.9
100 357086 0.025 4.6




Configuration Summary

Server Configuration:

    Sun SPARC Enterprise M5000 Server
    8 x SPARC64 VII 2.4GHz Quad Core
    96 GB memory

Storage Configuration:

    8 x Sun Storage J4200 Arrays, 12x 146 GB 15K RPM disks each (96 disks total)
    1 x Sun Storage F5100 Flash Array

Software Configuration:

    Oracle 11gR2
    Solaris 10

Benchmark Description

The workload consisted in a high volume of SQL select transactions accessing a very large table in a typical business oriented OLTP database.

The database consisted of various tables: Products, Customers, Orders, Warehouse Inventory (Stock) data, etc. and the Stock table alone was 3x the size of the db cache size.

To obtain a baseline, throughput and response times were measured applying the workload against a traditional storage configuration and constrained by disk I/O demand. The workload was then executed with an added Sun Storage F5100 Flash Array configured to contain an Extended SGA of incremental size.

During all tests, the in memory SGA data cache was limited to 25 GB .

The Extended SGA was allocated on a “raw' Solaris Volume created with the Solaris Volume Manager (SVM) on a set of devices (Flash Modules) residing on the Sun Storage F5100 flash array.

Key Points and Best Practices

In order to verify the performance improvement brought by extended SGA, the feature had to be tested with a large enough database size and with a workload requiring significant disk I/O activity to access the data. For that purpose, the size of the database needed to be a multiple of the physical memory size, avoiding the case in which the accessed data could be entirely or almost entirely cached in physical memory.

The above represents a typical “use case” in which the Flash Cache Extension is able to show remarkable performance advantages.

If the DB dataset is already entirely cached, or the DB I/O demand is not significant or the application is already saturating the CPU for non database related processing, or large data caching is not productive (DSS type Queries), the Extended SGA may not improve performance.

It is also relevant to know that additional memory structures needed to manage the Extended SGA are allocated in the “in memory” SGA, therefore reducing its data caching capacity.

Increasing the Extended Cache beyond a specific threshold, dependent on various factors, may reduce the benefit of widening the Flash SGA and actually reduce the overall throughput.

This new cache is somewhat similar architecturally to the L2ARC on ZFS. Once written, flash cache buffers are read-only, and updates are only done into main memory SGA buffers. This feature is expected to primarily benefit read-only and read-mostly workloads.

A typical sizing of database flash cache is 2x to 10x the size of SGA memory buffers. Note that header information is stored in the SGA for each flash cache buffer (100 bytes per buffer in exclusive mode, 200 bytes per buffer in RAC mode), so the number of available SGA buffers is reduced as the flash cache size increases, and the SGA size should be increased accordingly.

Two new init.ora parameters have been introduced, illustrated below:

    db_flash_cache_file = /lfdata/lffile_raw
    db_flash_cache_size = 100G
The db_flash_cache_file parameter takes a single file name, which can be a file system file, a raw device, or an ASM volume. The db_flash_cache_size parameter specifies the size of the flash cache. Note that for raw devices, the partition being used should start at cylinder 1 rather than cylinder 0 (to avoid the disk's volume label).

See Also

Disclosure Statement

Results as of October 10, 2009 from Sun Microsystems.

This blog copyright 2009 by John Henning