Sunday Oct 25, 2009

A prominent Seismic Processing algorithm, Reverse Time Migration with Optimal Checkpointing, in SMP "THREADS" Mode, was testing using a Sun Fire X4270 server configured with four high performance 15K SAS hard disk drives (HDDs) and a Sun Storage F5100 Flash Array. This benchmark compares I/O devices for checkpointing wave state information while processing a production seismic migration.

  • Sun Storage F5100 Flash Array is 2.2x faster than high-performance 15K RPM disks.

  • Multithreading the checkpointing using the Sun Studio C++ Compiler OpenMP implementation gives a 12.8x performance improvement over the original single threaded version.

These results show the new trend in seismic processing to run iterative Reverse Time Migrations and migration playback is a reality. This is made possible through the use of Sun FlashFire technology to provide good checkpointing speeds without additional disk cache memory. The application can take advantage of all the memory within a node without regard to checkpoint cache buffers required for performance to HDDs. Similarly, larger problem sizes can be solved without increasing the memory footprint of each computational node.

Performance Landscape


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -800 x 1151 x 1231 with 800 Samples - 60GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 660.8 25.8 686.6 277.4 40.2 317.6 2.2x
400 1615.6 382.3 1997.9 989.5 269.7 1259.2 1.6x


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 10.2 0.2 10.4 8.0 0.2 8.2 1.3x
400 52.3 0.4 52.7 45.2 0.3 45.5 1.2x
800 102.6 0.7 103.3 91.8 0.6 92.4 1.1x


Reverse Time Migration Optimal Checkpointing
Single Thread vs Multithreaded I/O Performance
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
Single Thread F5100
Total Time (secs)
Multithreaded F5100
Total Time (secs)
Multithread
Speedup
80 105.3 8.2 12.8x
400 482.9 45.5 10.6x
800 963.5 92.4 10.4x

Note: Hyperthreading and Turbo Mode enabled while running 16 threads per node.

Results and Configuration Summary

Hardware Configuration:

    Sun Fire 4270 Server
      2 x 2.93 GHz Quad-core Intel Xeon X5570 processors
      72 GB memory
      4 x 73 GB 15K SAS drives
        File system striped across 4 15K RPM high-performance SAS HD RAID0
      Sun Storage F5100 Flash Array with local/internal r/w buff 4096
        20 x 24 GB flash modules

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of it's ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

The Reverse Time Migration with Optimal Checkpointing was introduced so large migrations could be performed within minimal memory configurations of x86 cluster nodes. The idea is to only have three wavestate vectors in memory for each of the source and receiver wavefields instead of holding the entire wavefields in memory for the duration of processing. With the Sun Flash F5100, this can be done with little performance penalty to the full migration time. Another advantage of checkpointing is to provide the ability to playback migrations and facilitate iterative migrations.

  • The stored snapshot data can be reprocessed with different filtering, image conditioning, or a variety of other parameters.
  • Fine grain snapshoting can help the processing of more complex subsurface data.
  • A Geoscientist can "playback" a migration from the saved snapshots to visually validate migration accuracy or pick areas of interest for additional processing.

The Reverse Time Migration with Optimal Checkpointing is an algorithm designed by Griewank (Griewank, 1992; Blanch et al., 1998; Griewank, 2000; Griewank and Walther, 2000; Akcelik et al., 2003).

  • The application takes snapshots of wavefield state data for some interval of the total number of samples.
  • This adjoint state method performs crosscorrelation of the source and receiver wavefields at the each level.
  • Forward recursion is used for the source wavefield and backward recursion for the receiver wavefield.
  • For relatively small seismic migrations, all of the forward processed state information can be saved and restored with minimal impact on the total processing time.
  • Effectively, the computational complexity increases while the memory requirements decrease by a logarithmic factor of the number of snapshots.
  • Griewank's algorithm helps define the most optimal tradeoff between computational performance and the number of memory buffers (memory requirements) to support this cross correlation.

For the purposes of this benchmark, this implementation of the Reverse Time Migration with Optimal Checkpointing does not fully implement the optimal memory buffer scheme proposed by Griewank. The intent is to compare various I/O alternatives for saving wave state data for each node in a compute cluster.

This benchmark measures the time to perform the wave state saves and restores while simultaneously processing the wave state data.

Key Points and Best Practices

  • Mulithreading the checkpointing using Sun Studio OpenMP and running 16 I/O threads with hyperthreading enabled gives a performance advantage over single threaded I/O to the Sun Storage F5100 flash array. The Sun Storage F5100 flash array can process concurrent I/O requests from multiple threads very efficiently.
  • Allocating the majority of a node's available memory to the Reverse Time Migration algorithm and leaving little memory for I/O caching favors the Sun Storage F5100 flash array over direct attached high performance disk drives. This performance advantage decreases as the number of snapshots increase. The reason for this is that increasing the number of snapshots decreases the memory requirement for the application.

See Also

Disclosure Statement

Reverse Time Migration with Optimal Checkpointing, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf

Tuesday Oct 13, 2009

The Sun SPARC Enterprise M4000 server combined with Sun FlashFire technology, the Sun Storage F5100 flash array, has produced World Record Performance on PeopleSoft Payroll (North America) 9.0 benchmark.

  • A Sun SPARC Enterprise M4000 server with four new 2.53GHz SPARC64 VII processors and a Sun Storage F5100 flash array is 33% faster than the HP rx6600 (4 x 1.6GHz Itanium2 processors) on the PeopleSoft Payroll (NA) 9.0 benchmark. The Sun solution used the Oracle 11g database running on Solaris 10.

  • The Sun SPARC Enterprise M4000 server with four 2.53GHz SPARC64 VII processors and the Sun Storage F5100 flash array is 35% faster than the 2027 MIPs IBM Z990 (6 Z990 Gen1 processors) on the PeopleSoft Payroll (NA) 9.0 benchmark with Oracle 11g database running on Solaris 10. The IBM result used IBM DB2 for Z/OS 8.1 for the database.

  • The Sun SPARC Enterprise M4000 server with four 2.53GHz SPARC64 VII processors and a Sun Storage F5100 flash array processed 250K employee payroll checks using PeopleSoft Payroll (NA) 9.0 and Oracle 11g running on Solaris 10. Four different execution strategies were run with an average improvement of 25% compared to HP's results run on the rx6600. Sun achieved these results with 8 concurrent jobs using only 25% CPU utilization while HP required 16 concurrent jobs with a 88% CPU utilization.

  • The Sun SPARC Enterprise M4000 server combined with Sun FlashFire technology processed 8 Sequential Jobs and single run control with a total time of 527.85 mins, an improvement of 20% compared to HPs time of 633.09 mins.

  • The Sun SPARC Enterprise M4000 server combined with Sun FlashFire technology demonstrated a speedup of 81% going from 1 to 8 streams on the PeopleSoft Payroll (NA) 9.0 benchmark using the Oracle 11g database.

  • The Sun FlashFire technology dramatically improves IO performance for the PeopleSoft Payroll benchmark with significant performance boost over best optimized FC disks (60+).

  • The Sun Storage F5100 Flash Array is a high performance high density solid state flash array which provides a read latency of only 0.5 msec which is about 10 times faster than the normal disk latencies 5 msec measured on this benchmark.

  • Sun estimates that the MIPS rating for a Sun SPARC Enterprise M4000 server is over 2742 MIPS.

Performance Landscape

250K Employees

System Processor OS/Database Time in Minutes Version
Run 1 Run 2 Run 3
Sun M4000 4x 2.53GHz SPARC64 VII Solaris/Oracle 11g 79.35 288.47 527.85 9.0
HP rx6600 4x 1.6GHz Itanium2 HP-UX/Oracle 11g 81.17 350.16 633.25 9.0
IBM Z990 6x Gen1 2027 MIPS Z/OS /DB2 107.34 328.66 544.80 9.0
HP rx6600 4x 1.6GHz Itanium2 HP-UX/Oracle 11g 105.70 369.59 633.09 9.0

Note: IBM benchmark documents show that 6 Gen1 procs is 2027 mips. 13 Gen1 processors were in this config but only 6 were available for testing.

500K Employees

System Processor OS/Database Time in Minutes Version
Run 1 Run 2 Run 3
HP rx7640 8x 1.6GHz Itanium2 HP-UX/Oracle 11g 133.63 712.72 1665.01 9.0

Results and Configuration Summary

Hardware Configuration:

    1 x Sun SPARC Enterprise M4000 (4 x 2.53 GHz/32GB)
    1 x Sun Storage F5100 Flash Array (40 x 24GB FMODs)
    1 x Sun Storage J4200 (12 x 450GB SAS 15K RPM)

Software Configuration:

    Solaris 10 5/09
    Oracle PeopleSoft HCM 9.0
    Oracle PeopleSoft Enterprise (PeopleTools) 8.49
    Micro Focus Server Express 4.0 SP4
    Oracle RDBMS 11.1.0.7 64-bit
    HP's Mercury Interactive QuickTest Professional 9.0

Benchmark Description

The PeopleSoft 9.0 Payroll (North America) benchmark is a performance benchmark established by PeopleSoft to demonstrate system performance for a range of processing volumes in a specific configuration. This information may be used to determine the software, hardware, and network configurations necessary to support processing volumes. This workload represents large batch runs typical of OLTP workloads during a mass update.

To measure five application business process run times for a database representing large organization. The five processes are:

  • Paysheet Creation: generates payroll data worksheet for employees, consisting of std payroll information for each employee for given pay cycle.

  • Payroll Calculation: Looks at Paysheets and calculates checks for those employees.

  • Payroll Confirmation: Takes information generated by Payroll Calculation and updates the employees' balances with the calculated amounts.

  • Print Advice forms: The process takes the information generated by payroll Calculations and Confirmation and produces an Advice for each employee to report Earnings, Taxes, Deduction, etc.

  • Create Direct Deposit File: The process takes information generated by above processes and produces an electronic transmittal file use to transfer payroll funds directly into an employee bank a/c.

For the benchmark, we collect at least four data points with different number of job streams (parallel jobs). This batch benchmark allows a maximum of eight job streams to be configured to run in parallel.

Key Points and Best Practices

See Also

Disclosure Statement

Oracle PeopleSoft Payroll (NA) 9.0 benchmark, Sun M4000 (4 2.53GHz SPARC64) 79.35 min, IBM Z990 (6 gen1) 107.34 min, HP rx6600 (4 1.6GHz Itanium2) 105.70 min, www.oracle.com/apps_benchmark/html/white-papers-peoplesoft.html Results 10/13/2009.

Monday Oct 12, 2009

Significance of Results

The Sun Storage F5100 Flash Array is a high performance high density solid state flash array delivering over 1.6M IOPS (4K IO) and 12.8GB/sec throughput (1M reads). The Flash Array is designed to accelerate IO-intensive applications, such as databases, at a fraction of the power, space, and cost of traditional hard disk drives. It is based on enterprise-class SLC flash technology, with advanced wear-leveling, integrated backup protection, solid state robustness, and 3M hours MTBF reliability.

  • The Sun Storage F5100 Flash Array demonstrates breakthrough performance of 1.6M IOPS for 4K random reads
  • The Sun Storage F5100 Flash Array can also perform 1.2M IOPS for 4K random writes
  • The Sun Storage F5100 Flash Array has unprecedented throughput of 12.8 GB/sec.

Performance Landscape

Results were obtained using four hosts.

Bandwidth and IOPS Measurements

Test Flash Modules
80 40 20 1
Random 4K Read 1,591K IOPS 796K IOPS 397K IOPS 21K IOPS
Maximum Delivered Random 4K Write 1,217K IOPS 610K IOPS 304K IOPS 15K IOPS
Maximum Delivered 50-50 4K Read/Write 850K IOPS 426K IOPS 213K IOPS 11K IOPS
Sequential Read (1M) 12.8 GB/sec 6.4 GB/sec 3.2 GB/sec 265 MB/sec
Maximum Delivered Sequential Write (1M) 9.7 GB/sec 4.8 GB/sec 2.4 GB/sec 118 MB/sec

Sustained Random 4K Write* 311K IOPS 169K IOPS 86K IOPS 4.4K IOPS

(*) Maximum Delivered values measured over a 1 minute period. Sustained write performance differs from maximum delivered performance. Over time, wear-leveling and erase operations are required and impact write performance levels.

Latency Measurements

The Sun Storage F5100 Flash Array is tuned for 4 KB or larger IO sizes, the write service for IOs smaller than 4 KB can be 10 times more than shown in the table below. It should also be noted that the service times shown below are both the latency and the time to transfer the data. This becomes the dominant portion the the service time for IOs over 64 KB in size.

Transfer Size Service Time (ms)
Read Write
4 KB 0.41 0.28
8 KB 0.42 0.35
16 KB 0.45 0.72
32 KB 0.51 0.77
64 KB 0.63 1.52
128 KB 0.87 2.99
256 KB 1.34 6.03
512 KB 2.29 12.14
1024 KB 4.19 23.79

- Latencies are measured application latencies via vdbench tool.
- Please note that the F5100 Flash Array is a 4KB sector device. Doing IOs of less than 4KB in size, or not aligned on 4KB boundaries, can result in a significant performance degradations on write operations.

Results and Configuration Summary

Storage:

    Sun Storage F5100 Flash Array
      80 Flash Modules
      16 ports
      4 domains (20 Flash Modules per)
      CAM zoning - 5 Flash Modules per port

Servers:

    4 x Sun SPARC Enterprise T5240
    4 x 4 HBAs each, firmware version 01.27.03.00-IT

Software:

    OpenSolaris 2009.06 or Solaris 10 10/09 (MPT driver enhancements)
    Vdbench 5.0
    Required Flash Array Patches SPARC, ses/sgen patch 138128-01 or later & mpt patch 141736-05
    Required Flash Array Patches x86, ses/sgen patch 138129-01 or later & mpt patch 141737-05

Benchmark Description

Sun measured a wide variety of IO performance metrics on the Sun Storage F5100 Flash Array using Vdbench 5.0 measuring 100% Random Read, 100% Random Write, 100% Sequential Read, 100% Sequential Write, and 50-50 read/write. This demonstrates the maximum performance and throughput of the storage system.

Vdbench profile parmfile.txt here

Vdbench is publicly available for download at: http://vdbench.org

Key Points and Best Practices

  • Drive each Flash Modules with 32 outstanding IO as shown in the benchmark profile above.
  • LSI HBA firmware level should be at Phase 15 maxq.
  • LSI HBAs either use single port HBAs or only 1 port per HBA.
  • SPARC platforms will align with the 4K boundary size set by the Flash Array. x86/windows platforms don't necessarily have this alignment built in and can show lower performance

See Also

Disclosure Statement

Sun Storage F5100 Flash Array delivered 1.6M 4K read IOPS and 12.8 GB/sec sequential read. Vdbench 5.0 (http://vdbench.org) was used for the test. Results as of September 12, 2009.

This blog copyright 2009 by John Henning