Sun C48 & Lustre fast for Seismic Reverse Time Migration using Sun X6275
Significance of Results
A Sun Blade 6048 Modular System with 12 Sun Blade X6275 server modules were clustered together with QDR InfiniBand and using a Lustre File System with QDR InfiniBand to show performance improvements over an NFS file system for reading in Velocity, Epsilon, and Delta Slices and imaging 800 samples of various various grid sizes using the Reverse Time Migration.
-
The Initialization Time for populating the processing grids
demonstrates significant advantages of Lustre over NFS:
- 2486x1151x1231 : 20x improvement
- 1243x1151x1231 : 20x improvement
- 125x1151x1231 : 11x improvement
-
The Total Application Performance shows the Interconnect and I/O
advantages of using QDR InfiniBand Lustre for the large grid sizes:
- 2486x1151x1231 : 2x improvement - processed in less than 19 minutes
- 1243x1151x1231 : 2x improvement - processed in less than 10 minutes
-
The Computational Kernel Scalability Efficiency for the 3 grid sizes:
- 125x1151x1231 : 97% (1-8 nodes)
- 1243x1151x1231 : 102% (8-24 nodes)
- 2486x1151x1231 : 100% (12-24 nodes)
- The Total Application Scalability Efficiency for the large grid sizes:
- 1243x1151x1231 : 72% (8-24 nodes)
- 2485x1151x1231 : 71% (12-24 nodes)
- On the X5570 Intel processor with HyperThreading enabled and running 16 OpenMP threads per node gives approximately a 10% performance improvement over running 8 threads per node.
Performance Landscape
This first table presents the initialization time, comparing different number processors along with different problem sizes. The results are presented in seconds and shows the advantage the Lustre file system running over QDR InfiniBand provided when compared to a simple NFS file system.
| Initialization Time Performance Comparison Reverse Time Migration - SMP Threads and MPI Mode |
|||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Nodes | Procs | 125 x 1151 x 1231 800 Samples |
1243 x 1151 x 1231 800 Samples |
2486 x 1151 x 1231 800 Samples |
|||||||
| Lustre Time (sec) |
NFS Time (sec) |
Lustre Time (sec) |
NFS Time (sec) |
Lustre Time (sec) |
NFS Time (sec) |
||||||
| 24 | 48 | 1.59 | 18.90 | 8.90 | 181.78 | 15.63 | 362.48 | ||||
| 20 | 40 | 1.60 | 18.90 | 8.93 | 181.49 | 16.91 | 358.81 | ||||
| 16 | 32 | 1.58 | 18.59 | 8.97 | 181.58 | 17.39 | 353.72 | ||||
| 12 | 24 | 1.54 | 18.61 | 9.35 | 182.31 | 22.50 | 364.25 | ||||
| 8 | 16 | 1.40 | 18.60 | 10.02 | 183.79 | ||||||
| 4 | 8 | 1.57 | 18.80 | ||||||||
| 2 | 4 | 2.54 | 19.31 | ||||||||
| 1 | 2 | 4.54 | 20.34 | ||||||||
This next table presents the total application run time, comparing different number processors along with different problem sizes. It shows that for larger problems, using the Lustre file system running over QDR InfiniBand provided a big performance advantage when compared to a simple NFS file system.
| Total Application Performance Comparison Reverse Time Migration - SMP Threads and MPI Mode |
|||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Nodes | Procs | 125 x 1151 x 1231 800 Samples |
1243 x 1151 x 1231 800 Samples |
2486 x 1151 x 1231 800 Samples |
|||||||
| Lustre Time (sec) |
NFS Time (sec) |
Lustre Time (sec) |
NFS Time (sec) |
Lustre Time (sec) |
NFS Time (sec) |
||||||
| 24 | 48 | 251.48 | 273.79 | 553.75 | 1125.02 | 1107.66 | 2310.25 | ||||
| 20 | 40 | 232.00 | 253.63 | 658.54 | 971.65 | 1143.47 | 2062.80 | ||||
| 16 | 32 | 227.91 | 209.66 | 826.37 | 1003.81 | 1309.32 | 2348.60 | ||||
| 12 | 24 | 217.77 | 234.61 | 884.27 | 1027.23 | 1579.95 | 3877.88 | ||||
| 8 | 16 | 223.38 | 203.14 | 1200.71 | 1362.42 | ||||||
| 4 | 8 | 341.14 | 272.68 | ||||||||
| 2 | 4 | 605.62 | 625.25 | ||||||||
| 1 | 2 | 892.40 | 841.94 | ||||||||
The following table presents the run time and speedup of just the computational kernel for different processor counts for the three different problem sizes considered. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.
| Computational Kernel Performance & Scalability Reverse Time Migration - SMP Threads and MPI Mode |
|||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Nodes | Procs | 125 x 1151 x 1231 800 Samples |
1243 x 1151 x 1231 800 Samples |
2486 x 1151 x 1231 800 Samples |
|||||||
| X6275 Time (sec) |
Speedup: 1-node |
X6275 Time (sec) |
Speedup: 1-node |
X6275 Time (sec) |
Speedup: 1-node |
||||||
| 24 | 48 | 35.38 | 13.7 | 210.82 | 24.5 | 427.40 | 24.0 | ||||
| 20 | 40 | 35.02 | 13.8 | 255.27 | 20.2 | 517.03 | 19.8 | ||||
| 16 | 32 | 41.76 | 11.6 | 317.96 | 16.2 | 646.22 | 15.8 | ||||
| 12 | 24 | 49.53 | 9.8 | 422.17 | 12.2 | 853.37 | 12.0* | ||||
| 8 | 16 | 62.34 | 7.8 | 645.27 | 8.0* | ||||||
| 4 | 8 | 124.66 | 3.9 | ||||||||
| 2 | 4 | 238.80 | 2.0 | ||||||||
| 1 | 2 | 484.89 | 1.0 | ||||||||
The last table presents the speedup of the total application for different processor counts for the three different problem sizes presented. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.
| Total Application Scalability Comparison Reverse Time Migration - SMP Threads and MPI Mode |
|||||
|---|---|---|---|---|---|
| Nodes | Procs | 125 x 1151 x 1231 800 Samples Lustre Speedup: 1-node |
1243 x 1151 x 1231 800 Samples Lustre Speedup: 1-node |
2486 x 1151 x 1231 800 Samples Lustre Speedup: 1-node |
|
| 24 | 48 | 3.6 | 17.3 | 17.1 | |
| 20 | 40 | 3.8 | 14.6 | 16.6 | |
| 16 | 32 | 4.0 | 11.6 | 14.5 | |
| 12 | 24 | 4.1 | 10.9 | 12.0* | |
| 8 | 16 | 4.0 | 8.0* | ||
| 4 | 8 | 2.6 | |||
| 2 | 4 | 1.5 | |||
| 1 | 2 | 1.0 | |||
Note: HyperThreading is enabled and running 16 threads per Node.
Results and Configuration Summary
Hardware Configuration:-
Sun Blade 6048 Modular Modular System with
-
12 x Sun Blade x6275 Server Modules, each with
-
4 x 2.93 GHz Intel Xeon QC X5570 processors
12 x 4 GB memory at 1333 MHz
2 x 24 GB Internal Flash
GBit NFS file system
Software Configuration:
-
OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
MPI: Scali MPI Connect 5.6.6-59413
Compiler: Sun Studio 12 C++, Fortran, OpenMP
Benchmark Description
The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of its ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.
This Reverse Time Migration code reads in processing parameters that define the grid dimensions, number of threads, number of processors, imaging condition, and various other parameters. The master node calculates the memory requirements to determine if there is sufficient memory to process the migration "in-core". The domain decomposition across all the nodes is determined by dividing the first grid dimension by the number of nodes. Each node then reads in it's section of the Velocity Slices, Delta Slices, and Epsilon Slices using MPI IO reads. The three source and receiver wavefield state vectors are created: previous, current, and next state. The processing steps through the input trace data reading both the receiver and source data for each of the 800 time steps. It uses forward propagation for the source wave field and backward propagation in time to cross correlate the receiver wavefield. The computational kernel consists of a 13 point stencil to process a subgrid within the memory of each node using OpenMP parallelism. Afterwards, conditioning and absorption are applied and boundary data is communicated to neighboring nodes as each time step is processed. The final image is written out using MPI IO.
Total memory requirements for each grid size:
-
125x1151x1231: 7.5GB
1243x1151x1231: 78GB
2486x1151x1231: 156GB
For this phase of benchmarking, the focus was to optimize the data
initialization. In the next phase of benchmarking, the trace data
reading will be optimized so that each node reads in only it's section
of interest. In this benchmark the trace data
reading skews the Total Application Performance as the number of nodes
increase. This will be optimized in the next phase of benchmarking, as
well as, further node optimization with OpenMP. The IO description for
this benchmark phase on each grid size:
-
125x1151x1231:
-
Initialization MPI Read: 3 x 709MB = 2.1GB / number of nodes
Trace Data Read per Node: 2 x 800 x 576KB = 920MB * number of nodes
Final Output Image MPI Write: 709MB / number of nodes
-
Initialization MPI Read: 3 x 7.1GB = 21.3GB / number of nodes
Trace Data Read per Node: 2 x 800 x 5.7MB = 9.2GB * number of nodes
Final Output Image MPI Write: 7.1GB / number of nodes
-
Initialization MPI Read: 3 x 14.2GB = 42.6GB / number of nodes
Trace Data Read per Node: 2 x 800 x 11.4MB = 18.4GB * number of nodes
Final Output Image MPI Write: 42.6GB / number of nodes
Key Points and Best Practices
- Additional evaluations were performed to compare GBit NFS, Infiniband NFS, and Infiniband Lustre for the Reverse Time Migration Initialization. Infiniband NFS was 6x faster than GBit NFS and Infiniband Lustre was 3x faster than Infiniband NFS using the same disk configurations. On 12 nodes for grid size 2486x1151x1231 the initialization time was 22.50 seconds for IB Lustre, 61.03 seconds for IB NFS, and 364.25 seconds for GBit NFS.
- The Reverse Time Migration computational performance scales nicely as a function of the grid size being processed. This is consistent with the IBM published results for this application.
- The Total Application performance results are not typically reported in benchmark studies for this application. The IBM report specifically states that the execution times do not include I/O times and non-recurring allocation or initialization delays. Examining the total application performance reveals that the workload is no longer dominated by the the partial differential equation (PDE) solver, as IBM suggests, but is constrained by the I/O for grid initialization, reading in the traces, saving/restoring wave state data, and writing out the final image. Aggressive optimization of the PDE solver has little effect on the overall throughput of this application. It is clearly more important to optimize the I/O. The trend in seismic processing, as stated at the 2008 Society of Exploration Geophysicists (SEG) conference, is to run the reverse time migration iteratively on wide azimuth data. Thus, optimizing the I/O and application throughput is imperative to meet this trend. SSD and Flash technologies in conjunction with Sun's Lustre file system can reduce this I/O bottleneck and pave the path for the future in seismic processing.
- Minimal tuning effort was applied to achieve the results presented. Sun's HPC software stack, which includes the Sun Studio compiler, was used to build the 70000 lines of C++ and Fortran source into the application executable. The only compiler option used was "-fast". No assembly level optimizations, like those performed by IBM to use SIMD registers (SSE registers), where performed in this benchmark. Similarly, no explicit cache blocking, loop unrolling, or memory bandwidth optimizations were conducted. The idea was to demonstrate the performance that a customer can expect from their existing applications without extensive, platform specific optimizations.
