Wednesday Jan 23, 2008
Two benchmark test suites (standard and new)
for the Fluent computational fluid
dynamics (CFD) code were run on a mini cluster of Sun Blade X6250
blades with the recently announced 3.33 GHz dual core Intel 5260. The Sun Blade X6250 mini cluster beats all
posted results for both FLUENT benchmark test suites at each core
level from one up to the
maximum sixteen cores that were available on the X6250 cluster.
- In runs of the standard test suite the X6250
cluster was from 15% (lower core levels) to 42% (highest core level)
faster than the previous top set of results posted from a Harpertown
quad-core 3.0 GHz Xeon.
- The scaling efficiency with the X6250 cluster ranged from 100% to
on average 84% going from 1 to 8 cores when
running the standard large test cases.
- In runs of the new larger and more representative test suite, the X6250
cluster was from 15% (lower core levels) to 65% (highest core level)
faster than the top results posted from a Harpertown
quad-core 3.0 GHz Xeon.
- The scaling efficiency with the X6250 cluster ranged from 100% to
on average 84% going from 1 to 8 cores when
running all six of the test cases in the new test suite.
Four, 2-socket Sun X6250 blades with Infiniband interconnects were used.
Comparisons are presented against the results posted
at the FLUENT Performance website.
The two benchmark test suites that were considered are first,
the long standing standard "FLUENT 6"
benchmark test suite consisting of 9 test cases:
3 small, 3 medium sized, and 3 large and is referenced above as
the standard test suite.
Secondly, the "FLUENT 6.3" benchmark test suite
(referenced as the new test suite)
consisting of larger models (both in memory requirements
and mesh/model size) more suited for multi core multi node dmp cluster run
environments and more representative of current actual engineering CFD.
Fluent is one of the most prominent commercial CFD (Computational
Fluid Dynamics) codes. It is distributed worldwide to major
engineering organizations in a broad spectrum of disciplines
(aircraft, aerospace, automotive, marine, etc.) that are involved
with fluid flow.
CFD models tend to be extremely large (fluid flow over entire car,
aircraft and submarine bodies and complex flow involving mixing of
species and chemical reaction). In order to have reasonable run times
for the analyses use of many processing units is necessary. Currently
the most effective way of achieving this is via an interconnected
cluster of multi-core rack-mounted servers or blades.
Standard FLUENT 6 Benchmark Test Suite, large workload results ("ratings" bigger is better)
Click www.fluent.com/software/fluent/fl6bench/fl6bench_6.3/ to see the
full table of results.
Rating = No. of sequential runs of test case
possible in 1 day
86,400/(Total Elapsed Run Time in Seconds)
| System |
NCPUS |
FL5L1 |
FL5L2 |
FL5L3 |
| Sun X6250 3.33GHz DC 5260 |
1 |
259.4 |
178.5 |
32.4 |
| Intel 3.0GHz QC Harpertown |
1 |
220.5 |
151.2 |
27.9 |
| SGI Altix XE210 3.0GHz Xeon 5160 |
1 |
210.7 |
153.5 |
29.6 |
| IBM X3550 3GHz DC 5160 |
1 |
188.0 |
134.7 |
n/a |
| |
| Sun X6250 3.33GHz DC 5260 |
2 |
493.4 |
351.8 |
61.9 |
| Intel 3.0GHz QC Harpertown |
2 |
420.7 |
297.3 |
54.2 |
| SGI Altix XE210 3.0GHz Xeon 5160 |
2 |
396.1 |
298.0 |
56.2 |
| IBM X3550 3GHz DC 5160 |
2 |
342.6 |
236.8 |
55.0 |
| |
| Sun X6250 3.33GHz DC 5260 |
4 |
931.8 |
675.7 |
122.0 |
| SGI Altix XE210 3.0GHz Xeon 5160 |
4 |
679.2 |
449.7 |
80.7 |
| IBM X3550 3GHz DC 5160 |
4 |
623.5 |
411.4 |
94.9 |
| |
| Sun X6250 3.33GHz DC 5260 |
8 |
1811.3 |
1227.3 |
207.2 |
| Intel 3.0GHz QC Harpertown |
8 |
1279.1 |
710.5 |
120.0 |
| SGI Altix XE210 3.0GHz Xeon 5160 |
8 |
1343.7 |
899.5 |
161.0 |
| IBM X3550 3GHz DC 5160 |
8 |
1273.4 |
862.3 |
149.9 |
| |
| Sun X6250 3.33GHz DC 5260 |
16 |
2941.3 |
1577.4 |
246.0 |
| SGI Altix XE210 3.0GHz Xeon 5160 |
16 |
2584.9 |
1788.8 |
319.0 |
| IBM X3550 3GHz DC 5160 |
16 |
2479.2 |
1722.0 |
306.8 |
New FLUENT 6.3 Benchmark Test Suite, "Ratings" (bigger is better)
Rating = No. of sequential runs of test case
possible in 1 day
86,400/(Total Elapsed Run Time in Seconds)
| System |
NCPUS |
eddy |
turbo |
aircraft |
sedan |
truck14m |
truckpoly |
| Sun X6250 3.33GHz DC 5260 |
1 |
109.2 |
440.4 |
96.6 |
65.1 |
7.0 |
8.3 |
| Intel 3.0GHz QC Harpertown |
1 |
95.9 |
n/a |
84.2 |
55.9 |
6.2 |
6.9 |
| |
| Sun X6250 3.33GHz DC 5260 |
2 |
208.9 |
823.1 |
178.8 |
121.3 |
14.6 |
16.1 |
| Intel 3.0GHz QC Harpertown |
2 |
183.1 |
741.3 |
162.9 |
109.6 |
12.4 |
13.4 |
| |
| Sun X6250 3.33GHz DC 5260 |
4 |
415.6 |
1590.4 |
353.8 |
246.2 |
29.9 |
31.9 |
| |
| Sun X6250 3.33GHz DC 5260 |
8 |
780.8 |
2805.2 |
577.1 |
384.4 |
55.0 |
57.3 |
| Intel 3.0GHz QC Harpertown |
8 |
491.4 |
1685.0 |
321.0 |
207.2 |
32.1 |
33.2 |
| |
| Sun X6250 3.33GHz DC 5260 |
16 |
1095.8 |
3744.3 |
682.9 |
429.9 |
73.7 |
74.6 |
Key Technical Points
The "small" and
even "medium" test cases in the standard suite are both not too large and
not very representative any more of typical usage.
Real world CFD engineering models are typically very large
and are best analyzed with many cores in order to achieve reasonable
turnaround on run times. Scalability running these large models with
Fluent is very good often linear or perfect up to 64 and even 128
cores
Performance when running Fluent in a multi node configuration
is significantly enhanced when using high performance interconnects
such as Infiniband
Fluent supports a variety of interconnects from various
hardware vendors (e.g. Voltaire, Cisco/Topspin, QLogic [formerly
Silverstorm], Myrinet) MPI's (HP-MPI, MVAPICH(2), LAM, plus private
vendor versions) and communication protocals (e.g. ssh and rsh)
There is still not an officially certified version of a
Solaris build of Fluent for X86-64 platform architectures. However,
a prototype build compiled a while ago with Sun Studio 11 compilers
then outperformed all other platforms under other operating systems
(64-bit Linux). These competitive results are currently posted at
the Fluent website from several hardware vendors including current
competitive AMD and Intel based platforms running under 64-bit Linux
operating systems.
Very recently, Fluent has devloped a new benchmark test suite
with larger models specifically intended to be run either
on large multi core servers or large multi node clusters of multi
core platforms.
Benchmark Description
The Original Standard "Fluent 6" Benchmark Test Suite
Nine industrial CFD applications ranging in size from 32,000 to
10,000,000 cells have been selected to demonstrate the performance of
FLUENT on a variety of hardware platforms. The performance of a CFD
code will depend on several factors including size and topology of the
mesh, physical models, numerics and parallelization, compilers and
optimization, in addition to performance characteristics of the
hardware where the simulation is performed. The problems selected
represent a range of simulations typical of those which might be found
in industry. The principal objective of this benchmark suite is to
provide comprehensive and fair comparative information of the
performance of FLUENT on available hardware platforms.
Disclosure Statement:
All information on the Fluent website is Copyrighted 1995-2008 by
Fluent Inc. Results from
www.fluent.com as of January 7, 2008.
System Configuration
4 Sun Blade X6250's
3.33 GHz dual core Intel 5260
2 internal striped 15K SAS drives (cluster shared file system)
Infiniband (Voltaire) interconnects
SuSE Linux Enterprise Server SLES 10
Voltaire OFED gridstack
HP-MPI
Fluent V6.3.26
Fluent 6 Standard Benchmark Test Suite
Fluent 6.3 "New" Benchmark Test Suite
See Also:
New Fluent benchmark results posted
at:
http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.3
Standard Fluent benchmark results posted
at:
http://www.fluent.com/software/fluent/fl5bench/flbench_6.3/fullres.htm
Thursday Jul 12, 2007
The Sun Blade X6250 cluster was up to 27% faster or 6% faster on geometric mean than an SGI Altix XE 210 cluster (Xeon 3 GHz dual core 5160 Woodcrest) and Infiniband interconnects.
A cluster of four Sun Blade X6250 Cluster (Xeon 3 GHz 5160) with Infiniband
interconnects was used to set this record. Each of these two socket blades had dual-core Intel Xeon EM64T
5160 3 GHz (Woodcrest) 16 total cores.
The Sun Blade X6250 Cluster (Xeon 3 GHz 5160) cluster running computational
fluid dynamics program (CFD) the "Fluent 6" standard benchmark established
a world record for runs made of the test suite using from 1 to 16 cores.
Workload description
Fluent is one of the most prominent commercial CFD (Computational Fluid Dynamics) codes.
It is distributed worldwide to major engineering organizations in a broad spectrum of disciplines
(aircraft, aerospace, automotive, marine, etc.) that are involved with fluid flow in some manner.
Fluent like many major ISV's has developed a benchmark test suite to evaluate the performance
of platforms. For several years results have been posted from hardware vendor platforms
at the Fluent website.
CFD models tend to be extremely large (fluid flow over entire car, aircraft and submarine bodies
and complex flow involving mixing of species and chemical reaction).
In order to have reasonable run times for the analyses use of many processing units is
necessary. Currently the most effective way of achieving this is via an interconnected cluster
of multi core rack mounted servers or blades. The current set of entries posted at the Fluent
website reflect this fact.
FLUENT 6 Benchmark ("Ratings", bigger is better)
Rating = #f sequential runs in 1 day 86,400/(Total Elapsed Run Time in Seconds)
| Machine |
Sockets |
NCPUS |
FL5M1 |
FL5M2 |
FL5M3 |
FL5L1 |
FL5L2 |
FL5L3 |
| Sun Blade X6250 3GHz WC 5160 |
2 |
8 |
4965.5 |
10504.6 |
2563.8 |
1399.2 |
1028.3 |
174.9 |
| SGI Altix XE210 3GHz WC 5160 |
2 |
8 |
4937.1 |
9626.7 |
2014.0 |
1343.7 |
899.5 |
161.0 |
| |
| Sun Blade X6250 3GHz WC 5160 |
2 |
4 |
2780.4 |
5358.1 |
1336.9 |
731.7 |
573.7 |
101.2 |
| SGI Altix XE210 3GHz WC 5160 |
2 |
4 |
2681.1 |
4657.7 |
998.0 |
679.2 |
449.7 |
80.7 |
| |
| Sun Blade X6250 3GHz WC 5160 |
2 |
serial |
919.4 |
1465.6 |
352.9 |
207.2 |
142.6 |
27.6 |
| SGI Altix XE210 3GHz 5160 |
2 |
serial |
910.9 |
1445.4 |
349.5 |
204.1 |
136.6 |
26.8 |
Other interesting points:
- The "Fluent 6" standard benchmark test suite consists of "small" "medium" and
"large " test cases. However both the small and medium sized test cases are all
really on the small side and do not scale well beyond 16 cores.
- The largest test case in the suite, "fl5l3" requires 9 GB running with only
one core on a single node. This memory requirment per node is reduced when running in a dmp
cluster mode on multi nodes with multi cores.
- Fluent runs are cpu and sometimes memory intensive but do not require high performance I/O file systems.
- Very recently Fluent has devloped a new benchmark test suite with extremely large
models specifically intended to be run either on large multi core servers or
large multi node clusters of multi core platforms.
Workload Details
Nine industrial CFD applications ranging in size from 32,000 to 10,000,000 cells have been selected to demonstrate the performance of FLUENT on a variety of hardware platforms. The performance of a CFD code will depend on several factors including size and topology of the mesh, physical models, etc. The test cases represent a range of typical industry simulations.
Descriptions
Class Benchmark Cells Mesh Models Solver Description
small
FL5S1 32,000 hexahedral ke segregated implicit turbulent flow in a bend
FL5S2 32,000 hexahedral ke coupled implicit turbulent flow in a bend
FL5S3 89,856 hexahedral ke coupled implicit flow in a compressor, rotor 37
medium
FL5M1 155,188 tetrahedral ke 6spe reac DPM P1 segregated implicit coal combustion in a boiler, with particle tracking
FL5M2 242,782 hybrid, hanging-node ke segregated implicit turbulent flow in an engine valveport
FL5M3 352,800 hexahedral ke 6spe react segregated implicit combustion in a high velocity burner
large
FL5L1 847,746 hexahedral ke coupled explicit transonic flow around a fighter
FL5L2 3,618,080 hybrid RNG ke segregated implicit external aerodynamics around a car body
FL5L3 9,792,512 hexahedral RSM segregated implicit turbulent flow in a transition duct
Small Class Ratings
Small class problems contain less than 100,000 cells.
FL5S1 - Accelerating turbulent flow in an elbow duct using segregated implicit solver
Accelerating Turbulent Flow in an Elbow Duct using Segregated Implicit Solver
Flow is accelerated through a 90 degree elbow duct with a rectangular
cross section. The geometry and flow have a symmetry plane permitting
the modeling of only half the domain. Because of the curvature of the
duct, significant secondary flow occurs, with velocity components
normal to the principal flow direction. The segregated implicit solver
in FLUENT 5 is used to solve this flow.
Number of cells 32,000
Cell type hexahedral
Models k-epsilon turbulence
Solver segregated implicit
FL5S2 - Accelerating turbulent flow in an elbow duct using coupled implicit solver
Accelerating Turbulent Flow in an Elbow Duct using Coupled Implicit Solver
Flow is accelerated through a 90 degree elbow duct with a rectangular
cross section. The geometry and flow have a symmetry plane permitting
the modeling of only half the domain. Because of the curvature of the
duct, significant secondary flow occurs, with velocity components
normal to the principal flow direction. The coupled implicit solver in
FLUENT 5 is used to solve this flow.
Number of cells 32,000
Cell type hexahedral
Models k-epsilon turbulence
Solver coupled implicit
FL5S3 - Transonic flow in rotating fan
Transonic Flow through a Rotor
The flow through a transonic fan rotor (designated rotor 37 by NASA
Lewis) was computed. It has 36 blades. The calculation was performed at
a rotational speed of 17189 rpm. The domain boundaries consist of a
hub, blade and shroud surface, a pressure inlet and outlet surface, and
periodic surfaces.
Number of cells 89,856
Cell type hexahedral
Models k-epsilon turbulence
Solver coupled implicit
Medium class problems contain between 100,000 and 500,000 cells.
FL5M1 - Coal combustion in a boiler
Coal Combustion in a Boiler
This application couples a continuous gas phase calculation with a
discrete phase (particle) calculation. 500 coal particles are injected
into an industrial boiler where their trajectories are computed using a
Lagrangian formulation that includes dispersed phase inertia,
hydrodynamic drag and the force of gravity. Each particle injection is
subject to heating/cooling, vaporization, boiling and solid combustion.
During the injection calculations, momentum, heat and mass exchanges
are calculated and stored as source terms which are then used in the
subsequent gas phase calculation. Furthermore, stochastic modeling of
particle tracks, requiring a fixed number of "tries" per particle, are
used to account for local turbulent fluctuations. In this calculation,
10 stochastic tries per particle are used, resulting in a total of 5000
particle tracks per discrete phase update. There are 10 continuous
phase iterations per discrete phase update.
Number of cells 155,188
Cell type tetrahedral
Models k-epsilon turbulenc 6 species with reaction dispersed phase
P1 radiation
Solver segregated implicit
FL5M2 - Turbulent flow in an engine valveport
Turbulent Flow in an Engine Valveport
Flow is computed in an automotive valve port modeled using a zonal
hybrid mesh. The region around the valve has been meshed with
tetrahedral cells, while the duct providing the inlet flow to the valve
has been meshed with hexahedra. Pyramid cells are used to transition
between the hexahedral and tetrahedral cells. A fourth cell type called
a prismatic (or wedge) cell is used for the cylinder downstream of the
valve. Furthermore, hanging-node adaption was used to improve the
accuracy of the predicted flow field.
Number of cells 242,782
Cell type hybrid hanging-node adaption
Models k-epsilon turbulence
Solver segregated implicit
FL5M3 - Combustion in a high velocity burner
Combustion in a High Velocity Burner
Fuel (CH4) is injected into ports of a high velocity gas burner located
near the centerline. Air is supplied through the outer ports, with
secondary air delivered into an outer annular region. Directly
downstream of the annulus is a wedge-shaped annular baffle. The mixing
of fuel and air occurs downstream of this baffle and recirculation
zones behind the baffle provide stability and an attachment point for
the flame in the main combustion chamber. Combustion is assumed to
proceed via a two-step reaction mechanism, with turbulent mixing as the
limiting rate, as described by the Magnessen model.
Reference: M. Cavelli, A. Milani, "Spark-ignited wide stability gas
burner for on/off and continuous duty," IFRF HT Meeting, Milan, October
1996.
Number of cells 352,800
Cell type hexahedral
Models k-epsilon turbulenc 6 species with reaction
Solver segregated implicit
Large Class
Large class problems contain more than 500,000 cells.
FL5L1 Transonic flow around a fighter aircraft
Transonic Flow Around a Fighter Aircraft
Flow around the AGARD M-151 combat aircraft research model is computed.
The simulation geometry contains canards and forward swept wings, but
no tail. The conditions modeled were Mach number 0.9 and 10.46 degrees
angle of attack.
Number of cells 847,764
Cell type hexahedral
Models k-epsilon turbulence
Solver segregated implicit
FL5L2 Exterior flow around a passenger sedan
Exterior Flow Around a Passenger Sedan
This benchmark represents the computation of the exterior flow field
around a simplified model of a passenger sedan. The simulation geometry
was used for the Japan External Aerodynamics competition. A
viscous-hybrid grid with prismatic cells is used to adequately model
the boundary layer regions.
Number of cells 3,618,080
Cell type
FL5L2 Exterior flow around a passenger sedan
Exterior Flow Around a Passenger Sedan
This benchmark represents the computation of the exterior flow field
around a simplified model of a passenger sedan. The simulation geometry
was used for the Japan External Aerodynamics competition. A
viscous-hybrid grid with prismatic cells is used to adequately model
the boundary layer regions.
Number of cells 3,618,080
Cell type hybrid
Models k-epsilon turbulence
Solver segregated implicit
FL5L3 Turbulent flow through a transition duct
Turbulent Flow Through a Transition Duct
Turbulent flow of air through a duct is computed for this benchmark.
The cross-sectional planes of the duct transition from a circle at the
inlet to a rectangle at the outflow boundary. The Reynolds-Stress Model
(7 equation) is used for computing turbulence.
Number of cells 9,792,512
Cell type hexahedral
Models RSM turbulence
Solver segregated implicit
The cluster of Sun Blade X6250 outperfomed the following competitive
hardware vendor clusters at all core levels considered
(1 core smp, 1- core parallel, 2- 4- 8- and 16-core parallel runs)
and for all (9) test cases in the benchmark test suite:
HP BL460C (EM64T_WOODCREST_2CORE,3000,WINCCS,IB_HPMPI)
HP DL140 (EM64T_WOODCREST_2CORE,3000,LINUX,IB)
HP DL145_G2 (OPTERON_2CORE,2200,WINCCS,IB_HPMPI)
SGI ALTIX4700 (IA64_MONTECITO_2CORE,1600,LINUX)
SGI ALTIXXE210 (EM64T_WOODCREST_2CORE,3000,LINUX,IB_VOLTAIRE)
TYAN TYPHOON_630 (EM64T_WOODCREST_2CORE,2300,SLES10,GIGE)
TYAN TYPHOON_630 (EM64T_WOODCREST_2CORE,2300,WINCCS,GIGE)
BULL NOVASCALE (EM64T_WOODCREST_2CORE,3000,RHEL4,IB)
APPRO XTREMESERVER (OPTERON_2CORE,2800,RHEL4,IB)
Disclosure Statement:
All information on the Fluent website is Copyrighted 1995-2007 by Fluent Inc.Results from http://www.fluent.com/software/fluent/fl5bench/flbench_6.3/fullres.htm as of July 2, 2007.
Sun Blade X6250
4 2-socket Sun Blade X6250's
2x3.0 GHz DC Intel Xeon EM64T 5160 (Woodcrest)
Infiniband (Voltaire) Interconnects (PCI-Express HCA's)
Software Configuration:
64-bit SUSE SLES 10
Fluent V6.3.26
Fluent 6 Standard Benchmark Test Suite
Voltaire GridStack 4.1.5-7 for SLES 10
Friday Mar 30, 2007
The Sun Fire X2200 M2 server beats Woodcrest on
large CFD models. The X2200 M2 Cluster beats all currently posted
Opteron cluster results (dual core HP XC4000 2.2GHz, HP DL145 G2 2.2GHz,
HP XW9300 2.4GHz, and HP DL585 2.6GHz) for all "cpu" levels and for all
test cases. All clusters had the high performance Infiniband interconnects.
The X2200 M2 beats the IBM X3650 2.66GHz quad core Clovertown across the board at
all cpu levels and for all test cases.
Tests were run on the official version of Fluent (lnxamd64 V6.3.26 build).
The Sun Opteron server numbers were generated under 64-bit SUSE SLES 9 SP 3.
Sun many customers that use Solaris, Linux, and windows so we show
benchmarks on all of these.
Although the X2200 M2 cluster has the best performance on the larger
and more complex tests, "FL5L3". It is most closely representative of
actual customer benchmarks (requires over 9GB of memory, best run using
several cpu's). FL5L3 simulates turbulent flow through a transition duct.
Note that the X2200 M2 cluster results shown in following table are consistently
better than those obtained on the two Woodcrest cluster systems at the same
"cpu" levels and for all indicated "cpu" levels (4 to 32).
The efficiency of the Sun X2200 M2 cluster is superb at well above 90% up to 32 cores. This essentially perfect scalability is contrasted with the Woodcrest
clusters where scalability has dropped off and efficiency is below 70% at
and above 4 cores.
Scaling Performance : Results in "Ratings" (# runs/day, bigger is better)
| System |
4 Cores |
8 Cores |
16 Cores |
32 Cores |
Sun X2200 M2 2.8GHz Operton |
89.9 |
174.4 |
341.5 |
664.4 |
HP BL460C 3.0GHz Woodcrest |
80.3 |
155.4 |
299.0 |
576.0 |
HP DL140 3.0GHz Woodcrest |
N/A |
160.7 |
320.5 |
620.1 |
Bull NovaScale 3.0GHz Woodcrest |
78.9 |
157.8 |
313.2 |
619.0 |
Fluent Performance : Results in "Ratings" (# runs/day, bigger is better)
| System |
Interconnect/MPI |
cores |
FL5L1 |
FL5L2 |
FL5L3 |
| X2200 2.8GHz DC 2220 SLES 9 SP 3 |
IB(V)/HP-MPI |
8 |
1219.5 |
952.1 |
174.4 |
| X2100 3.0GHz SC 156 SLES 9 SP3 |
IB(V)/MVAPICH |
8 |
1148.2 |
1063.4 |
184.6 |
| HPDL140 3.0GHz DC WC EM64T Linux |
IB/HP-MPI |
8 |
1378.0 |
915.0 |
160.7 |
| Bull Nova 3.0 GHz DC WC EM64T RHEL4 |
IB |
8 |
1323.6 |
884.1 |
157.8 |
| HP BL460C 3.0GHz WC EM64T WinCCS |
IB(V) |
8 |
1289.6 |
881.6 |
155.4 |
| Intel White 3.0GHz WC EM64T DC RHAS4 |
IB(Mellanox) |
8 |
--- |
828.0 |
137.8 |
| Tyan Typh. 630 2.3GHz WC SLES 10 |
GbE |
8 |
1011.7 |
692.4 |
122.7 |
| Tyan Typh. 630 2.3GHz WC WinCCS |
GbE |
8 |
981.8 |
635.3 |
--- |
| HPDL140 3.6GHz EM64T WINCCS |
IB |
8 |
970.8 |
675.0 |
120.0 |
| HPDL585 2.6GHz DC 152 RHEL4 |
IB(V)/HP-MPI |
8 |
966.2 |
723.2 |
119.2 |
| HPXC4000 2.2GHz DC 148 Linux |
IB(V)/HP-MPI |
8 |
951.0 |
680.4 |
102.7 |
| HPDL145 G2 Opteron 2.2GHz DC WinCCS |
IB(V) |
8 |
847.1 |
654.5 |
119.2 |
| IBMX3650 2.66GHz 4C Clovert. EM64T RHEL4 |
? |
8 |
953.6 |
551.2 |
93.3 |
Benchmark Description
Nine industrial CFD applications ranging in size from 32,000 to
10,000,000 cells have been selected to demonstrate the performance of
FLUENT on a variety of hardware platforms. The performance of a CFD
code will depend on several factors including size and topology of the
mesh, physical models, numerics and parallelization, compilers and
optimization, in addition to performance characteristics of the
hardware where the simulation is performed. The problems selected
represent a range of simulations typical of those which might be found
in industry. The principal objective of this benchmark suite is to
provide comprehensive and fair comparative information of the
performance of FLUENT on available hardware platforms.
System Configuration
Hardware Configuration:
Sun Fire X2200 M2
2-socket 2x2.8 GHz dual core Opteron 2220 processors
4x1GB + 4x2GB (12GB) DDR2 667 MHz dimms
IB(Voltaire)/PCI-Express (interconnect)
Software Configuration:
64-bit SuSE SLES 9 SP 3
Fluent V6.3.26
Voltaire Infiniband Software Stack: 3.5.5_16-S2sles9.k2.6.5_7.244_smp.x86_64
Message Passing Interface: HP-MPI V hpmpi-2.02.05.00-20061003r.x86_64
See Also
Current V6.2(.16) results at:
http://www.fluent.com/software/fluent/fl5bench/flbench_6.3/fullres.htm
Wednesday Nov 15, 2006
As mentioned in the posting earlier today, scalability is important factor in system performance. Woodcrest's poor scaling may not bode well for
Cloverton. Sure you can package for threads onto a module, but unless you design for them you'll just have more threads not delivering performance but just burning more watts.
Wattage: I'll get detailed wattage results posted soon, but it looks like
as we mentioned Opteron performance is about 20% more than Woodcrest. The
wattage for both configurations looks the same. Therefore expect Sun's
Opteron to have about 20% perf/watt advantage.
Sun's Fluent results will be posted shortly on the website, it is a busy
week with Supercomputing conference and lots of busy people. So keep
checking back. A few of the smaller gave Woodcrest a small percent advantage, but most were significantly faster on Sun's Opteron.
...maybe Woodcrest will have better idle power, but why in the world would
you buy the latest server and leave it idle?
Wednesday Nov 15, 2006
Woodcrest scaling issues? Yes, remember scaling is critical for
system performance, so don't look too much at single core performance or single job performance as it can lead to the wrong conclusions. In fact Sun's Opteron scaling means that the Sun systems can outperform Woodcrest by 18% to 22% as shown below.
On a 4 core/2chip
Intel Woodcrest systems they are only seeing 2.8x to 2.9x on 4 cores -- this doesn't bode well for quad-core or larger systems made out of these. Sun sees 3.6x to 4.1x scaling in the table below. Couple this with the high-wattage of these Woodcrest (31-Oct posting) and Woodcrest may have issues?
Opteron leads poor Woodcrest scaling & performance on Fluent 6 Benchmark (Both systems 2 sockets and using dual-core)
| System |
GHz/Chip |
#cores |
FL5M3 (scaling) |
FL5L2 (scaling) |
| INTEL S5000XAL |
3.0GHz Xeon Woodcrest 5160 |
4-core |
827.0 (2.8x) |
400.0 (2.9x) |
| INTEL S5000XAL |
3.0GHz Xeon Woodcrest 5160 |
2-core |
553.7 (1.9x) |
226.0 (1.6x) |
| INTEL S5000XAL |
3.0GHz Xeon Woodcrest 5160 |
1-core |
297.3 (1.0x) |
138.0 (1.0x) |
| Sun |
| Sun X4100 M2 |
2.8GHz Opteron DC 2200 |
4-core |
979.9 (3.6x) |
486.6 (4.1x) |
| Sun X4100 M2 |
2.8GHz Opteron DC 2200 |
2-core |
516.1 (1.9x) |
241.8 (2.1x) |
| Sun X4100 M2 |
2.8GHz Opteron DC 2200 |
1-core |
273.5 (1.0x) |
117.6 (1.0x) |
Rating = No. of sequential runs of test case possible in 1 day,
86,400/(Total Elapsed Run Time in Seconds)
Fluent results at:
http://www.fluent.com/software/fluent/fl5bench/flbench_6.2/fullres.htm
...I suspect even better performance and scaling on Sun Fire X4100 M2
with Solaris...
Friday Nov 03, 2006
Does the Woodcrest have scaling issues now? It may be caused by the rush
to increase core count without really considering design. On a 4 core/2chip
Intel Woodcrest systems we are only seeing 3.0x to 3.3x on 4 cores -- this doesn't bode well for quad-core or larger systems made out of these. Couple this with the high-wattage of these chips (Tuesday's posting) and this chip may have issues?
Poor Woodcrest scaling & Performance on Fluent 6 Benchmark
| System |
GHz/Chip |
#cores |
FL5L1 (scaling) |
FL5L2 (scaling) |
| INTEL S5000XAL |
3.0GHz Xeon Woodcrest 5160 |
4-core 2-Socket |
631.8 (3.3x) |
400.0 (3.0x) |
| INTEL S5000XAL |
3.0GHz Xeon Woodcrest 5160 |
2-core 1-Socket |
372.8 (1.9x) |
226.0 (1.7x) |
| INTEL S5000XAL |
3.0GHz Xeon Woodcrest 5160 |
1-core |
194.0 (1.0x) |
133.0 (1.0x) |
Rating = No. of sequential runs of test case possible in 1 day,
86,400/(Total Elapsed Run Time in Seconds)
Fluent results at:
http://www.fluent.com/software/fluent/fl5bench/flbench_6.2/fullres.htm