jParallelogram

pageicon Tuesday Apr 14, 2009

LS-DYNA on Nehalem - An Eco-friendly way of computing

With Intel's Xeon Process 5500 series (aka Nehalem) CPU introduced, and Sun's release of Nehalem-based systems, I have had a privilege to test out LS-DYNA on the new systems - even clusters of - running Solaris x64 operating environment. The result is outstanding, confirming what Intel touts about the new chip; improved speed over their previous versions of chips such as Clovertown or Harpertown, with new technologies such as HyperThreading (HT) and Turbo mode (TM), creating chips with even lower power consumption while giving faster computational speed. Tests running LS-DYNA on these systems proved these factors claimed by Intel. They also revealed some interesting points to consider while running LS-DYNA in Solaris OS, which will be described here. (For blogs of other Nehalem results, take a look at this and this.)

First, about testing environment.

1. Hardware configuration

  1. Sun Fire X6270 blade server with Intel(r) Xeon(r) X5560 CPU (Nehalem) @ 2.80GHz, dual socket, quad core.
  2. Sun Fire X6270 blade server with Intel(r) Xeon(r) L5520 CPU (Nehalem) @ 2.27GHz, dual socket, quad core.
  3. Sun Fire X4150 server with Intel(r) Xeon(r) X5460 CPU (Harpertown) @ 3.16GHz, dual socket, quad core.
  4. On-board DDR InfiniBand NEM (Network Express Modules) on X6270's
  5. Topspin-270 96-port InfiniBand switch

2. Software configuration:

  1. Solaris 10 x64 2008/10 (Update 6)
  2. Solaris 10 InfiniBand update 3 ( link )
  3. Sun HPC ClusterTools 7.1 & 8.1
  4. Sun Studio Express 2008/11
  5. LS-DYNA: mpp971_s_7600.2.1224_hpc7.1

Now, the results:

1. Nehalem speed-up

First to see the overall speed up effect of the new Sun Blade X6270 system, I made comparison against Harpertown-based Sun Fire X4150 system (1.3 above). For X4150, I've used release binary of LS-DYNA, and for the new X6270, a newly compiled version based on the same source code base. Compiler used was Sun Studio Express 2008/11. The compiler has implemented code generator for Nehalem chip, and the flag -xtarget=nehalem along with -xvector=simd  has been used. The run has been made on a single node. The numbers are simulations per day, so the larger the number, the faster the system. The simulation data used was of a medium size typical car crash simulation.


 Harpertown(3.16GHz)
 Nehalem(2.8GHz)
 Speed-Up
 NCPU  Simulation/Day  Scaling  Simulation/Day  Scaling   Nehalem
 1  6.20  1.0  13.86  1.0  2.23x
 2  12.53  2.02  28.00  2.02  2.24x
 4  22.13  3.57  51.71  3.73  2.34x
 8  34.80  5.61  58.18  4.20  1.67x (HT on)
 8

 84.79  6.12  2.44x (HT off)

 Overall, the new system shows a speed-up greater than 2X with even lower speed CPU than previous generation of Intel Xeon. The scalability of the new system is slightly better than the older one, showing best result for full-core count (8-cpu) case. Of particular interest is the effect of HyperThreading on the case of 8-cpu, i.e. full-core count case. More on this on next section. Of course, the speed-up is a combination of the effects of both hardware and software, and the break-down of the performance difference according to incumbent factors is possible, but will take time. So, let's enjoy the lumpsum improvement for now, which is fit for the festivity of announcing new systems, and shoot for further analysis with follow-on blogs.

 2. Effect of HyperThreading and Turbo Mode

One of the most interesting observations that I can make out of the testing is the HT effect (and TM effect to a lesser extent). Gist of it is this: for Solaris (I haven't tested for Linux yet) running MPI application (I haven't tested pthread or OpenMP apps yet) running at full core-count cases, you need to set HT off and TM on for best performance. The effect of HT can be rather large at 30-70%, while the effect multiplies as the number of MPI processes increases. Here is an example where I compared the performance of the case of both HT and TM on (default BIOS setting) to that of both HT and TM off. These tests were done on the cluster of X6270 @ 2.27GHz (system 1.2 above).

 NCPU  node x core
 HT on TM on
 HT off TM off
 8  1 x 8
 1.0  1.33
 16  2 x 8
 1.0  1.45
 32  4 x 8
 1.0  1.71

Later I found that turning TM on gives extra boost of 7% for a case, so comparing HT on TM on to HT off TM on will give larger difference. Figuring out what brings this kind of rather dramatic difference in performance using for example DTrace would render a quite interesting project itself, which I plan to work on soon.

In fact, this testing has served me the opportunity to learn about ILOM (Integrated Lights Out Manager). It is an exciting feature that allows you to remotely control Sun servers with web-based interface and the System Controller (aka System Processor, SP) on the server. Setting the HT or TM is done as part of BIOS setting, so remote system control capability is indispensable. For the readers who want to try out  HT/TM effect, I have prepared a separate blog for ILOM here.

3. Conclusion

I have gathered specs of two Intel Xeon processors from the web, [1], [2] :

 Model  Speed (GHz)
L2$ (MB)
FSB (MHz)
TDP (W)
 X5460 Harpertown
 3.16  12  1333  120 [1]
 X5560 Nehalem
 2.80  8  1333  95 [2]

While Nehalem processor consumes 25% less power than previous generation of Harpertown at higher clock-frequency, Nehalem-based X6270 server shows more than 2X speed-up in running LS-DYNA. The application is one of the most widely used ones in manufacturing industry including automotive such as car crash and occupant safety simulation, the new server will make a truly eco-friendly apparatus for safety-enhancing simulations. Solaris O/S is ready for running LS-DYNA on the cluster of the new servers, along with HPC ClusterTools, Solaris IB update and Sun Studio compiler.


Comments:

Post a Comment:
  • HTML Syntax: NOT allowed

« October 2009
SunMonTueWedThuFriSat
    
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today

Feeds

Search this blog

Links

Weblog menu

Today's referrers

Today's Page Hits: 19