Tuesday Oct 13, 2009

The SPEC CPU2006 benchmarks were run on the new 2.88 GHz and 2.53 GHz SPARC64 VII processors for the Sun SPARC Enterprise Mseries servers. The new processors were tested in the Sun SPARC Enterprise M4000, M5000, M8000, M9000 servers.


  • The Sun SPARC Enterprise M9000 server running the new 2.88 GHz SPARC64 VII processors beats the IBM Power 595 server running 5.0 GHz POWER6 processors by 20% on the SPECint_rate2006 benchmark.

  • The Sun SPARC Enterprise M9000 server running the new 2.88 GHz SPARC64 VII processors beats the IBM Power 595 server running 5.0 GHz POWER6 processors by 29% on the SPECint_rate_base2006 benchmark.

  • The Sun SPARC Enterprise M9000 server with 64 SPARC64 VII 2.88GHz processors delivered results of 2590 SPECint_rate2006 and 2100 SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 64 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 13% for SPECint_rate2006 and 5% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 32 SPARC64 VII 2.88GHz processors delivered results of 1450 SPECint_rate2006 and 1250 SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 32 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 17% for SPECint_rate2006 and 13% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M8000 server with 16 SPARC64 VII 2.88GHz processors delivered results of 753 SPECint_rate2006 and 666 SPECfp_rate2006.

  • The Sun SPARC Enterprise M8000 server with 16 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 18% for SPECint_rate2006 and 14% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M5000 server with 8 SPARC64 VII 2.53GHz processors delivered results of 296 SPECint_rate2006 and 234 SPECfp_rate2006.

  • The Sun SPARC Enterprise M5000 server with 8 SPARC64 VII processors at 2.53GHz improves performance vs. 2.40 GHz by 12% for SPECint_rate2006 and 5% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M4000 server with 4 SPARC64 VII 2.53GHz processors delivered results of 152 SPECint_rate2006 and 116 SPECfp_rate2006.

  • The Sun SPARC Enterprise M4000 server with 4 SPARC64 VII processors at 2.53GHz improves performance vs. 2.40 GHz by 13% for SPECint_rate2006 and 4% for SPECfp_rate2006.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 10/07/09.

In the tables below
"Base" = SPECint_rate_base2006 or SPECfp_rate_base2006
"Peak" = SPECint_rate2006 or SPECfp_rate2006

SPECint_rate2006 results - large systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
SGI Altix 4700 Bandwidth 1024/512 Itanium 2 1.6 1020 9031 na
Sun Blade X6440 Cluster 768/192 Opteron 8384 2.7 705 8845 na
SGI Altix 4700 Density 256/128 Itanium 2 1.66 256 2893 3354
vSMP Foundation 128/32 Xeon X5570 2.93 255 3147 na
SGI Altix 4700 Bandwidth 256/128 Itanium 2 1.6 256 2715 2971
SPARC Enterprise M9000 256/64 SPARC64 VII 2.88 511 2400 2590 New
SPARC Enterprise M9000 256/64 SPARC64 VII 2.52 511 2088 2288
IBM Power 595 64/32 POWER6 5.0 128 1866 2155
HP Superdome 128/64 Itanium 2 1.6 128 1534 1648
SPARC Enterprise M9000 128/32 SPARC64 VII 2.88 255 1370 1450 New
SPARC Enterprise M9000 128/64 SPARC64 VI 2.4 255 1111 1294
SPARC Enterprise M9000 128/32 SPARC64 VI 2.52 255 1141 1240
Unisys ES7000 96/16 Xeon X7460 2.66 96 999 1049
SGI Altix ICE 8200EX 32/8 Xeon X5570 2.93 64 931 999
IBM Power 575 32/16 POWER6 4.7 64 812 934
IBM Power 570 32/16 POWER6+ 4.2 64 661 832
SPARC Enterprise M8000 64/16 SPARC64 VII 2.88 127 706 753 New
SPARC Enterprise M9000 64/32 SPARC64 VI 2.4 127 553 650
SPARC Enterprise M8000 64/16 SPARC64 VII 2.52 127 565 637

SPECint_rate2006 results - small systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
Sun Fire X4440 24/4 Opteron 8435 SE 2.6 24 296 377
SPARC Enterprise M5000 32/8 SPARC64 VII 2.53 64 267 296 New
Sun Blade X6440 16/4 Opteron 8389 2.9 16 226 292
HP ProLiant BL680c G5 24/4 Xeon E7458 2.4 24 247 268
SPARC Enterprise M5000 32/8 SPARC64 VII 2.4 63 232 264
IBM Power 550 8/4 POWER6+ 5.0 16 215 263
Sun Fire X2270 8/2 Xeon X5570 2.93 16 223 260
SPARC Enterprise T5240 16/2 UltraSPARC T2 Plus 1.6 127 171 183
SPARC Enterprise M4000 16/4 SPARC64 VII 2.53 32 136 152 New
SPARC Enterprise M4000 16/4 SPARC64 VII 2.4 32 118 135

SPECfp_rate2006 results - large systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
SGI Altix 4700 Bandwidth 1024/512 Itanium 2 1.6 1020 10583 na
SGI Altix 4700 Density 1024/512 Itanium 2 1.66 1020 10580 na
Sun Blade X6440 Cluster 768/192 Opteron 8384 2.7 705 6502 na
SGI Altix 4700 Bandwidth 256/128 Itanium 2 1.6 256 3419 3507
ScaleMP vSMP Foundation 128/32 Xeon X5570 2.93 255 2553 na
IBM Power 595 64/32 POWER6 5.0 128 1681 2184
IBM Power 595 64/32 POWER6 5.0 128 1822 2108
SPARC Enterprise M9000 256/64 SPARC64 VII 2.88 511 1930 2100 New
SPARC Enterprise M9000 256/64 SPARC64 VII 2.52 511 1861 2005
SGI Altix 4700 Bandwidth 128/64 Itanium 2 1.66 128 1832 1947
HP Superdome 128/64 Itanium 2 1.6 128 1422 1479
SPARC Enterprise M9000 128/32 SPARC64 VII 2.88 255 1190 1250 New
SPARC Enterprise M9000 128/64 SPARC64 VI 2.4 255 1160 1225
SPARC Enterprise M9000 128/32 SPARC64 VII 2.52 255 1059 1110
IBM Power 575 32/16 POWER6 4.7 64 730 839
SPARC Enterprise M8000 64/16 SPARC64 VII 2.88 127 616 666 New
SPARC Enterprise M9000 64/32 SPARC64 VI 2.52 127 588 636
IBM Power 570 32/16 POWER6+ 4.2 64 517 602
SPARC Enterprise M8000 64/32 SPARC64 VI 2.4 127 538 582

SPECfp_rate2006 results - small systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
Supermicro H8QM8-2 24/4 Opteron 8435 SE 2.8 24 261 287
SPARC Enterprise T5440 32/4 UltraSPARC T2 Plus 1.6 255 254 270
IBM Power 560 16/8 POWER6+ 3.6 32 226 263
SPARC Enterprise M5000 32/8 SPARC64 VII 2.53 64 218 234 New
SPARC Enterprise M5000 32/8 SPARC64 VII 2.4 63 208 223
IBM Power 550 8/4 POWER6+ 5.0 16 188 222
ASUS Z8PE-D18 8/2 Xeon X5570 2.93 16 197 203
SPARC Enterprise T5240 16/2 UltraSPARC T2 Plus 1.6 127 124 133
SPARC Enterprise M4000 16/4 SPARC64 VII 2.53 32 111 116 New
SPARC Enterprise M4000 16/4 SPARC64 VII 2.4 32 107 112

Results and Configuration Summary

Test Configurations:

Sun SPARC Enterprise M9000
64 x 2.88 GHz SPARC64 VII
1152 GB (448 x 2GB + 64 x 4GB)
Solaris 10 5/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M9000
32 x 2.88 GHz SPARC64 VII
704 GB (160 x 2GB + 96 x 4GB)
Solaris 10 5/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M8000
16 x 2.88 GHz SPARC64 VII
512 GB (128 x 4GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M5000
8 x 2.53 GHz SPARC64 VII
128 GB (64 x 2GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M4000
4 x 2.53 GHz SPARC64 VII
32 GB (32 x 1GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Results Summary:

M9000 M9000 M8000 M5000 M4000
SPECint_rate_base2006 2400 1370 706 267 136
SPECint_rate2006 2590 1450 753 296 152
SPECfp_rate_base2006 1930 1190 616 218 111
SPECfp_rate2006 2100 1250 666 234 116
SPECint_base2006 - - 12.4 - 12.1
SPECint2006 - - 13.6 - 12.9
SPECfp_base2006 - - 15.6 - 13.3
SPECfp2006 - - 16.5 - 13.9
SPECfp2006 - autopar - - 28.2 - -
SPECfp2006 - autopar - - 33.9 - -

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark, with over 8000 results published in the three years since it was introduced. It measures:

  • "Speed" - single copy performance of chip, memory, compiler
  • "Rate" - multiple copy (throughput)

The rate metrics are used for the throughput-oriented systems described on this page. These metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

Key Points and Best Practices

Result on this page for the Sun SPARC Enterprise M9000 server were measured on a Fujitsu SPARC Enterprise M9000. The Sun SPARC Enterprise M9000 and Fujitsu SPARC Enterprise M9000 are electronically equivalent. Results for the Sun SPARC Enterprise M8000, M4000 and M5000 were measured on those systems. The similarly named Fujitsu sytems are electronically equivalent.

Use the latest compiler. The Sun Studio group is always working to improve the compiler. Sun Studio 12 Update 1, which are used in these submissions, provides updated code generation for a wide variety of SPARC and x86 implementations.

I/O still counts. Even in a CPU-intensive workload, some I/O remains. This point is explored in some detail at http://blogs.sun.com/jhenning/entry/losing_my_fear_of_zfs.

See Also

Disclosure Statement

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Competitive results from www.spec.org as of 7 October 2009. Sun's new results quoted on this page have been submitted to SPEC. Sun SPARC Enterprise M9000 2400 SPECint_rate_base2006, 2590 SPECint_rate2006, 1930 SPECfp_rate_base2006, 2100 SPECfp_rate2006; Sun SPARC Enterprise M9000 (32 chips) 1370 SPECint_rate_base2006, 1450 SPECint_rate2006, 1190 SPECfp_rate_base2006, 1250 SPECfp_rate2006; Sun SPARC Enterprise M8000 706 SPECint_rate_base2006, 753 SPECint_rate2006, 616 SPECfp_rate_base2006, 666 SPECfp_rate2006; Sun SPARC Enterprise M5000 267 SPECint_rate_base2006, 296 SPECint_rate2006, 218 SPECfp_rate_base2006, 234 SPECfp_rate2006; Sun SPARC Enterprise M4000 136 SPECint_rate_base2006, 152 SPECint_rate2006, 111 SPECfp_rate_base2006, 116 SPECfp_rate2006; Sun SPARC Enterprise M9000 (2.52GHz) 2088 SPECint_rate_base2006, 2288 SPECint_rate2006, 1860 SPECfp_rate_base2006, 2010 SPECfp_rate2006; Sun SPARC Enterprise M9000 (32 chips 2.52GHz) 1140 SPECint_rate_base2006, 1240 SPECint_rate2006, 1060 SPECfp_rate_base2006, 1110 SPECfp_rate2006; Sun SPARC Enterprise M8000 (2.52GHz) 565 SPECint_rate_base2006, 637 SPECint_rate2006, 538 SPECfp_rate_base2006, 582 SPECfp_rate2006; Sun SPARC Enterprise M5000 (2.4GHz) 232 SPECint_rate_base2006, 264 SPECint_rate2006, 208 SPECfp_rate_base2006, 223 SPECfp_rate2006; Sun SPARC Enterprise M4000 (2.4GHz) 118 SPECint_rate_base2006, 135 SPECint_rate2006, 107 SPECfp_rate_base2006, 112 SPECfp_rate2006; IBM Power 595 1866 SPECint_rate_base2006, 2155 SPECint_rate2006,

Wednesday Jul 22, 2009

Sun has upgraded the UltraSPARC T2 and UltraSPARC T2 Plus processors to 1.6 GHz. As described in some detail in yesterday's post, new results show SPEC CPU2006 performance improvements vs. previous systems that often exceed the clock speed improvement.  The scaling can be attributed to both memory system improvements and software improvements, such as the Sun Studio 12 Update 1 compiler.

A MHz improvement within a product line is often useful.  If yesterday's chip runs at speed n and today's at n*1.12 then, intuitively, sure, I'll take today's.

Comparing MHz across product lines is often counter-intuitive.  Consider that Sun's new systems provide:

  • up to 68% more throughput than the 4.7 GHz POWER6+ [1], and
  • up to 3x the throughput of the Itanium 9150N [2].

The comparisons are particularly striking when one takes into account the cache size advantage for both the POWER6+ and the Itanium 9150N, and the MHz advantage for the POWER6+:

Processor GHz Number of
hw cache levels
Size of
last cache
(per chip)
SPECint_rate_base2006
UltraSPARC T2
UltraSPARC T2 Plus
1.6 2 4 MB 1 chip: 89
2 chips: 171
4 chips: 338
POWER6+ 4.7 3 32 MB Best 2 chip result: 102. UltraSPARC T2 Plus delivers 68% more integer throughput [1]
Itanium 9150N 1.6 3 24 MB Best 4 chip result: 114. UltraSPARC T2 Plus delivers 3x the integer throughput. [2]

These are per-chip results, not per-core or per-thread. Sun's CMT processors are designed for overall system throughput: how much work can the overall system get done.  

A mystery: With comparatively smaller caches and modest clock rates, why do the Sun CMT processors win?

The performance hole: Memory latency. From the point of view of a CPU chip, the big performance problem is that memory latency is inordinately long compared to chip cycle times.

A hardware designer can attempt to cover up that latency with very large caches, as in the POWER6+ and Itanium, and this works well when running a small number of modest-sized applications. Large caches become less helpful, though, as workloads become more complex.

MHz isn't everything. In fact, MHz hardly counts at all when the problem is memory latency. Suppose the hot part of an application looks like this:

  loop:
       computational instruction
       computational instruction
       computational instruction
       memory access instruction
       branch to loop

For an application that looks like this, the computational instructions may complete in only a few cycles, while the memory access instruction may easily require on the order of 100ns - which, for a 1 GHz chip, is on the order of 100 cycles. If the processor speed is increased by a factor of 4, but memory speed is not, then memory is still 100ns away, and when measured in cycles, it is now 400 cycles distant. The overall loop hardly speeds up at all.

Lest the reader think I am making this up - consider page 8 of this IBM talk from April, 2008 regarding the POWER6:

latencies

The IBM POWER systems have some impressive performance characteristics - if your application is tiny enough to fit in its first or second level cache. But memory latency is not impressive. If your workload requires multiple concurrent threads accessing a large memory space, Sun's CMT approach just might be a better fit.

Operating System Overhead A context switch from one process to another is mediated by operating system services. The OS parks context from the process that is currently running - typically saving dozens of program registers and other context (such as virtual address space information); decides which process to run next (which may require access to several OS data structures); and loads the context for the new process (registers, virtual address context, etc.). If the system is running many processes, then caches are unlikely to be helpful during this context switch, and thousands of cycles may be spent on main memory accesses.

Design for throughput: Sun's CMT approach handles the complexity of real-world applications by allowing up to 64 processes to be simultaneously on-chip. When a long-latency stall occurs, such as an access to main memory, the chip switches to executing instructions on behalf of other, non-stalled threads, thus improving overall system throughput. No operating system intervention is required as resources are shared among the processes on the chip.

[1] http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090427-07263.html
[2] http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090522-07485.html

Competitive results retrieved from www.spec.org   20 July 2009.  Sun's CMT results have been submitted to SPEC.  SPEC, SPECfp, SPECint are registered trademarks of the Standard Performance Evaluation Corporation.

Tuesday Jul 21, 2009

UltraSPARC T2 and T2 Plus Systems

Improved Performance Over 1.4 GHz

Reported 07/21/09

Significance of Results

Results are presented for the SPEC CPU2006 rate benchmarks run on the new 1.6 GHz Sun UltraSPARC T2 and Sun UltraSPARC T2 Plus processors based systems. The new processors were tested in the Sun CMT family of systems, including the Sun SPARC Enterprise T5120, T5220, T5240, T5440 servers and the Sun Blade T6320 server module.

SPECint_rate2006

  • The Sun SPARC Enterprise T5440 server, equipped with four 1.6 GHz UltraSPARC T2 Plus processor chips, delivered 57% and 37% better results than the best 4-chip IBM POWER6+ based systems on the SPEC CPU2006 integer throughput metrics.

  • The Sun SPARC Enterprise T5240 server equipped with two 1.6 GHz UltraSPARC T2 Plus processor chips, produced 68% and 48% better results than the best 2-chip IBM POWER6+ based systems on the SPEC CPU2006 integer throughput metrics.

  • The single-chip 1.6 GHz UltraSPARC T2 processor-based Sun CMT servers produced 59% to 68% better results than the best single-chip IBM POWER6 based systems on the SPEC CPU2006 integer throughput metrics.

  • On the four-chip Sun SPARC Enterprise T5440 server, when compared versus the 1.4 GHz version of this server, the new 1.6 GHz UltraSPARC T2 Plus processor delivered performance improvements of 25% and 20% as measured by the SPEC CPU2006 integer throughput metrics.

  • The new 1.6 GHz UltraSPARC T2 Plus processor, when put into the 2-chip Sun SPARC Enterprise T5240 server, delivered improvements of 20% and 17% when compared to the 1.4 GHz UltraSPARC T2 Plus processor based server, as measured by the SPEC CPU2006 integer throughput metrics.

  • On the single-chip Sun Blade T6320 server module, Sun SPARC Enterprise T5120 and T5220 servers, the new 1.6 GHz UltraSPARC T2 processor delivered performance improvements of 13% to 17% over the 1.4 GHz version of these servers, as measured by the SPEC CPU2006 integer throughput metrics.

  • The Sun SPARC Enterprise T5440 server, equipped with four 1.6 GHz UltraSPARC T2 Plus processor chips, delivered a SPECint_rate_base2006 score 3X the best 4-chip Itanium based system.

  • The Sun SPARC Enterprise T5440 server, equipped with four 1.6 GHz UltraSPARC T2 Plus processors, delivered a SPECint_rate_base2006 score of 338, a World Record score for 4-chip systems running a single operating system instance (i.e. SMP, not clustered).

SPECfp_rate2006

  • The Sun SPARC Enterprise T5440 server, equipped with four 1.6 GHz UltraSPARC T2 Plus processor chips, delivered 35% and 22% better results than the best 4-chip IBM POWER6+ based systems on the SPEC CPU2006 floating-point throughput metrics.

  • The Sun SPARC Enterprise T5240 server, equipped with two 1.6 GHz UltraSPARC T2 Plus processor chips, produced 40% and 27% better results than the best 2-chip IBM POWER6+ based systems on the SPEC CPU2006 floating-point throughput metrics.

  • The single 1.6 GHz UltraSPARC T2 processor based Sun CMT servers produced between 24% and 18% better results than the best single-chip IBM POWER6 based systems on the SPEC CPU2006 floating-point throughput metrics.

  • On the four chip Sun SPARC Enterprise T5440 server, the new 1.6 GHz UltraSPARC T2 Plus processor delivered performance improvements of 20% and 17% when compared to 1.4 GHz processors in the same system, as measured by the SPEC CPU2006 floating-point throughput metrics.

  • The new 1.6 GHz UltraSPARC T2 Plus processor, when put into a Sun SPARC Enterprise T5240 server, delivered an improvement of 12% when compared to the 1.4 GHz UltraSPARC T2 Plus processor based server as measured by the SPEC CPU2006 floating-point throughput metrics.

  • On the single processor Sun Blade T6320 server module, Sun SPARC Enterprise T5120 and T5220 servers, the new 1.6 GHz UltraSPARC T2 processor delivered a performance improvement over the 1.4 GHz version of these servers of between 11% and 10% as measured by the SPEC CPU2006 floating-point throughput metrics.

  • The Sun SPARC Enterprise T5440 server, equipped with four 1.6 GHz UltraSPARC T2 Plus processor chips, delivered a peak score 3X the best 4-chip Itanium based system, and base 2.9X, on the SPEC CPU2006 floating-point throughput metrics.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 7/17/09.

In the tables below
"Base" = SPECint_rate_base2006 or SPECfp_rate_base2006
"Peak" = SPECint_rate2006 or SPECfp_rate2006

SPECint_rate2006 results - 1 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
Supermicro X8DAI 4/1 Xeon W3570 3200 8 127 136 Best Nehalem result
HP ProLiant BL465c G6 6/1 Opteron 2435 2600 6 82.1 104 Best Istanbul result
Sun SPARC T5220 8/1 UltraSPARC T2 1582 63 89.1 97.0 New
Sun SPARC T5120 8/1 UltraSPARC T2 1582 63 89.1 97.0 New
Sun Blade T6320 8/1 UltraSPARC T2 1582 63 89.2 96.7 New
Sun Blade T6320 8/1 UltraSPARC T2 1417 63 76.4 85.5
Sun SPARC T5120 8/1 UltraSPARC T2 1417 63 76.2 83.9
IBM System p 570 2/1 POWER6 4700 4 53.2 60.9 Best POWER6 result

SPECint_rate2006 - 2 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
Fujitsu CELSIUS R670 8/2 Xeon W5580 3200 16 249 267 Best Nehalem result
Sun Blade X6270 8/2 Xeon X5570 2933 16 223 260
A+ Server 1021M-UR+B 12/2 Opteron 2439 SE 2800 12 168 215 Best Istanbul result
Sun SPARC T5240 16/2 UltraSPARC T2 Plus 1582 127 171 183 New
Sun SPARC T5240 16/2 UltraSPARC T2 Plus 1415 127 142 157
IBM Power 520 4/2 POWER6+ 4700 8 101 124 Best POWER6+ peak
IBM Power 520 4/2 POWER6+ 4700 8 102 122 Best POWER6+ base
HP Integrity rx2660 4/2 Itanium 9140M 1666 4 58.1 62.8 Best Itanium peak
HP Integrity BL860c 4/2 Itanium 9140M 1666 4 61.0 na Best Itanium base

SPECint_rate2006 - 4 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
SGI Altix ICE 8200EX 16/4 Xeon X5570 2933 32 466 499 Best Nehalem result
Note: clustered, not SMP
Tyan Thunder n4250QE 24/4 Opteron 8439 SE 2800 24 326 417 Best Istanbul result
Sun SPARC T5440 32/4 UltraSPARC T2 Plus 1596 255 338 360 New.  World record for
4-chip SMP
SPECint_rate_base2006
Sun SPARC T5440 32/4 UltraSPARC T2 Plus 1414 255 270 301
IBM Power 550 8/4 POWER6+ 5000 16 215 263 Best POWER6 result
HP Integrity BL870c 8/4 Itanium 9150N 1600 8 114 na Best Itanium result

SPECfp_rate2006 - 1 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
Supermicro X8DAI 4/1 Xeon W3570 3200 8 102 106 Best Nehalem result
HP ProLiant BL465c G6 6/1 Opteron 2435 2600 6 65.2 72.2 Best Istanbul result
Sun SPARC T5220 8/1 UltraSPARC T2 1582 63 64.1 68.5 New
Sun SPARC T5120 8/1 UltraSPARC T2 1582 63 64.1 68.5 New
Sun Blade T6320 8/1 UltraSPARC T2 1582 63 64.1 68.5 New
Sun Blade T6320 8/1 UltraSPARC T2 1417 63 58.1 62.3
SPARC T5120 8/1 UltraSPARC T2 1417 63 57.9 62.3
SPARC T5220 8/1 UltraSPARC T2 1417 63 57.9 62.3
IBM System p 570 2/1 POWER6 4700 4 51.5 58.0 Best POWER6 result

SPECfp_rate2006 - 2 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
ASUS TS700-E6 8/2 Xeon W5580 3200 16 201 207 Best Nehalem result
A+ Server 1021M-UR+B 12/2 Opteron 2439 SE 2800 12 133 147 Best Istanbul result
Sun SPARC T5240 16/2 UltraSPARC T2 Plus 1582 127 124 133 New
Sun SPARC T5240 16/2 UltraSPARC T2 Plus 1415 127 111 119
IBM Power 520 4/2 POWER6+ 4700 8 88.7 105 Best POWER6+ result
HP Integrity rx2660 4/4 Itanium 9140M 1666 4 54.5 55.8 Best Itanium result

SPECfp_rate2006 - 4 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
SGI Altix ICE 8200EX 16/4 Xeon X5570 2933 32 361 372 Best Nehalem result
Tyan Thunder n4250QE 24/4 Opteron 8439 SE 2800 24 259 285 Best Istanbul result
Sun SPARC T5440 32/4 UltraSPARC T2 Plus 1596 255 254 270 New
Sun SPARC T5440 32/4 UltraSPARC T2 Plus 1414 255 212 230
IBM Power 550 8/4 POWER6+ 5000 16 188 222 Best POWER6+ result
HP Integrity rx7640 8/4 Itanium 2 9040 1600 8 87.4 90.8 Best Itanium result

Results and Configuration Summary

Test Configurations:


Sun Blade T6320
1.6 GHz UltraSPARC T2
64 GB (16 x 4GB)
Solaris 10 10/08
Sun Studio 12, Sun Studio 12 Update 1, gccfss V4.2.1

Sun SPARC Enterprise T5120/T5220
1.6 GHz UltraSPARC T2
64 GB (16 x 4GB)
Solaris 10 10/08
Sun Studio 12, Sun Studio 12 Update 1, gccfss V4.2.1

Sun SPARC Enterprise T5240
2 x 1.6 GHz UltraSPARC T2 Plus
128 GB (32 x 4GB)
Solaris 10 5/09
Sun Studio 12, Sun Studio 12 Update 1, gccfss V4.2.1

Sun SPARC Enterprise T5440
4 x 1.6 GHz UltraSPARC T2 Plus
256 GB (64 x 4GB)
Solaris 10 5/09
Sun Studio 12 Update 1, gccfss V4.2.1

Results Summary:



T6320 T5120 T5220 T5240 T5440
SPECint_rate_base2006 89.2 89.1 89.1 171 338
SPECint_rate2006 96.7 97.0 97.0 183 360
SPECfp_rate_base2006 64.1 64.1 64.1 124 254
SPECfp_rate2006 68.5 68.5 68.5 133 270

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark, with over 7000 results published in the three years since it was introduced. It measures:

  • "Speed" - single copy performance of chip, memory, compiler
  • "Rate" - multiple copy (throughput)

The rate metrics are used for the throughput-oriented systems described on this page. These metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

See here for additional information.

Key Points and Best Practices

Result on this page for the Sun SPARC Enterprise T5120 server were measured on a Sun SPARC Enterprise T5220. The Sun SPARC Enterprise T5120 and Sun SPARC Enterprise T5220 are electronically equivalent. A SPARC Enterprise 5120 can hold up to 4 disks, and a T5220 can hold up to 8. This system was tested with 4 disks; therefore, results on this page apply to both the T5120 and the T5220.

Know when you need throughput vs. speed. The Sun CMT systems described on this page provide massive throughput, as demonstrated by the fact that up to 255 jobs are run on the 4-chip system, 127 on 2-chip, and 63 on 1-chip. Some of the competitive chips do have a speed advantage - e.g. Nehalem and Istanbul - but none of the competitive results undertake to run the large number of jobs tested on Sun's CMT systems.

Use the latest compiler. The Sun Studio group is always working to improve the compiler. Sun Studio 12, and Sun Studio 12 Update 1, which are used in these submissions, provide updated code generation for a wide variety of SPARC and x86 implementations.

I/O still counts. Even in a CPU-intensive workload, some I/O remains. This point is explored in some detail at http://blogs.sun.com/jhenning/entry/losing_my_fear_of_zfs.

Disclosure Statement

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Competitive results from www.spec.org as of 16 July 2009.  Sun's new results quoted on this page have been submitted to SPEC.
Sun Blade T6320 89.2 SPECint_rate_base2006, 96.7 SPECint_rate2006, 64.1 SPECfp_rate_base2006, 68.5 SPECfp_rate2006;
Sun SPARC Enterprise T5220/T5120 89.1 SPECint_rate_base2006, 97.0 SPECint_rate2006, 64.1 SPECfp_rate_base2006, 68.5 SPECfp_rate2006;
Sun SPARC Enterprise T5240 172 SPECint_rate_base2006, 183 SPECint_rate2006, 124 SPECfp_rate_base2006, 133 SPECfp_rate2006;
Sun SPARC Enterprise T5440 338 SPECint_rate_base2006, 360 SPECint_rate2006, 254 SPECfp_rate_base2006, 270 SPECfp_rate2006;
Sun Blade T6320 76.4 SPECint_rate_base2006, 85.5 SPECint_rate2006, 58.1 SPECfp_rate_base2006, 62.3 SPECfp_rate2006;
Sun SPARC Enterprise T5220/T5120 76.2 SPECint_rate_base2006, 83.9 SPECint_rate2006, 57.9 SPECfp_rate_base2006, 62.3 SPECfp_rate2006;
Sun SPARC Enterprise T5240 142 SPECint_rate_base2006, 157 SPECint_rate2006, 111 SPECfp_rate_base2006, 119 SPECfp_rate2006;
Sun SPARC Enterprise T5440 270 SPECint_rate_base2006, 301 SPECint_rate2006, 212 SPECfp_rate_base2006, 230 SPECfp_rate2006;
IBM p 570 53.2 SPECint_rate_base2006, 60.9 SPECint_rate2006, 51.5 SPECfp_rate_base2006, 58.0 SPECfp_rate2006;
IBM Power 520 102 SPECint_rate_base2006, 124 SPECint_rate2006, 88.7 SPECfp_rate_base2006, 105 SPECfp_rate2006;
IBM Power 550 215 SPECint_rate_base2006, 263 SPECint_rate2006, 188 SPECfp_rate_base2006, 222 SPECfp_rate2006;
HP Integrity BL870c 114 SPECint_rate_base2006;
HP Integrity rx7640 87.4 SPECfp_rate_base2006, 90.8 SPECfp_rate2006.

Tuesday Jun 23, 2009

Significance of Results

A Sun Constellation system, composed of 48 Sun Blade X6440 server modules in a Sun Blade 6048 chassis, running OpenSolaris 2008.11 and using the Sun Studio 12 Update 1 compiler delivered World Record SPEC CPU2006 rate results.

On the SPECint_rate_base2006 benchmark, Sun delivered 4.7 times more performance than the IBM power 595 (5GHz POWER6); this IBM system requires a slightly larger cabinet than the Sun Blade 6048 chassis (details below). 

On the SPECfp_rate_base2006 benchmark Sun delivered 3.9 times more performance than the largest IBM power 595 (5GHz POWER6); this IBM system requires a slightly larger cabinet than the Sun Blade 6048 chassis (details below).

  • The Sun Constellation System equipped with AMD Opteron QC 8384 2.7 GHz processors, running OpenSolaris 2008.11 and using the Sun Studio 12 update 1 compiler, delivered the World Record SPECint_rate_base2006 score of 8840.
  • This SPECint_rate_base2006 score beat the previous record holding score by over three times.
  • The Sun Constellation System equipped with AMD Opteron QC 8384 2.7 GHz processors, running OpenSolaris 2008.11 and using the Sun Studio 12 update 1 compiler, delivered the fastest x86 SPECfp_rate_base2006 score of 6500.
  • This SPECfp_rate_base2006 score beat the previous x86 record holding score by nine times.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results.

SPECint_rate2006

System Processors Performance Results Notes (1)
Type GHz Chips Cores Peak Base
Sun Blade 6048 Opteron 8384 2.7 192 768
8840 New Record
SGI Altix 4700 Density System Itanium 9150M 1.66 128 256 3354 2893 Previous Best
SGI Altix 4700 Bandwidth System Itanium2 9040 1.6 128 256 2971 2715
Fujitsu/Sun SPARC Enterprise M9000 SPARC64 VII 2.52 64 256 2290 2090
IBM Power 595 POWER6 5.0 32 64 2160 1870 Best POWER6

(1) Results as of 23 June 2009 from www.spec.org.

SPECfp_rate2006

System Processors Performance Results Notes (2)
Type GHz Chips Cores Peak Base
SGI Altix 4700 Density System Itanium 9140M 1.66 512 1024
10580
Sun Blade 6048 Opteron 8384 2.7 192 768
6500 New x86 Record
SGI Altix 4700 Bandwidth System Itanium2 9040 1.6 128 256 3507 3419
IBM Power 595 POWER 6 5.0 64 32 2184 1681 Best POWER6
Fujitsu/Sun SPARC Enterprise M9000 SPARC64 VII 2.52 64 256 2005 1861
SGI Altix 4700 Bandwidth System Itanium 9150M 1.66 128 256 1947 1832
SGI Altix ICE 8200EX Intel X5570 2.93 8 32 742 723

(2) Results as of 23 June 2009 from www.spec.org.

(2) Results as of 23 June 2009 from www.spec.org.

Results and Configuration Summary

Hardware Configuration:
    1 x Sun Blade 6048
      48 x Sun Blade X6440, each with
        4 x 2.7 GHz QC AMD Opteron 8384 processors
        32 GB, (8 x 4GB)

Software Configuration:

    O/S: OpenSolaris 2008.11
    Compiler: Sun Studio 12 Update 1
    Other SW: MicroQuill SmartHeap Library 9.01 x64
    Benchmark: SPEC CPU2006 V1.1

Key Points and Best Practices

The Sun Blade 6048 chassis is able to contain a variety of server modules. In this case, the Sun Blade X6440 was used to provide this capacity solution. This single rack delivered results which have not been seen in this form factor.

To run this many jobs, the benchmark requires a reasonably good file server where the benchmark is run. The Sun Fire X4540 server was used to provide the disk space required being accessed by NFS by the blades.

Sun has shown 4.7x greater SPECint_rate_base2006 and 3.9x greater SPECfp_rate_base2006 in a slightly smaller cabinet. IBM specifications are at: http://www-03.ibm.com/systems/power/hardware/595/specs.html. One frame (slimline doors): 79.3"H x 30.5"W x 58.5"D weight: 3,376 lb. One frame (acoustic doors): 79.3"H x 30.5"W x 71.1"D weight: 3,422 lb. The Sun Blade 6048 specifications are at: http://www.sun.com/servers/blades/6048chassis/specs.xml One Sun Blade 6048: 81.6"H x 23.9"W x 40.3"D weight: 2,300 lb (fully configured). 

Disclosure Statement:

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 6/22/2009 and this report. Sun Blade 6048 chassis with Sun Blade X6440 server modules (48 nodes with 4 chips, 16 cores, 16 threads each, OpenSolaris 2008.11, Studio 12 update 1) - 8840 SPECint_rate_base2006, 6500 SPECfp_rate_base2006; IBM p595, 1870 SPECint_rate_base2006, 1681 SPECfp_rate_base2006.

See Also

Tuesday Jun 09, 2009

Not the Free Compiler That You Thought, No, This Other One.

Nehalem performance measured with several software configurations

Contributed by: John Henning and Karsten Guthridge

Introduction

race

The GNU C Compiler, GCC, is popular, widely available, and an exemplary collaborative effort.

But how does it do for performance -- for example, on Intel's latest hot "Nehalem" processor family? How does it compare to the freely available Sun Studio compiler?

Using the SPEC CPU benchmarks, we take a look at this question. These benchmarks depend primarily on performance of the chip, the memory hierarchy, and the compiler. By holding the first two of these constant, it is possible to focus in on compiler contributions to performance.

Current Record Holder

The current SPEC CPU2006 floating point speed record holder is the Sun Blade X6270 server module. Using 2x Intel Xeon X5570 processor chips and 24 GB of DDR3-1333 memory, it delivers a result of 45.0 SPECfp_base2006 and 50.4 SPECfp2006. [1]

We used this same blade system to compare GCC vs. Studio. On separate, but same-model disks, this software was installed:

  • SuSE Linux Enterprise Server 11.0 (x86_64) and GCC V4.4.0 built with gmp-4.3.1 and mpfr-2.4.1
  • OpenSolaris2008.11 and Sun Studio 12 Update 1

Tie One Hand Behind Studio's Back

In order to make the comparison more fair to GCC, we took several steps.

  1. We simplified the tuning for the OpenSolaris/Sun Studio configuration. This was done in order to counter the criticism that one sometimes hears that SPEC benchmarks have overly aggressive tuning. Benchmarks were optimized with a reasonably short tuning string:

    For all:  -fast -xipo=2 -m64 -xvector=simd -xautopar
    For C++, add:  -library=stlport4
  2. Recall that SPEC CPU2006 allows two kinds of tuning: "base", and "peak". The base metrics require that all benchmarks of a given language use the same tuning. The peak metrics allow individual benchmarks to have differing tuning, and more aggressive optimizations, such as compiler feedback. The simplified Studio configuration used only the less aggressive base tuning.

Both of the above changes limited the performance of Sun Studio.  Several measures were used to increase the performance of GCC:

  1. We tested the latest released version of GCC, 4.4.0, which was announced on 21 April 2009. In our testing, GCC 4.4.0 provides about 10% better overall floating point performance than V4.3.2. Note that GCC 4.4.0 is more recent than the compiler that is included with recent Linux distributions such as SuSE 11, which includes 4.3.2; or Ubuntu 8.10, which updates to 4.3.2 when one does "apt-get install gcc". It was installed with the math libraries mpfr 2.4.1 and gmp 4.3.1, which are labeled as the latest releases as of 1 June 2009.

  2. A tuning effort was undertaken with GCC, including testing of -O2 -O3 -fprefetch-loop-arrays -funroll-all-loops -ffast-math -fno-strict-aliasing -ftree-loop-distribution -fwhole-program -combine and -fipa-struct-reorg

  3. Eventually, we settled on this tuning string for GCC base:

    For all:  -O3 -m64 -mtune=core2 -msse4.2 -march=core2
    -fprefetch-loop-arrays -funroll-all-loops
    -Wl,-z common-page-size=2M
    For C++, add:  -ffast-math

    The reason that only the C++ benchmarks used the fast math library was that 435.gromacs, which uses C and Fortran, fails validation with this flag. (Note: we verified that the benchmarks successfully obtained 2MB pages.)

Studio wins by 2x, even with one hand tied behind its back

At this point, a fair base-to-base comparison can be made, and Sun Studio/OpenSolaris finishes the race while GCC/Linux is still looking for its glasses: 44.8 vs. 21.1 (see Table 1). Notice that Sun Studio provides more than 2x the performance of GCC.

Table 1: Initial Comparisons, SPECfp2006,
Sun Studio/Solaris vs. GCC/Linux
  Base Peak
Industry FP Record
    Sun Studio 12 Update 1
    OpenSolaris 2008.11
45.0 50.4
Studio/OpenSolaris: simplify above (less tuned) 44.8  
GCC V4.4 / SuSE Linux 11 21.1  
Notes: All results reported are from rule-compliant, "reportable" runs of the SPEC CPU2006 floating point suite, CFP2006. "Base" indicates the metric SPECfp_base2006. "Peak" indicates SPECfp2006. Peak uses the same benchmarks and workloads as base, but allows more aggressive tuning. A base result, may, optionally, be quoted as peak, but the converse is not allowed. For details, see SPEC's Readme1st.

Fair? Did you say "Fair"?

Wait, wait, the reader may protest - this is all very unfair to GCC, because the Studio result used all 8 cores on this 2-chip system, whereas GCC used only one core! You're using trickery!

To this plaintive whine, we respond that:

  • Compiler auto-parallelization technology is not a trick. Rather, it is an essential technology in order to get the best performance from today's multi-core systems. Nearly all contemporary CPU chips provide support for multiple cores. Compilers should do everything possible to make it easy to take advantage of these resources.

  • We tried to use more than one core for GCC, via the -ftree-parallelize-loops=n flag. GCC's autoparallelization appears to be in a much earlier development stage than Studio's, since we did not observe any improvements for all values of "n" that we tested. From the GCC wiki, it appears that a new autoparallelization effort is under development, which may improve its results at a later time.

  • But, all right, if you insist, we will make things even harder for Studio, and see how it does.

Tie Another Hand Behind Studio's Back

The earlier section mentioned various ways in which the performance comparison had been made easier for GCC. Continuing the paragraph numbering from above, we took these additional measures:

  1. Removed the autoparallelization from Studio, substituting instead a request for 2MB pagesizes (which the GCC tuning already had).

  2. Added "peak" tuning to GCC: for benchmarks that benefit, add -ffast-math, and compiler profile-driven feedback

At this point, Studio base beats GCC base by 38%, and Studio base beats GCC peak by more than 25% (see table 2).

Table 2: Additional Comparisons, SPECfp2006,
Sun Studio/Solaris vs. GCC/Linux
  Base Peak
Sun Studio/OpenSolaris: base only, noautopar 29.1  
GCC V4.4 / SuSE Linux 11 21.1 23.1
The notes from Table 1 apply here as well.

Bottom line

The freely available Sun Studio 12 Update 1 compiler on OpenSolaris provides more than double the performance of GCC V4.4 on SuSE Linux, as measured by SPECfp_base2006.

If compilation is restricted to avoid using autoparallelization, Sun Studio still wins by 38% (base to base), or by more than 25% (Studio base vs. GCC peak).

YMMV

Your mileage may vary. It is certain that both GCC and Studio could be improved with additional tuning efforts. Both provide dozens of compiler flags, which can keep the tester delightfully engaged for an unbounded number of days. We feel that the tuning presented here is reasonable, and that additional tuning effort, if applied to both compilers, would not radically alter the conclusions.

Additional Information

The results disclosed in this article are from "reportable" runs of the SPECfp2006 benchmarks, which have been submitted to SPEC.

[1] SPEC and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation. Competitive comparisons are based on data published at www.spec.org as of 1 June 2009. The X6270 result can be found at http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090413-07019.html.

Monday Jun 08, 2009

Last Friday, Chris posted SPECpower results on our new group blog BestPerf called "Interpreting Sun's SPECpower_ssj2008 Publications", it is well worth a read.

In the coming days you will see a wide variety of other results posted, these may include: speccpu, specfp, specint, specjappserver, specweb, specjbb, specomp, specpower, specjvm, specmail, speccpu, igen, Ansys, Nastran, sap-sd, Siebel, peoplesoft, TPC-C, TPC-E, TPC-H, etc.

This blog copyright 2009 by John Henning