BM Seer Unofficial thoughts from an anonymous Sun employee

2-chip world record SPECjbb2005 Sun Blade T6340 & Sun SPARC Enterprise T5240

Thursday Oct 23, 2008

World Record multi-JVM 2-chip performance. Nice way to end my day presenting some more performance facts...

The Sun Blade T6340 server module and Sun SPARC Enterprise T5240 server, each with two 1.4 GHz UltraSPARC T2 Plus processors, obtained the best multi-JVM 2 chip results on the SPECjbb2005 server-side Java benchmark.

A Sun Blade T6340 server equipped with two UltraSPARC T2 Plus processors at 1.4GHz, delivered a World Record 2-chip result of 388456 SPECjbb2005 bops, 24279 SPECjbb2005 bops/JVM.

A Sun SPARC Enterprise T5240 server equipped with two UltraSPARC T2 Plus processors at 1.4GHz, delivered an outstanding 2-chip result of 384934 SPECjbb2005 bops, 24058 SPECjbb2005 bops/JVM.

The Sun Blade T6340 (two 1.4 GHz UltraSPARC T2 Plus chips) demonstrated 11% better performance over the IBM Power 550 (four 4.2 GHz POWER6 chips) result of 350642 SPECjbb2005 bops, 87661 SPECjbb2005 bops/JVM.

The Sun Blade T6340 (two 1.4 GHz UltraSPARC T2 Plus chips) demonstrated 17% better performance over the IBM x3650 result of 330605 SPECjbb2005 bops, 82651 SPECjbb2005 bops/JVM which uses two 3.3 GHz Xeon chips.

The Sun Blade T6340 and Sun SPARC Enterprise T5240 each used Solaris 10 10/08 and Sun JDK 1.6.0_06 Performance Release to obtain these leading results.

Benchmark Description

SPECjbb2005 (Java Business Benchmark) measures the performance of a Java implemented application tier (server-side Java). The benchmark is based on the order processing in a wholesale supplier application. The performance of the user tier and the database tier are not measured in this test. The metrics given are number of SPECjbb2005 bops (Business Operations per Second) and SPECjbb2005 bops/JVM (bops per JVM instance).

Competitive Landscape

SPECjbb2005 Performance Chart (ordered by performance, bops: SPECjbb2005 Business Operations per Second (bigger is better)

System Processors Performance
Ch,Cr,Thr GHz Type SPECjbb2005
bops
SPECjbb2005
bops/JVM
Sun Blade T6340 2,16,128 1.4 UltraSPARC T2 Plus 388456 24279
Sun SPARC Enterprise T5240 2,16,128 1.4 UltraSPARC T2 Plus 384934 24058
IBM Power 550 4,8,16 4.2 Dual-Core POWER6 350642 87661
IBM x3650 2,8,8 3.3 QC Xeon 330605 82651
IBM x3650 2,8,8 3.1 QC Xeon 323172 80793
Fujitsu RX200 2,8,8 3.3 QC Xeon 316728 79182
Dell M600 2,8,8 2.3 QC Xeon 314513 78628
IBM BC HS21XM 2,8,8 3.0 QC Xeon 310028 77507
Dell PE 2950 2,8,8 3.16 QC Xeon 305411 76353

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org.

Disclosure Statement:

SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 10/17/2008 on spec.org. Sun Blade T6340 (2 chips, 16 cores) 388456 SPECjbb2005 bops, 24279 SPECjbb2005 bops/JVM submitted for review. Sun SE T5240 (2 chips, 16 cores) 384934 SPECjbb2005 bops, 24058 SPECjbb2005 bops/JVM submitted for review. IBM p550 (4 chips, 8 cores) 350642 SPECjbb2005 bops, 87661 SPECjbb2005 bops/JVM IBM x3650 (2 chips, 8 cores) 330605 SPECjbb2005 bops, 82651 SPECjbb2005 bops/JVM IBM x3650 (2 chips, 8 cores) 323172 SPECjbb2005 bops, 80793 SPECjbb2005 bops/JVM Fujitsu RX200 (2 chips, 8 cores) 316728 SPECjbb2005 bops, 79182 SPECjbb2005 bops/JVM Dell M600 (2 chips, 8 cores) 314513 SPECjbb2005 bops, 78628 SPECjbb2005 bops/JVM IBM BC HS21XM (2 chips, 8 cores) 310028 SPECjbb2005 bops, 77507 SPECjbb2005 bops/JVM Dell PE 2950 (2 chips, 8 cores) 305411 SPECjbb2005 bops, 76353 SPECjbb2005 bops/JVM

Results Summary

T6340 Results: 388456 SPECjbb2005 bops, 24279 SPECjbb2005 bops/JVM
T5240 Results: 384934 SPECjbb2005 bops, 24058 SPECjbb2005 bops/JVM
Reference Date: Oct 21, 2008
Systems: Sun Blade T6340
Sun SPARC Enterprise T5240
Total Number Processors: 2, 2
Processor/GHz of Server: UltraSPARC T2 Plus 1.4 GHz
Operating System: Solaris 10 10/08
JVM: Java HotSpot(TM) 32-Bit Server, Version 1.6.0_06 Performance Release

Like this post? del.icio.us | furl | slashdot | technorati | digg

Extremely Fast Pattern Matching on Sun SPARC Enterprise T5220/T5240

Friday Aug 08, 2008

Sun SPARC Enterprise T5220 / T5240 beats IBM Cell Broadband Engine with significantly easier application code development!

Pattern matching or string searching are important to a variety of commercial, government and HPC applications. One of the core functions needed for text identification algorithms in data repositories is real-time string searching. For this benchmark, both IBM and Sun used the Aho-Corasick algorithm for string searching.

Note: Got this from an internal website on info that is going public.

The 2-chip Sun SPARC Enterprise T5240 performed string searching at a rate of 6.12 GB/s (49.0 Gbit/sec) whereas the 2-chip IBM Cell Broadband Engine DD3 Blade performed string searching at a rate of 0.48 GB/s (3.8 Gbit/sec).

The 1-chip Sun SPARC Enterprise T5220 performed string searching at a rate of 3.08 GB/s (24.6 Gbits/s).

The Sun SPARC Enterprise T5240 demonstrated a 2x speedup over the Sun SPARC Enterprise T5220.

The Aho-Corasick algorithm as deployed on the IBM Cell Broadband Engine DD3 Blade required substantial optimization and tuning to achieve the reported performance, whereas on the Sun SPARC Enterprise T5220 or T5240 only a basic implementation of the algorithm and a simple compilation were needed.

Performance Summary

System Throughput
(GBits/sec)
Chips Cores GHz
Sun SPARC Enterprise
T5240
49.0 2 16 1.4
Sun SPARC Enterprise
T5220
24.6 1 8 1.4
IBM Cell Broadband Engine
DD3 Blade
3.8 2 16 3.2

IBM results are obtained from Figure 7(d) of IEEE Computer, Volume 41, Number 4, pp. 42-50, April 2008. Sun benchmark results as of 08/05/2008.

Benchmark Description

One of the core functions needed for text identification algorithms in data repositories is real-time string searching. This string searching benchmark demonstrates the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code creation and speed of code execution.

In IEEE Computer, Volume 41, Number 4, pp. 42-50, April 2008, IBM describes a variant of the Aho-Corasick string searching algorithm that uses deterministic finite automata. The algorithm first constructs a graph that represents a dictionary, then walks that graph using successive input characters from a text file. Each "state" in the graph includes a state transition table (STT) that is accessed using the next input character from the text file to determine the address of the next state in the graph. IBM defines an automaton as a two-step loop that: (1) obtains the address of the next state from the STT, and (2) fetches the next state in the graph.

IBM reports the performance of its Cell Broadband Engine (CBE) to execute this algorithm to search a 4.4 MB version of the King James Bible using a dictionary of the 20,000 most used words in the English language (average word length of 7.59 characters). Each of the 8 synergistic processing elements (SPEs) of each of the two CBEs executes 16 automata, for a total of 256 automata. All automata and hence all SPEs access a single, shared dictionary.

IBM describes elaborate optimizations of the Aho-Corasick algorithm, including state shuffling, state replication, alphabet shuffling and state caching. These optimizations were required to: (1) overcome "memory congestion", i.e., contention amongst the SPEs for access to the shared dictionary, and (2) compensate for the limited local storage that is associated with each SPE. These optimizations were necessary to achieve the performance reported for the CBE DD3 Blade. IBM does not provide references that indicate where to obtain the dictionary and Bible. IBM reports the algorithmic performance in Gbits/s but does not indicate whether an 8-bit byte is extended to 10 bits as required for network transmission.

In order to closely approximate the dictionary and Bible that were used by IBM, Sun used a dictionary of 25,144 English words (the Open Solaris file cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/spell/list) for which the average word length is 8.22 characters, and a 4.6 MB version of the King James Bible (www.patriot.net/users/bmcgin/kjv12.zip). For reporting of results in Gbits/s, the length of a byte is assumed to be 8 bits.

In order to demonstrate the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code generation and speed of code execution, Sun implemented the Aho-Corasick algorithm using ANSI C. No optimizations of the algorithm were required to achieve the performance reported for the T5220 and TT5240.

The source code was compiled using the -m64 -xO3 and -xopenmp options. The dictionary is represented using a graph that comprises 187 MB. Each core of the T5220 or T5240 executes 8 automata using one OpenMP thread per automaton. Thus, the T5220 executes 64 total automata and the T5240 executes 128 total automata. All automata and hence all cores access a single, shared dictionary. Access to this dictionary is accelerated by the large, shared L2 caches of the Sun SPARC Enterprise T5220 and T5240.

Disclosure Statement:

Pattern Matching: Sun SPARC Enterprise T5240 (2 x 1.4 GHz UltraSPARC T2 Plus, 2 chips, 16 cores), Solaris 10, Sun C 5.9, 49.0 GBits/sec; Sun SPARC Enterprise T5220 (1 x 1.4 GHz UltraSPARC T2, 1 chip, 8 cores), Solaris 10, Sun C 5.9, 24.6 GBits/sec; IBM Cell Broadband Engine DD3 Blade (2 x 3.2 GHz Cell Broadband Engine, 2 chips, 16 cores), Linux kernel v2.6.16, IBM CBE Software Development Kit v2.1, 3.8 GBits/sec.

System Configuration

Throughput (GBits/sec) 24.6   T5220
  49.0   T5240
Reference Date: August 5, 2008
Systems: Sun SPARC Enterprise T5220, T5240
Total Number Processors: 1, 2
Processor/GHz of Server: 1.4 GHz UltraSPARC T2, T2 Plus
Operating System: Solaris 10

Like this post? del.icio.us | furl | slashdot | technorati | digg