BM Seer Unofficial thoughts from an anonymous Sun employee

Extremely Fast Pattern Matching on Sun SPARC Enterprise T5220/T5240

Friday Aug 08, 2008

Sun SPARC Enterprise T5220 / T5240 beats IBM Cell Broadband Engine with significantly easier application code development!

Pattern matching or string searching are important to a variety of commercial, government and HPC applications. One of the core functions needed for text identification algorithms in data repositories is real-time string searching. For this benchmark, both IBM and Sun used the Aho-Corasick algorithm for string searching.

Note: Got this from an internal website on info that is going public.

The 2-chip Sun SPARC Enterprise T5240 performed string searching at a rate of 6.12 GB/s (49.0 Gbit/sec) whereas the 2-chip IBM Cell Broadband Engine DD3 Blade performed string searching at a rate of 0.48 GB/s (3.8 Gbit/sec).

The 1-chip Sun SPARC Enterprise T5220 performed string searching at a rate of 3.08 GB/s (24.6 Gbits/s).

The Sun SPARC Enterprise T5240 demonstrated a 2x speedup over the Sun SPARC Enterprise T5220.

The Aho-Corasick algorithm as deployed on the IBM Cell Broadband Engine DD3 Blade required substantial optimization and tuning to achieve the reported performance, whereas on the Sun SPARC Enterprise T5220 or T5240 only a basic implementation of the algorithm and a simple compilation were needed.

Performance Summary

System Throughput
(GBits/sec)
Chips Cores GHz
Sun SPARC Enterprise
T5240
49.0 2 16 1.4
Sun SPARC Enterprise
T5220
24.6 1 8 1.4
IBM Cell Broadband Engine
DD3 Blade
3.8 2 16 3.2

IBM results are obtained from Figure 7(d) of IEEE Computer, Volume 41, Number 4, pp. 42-50, April 2008. Sun benchmark results as of 08/05/2008.

Benchmark Description

One of the core functions needed for text identification algorithms in data repositories is real-time string searching. This string searching benchmark demonstrates the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code creation and speed of code execution.

In IEEE Computer, Volume 41, Number 4, pp. 42-50, April 2008, IBM describes a variant of the Aho-Corasick string searching algorithm that uses deterministic finite automata. The algorithm first constructs a graph that represents a dictionary, then walks that graph using successive input characters from a text file. Each "state" in the graph includes a state transition table (STT) that is accessed using the next input character from the text file to determine the address of the next state in the graph. IBM defines an automaton as a two-step loop that: (1) obtains the address of the next state from the STT, and (2) fetches the next state in the graph.

IBM reports the performance of its Cell Broadband Engine (CBE) to execute this algorithm to search a 4.4 MB version of the King James Bible using a dictionary of the 20,000 most used words in the English language (average word length of 7.59 characters). Each of the 8 synergistic processing elements (SPEs) of each of the two CBEs executes 16 automata, for a total of 256 automata. All automata and hence all SPEs access a single, shared dictionary.

IBM describes elaborate optimizations of the Aho-Corasick algorithm, including state shuffling, state replication, alphabet shuffling and state caching. These optimizations were required to: (1) overcome "memory congestion", i.e., contention amongst the SPEs for access to the shared dictionary, and (2) compensate for the limited local storage that is associated with each SPE. These optimizations were necessary to achieve the performance reported for the CBE DD3 Blade. IBM does not provide references that indicate where to obtain the dictionary and Bible. IBM reports the algorithmic performance in Gbits/s but does not indicate whether an 8-bit byte is extended to 10 bits as required for network transmission.

In order to closely approximate the dictionary and Bible that were used by IBM, Sun used a dictionary of 25,144 English words (the Open Solaris file cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/spell/list) for which the average word length is 8.22 characters, and a 4.6 MB version of the King James Bible (www.patriot.net/users/bmcgin/kjv12.zip). For reporting of results in Gbits/s, the length of a byte is assumed to be 8 bits.

In order to demonstrate the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code generation and speed of code execution, Sun implemented the Aho-Corasick algorithm using ANSI C. No optimizations of the algorithm were required to achieve the performance reported for the T5220 and TT5240.

The source code was compiled using the -m64 -xO3 and -xopenmp options. The dictionary is represented using a graph that comprises 187 MB. Each core of the T5220 or T5240 executes 8 automata using one OpenMP thread per automaton. Thus, the T5220 executes 64 total automata and the T5240 executes 128 total automata. All automata and hence all cores access a single, shared dictionary. Access to this dictionary is accelerated by the large, shared L2 caches of the Sun SPARC Enterprise T5220 and T5240.

Disclosure Statement:

Pattern Matching: Sun SPARC Enterprise T5240 (2 x 1.4 GHz UltraSPARC T2 Plus, 2 chips, 16 cores), Solaris 10, Sun C 5.9, 49.0 GBits/sec; Sun SPARC Enterprise T5220 (1 x 1.4 GHz UltraSPARC T2, 1 chip, 8 cores), Solaris 10, Sun C 5.9, 24.6 GBits/sec; IBM Cell Broadband Engine DD3 Blade (2 x 3.2 GHz Cell Broadband Engine, 2 chips, 16 cores), Linux kernel v2.6.16, IBM CBE Software Development Kit v2.1, 3.8 GBits/sec.

System Configuration

Throughput (GBits/sec) 24.6   T5220
  49.0   T5240
Reference Date: August 5, 2008
Systems: Sun SPARC Enterprise T5220, T5240
Total Number Processors: 1, 2
Processor/GHz of Server: 1.4 GHz UltraSPARC T2, T2 Plus
Operating System: Solaris 10

Like this post? del.icio.us | furl | slashdot | technorati | digg