Extremely Fast Pattern Matching on Sun SPARC Enterprise T5220/T5240
Friday Aug 08, 2008
Sun SPARC Enterprise T5220 / T5240 beats IBM Cell Broadband Engine with significantly easier application code development!
Pattern matching or string searching are important to a variety of commercial, government and HPC applications. One of the core functions needed for text identification algorithms in data repositories is real-time string searching. For this benchmark, both IBM and Sun used the Aho-Corasick algorithm for string searching.
Note: Got this from an internal website on info that is going public.
The 2-chip Sun SPARC Enterprise T5240 performed string searching at a rate of 6.12 GB/s (49.0 Gbit/sec) whereas the 2-chip IBM Cell Broadband Engine DD3 Blade performed string searching at a rate of 0.48 GB/s (3.8 Gbit/sec).
The 1-chip Sun SPARC Enterprise T5220 performed string searching at a rate of 3.08 GB/s (24.6 Gbits/s).
The Sun SPARC Enterprise T5240 demonstrated a 2x speedup over the Sun SPARC Enterprise T5220.
The Aho-Corasick algorithm as deployed on the IBM Cell Broadband Engine DD3 Blade required substantial optimization and tuning to achieve the reported performance, whereas on the Sun SPARC Enterprise T5220 or T5240 only a basic implementation of the algorithm and a simple compilation were needed.
Performance Summary
| System | Throughput (GBits/sec) |
Chips | Cores | GHz |
|---|---|---|---|---|
| Sun SPARC Enterprise T5240 |
49.0 | 2 | 16 | 1.4 |
| Sun SPARC Enterprise T5220 |
24.6 | 1 | 8 | 1.4 |
| IBM Cell Broadband Engine DD3 Blade |
3.8 | 2 | 16 | 3.2 |
IBM results are obtained from Figure 7(d) of IEEE Computer, Volume 41, Number 4, pp. 42-50, April 2008. Sun benchmark results as of 08/05/2008.
Benchmark Description
One of the core functions needed for text identification algorithms in data repositories is real-time string searching. This string searching benchmark demonstrates the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code creation and speed of code execution.
In IEEE Computer, Volume 41, Number 4, pp. 42-50, April 2008, IBM describes a variant of the Aho-Corasick string searching algorithm that uses deterministic finite automata. The algorithm first constructs a graph that represents a dictionary, then walks that graph using successive input characters from a text file. Each "state" in the graph includes a state transition table (STT) that is accessed using the next input character from the text file to determine the address of the next state in the graph. IBM defines an automaton as a two-step loop that: (1) obtains the address of the next state from the STT, and (2) fetches the next state in the graph.
IBM reports the performance of its Cell Broadband Engine (CBE) to execute this algorithm to search a 4.4 MB version of the King James Bible using a dictionary of the 20,000 most used words in the English language (average word length of 7.59 characters). Each of the 8 synergistic processing elements (SPEs) of each of the two CBEs executes 16 automata, for a total of 256 automata. All automata and hence all SPEs access a single, shared dictionary.
IBM describes elaborate optimizations of the Aho-Corasick algorithm, including
state shuffling, state replication, alphabet shuffling and state caching.
These optimizations were required to: (1) overcome "memory congestion", i.e.,
contention amongst the SPEs for access to the shared dictionary, and (2)
compensate for the limited local storage that is associated with each SPE.
These optimizations were necessary to achieve the performance reported for
the CBE DD3 Blade.
In order to closely approximate the dictionary and Bible that were used by IBM,
Sun used a dictionary of 25,144 English words (the Open Solaris
file cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/spell/list)
for which the average word length is 8.22 characters, and a 4.6 MB version
of the King James Bible (www.patriot.net/users/bmcgin/kjv12.zip). For
reporting of results in Gbits/s, the length of a byte is assumed to be 8 bits.
In order to demonstrate the usefulness of Sun's UltraSPARC T2 and T2 Plus
processors for both ease of code generation and speed of code execution, Sun
implemented the Aho-Corasick algorithm using ANSI C. No optimizations of the
algorithm were required to achieve the performance reported for the T5220 and TT5240.
The source code was compiled using the -m64 -xO3 and -xopenmp options. The
dictionary is represented using a graph that comprises 187 MB. Each core of
the T5220 or T5240 executes 8 automata using one OpenMP thread per automaton.
Thus, the T5220 executes 64 total automata and the T5240 executes 128 total
automata. All automata and hence all cores access a single, shared dictionary.
Access to this dictionary is accelerated by the large, shared L2 caches of the
Sun SPARC Enterprise T5220 and T5240.
Disclosure Statement:
Pattern Matching: Sun SPARC Enterprise T5240 (2 x 1.4 GHz UltraSPARC T2 Plus, 2 chips, 16 cores),
Solaris 10, Sun C 5.9, 49.0 GBits/sec;
Sun SPARC Enterprise T5220 (1 x 1.4 GHz UltraSPARC T2, 1 chip, 8 cores),
Solaris 10, Sun C 5.9, 24.6 GBits/sec;
IBM Cell Broadband Engine DD3 Blade (2 x 3.2 GHz Cell Broadband Engine,
2 chips, 16 cores), Linux kernel v2.6.16, IBM CBE Software Development Kit
v2.1, 3.8 GBits/sec.
System Configuration
Throughput (GBits/sec)
24.6 T5220
 
49.0 T5240
Reference Date:
August 5, 2008
Systems:
Sun SPARC Enterprise T5220, T5240
Total Number Processors:
1, 2
Processor/GHz of Server:
1.4 GHz UltraSPARC T2, T2 Plus
Operating System:
Solaris 10
Tags: cell ibm performance sparc sun t5220 t5240 ultrasparc










