String Searching Sun SPARC Enterprise T5440
Tuesday Oct 14, 2008
Outperforms IBM Cell Broadband Engine, and Significantly Simpler Applications Code Development!
-
Note: When I blogged that IBM was slowly getting more cores per chip (power6 is only dual-core, new p560 is Quad-core module but still DC chips), a comment said that IBM had made huge advances in Multi-core with the cell processor. Well quite simply, the IBM Cell is difficult to program. see below in the IBM paper where IBM researchers describe elaborate optimizations of the Aho-Corasick algorithm to:
(1) overcome "memory congestion", i.e., contention amongst the SPEs, and
(2) compensate for the limited local storage that is associated with each SPE.
...but what about performance, read on...
Significance of Sun's Results
String searching or pattern matching are important to a variety of commercial, government and HPC applications. One of the core functions needed for text identification algorithms in data repositories is real-time string searching. Another application example is computer virus detection. For this benchmark, the IBM, HP and Sun systems used the Aho-Corasick algorithm for string searching.
A Sun SPARC Enterprise T5440 could search a book as tall as Mt. Everest (29,208 feet, 861 GB book) in 68 seconds, which corresponds to a string search rate of 12.72 GB/s.
A Sun SPARC Enterprise T5440 can search at a rate of 12.72 GB/s, which corresponds to searching a book containing one terabyte of data (34,745 feet high) in only 79 seconds.
The 4-chip Sun SPARC Enterprise T5240 performed string searching at a rate of 12.7 GB/s which is 26.8 times faster than the 2-chip IBM Cell Broadband Engine DD3 Blade that performed string searching at a rate of 0.475 GB/s
The 4-chip Sun SPARC Enterprise T5440 performed string searching 3.3 times faster than the 4-chip HP DL-580 (2.93 GHz Xeon QC) that performed string searching at a rate of 3.87 GB/s. On other tests we have measured the power consumption of the HP DL-580 (2.93 GHz Xeon QC) to be 830 watts. Using this value for the power consumption of the HP DL-580, Sun estimates the Sun SPARC Enterprise T5440 to have a 1.8 times advantage in delivered power-performance.
The Aho-Corasick algorithm as deployed on the IBM Cell Broadband Engine DD3 Blade required substantial optimization and tuning to achieve the reported performance, whereas on the Sun SPARC Enterprise T5220, T5240 or T5440 only a basic implementation of the algorithm and a simple compilation were needed.
The Sun SPARC Enterprise T5440 demonstrated a 2x speedup over the Sun SPARC Enterprise T5240, which demonstrated a 2x speedup over the Sun SPARC Enterprise T5220.
The 2-chip Sun SPARC Enterprise T5240 performed string searching at a rate of 6.36 GB/s which is 13.4 times faster than the 2-chip IBM Cell Broadband Engine DD3 Blade that performed string searching at a rate of 0.475 GB/s.
The 2-chip Sun SPARC Enterprise T5440 performed string searching 1.6 times faster than the 4-chip HP DL-580 (2.93 GHz Xeon QC) that performed string searching at a rate of 3.87 GB/s.
The 1-chip Sun SPARC Enterprise T5220 performed string searching at a rate of 3.16 GB/s which is 6.7 times faster than the 2-chip IBM Cell Broadband Engine DD3 Blade that performed string searching at a rate of 0.475 GB/s.
Performance Landscape (in GBytes/sec)
| System | Throughput (GB/sec) |
Chips | Cores | GHz |
|---|---|---|---|---|
| Sun SPARC Enterprise T5440 |
12.7 | 4 | 32 | 1.4 |
| Sun SPARC Enterprise T5240 |
6.4 | 2 | 16 | 1.4 |
| HP DL-580 | 3.9 | 4 | 16 | 2.9 |
| Sun SPARC Enterprise T5220 |
3.2 | 1 | 8 | 1.4 |
| IBM Cell Broadband Engine DD3 Blade |
0.48 | 2 | 16 | 3.2 |
Benchmark Description
One of the core functions needed for text identification algorithms in data repositories is real-time string searching. This string searching benchmark demonstrates the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code creation and speed of code execution.
In IEEE Computer, Volume 41, Number 4, pp. 42-50, April 2008, IBM describes a variant of the Aho-Corasick string searching algorithm that uses deterministic finite automata. The algorithm first constructs a graph that represents a dictionary, then walks that graph using successive input characters from a text file. Each "state" in the graph includes a state transition table (STT) that is accessed using the next input character from the text file to determine the address of the next state in the graph. IBM defines an automaton as a two-step loop that: (1) obtains the address of the next state from the STT, and (2) fetches the next state in the graph.
IBM reports the performance of its Cell Broadband Engine (CBE) to execute this algorithm to search a 4.4 MB version of the King James Bible using a dictionary of the 20,000 most used words in the English language (average word length of 7.59 characters). Each of the 8 synergistic processing elements (SPEs) of each of the two CBEs executes 16 automata, for a total of 256 automata. All automata and hence all SPEs access a single, shared dictionary.
IBM describes elaborate optimizations of the Aho-Corasick algorithm, including state shuffling, state replication, alphabet shuffling and state caching. These optimizations were required to: (1) overcome "memory congestion", i.e., contention amongst the SPEs for access to the shared dictionary, and (2) compensate for the limited local storage that is associated with each SPE. These optimizations were necessary to achieve the performance reported for the CBE DD3 Blade.
In order to closely approximate the dictionary and Bible that were used by IBM, Sun used a dictionary of 25,143 English words (the Open Solaris file cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/spell/list) for which the average word length is 7.2 characters, and a 4.6 MB version of the King James Bible (www.patriot.net/users/bmcgin/kjv12.zip). For reporting of results in Gbits/s, the length of a byte is assumed to be 8 bits.
In order to demonstrate the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code generation and speed of code execution, Sun implemented the Aho-Corasick algorithm using ANSI C. No optimizations of the algorithm were required to achieve the performance reported for the T5220, T5240 and T5440.
Disclosure Statement:
String Searching: Sun SPARC Enterprise T5440 (4 x 1.4 GHz UltraSPARC T2 Plus, 4 chips, 32 cores), Solaris 10, Sun C 5.9, 12.7 GB/sec; Sun SPARC Enterprise T5240 (2 x 1.4 GHz UltraSPARC T2 Plus, 2 chips, 16 cores), Solaris 10, Sun C 5.9, 6.36 GB/sec; Sun SPARC Enterprise T5220 (1 x 1.4 GHz UltraSPARC T2, 1 chip, 8 cores), Solaris 10, Sun C 5.9, 3.16 GB/sec; IBM Cell Broadband Engine DD3 Blade (2 x 3.2 GHz Cell Broadband Engine, 2 chips, 16 cores), Linux kernel v2.6.16, IBM CBE Software Development Kit v2.1, 0.475 GB/sec. HP DL-580 (4 x 2.9 GHz Intel Xeon X7350, 4 chips, 16 cores), SuSE Linux Enterprise Server v10 patch 1, Sun C 5.9, 3.87 GB/sec.
IBM results are obtained from Figure 7(d) of IEEE Computer, Volume 41, Number 4, pp. 42-50, April 2008. Sun benchmark results as of 10/13/2008.
Results Summary
|   | 12.7 T5440 | ||
|   | 6.36 T5240 | ||
| Throughput (GB/sec) | 3.16 T5220 | ||
| Reference Date: | October 13, 2008 | ||
| Systems: | Sun SPARC Enterprise T5220, T5240, T5440 | ||
| Total Number Processors: | 1, 2, 4 | ||
| Processor/GHz of Server: | 1.4 GHz UltraSPARC T2, T2 Plus | ||
| Operating System: | Solaris 10 | ||











Any chance someone will expand on this and publish...
More info can be found...