Tuesday June 26, 2007 | The Navel of Narcissus Josh Simons' Coordinates in the Blogosphere |
|
PetaScale Unveiled: Photos From Dresden A few photos from the unveiling of the ultra-dense components of Sun's new Constellation System architecture tonight in Dresden.
Andy Bechtolsheim, Marc Hamilton, and a magnum of champagne
Behold The Switch!
The ultra-dense switch with cable management
Overall booth shot: crowd, press, etc.
A sea of switch ports... Sun Constellation System: Petascale Computing Done Right Today Sun is revealing in a technology preview at the International Supercomputing Conference in Dresden the approach we will use to build TACC's 500+ TFLOP Ranger system later this year. And other large machines that have not yet been announced. We call systems built with this approach Sun Constellation Systems. Such systems can scale from TeraFLOPS up into the PetaFLOPs range. To my mind, Sun's approach to Petascale starts with this:
Yes, it's a connector. Specifically, this connector allows three 4X InfiniBand links to be run across a single cable, rather than three separate cables. The cable is also both higher quality and significantly less bulky than the three separate cables taken together. The connector itself is also mechanically and electrically superior to standard InfiniBand connectors. Sun (by which I mean AndyB and others working closely with him) has put a lot of thought into this because to effectively build a petascale system one needs to closely examine current approaches and assess whether just "doing more of the same" really gets you where you need to be or whether new thinking is needed. The above dense cabling approach complements our ultra-dense blade server and ultra-dense InfiniBand switch components. The blade server supports 48 blades (768 cores) in a single rack/chassis (the rack is the chassis in this case) and the switch sports over 3000 4X Infiniband ports with all of the complexity of a multi-stage network compressed into a single chassis, removing the need for a huge number of cables and a large number of intermediate, discrete InfiniBand switches. Compared to a conventional approach to building a large cluster, the Sun Constellation system uses 1/6 the cables, 20% smaller footprint, and one switch rather than 300 (not a typo.) The switch component is a double-wide chassis and it looks like this:
The blade chassis looks like this:
I'm walking over to the show floor at ISC here in Dresden in an hour so and will post some photos later tonight. (2007-06-26 07:53:09.0) Permalink Comments [0] HPC Consortium: Shared Memory Parallelization on Multi-Core Processors
While I enjoyed Barton and Ruud's talks about the Niagara 2 processor yesterday at Sun's HPC Consortium meeting in Dresden, I always get more of a kick from customer presentations. In this case, Dieter an Mey from RWTH Aachen University gave a nice talk about the pitfalls and benefits of multi-core processors for programmers. While it was delivered here at a High Performance Computing event, the observations and lessons are applicable to anyone interested in application performance on multi-core processors. Given the direction of our industry, that encompasses a lot of programmers. Dieter and his colleague Christian Terboven examined performance on several systems based on a variety of processors:: the UltraSPARC IV, UltraSPARC T2 (Niagara 2), Intel Woodcrest and Clovertown, and a quad-core AMD Opteron. For each of these systems, they measured achievable aggregate bandwidth over a variety of active thread counts, processor bindings, and memory placement. I've included a few of his slides with Niagara 2 performance results removed (sorry.) As Dieter said, the results aren't surprising once you look carefully at the non-uniformities in the underlying system architectures. If, however, programmers do not understand these issues, they will likely achieve very sub-optimal application performance. My own concern is that as multi-core and multi-threaded processors become the norm across the computer industry, programmers will not understand these issues in the way a seasoned HPC programmer might. This is one small part of the challenge the software industry faces in helping programmers achieve high performance on these new processors.
In addition to running the above bandwidth tests, Dieter and Christian ran two applications on each of these systems. The first was a very cache-friendly code used to compute contact interactions between bevel gears. The second was a Navier-Stokes code that put high stress on the memory sub-system due its manipulations of sparse data structures. They also did additional throughput tests, running multiple copies of these applications on each system. The results for Niagara 2 demonstrated the value of the CMT approach in hiding latency for throughput workloads and being able to do so with floating-point intensive codes with the increased FP capabilities of this new processor. Oh, and one more thing. The graph below shows performance results for the bevel gear code run with different numbers of threads. Look at Columns 5 and 6. These results were generated on Solaris and Linux using the same compilers and the exact same hardware. Higher is better. :-)
(2007-06-26 00:46:52.0) Permalink Comments [0] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||