The Navel of Narcissus
Josh Simons' Coordinates in the Blogosphere

20080616 Monday June 16, 2008

HPC Consortium: Texas Advanced Computing Center

Karl Schulz, Associate Director of High Performance Computing at the Texas Advanced Computing Center (TACC) delivered the first customer presentation on Day Two of the Sun HPC Consortium meeting here is Dresden. Karl gave an update on the Ranger supercomputer and shared some experiences with the system since it went fully into production in February of this year.

Karl first reviewed Ranger's specs, a few of which caught my attention. First, the system is memory-rich with 2 GB of memory per core rather than the perhaps more common 1 GB/core. With almost 63K cores, this translates to a total of an impressive 123 TeraBytes of main memory available. On the storage side, I was happy to hear that real applications have achieved close to 40 GB/sec of aggregate IO bandwidth using their Lustre cluster file system. Physically, Ranger's Lustre file systems are configured across 72 Thumper (Sun Fire X4500) storage nodes with a raw total capacity of 1.7 PetaBytes.

When we think about supercomputers, compute nodes are what most commonly come to mind first, followed by storage nodes. But as Karl points out, there are a large number of other critical functions that are needed to actually run a supercomputer like Ranger. He indicated that Ranger includes 25 management servers that perform the following functions: login servers, ROCKS master node, Sun Grid Engine masters, Sun Connection Management servers (for hardware monitoring), InfiniBand subnet manager, Lustre Metadata Servers, archive data movers, and GridFTP servers.

I got the sense that TACC has a love/hate relationship with their InfiniBand cabling. On the one hand, we had some early manufacturing issues with our Sun Constellation System cables that caused problems during the Build phase at TACC and much work was required to get past these issues. On the other hand, Sun's cabling approach allowed for a massive reduction in the number of cables required. As Karl pointed out, Ranger uses 1312 12x-12x InfiniBand cables to wire its 3936 nodes to the central Constellation switches, which is considerably fewer cables than TACC needed to wire an earlier and smaller IB cluster with only 1400 nodes. Of course, with a total of 15.4 kilometers of cabling, there is still an inherent amount of logistical complexity that cannot be avoided.

From Karl's perspective, the primary challenges they've encountered so far with Ranger were the cabling issues, quad processor delays, the logistics associated with the sheer number of components in the system, including BIOS and firmware updates for compute blades, switches, HCAs, etc.

On the software side, Karl mentioned some issues with Sun Grid Engine scalability and performance issues and MPI collective performance and job startup issues. He showed some nice improvements in job startup time that had been implemented in MVAPICH by OSU. (Separately, I was happy to hear from our Sun MPI team that Open MPI is also making considerable strides in this area.)

Since going into full production in February, the system has been heavily used. Based on a graph Karl showed, it looked like over 10M wallclock hours have been used running parallel jobs that consume between 16K and 32K cores. Another popular job size seemed to be in the 2K - 4K core range with many much smaller jobs being run as well.

Karl showed some scaling results for a CFD code that implements a Direct Numerical Simulation (DNS) of turbulence, which has very challenging requirements in that about 75% of the runtime is spent in MPI, it uses very fine-grain meshes, and has a high dependency on MPI collectives like all-to-all. It's a difficult application to handle well. Despite that, Karl showed results indicating strong scaling performance with this application in the range of 98% efficiency. (With strong scaling, problem size remains constant while number of compute resources is increased. This is typically considered more difficult to handle than weak scaling in which problem size is increased as resources are increased.)

When asked in the Q&A session about most common failure issues on Ranger, Karl indicated that most common were DIMM and CPU issues, followed by file system interrupts. Luckily, or rather as a result of a lot of careful engineering on the part of the Lustre team, 95-97% of the underlying file system failures did not result in the loss of any running jobs--they hang, but do not die.


(2008-06-16 07:27:39.0) Permalink Comments [0]

Trackback URL: http://blogs.sun.com/simons/entry/hpc_consortium_texas_advanced_computing
Comments:

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed

 
archives
links
stats