The Navel of Narcissus
Josh Simons' Coordinates in the Blogosphere

20071112 Monday November 12, 2007

University of Warsaw: New HPC Perspectives and Prospects

Marek Niezgodka, Director of the Interdisciplinary Centre for Mathematical and Computational Modelling at the University of Warsaw spoke this weekend at the HPC Consortium meeting in Reno.

ICM is a high-end computng center for research and applications in Poland, a national laboratory in computational and informational sciences, and a partner and leader on multiple grid projects.

ICM research focuses in several areas, including:

  • Distributed information systems and grids, including healthcare infrastructure and distributed high-end computing
  • Quantitative biology of systems, including bioinformatics, functional proteomics, protein engineering, etc.
  • Design and characterization of functional materials, which includes nanomaterials and bionanomaterials.
  • Biomedical modelling of blood circulation and physiology, tissue engineering, and imaging.
  • Non-linear process dynamics of complex networks

In addition to research activities, the Center is heavily involved in delivering wide-area services. For example, numerical weather prediction for central Europe at a 4km horizontal resolution and additional prediction for the northern Atlantic and Asia. ICM also functions as a knowledge repository, a healthcare grid for cardiology, and it offers large-scale data processing and analysis for industry and the public sector.

ICM is currently undergoing a significant infrastructure expansion, including a doubling of staff to approximately 300 by 2010, data expansion to between 5 and 10 Petabytes by 2009. Compute capabilities will be expanded to a total capacity of approximately 100 TFLOPs. This deployment is currently underway and will be completed in 2008. The core of this system is built with Sun Constellation components, including Thumper (X4500) storage.


(2007-11-12 14:28:05.0) Permalink Comments [5]

Multicore Performance Analysis Tools from Academia

Karl Fuerlinger, from the Innovative Computing Laboratory at the University of Tennessee at Knoxville spoke about multicore performance analysis tools at the HPC Consortium meeting here in Reno yesterday. He focused on tools available from academia rather than vendor-supplied tools.

In Karl's view, the vendor tools are powerful, commercially supported, and typically limited to the vendor platform, while academic tools are generally cross-platform, often include advanced or experimental techniques like automated performance analysis and often focus more on high levels of scalability.

Popular academic tools include:

  • PAPI, which supports platform-independent access to hardware counters. PAPI has recently been expanded to support access to additional counter types beyond CPU counters. Temperature sensors, HW events on NICs, and instrumentation on memory interfaces are examples. It is possible to generate composite displays showing time-lines of FLOP rates, system temperature, etc.
  • TAU has extensive support for tracing and profiling and is considered by many to be the swiss army knife of profiling tools.
  • KOJAK/SCALASCA, which offers trace-baed automatic performance analysis capabilities. It does this by automatically searching for patterns of inefficiences in traces with demonstrated scalability to 22K processes.
  • Vampir, a tracefile visualization tool for MPI that has applicability f or other programming models as well.
  • ompP, a profiling tool for OpenMP and the focus of Karl's work. ompP us es a source-based instrumentation approach to gain independence from specific compilers and runtimes. It is tested and supported on Linux, Solaris, AIX, and with the Pathscale, PGI, gcc, IBM, and Sun compilers. Codes are instrumented to understand how much time is spent in imbalance, synchronization, limited parallelism, and thread management states. Incremental and continuous profiling are supported.

Karl pointed out that these academic tools tend to generally interoperate with each other. For example, PAPI can be used by most of the above tools to access performance counter information. Profiles can be gathered by several of these tools and then visualized with TAU. And trace data collected with these tools can be fed into the KOJAK/SCALASCA automatic trace analysis capabilities. Traces generated from TAU or KOJAK/SCALASCA can be visualized with Vampir.


(2007-11-12 14:19:47.0) Permalink Comments [0]

100 TFLOPs Insufficient?

James Leylek, Executive Director of the Clemson University Computational Center for Mobility Systems, spoke at the HPC Consortium meeting about the computational requirements for simulation of vehicle-related phenomena.

A main point of Dr. Leylek's talk was that unsteady simulations are required to adequately model the physical behavior of mobility systems. There are many cases in which unsteady or turbulent mechanisms dominate in this class of problems. There are boundary layer issues, laminar to turbulent transitions, so-called Type II transient flows, etc. They key, though, is finding appropriate numeric techniques to perform these simulations.

Typical mobility application areas include formula-1 race cars, airplane wing design, engine fan design, aircraft carriers, submarines, engine block cooling, and blood flow through artificial hearts.

As an example of the problems sizes in this space, Leylek described what is required to simulate the aerodynamics of a Formula-1 race car. It requires 300M finite element volumes, with eight equations per volume for a total of about 2.4B equations to be solved. And because of the unsteady nature of flows around these bodies, the simulations must be run for tens of thousands of time steps. This essentially means that dedicating even 100 TFLOPs to one team would not be sufficient to allow the dozens of "what if" experiments needed during the vehicle design phase. When one realizes that aerodynamics is just one of a number of attributes that must be simulated for this one application area, the situation becomes even more daunting.

There are a number of numerical methods that can be used to perform these simulations. Full unsteady simulation is impractical for the time being until much larger computational facilities are available at a more affordable cost. In the meantime, what to do? The Computational Center for Mobility Systems at Clemson brings together a large amount of Sun HPC gear and the algorithmic expertise to team with companies and other organizations to perform these simulations using the unique capabilities of semi-deterministic stress model (SDSM) techniques to deliver value to their partners in the shorter term. The point is to be smarter about how these problems should be solved and not be intimidated by the computational requirements predicted by extrapolations based on brute-force methodologies.


(2007-11-12 09:09:24.0) Permalink Comments [0]

UltraSPARC T2 for HPC: A Customer Assessment

Dieter an Mey, HPC Team Lead at RWTH Aachen's Center for Computing and Communication, presented an evaluation of the suitability of Sun's UltraSPARC T2 processor for High Performance Computing at the HPC Consortium meeting in Reno.

The Aachen study compares systems with the T2 processor against systems with Sun's UltraSPARC IV processor, with AMD Opteron processors, and with Intel Woodcrest and Clovertown processors. The test cases used were representative of a range of applications and attributes that are important to users at Aachen.

I will briefly summarize the results here and recommend those interested in more detail visit this page for a full explanation of the methodology and to view the detailed results.

Aachen examined several performance kernels: memory bandwidth, LINPACK, and sparse matrix-vector multiplication. They also examined results for several applications, including TFS, which used to model nasal flow for computer-aided surgery. This code can be run in several ways using OpenMP for parallelization. They also ran FLOWer and a code does contact analysis of bevel gears. In addition to these application tests, Aachen ran multiple instances of applications simultaneously to assess the throughput capabilities of each system. A power and performance/power analysis was also done.

The results showed that a combination of T2-based systems and x64/x86 systems would be ideal for Aachen. Very cache-friendly codes did not benefit as much from the N2 architecture and these performed better on the Intel and AMD based systems. The bevel gear code is an example of such a code. TFS, on the other hand, performed better in throughput mode on the T2 system. In both cases the best results were 2X better than the altenative. That is, the Intel/AMD systems generally did about 2X better than the T2 system on cache-friendly codes while the T2 system was 2X better in cases where memory bandwidth was a limiting factor.


(2007-11-12 08:12:43.0) Permalink Comments [2]

Ranger Update: TACC's Path to Petascale

Jay Boisseau, Director of the Texas Advanced Computing Center, site of Sun's largest supercomputer installation to date, gave an update on the Ranger system to the HPC Consortium in Reno.

Ranger is the first in a series of annual Track 2 NSF procurements that have been motivated by the findings of the NSF Cyberinfrastructure Strategic Plan, which is available in PDF format here.

There are several institutions involved in this procurement. TACC / UT Austin provides project leadership, hosts and runs Ranger, provides user support, etc. ICES / UT Austin provides algorithmic expertise and applications collaborations. The Cornell Center for Advanced Computing (formerly the Cornell Theory Center) provides large-scale data management and analysis and training. Arizona State HPCI contributes user support and technology evaluation and insertion.

So, just how big is this Big Iron? Just over one-half PetaFLOPs (504 TFLOPs), built with 3936 Sun four-socket blades, each socket populated by a four-core 2.0 GHz AMD Barcelona processor for a total of almost 63,000 cores. Memory is big as well, with 2 GB per core (32 GB/node) for a total of 125 Terabytes in the Ranger system.

This being Texas, the disk subsystem does not disappoint with 1.7 Petabytes of storage built from 72 Sun X4500 (Thumper) I/O servers, each with 24 Terabytes delivering a total aggregate bandwidth of 72 Gbytes/sec. The largest filesystem built on this storage offers one Petabyte of storage.

The system interconnect is InfiniBand using Mellanox's latest ConnectX Infiniband cards and two of Sun's 3456-port Magnum switches. Interconnect link bandwidth is approximately 10 Gb/sec and latency is approximately 2.3us.

Physically, the system fits in 96 racks (82 compute, 12 support, 2 switches) that sit in about 4500 square feet along with 116 APC InRow cooling units. Due to the density of the Sun solution, floor space has not really been an issue. Power requirements on the other hand, are quite daunting for a system of this size. 1 MW of the 3.4 MW required to run Ranger are needed for cooling.

I was impressed to hear that Jay expects a significant number of applications to sustain 50-100 TFLOPs on Ranger--that is some serious application scaling! He predicted there will be a double-digit number of codes using over 10,000 cores by the end of 2008 and expects a few of these to run later this year.

In terms of software environment, Ranger is a Linux cluster that uses the ROCKS provisioning software to handle OS and application deployments, Lustre as its scalable parallel filesystem, and the OpenFabrics stack to control the InfiniBand interconnect. In addition, at least two MPI implementations will be used on Ranger -- MVAPICH and Open MPI. There will be several compiler suites available, including Sun Studio, Pathscale, and the Portland Group compilers. Sun Grid Engine will be used for job scheduling.

The impact Ranger will have on the capabilities of the TeraGrid is considerable as it will make more CPU hours available to TeraGrid users than all other current TeraGrid systems combined. At 504 TFLOPs, Ranger is 5X larger than the current top TeraGrid system.

Jay ended with a brief summary of the status of the Ranger installation process, which is ongoing. He characterized most things as good: TACC is happy with Barcelona performance, with the Sun Constellation blades, the performance of the InfiniBand fabric (Sun switch and Mellanox card), Thumper performance, the Sun racks, the APC cooling solution, and Sun Grid Engine.

There have been some BIOS issues that Sun and AMD have been working through and there have been some expected component failures due to the very large number of components involved in this system.

The most vexing problem, which we hope has now been solved, involved manufacturing issues related to the special InfiniBand cables used in the Constellation system. Apparently, some step in the manufacturing process introduced a crimp which caused connectivity problems. Correctly manufactured cables are now being put in place.

As Jay said, it has been through the extremely hard work of Sun and TACC personnel that the delays introduced by these problems have been largely overcome. I know the Sun folks I've talked with are working incredibly hard to make TACC successful.

The system is expected to be online in early December.


(2007-11-11 21:45:41.0) Permalink Comments [0]


 
archives
links
stats