Monday November 17, 2008 | The Navel of Narcissus Josh Simons' Coordinates in the Blogosphere |
|
How to Observe Performance of OpenMP Codes
A great benefit of the OpenMP standard is that it allows a programmer to specify parallelization strategies, leaving the implementation details to the compiler and its runtime system. A downside of this is that the programmer loses some understanding and visibility into what is actually happening, making it difficult to find and fix performance problems. This is precisely the issue discussed by Professor Barbara Chapman from the University of Houston during her talk at the Sun HPC Consortium Meeting here in Austin today. Prof. Chapman briefly described the work she has been doing using the OpenUH compiler as a research base. The older POMP project had used source-level instrumentation and source-to-source translation to produce codes that allowed some access to performance information, but the approach wasn't very popular. Instead, instrumentation has now been directly implemented in the compiler and inserted much later in the compilation process. This allowed the instrumentation to be both improved and also reduced to a more selective set of probe points, greatly reducing the overhead of instrumentation. Professor Chapman touched on a few application examples in which this selective implementation approach has resulted in significant performance improvements with little work needed to pinpoint the problem areas within the code. In one example, application performance was easily increased by between 20 and 25% over a range of problem sizes. In another case involving an untuned OpenMP code, the instrumentation quickly pointed to incorrect usage of shared arrays and initialization problems related to first-touch memory allocation. A second thrust of this research work is to take advantage of the fact that the OpenMP runtime layer is basically in charge as the application executes. Because it controls execution, it can also be used to gather runtime performance information as part of a performance monitoring system. Both of these techniques contribute to giving the programmer tools to performance debug their codes at the semantic level at which it was initially written, which is critically important as more and more HPC (and other) users attempt to extract good parallel performance from existing and future multi-core chips. (2008-11-17 15:22:58.0) Permalink Comments [0] Project Thebes Update from Georgetown University
The big news from Arnie Miles, Senior Systems Architect at Georgetown University, is that the Thebes Middleware Consortium has moved from concept to code with a new prototype of a service provider based on DRMAA that mediates access to an unmodified Sun Grid Engine instance from a small Java-based client app. In addition, the Thebes Consortium has just released a first draft of an XML schema that attempts to create a language that harmonizes how jobs and resources are described in a resource-sharing environment and sits above the specific approaches taken by existing systems like Ganglia, Sun Grid Engine, PBS, Condor, LSF, etc.) The proposal will soon be submitted to OGF for consideration. The next nut to crack is the definition of a resource discovery network, which is under development now. The team hopes to be able to share their work on this at ISC in Hamburg in June of next year. (2008-11-17 10:29:52.0) Permalink Comments [0] Dealing with Data Proliferation at Clemson
Jim
Pepin, CTO for Clemson
University talked today at the HPC Consortium
Meeting
about the challenges and problems created by emerging technology
and socal trends as seen through the lens of a university environment.
As a preamble, Jim noted that between 1970 and now the increases in compute and storage capabilities have pretty much kept pace with each other. Networking bandwidth, however, has lagged by about two orders of magnitude. This has a variety of ramifications for local/centralized data storage decisions (or constraints.) In many ways, storage is moving closer to end-users. Examples include personal storage like iPods, phones, local NAS boxes, etc, as well as more research-oriented data collection efforts related to the proliferation of new sensors and instrumentation. There is data everywhere in vast quantities and widely distributed across a typical university environment. Particular issues of concern at Clemson include how to back up these distributed and rapidly-growing pools of storage, how to handle security, how to protect data while still being able to open networks, and how to deal with a wide diversity of systems and data-generating instruments. (2008-11-17 10:12:06.0) Permalink Comments [0] So, What About Java for HPC?
About ten years ago the HPC community attempted to embrace Java as a viable approach for high performance computing via a forum called Java Grande. That effort ultimately failed for various reasons, one of which was the difficulty of achieving acceptable performance for interesting HPC workloads. Today at the HPC Consortium Meeting here in Austin, Professor Denis Caromel from the University of Nice made the case that Java is ready now for serious HPC use. He described the primary features of ProActive Java, a joint project between INRIA and University of Nice CNRS, and provided some performance comparisons against Fortran/MPI benchmarks. As background, Denis explained that the goal of ProActive is to enable parallel, distributed, and multi-core solutions with Java using one unified framework. Specifically, the approach should scale from a single, multi-core node to a large, enterprise-wide grid environment. ProActive embraces three primary areas: Programming, Optimizing, and Scheduling. The programming approach is based on the use of active objects to create a dataflow-like asynchronous communication framework in which objects can be instantiated in either separate JVMs or within the same address space in the case of a multi-core node. Objects are instantiated asynchronously on the receiver side and then represented immediately on the sender side by "future objects" which will be populated asynchronously when the remote computation completes. Accessing future events whose contents have not yet arrived causes a "wait by necessity" which implements the dataflow synchronization mechanism. ProActive also supports a SPMD programming style with many of the same primitives found in MPI -- e.g., barriers, broadcast, reductions, scatter-gather, etc. Results for several NAS parallel benchmarks were presented, in particular CG, MG, and EP. On CG, the ProActive version performed at essentially the same speed as the Fortran/MPI version over a range of problem sizes from 1-32 processes. Fortran did better on MG and this seems to relate to issues around large memory footprints, which the ProActive team is looking at in more detail. With EP, Java was faster or significantly faster in virtually all cases. Work continues to lower messaging latency, to optimize in-node data transfers by sending pointers rather than data, and to reduce message-size overhead. When asked how ProActive compares to X10, Denis pointed out that while X10 does share some concepts with ProActive, X10 is a new language while ProActive is designed to run on standard Java JVMs and to enable to use of standard Java for HPC. A full technical paper about ProActive in PDF format is available here. (2008-11-17 09:30:48.0) Permalink Comments [0] A Customer's View of Sun's HPC Consortium Meeting One of our customers, Gregg TeHennepe from Jackson Laboratory, has been blogging about his attendance at Sun's HPC Consortium meeting here in Austin. For his perspectives and for some excellent photos of the Ranger supercomputer at TACC, check out his blog, Mental Burdocks. (2008-11-17 09:26:12.0) Permalink Comments [0]If You Doubt the Utility of GPUs for HPC, Read this
Professor Satoshi Matsuoka from the Tokyo Institute of Technology gave a really excellent talk this afternoon about using GPUs for HPC at the HPC Consortium Meeting here in Austin. As you may know, the Tokyo Institute of Technology is the home of TSUBAME, the largest supercomputer in Asia. It is an InfiniBand cluster of 648 Sun Fire x4600 compute nodes, many with installed Clearspeed accelerator cards. The desire is to continue to scale TSUBAME into a petascale computing resource over time. However, power is a huge problem at the site. The machine is responsible for roughly 10% of the overall power consumption of the Institute and therefore they cannot expect their power budget to grow over time. The primary question, then, is how to add significant compute capacity to the machine while working within a constant power budget. It was clear from their analysis that conventional CPUs would not allow them to reach their performance goals while also satisfying the no-growth power constraint. GPUs--graphical processing units like those made by nVidia--looked appealing in that they claim extremely high floating point capabilities and deliver this performance at a much better performance/watt ratio that conventional CPUs. The question, though, is whether GPUs can be used to significantly accelerate important classes of HPC computations or whether they are perhaps too specialized to be considered for inclusion in a general-purpose compute resource like TSUBAME. Professor Matsuoka's talk focused on this question. The talk approached the question by presenting performance speed-up results for a selection of important HPC applications or computations based on algorithmic work done by Prof. Matsuoka and other researchers at the Institute. These studies were done in part because GPU vendors do a very poor job of describing exactly what GPUs are good for and what problems are perhaps not handled well by GPUs. By assessing the capabilities over a range of problem areas, it was hoped that conclusions could be drawn about the general utility of the GPU approach for HPC. The first problem examined was a 3D protein docking analysis that performs an all-to-all analysis of 1K proteins to 1K proteins. Based on their estimates, a single protein-protein interaction analysis requires about 200 TeraOps while the full 1000x1000 problem requires about 200 ExaOps. In order to maximally exploit GPUs for this problem, a new 3D FFT algorithm was developed that in the end delivered excellent performance and a 4x better performance/watt over IBM's BG/L system, which itself is much more efficient than a more conventional cluster approach. In addition, other algorithmic work delivered speedups of 45X over single conventional CPUs for CFD, which is typically limited by available bandwidth. Likewise, a computation involving phase separation liquid delivered a speedup of 160X over a conventional processor. Having looked at single node performance and compared it to a single-node GPU approach and found that GPUs do appear to able to deliver interesting performance and performance/watt for an array of useful problem types so long as new algorithms can be created to exploit the specific capabilities of these GPUs, the next question was whether these results could be extended to multi-GPU and cluster environments. To test this, the team worked with the RIKEN Himeno CFD benchmark, which is considered the worst memory bandwidth-limited code one will ever see. It is actually worse than any real application one would ever encounter. If this could be parallelized and used with GPUs to advantage, then other less difficult codes should also benefit from the GPU approach. To test this, the code was parallelized to run using multiple GPUs per node and with MPI as the communication mechanism between nodes. Results showed about a 50X performance improvement over a conventional CPU cluster on a small-sized problem. A multi-GPU parallel sparse solver was also created which showed a 25X-35X improvement over conventional CPUs. This was accomplished using double precision implemented using mixed-precision techniques. While all of these results seemed promising, could such a GPU approach be deployed at scale in a very large cluster rather than just within a single node or across a modest-sized cluster? The Institute decided to find out by teaming with nVidia and Sun to enhance TSUBAME by adding Tesla GPUs to some (most) nodes. Installing the Tesla cards into the system went very smoothly and resulted in three classes of nodes: those with both Clearspeed and Tesla installed, those with only Tesla installed, and those Opteron nodes with neither kind of accelerator installed. Could this funky array of heterogeneous nodes be harnessed to deliver an interesting LINPACK number? It turns out that it could, with much work and in spite of the fact that there was limited bandwidth in the upper links of the InfiniBand fabric and that they had limited PCIx/PCIe bandwidth available in the nodes (I believe due to the number and types of slots available in the x4600 and the number of required devices in some of the TSUBAME compute nodes.) As a result of the LINPACK work (which could have used more time--it was deadline-limited) the addition of GPU capability in TSUBAME allowed its LINPACK number to be raised from 67.7 TFLOPs, which was reported in June, to a new high of 77.48 TFLOPs which shows an impressive increase. With the Tesla cards installed, TSUBAME can now be viewed as a 900 TFLOPs (single precision) or 170 TFLOPs (double precision) machine. A machine that has either 10K cores or 300K SIMD cores if one counts the components embedded within each installed GPU. The conclusion is pretty clearly that GPUs can be used to significant advantage on an interesting range of HPC problem types, though it is worth noting that it also appears that significantly clever, new algorithms may also need to be developed to map these problems efficiently onto GPU compute resources. (2008-11-16 21:28:20.0) Permalink Comments [0] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||