The Navel of Narcissus
Josh Simons' Coordinates in the Blogosphere

20090320 Friday March 20, 2009

Amazon EC2: More Reserved Than Ever

Back in September, I expressed skepticism that a purely on-demand model for cloud computing would be sufficient for businesses to seriously commit to the cloud model as a way to run their businesses. Apparently, Amazon is an avid reader of the Navel (joke!), because they recently announced a new resource model -- Reserved Instances -- that in large part addresses the issue I raised. Specifically, it is now possible to pay up-front to reserve an instance for use over some period of time. In addition, when the instance is actually used, the rate is lower than for purely on-demand instances.

This hybrid model will appeal to those customers who worry about resource availability as demand for cloud computing resources continues to grow and certainly to those who have developed a business- or mission-critical reliance on access to these remote resources.


(2009-03-20 14:29:11.0) Permalink Comments [0]

20090318 Wednesday March 18, 2009

Australian Supercomputing: Who's Da BOM?

The Australian press has an article today about a new Sun supercomputer to be installed at the Australian Bureau of Meteorology (BOM.) The new 1.5 TFLOP machine, which will be ten times more powerful than their current system, is said to be the largest in the southern hemisphere. The article is here.


(2009-03-18 20:01:44.0) Permalink Comments [0]

More Free HPC Developer Tools for Solaris and Linux

The Sun Studio team just released the latest version of our HPC developer tools with so many enhancements and additions it's hard to know where to start this blog entry. I suppose with the basics: As usual, all of the software is free. And available for both Solaris and Linux, specifically Solaris, OpenSolaris, RHEL, SuSE, and Ubuntu. Frankly, Sun would like to be your preferred provider for high-performance Fortran, C, and C++ compilers and tools. Given the performance and capabilities we deliver for HPC with Sun Studio, that seems a pretty reasonable goal to me. We think the price has been set correctly to achieve that as well. :-)

I have to admit to being confused by the naming convention for this release, but it goes something like this. The release is an EA (Early Access) version of Sun Studio 12 Update 1 -- the first major update to Sun Studio 12 since it was released in the summer of 2007. Since Sun Studio's latest and greatest bits are released every three months as part of the Express program, this release can also be called Sun Studio Express 3/09. Different names, same bits. Don't worry about it -- just focus on the fact that they make great compilers and tools. :-)

Regardless of what they call it, the release can be downloaded here. Take it for a spin and let the developers know what you think on the forum or file a request for enhancement (RFE) or a bug report here.

For the full list of new features, go here. For my personal list of favorite new features, read on.

  • Full OpenMP 3.0 compiler and tools support. For those not familiar, OpenMP is the industry standard for directives-based threaded application parallelization. Or, the answer to the question, "So how do I use all the cores and threads in my spiffy new multicore processor?"
  • ScaLAPACK 1.8 is now included in the Sun Performance Library! It works with Sun's MPI (Sun HPC ClusterTools), which is based on Open MPI 1.3. The Perflib team has also made significant performance enhancements to BLAS, LAPACK, and the FFT routines, including support for the latest Intel and AMD processors. Nice.
  • MPI performance analysis integrated into the Sun Performance Analyzer. Analyzer has been for years a kick-butt performance tool for single-process applications. It has now been extended to help MPI programmers deal with message-passing related performance problems.
  • Continued, aggressive attention paid to optimizing for the latest SPARC, Intel, and AMD processors. C, C++, and Fortran performance will all benefit from these changes.
  • A new standalone GUI debugger. Go ahead, graduate from printf() and try a real debugger. It won't bite.

As I mentioned above, full details on these new features and many, many more are all documented on this wiki page. And, again, the bits are here.

(2009-03-18 15:15:22.0) Permalink Comments [0]

20090109 Friday January 09, 2009

HPC and Virtualization: Oak Ridge Trip Report

Just before Sun's Winter Break, I attended a meeting at Oak Ridge National Laboratory in Tennessee with Stephen Scott, Geoffroy Vallee, Christian Engelmann, Thomas Naughton, and Anand Tikotekar, all of the Systems Research Team (SRT) at ORNL. Attending from Sun were Tim Marsland, Greg Lavender, Rebecca Arney, and myself. The topic was HPC and virtualization, an area the SRT has been exploring for some time and one I've been keen on as well as it has become clear v12n has much to offer the HPC community. This is my trip report.

I arrived at Logan Airport in Boston early enough on Monday to catch an earlier flight to Dulles, narrowly avoiding the five-hour delay that eventually afflicted my original flight. The flight from Boston to Knoxville via Dulles went smoothly and I arrived without difficulty to a rainy and chilly Tennessee evening. I was thrilled to have made it through Dulles without incident since more often than not I have some kind of travel difficulty when my trips pass through IAD (more on that later.) The 25 mile drive to the Oak Ridge DoubleTree was uneventful.

Oak Ridge is still very much a Lab town from what I could see, much like Los Alamos, but certainly less isolated. Movie reviews in the Oak Ridge Observer are rated with atoms rather than stars. Stephen Scott, who leads the System Research Team (SRT) at ORNL, mentioned that the plot plan for his house is stamped "Top Secret -- Manhattan Project" because the plan shows the degree difference between "ORNL North" and "True North", an artifact of the time when period maps of the area deliberately skewed the position of Oak Ridge to lessen the chance that a map could be used to successfully bomb ORNL targets from the air during the war.

We spent all day Tuesday with Stephen and most of the System Research Team. Tim talked about what Sun is doing with xVM and our overall virtualization strategy and ended with a set of questions that we spent some time discussing. Greg then talked in detail about both Crossbow and InfiniBand, specifically with respect to aspects related to virtualization. We spent the rest of the day hearing about some of the work on resiliency and virtualization being done by the team. See the end of this blog entry for pointers to some of the SRT papers as well as other HPC/virtualization papers I have found to be interesting.

Resiliency isn't something the HPC community has traditionally cared much about. Nodes were thin and cheap. If a node crashed, restart the job, replace the node, use checkpoint-restart if you can. Move on; life on the edge is hard. But the world is changing. Nodes are getting fatter again--more cores, more memory, more IO. Big SMPs in tiny packages with totally different economics from traditional large SMPs. Suddenly there is enough persistent state on a node that people start to care how long their nodes stay up. Capabilities like Fault Management start to look really interesting, especially if you are a commercial HPC customer using HPC in production.

In addition, clusters are getting larger. Much larger, even with fatter nodes. Which means more frequent hardware failures. Bad news for MPI, the world's most brittle programming model. Certainly, some more modern programming models would be welcome, but in the meantime what can be done to keep these jobs running longer in the presence of continual hardware failures? This is one promise of virtualization. And one reason why a big lab like ORNL is looking seriously at virtualization technologies for HPC.

Live migration -- the ability to shift running OS instances from one node to another -- is particularly interesting from a resiliency perspective. Linking live migration to a capable fault management facility (see, for example, what Sun has been doing in this area) could allow jobs to avoid interruption due to an impending node failure. Research by the SRT (see the Proactive Fault Tolerance paper, below) and others has shown this is a viable approach for single-node jobs and also for increasing the survivability of MPI applications in the presence of node failures. Admittedly, the current prototype depends on Xen TCP tricks to handle MPI traffic interruption and continuation, but with sufficient work to virtualize the InfiniBand fabric, this technique could be extended to that realm as well. In addition, the use of an RDMA-enabled interconnect can itself greatly increase the speed of live migration as is demonstrated in the last paper listed in the reference section below.

We discussed other benefits of virtualization. Among them, the use of multiple virtual machines per physical node to simulate a much larger cluster for demonstrating an application's basic scaling capabilities in advance of being allowed access to a real, full-scale (and expensive) compute resource. Such pre-testing becomes very important in situations in which large user populations are vying for access to relatively scarce, large-scale, centralized research resources.

Geoffroy also spoke about "adapting systems to applications, not applications to systems" by which he meant that virtualization allows an application user to bundle their application into a virtual machine instance with any other required software, regardless of the "supported" software environment available on a site's compute resource. Being able to run applications using either old versions of operating systems or perhaps operating systems with which a site's administrative staff has no experience, does truly allow the application provider to adapt the system to their application without placing an additional administrative burden on a site's operational staff. Of course, this does push the burden of creating a correct configuration onto the application provider, but the freedom and flexibility should be welcomed by those who need it. Those who don't could presumably bundle their application into a "standard" guest OS instance. This is completely analogous to the use and customization of Amazon Machine Instances (AMIs) on the Amazon Elastic Compute Cloud (EC2) infrastructure.

Observability was another simpatico area of discussion. DTrace has taken low-cost, fine-grained observability to new heights (new depths, actually). Similarly, SRT is looking at how one might add dynamic instrumentation at the hypervisor level to offer a clearer view of where overhead is occurring within a virtualized environment to promote user understanding and also offer a debugging capability for developers.

A few final tidbits to capture before closing. Several other research efforts are looking at HPC and virtualization. Among them V3VEE (University of New Mexico and Northwestern University), XtreemOS (a bit of a different approach to virtualization for HPC and Grids). SRT is also working on a virtualized version of OSCAR called OSCAR-V.

The Dulles Vortex of Bad Travel was more successful on my way home. My flight from Knoxville was delayed with an unexplained mechanical problem that could not be fixed in Knoxville, requiring a new plane to be flown from St. Louis. I arrived very late into Dulles, about 10 minutes before my connection to Boston was due to leave from the other end of the terminal. I ran to the gate, arriving two minutes before the flight was scheduled to depart and it was already gone-- no sign of the gate agents or the plane. Spent the night at an airport hotel and flew home first thing the next morning. Dulles had struck again--this was at least the third time I've had problems like this when passing through IAD. I have colleagues that refuse to travel with me through this airport. With good reason, apparently.

Reading list:

Proactive Fault Tolerance for HPC with Xen Virtualization, Nagarajan, Mueller, Engelmann, Scott

The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software, Youseff, Seymour, You, Dongarra, Wolski

Performance Implications of Virtualizing Multicore Cluster Machines, Ranadive, Kesavan, Gavrilovska, Schwan

High Performance Virtual Machine Migration with RDMA over Modern Interconnects, Huang, Gao, Liu, Panda


(2009-01-09 09:47:13.0) Permalink Comments [4]

20081218 Thursday December 18, 2008

Beta Testers Wanted: Sun Grid Engine 6.2 Update 2

A busy day for fresh HPC bits, apparently...

The Sun Grid Engine team is looking for experienced SGE users interested in taking their latest Update release for a test drive. The Update includes bug fixes, but also some new features as well. Two features in particular caught my eye: a new GUI-based installer and optimizations to support very large Linux clusters (think TACC Ranger.)

Full details are below in the official call for beta testers. The beta program will run until February 2nd, 2009. Look no further for something to do during the upcoming holiday season. :-)


Sun Grid Engine 6.2 Update 2 Beta (SGE 6.2u2beta) Program

This README contains important information about the targeted audience of this beta release, new functionality, the duration of this SGE beta program and your possibilities to get support and provide feedback.

  1. Audience of this beta program
  2. Duration of the beta program and release date
  3. New functionality delivered with this release
  4. Installing SGE 6.2u2beta in parallel to a production cluster
  5. Beta program feedback and evaluation support
  1. Audience of this beta program

    This Beta is intended for users who already have experience with the Sun Grid Engine software or DRM (Distributed Resource Management) systems of other vendors. This beta adds new features to the SGE 6.2 software. Users new to DRM systems or users who are seeking a production ready release should use the Sun Grid Engine 6.2 Update 1 (SGE 6.2u1) release which is available from here.

    For the shipping SGE 6.2u1 release we are offering a free 30 day evaluation email support.

  2. Duration of the Beta program and release date

    This beta program lasts until Monday, February 2, 2009. The final release of Sun Grid Engine 6.2 Update 2 is planned for March 2009.

  3. New functionality delivered with this release

    Sun Grid Engine 6.2 Update 2 (SGE 6.2u2) is a feature update release for SGE 6.2 which adds the following new functionality to the product:

    • a GUI based installer helping new users to more easily install the software. It complements the existing CLI based installation routine.
    • new support for 32-bit and 64-bit editions of Microsoft Windows Vista (Enterprise and Ultimate Edition), Windows Server 2003R2 and Windows Server 2008.
    • a client and server side Job Submission Verifier (JSV) allows an administrator to control, enforce and adjust jobs requests, including job rejection. JSV scripts can be written in any scripting language, e.g. Unix shells, Perl or TCL.
    • consumable resource attributes can now be requested per job. This makes resource requests for parallel jobs much easier to define, especially when using slot ranges.
    • on Linux, the use of the 'jemalloc' malloc library improves performance and reduces memory requirements.
    • the use of the poll(2) system call instead of select(2) on Linux systems improves scalability of qmaster in extremely huge clusters.
  4. Installing SGE 6.2u2 in parallel to a production cluster

    Like with every SGE release it is safe to install multiple Grid Engine clusters running multiple versions in parallel if all of the following settings are different:

    • directory
    • ports (environment variables) for qmaster and execution daemons
    • unique "cluster name" - from SGE 6.2 the cluster name is appended to the name of the system wide startup scripts
    • group id range ("gid_range")

    Starting with SGE 6.2 the Accounting and Reporting Console (ARCo) accepts reporting data from multiple Sun Grid Engine clusters. Following the installation directions for ARCo and using a unique cluster name for this beta release there is no risk of losing or mixing reporting data from multiple SGE clusters.

  5. Beta Program Feedback and Evaluation Support

    We welcome your feedback and questions on this Beta. Weask you to restrict your questions to this Beta release only. If you need general evaluation support for the Sun Grid Engine software please subscribe to the free evaluation support by downloading and using the shipping version of SGE 6.2 Update 1.

    The following email aliases are available:


  6. (2008-12-18 11:30:25.0) Permalink Comments [0]

    Fresh Bits: InfiniBand Updates for Solaris 10

    Fresh InfiniBand bits for Solaris 10 Update 6 have just been announced by the IB Engineering Team:

    The Sun InfiniBand Team is pleased to announce the availability of the Solaris InfiniBand Updates 2.1. This comprises updates to the previously available Solaris InfiniBand Updates 2. InfiniBand Updates 2 has been removed from the current download pages. (Previous versions of InfiniBand Updates need to be carefully matched to the OS Update versions that they apply to.)

    The primary deliverable of Solaris InfiniBand Updates 2.1 is a set of updates of the Solaris driver supporting HCAs based on Mellanox's 4th generation silicon, ConnectX. These updates include the fixes that have been added to the driver since its original delivery, and functionality in this driver is equivalent to what was delivered as part of OpenSolaris 2008.11. In addition, there continues to be a cxflash utility that allows Solaris users to update firmware on the ConnectX HCAs. This utility is only to be used for ConnectX HCAs.

    Other updates include:

    • uDAPL InfiniBand service provider library for Solaris (compatible with Sun HPC ClusterTools MPI)
    • Tavor and Arbel/memfree drivers that are compatible with new interfaces in the uDAPL library
    • Documentation (README and man pages)
    • A renamed flash utility for Tavor-, Arbel memfull, Arbel memfree, and Sinai based HCAs. Instead of "fwflash" this utility is rename "ihflash" to avoid possible namespace conflicts with a general firmware flashing utility in Solaris

    All are compatible with Solaris 10 10/08 (Solaris 10, Update 6), for both SPARC and X86.

    You can download the package from the "Sun Downloads" A-Z page by visiting http://www.sun.com/download/index.jsp?tab=2 and scrolling down or searching for the link for "Solaris InfiniBand (IB) Updates 2.1" or alternatively use this link.

    Please read the README before installing the updates. This contains both installation instructions and other information you will need to know before running this product.

    Please note again that this Update package is for use on Solaris 10/08 (Solaris 10, Update 6) only. A version of the Hermon driver has also been integrated into Update 7 and will be available with that Update's release.

    Congratulations to the Solaris IB Hermon project team and the extended IB team for their efforts in making this product available!


    (2008-12-18 08:06:27.0) Permalink Comments [0]

    20081120 Thursday November 20, 2008

    Random Notes from the IDC HPC Breakfast Briefing

    I went to the IDC HPC Breakfast briefing yesterday morning because they are usually pretty interesting. This one felt mostly like a rehash of earlier material and was somewhat disappointing as a result. I did hear a few things I thought were worth passing on and here they are.

    I made the above graph based on a table that was flashed quickly on the screen during the briefing. If N was specified, I didn't catch it. It is amazing (depressing?) to see how few ISV applications actually scale beyond 32 processors, even after all these years. I showed the graph to Dave Teszler, US Practice Manager for HPC, and he confirmed that he sees lots of commercial HPC customers who buy large clusters, but who really use them as throughput machines where the unit of throughput might be a 32-process job or smaller. In other words, just because a customer buys a 1024-node cluster and is known to use MPI, one cannot assume they are running 1024-process MPI jobs as one can with other kinds of customers like the National Labs or other large supercomputing centers.

    Other notes jotted during the meeting:

    • Over the last four years HPC has shown a yearly growth rate of 19%
    • Blades are making inroads into all segments, driven largely by concerns about power, cooling, and density
    • HPC is growing partly because "live engineering" and "live science" costs continue to escalate, making simulation much more effective for delivering faster "time to solution."
    • Global competitiveness continues to drive HPC growth by offering businesses ways to differentiation through better R&D and product design using HPC techniques
    • x86 was described as being a weak architecture for HPC due to the very wide range of application requirements seen in HPC. this and poor delivered performance on multicore is causing customers to buy more more processors for technical computing than they would otherwise.
    • The power issue is not the for enterprise and HPC. For enterprise challenge is how to reduce their power consumption whereas for HPC it is a constraint on growth.
    • Software is still seen as the #1 roadblock for HPC
    • Better management software is needed because HPC clusters are hard to set up and operation and because new buyers need "ease of everything."
    • Current economic uncertainty has delayed IDC forecasting, but do see real weakness in CAE. By contrast Oil/Gas, Climate/Weather, University, and DCC (Digital Content Creation) all still appear healthy. The outlook for Finance, Government, Bio/Life, and EDA is unknown at this point.

    (2008-11-20 07:12:15.0) Permalink Comments [0]

    Bjorn to be Wild! Fun at Supercomputing '08

    It's been crazy-busy here at the Sun booth at Supercomputing '08 in Austin, but we do get to have some fun as well. This is Bjorn Andersson, Director of HPC for Sun. He is Bjorn to be Wild.

    This photo reminded my friend Kai of Fjorg 2008. Worth a look.



    (2008-11-20 06:53:37.0) Permalink Comments [2]

    20081119 Wednesday November 19, 2008

    Sun Supercomputing: Red Sky at Night, Sandia's Delight

    Yesterday we officially announced that Sun will be supplying Sandia National Laboratories its next generation clustered supercomputer, named Red Sky. Douglas Doerfler from the Scalable Architectures Department at Sandia spoke at the Sun HPC Consortium Meeting here in Austin and gave an overview of the system to assembled customers and Sun employees. As Douglas noted, this was the world premiere Red Sky presentation.

    The system is slated to replace Thunderbird and other aging cluster resources at Sandia. It is a Sun Constellation system using the Sun Blade 6000 blade architecture, but with some differences. First, the system will use a new diskless two-node Intel blade to double the density of the overall system. The initial system will deliver 160 TFLOPs peak performance in a partially populated configuration with expansion available to 300 TFLOPs.

    Second, the interconnect topology is a 3D torus rather than a fat-tree. The torus will support Sandia's secure red/black switching requirement with a middle "swing" section that can be moved to either the red or black side of the machine as needed with the required air gap.

    Primary software components include CentOS, Open MPI, OpenSM, and Lash for deadlock-free routing across the torus. The filesystem will be based on Lustre. oneSIS will be used for diskless cluster management, including booting over InfiniBand.


    (2008-11-19 08:05:36.0) Permalink Comments [2]

    20081117 Monday November 17, 2008

    How to Observe Performance of OpenMP Codes

    A great benefit of the OpenMP standard is that it allows a programmer to specify parallelization strategies, leaving the implementation details to the compiler and its runtime system. A downside of this is that the programmer loses some understanding and visibility into what is actually happening, making it difficult to find and fix performance problems. This is precisely the issue discussed by Professor Barbara Chapman from the University of Houston during her talk at the Sun HPC Consortium Meeting here in Austin today.

    Prof. Chapman briefly described the work she has been doing using the OpenUH compiler as a research base. The older POMP project had used source-level instrumentation and source-to-source translation to produce codes that allowed some access to performance information, but the approach wasn't very popular. Instead, instrumentation has now been directly implemented in the compiler and inserted much later in the compilation process. This allowed the instrumentation to be both improved and also reduced to a more selective set of probe points, greatly reducing the overhead of instrumentation.

    Professor Chapman touched on a few application examples in which this selective implementation approach has resulted in significant performance improvements with little work needed to pinpoint the problem areas within the code. In one example, application performance was easily increased by between 20 and 25% over a range of problem sizes. In another case involving an untuned OpenMP code, the instrumentation quickly pointed to incorrect usage of shared arrays and initialization problems related to first-touch memory allocation.

    A second thrust of this research work is to take advantage of the fact that the OpenMP runtime layer is basically in charge as the application executes. Because it controls execution, it can also be used to gather runtime performance information as part of a performance monitoring system.

    Both of these techniques contribute to giving the programmer tools to performance debug their codes at the semantic level at which it was initially written, which is critically important as more and more HPC (and other) users attempt to extract good parallel performance from existing and future multi-core chips.


    (2008-11-17 15:22:58.0) Permalink Comments [0]

    Project Thebes Update from Georgetown University

    The big news from Arnie Miles, Senior Systems Architect at Georgetown University, is that the Thebes Middleware Consortium has moved from concept to code with a new prototype of a service provider based on DRMAA that mediates access to an unmodified Sun Grid Engine instance from a small Java-based client app.

    In addition, the Thebes Consortium has just released a first draft of an XML schema that attempts to create a language that harmonizes how jobs and resources are described in a resource-sharing environment and sits above the specific approaches taken by existing systems like Ganglia, Sun Grid Engine, PBS, Condor, LSF, etc.) The proposal will soon be submitted to OGF for consideration.

    The next nut to crack is the definition of a resource discovery network, which is under development now. The team hopes to be able to share their work on this at ISC in Hamburg in June of next year.


    (2008-11-17 10:29:52.0) Permalink Comments [0]

    Dealing with Data Proliferation at Clemson
    Jim Pepin, CTO for Clemson University talked today at the HPC Consortium Meeting about the challenges and problems created by emerging technology and socal trends as seen through the lens of a university environment.

    As a preamble, Jim noted that between 1970 and now the increases in compute and storage capabilities have pretty much kept pace with each other. Networking bandwidth, however, has lagged by about two orders of magnitude. This has a variety of ramifications for local/centralized data storage decisions (or constraints.)

    In many ways, storage is moving closer to end-users. Examples include personal storage like iPods, phones, local NAS boxes, etc, as well as more research-oriented data collection efforts related to the proliferation of new sensors and instrumentation. There is data everywhere in vast quantities and widely distributed across a typical university environment.

    Particular issues of concern at Clemson include how to back up these distributed and rapidly-growing pools of storage, how to handle security, how to protect data while still being able to open networks, and how to deal with a wide diversity of systems and data-generating instruments.


    (2008-11-17 10:12:06.0) Permalink Comments [0]

    So, What About Java for HPC?

    About ten years ago the HPC community attempted to embrace Java as a viable approach for high performance computing via a forum called Java Grande. That effort ultimately failed for various reasons, one of which was the difficulty of achieving acceptable performance for interesting HPC workloads. Today at the HPC Consortium Meeting here in Austin, Professor Denis Caromel from the University of Nice made the case that Java is ready now for serious HPC use. He described the primary features of ProActive Java, a joint project between INRIA and University of Nice CNRS, and provided some performance comparisons against Fortran/MPI benchmarks.

    As background, Denis explained that the goal of ProActive is to enable parallel, distributed, and multi-core solutions with Java using one unified framework. Specifically, the approach should scale from a single, multi-core node to a large, enterprise-wide grid environment.

    ProActive embraces three primary areas: Programming, Optimizing, and Scheduling. The programming approach is based on the use of active objects to create a dataflow-like asynchronous communication framework in which objects can be instantiated in either separate JVMs or within the same address space in the case of a multi-core node. Objects are instantiated asynchronously on the receiver side and then represented immediately on the sender side by "future objects" which will be populated asynchronously when the remote computation completes. Accessing future events whose contents have not yet arrived causes a "wait by necessity" which implements the dataflow synchronization mechanism.

    ProActive also supports a SPMD programming style with many of the same primitives found in MPI -- e.g., barriers, broadcast, reductions, scatter-gather, etc.

    Results for several NAS parallel benchmarks were presented, in particular CG, MG, and EP. On CG, the ProActive version performed at essentially the same speed as the Fortran/MPI version over a range of problem sizes from 1-32 processes. Fortran did better on MG and this seems to relate to issues around large memory footprints, which the ProActive team is looking at in more detail. With EP, Java was faster or significantly faster in virtually all cases.

    Work continues to lower messaging latency, to optimize in-node data transfers by sending pointers rather than data, and to reduce message-size overhead.

    When asked how ProActive compares to X10, Denis pointed out that while X10 does share some concepts with ProActive, X10 is a new language while ProActive is designed to run on standard Java JVMs and to enable to use of standard Java for HPC.

    A full technical paper about ProActive in PDF format is available here.


    (2008-11-17 09:30:48.0) Permalink Comments [0]

    A Customer's View of Sun's HPC Consortium Meeting

    One of our customers, Gregg TeHennepe from Jackson Laboratory, has been blogging about his attendance at Sun's HPC Consortium meeting here in Austin. For his perspectives and for some excellent photos of the Ranger supercomputer at TACC, check out his blog, Mental Burdocks.

    (2008-11-17 09:26:12.0) Permalink Comments [0]

    If You Doubt the Utility of GPUs for HPC, Read this

    Professor Satoshi Matsuoka from the Tokyo Institute of Technology gave a really excellent talk this afternoon about using GPUs for HPC at the HPC Consortium Meeting here in Austin.

    As you may know, the Tokyo Institute of Technology is the home of TSUBAME, the largest supercomputer in Asia. It is an InfiniBand cluster of 648 Sun Fire x4600 compute nodes, many with installed Clearspeed accelerator cards.

    The desire is to continue to scale TSUBAME into a petascale computing resource over time. However, power is a huge problem at the site. The machine is responsible for roughly 10% of the overall power consumption of the Institute and therefore they cannot expect their power budget to grow over time. The primary question, then, is how to add significant compute capacity to the machine while working within a constant power budget.

    It was clear from their analysis that conventional CPUs would not allow them to reach their performance goals while also satisfying the no-growth power constraint. GPUs--graphical processing units like those made by nVidia--looked appealing in that they claim extremely high floating point capabilities and deliver this performance at a much better performance/watt ratio that conventional CPUs. The question, though, is whether GPUs can be used to significantly accelerate important classes of HPC computations or whether they are perhaps too specialized to be considered for inclusion in a general-purpose compute resource like TSUBAME. Professor Matsuoka's talk focused on this question.

    The talk approached the question by presenting performance speed-up results for a selection of important HPC applications or computations based on algorithmic work done by Prof. Matsuoka and other researchers at the Institute. These studies were done in part because GPU vendors do a very poor job of describing exactly what GPUs are good for and what problems are perhaps not handled well by GPUs. By assessing the capabilities over a range of problem areas, it was hoped that conclusions could be drawn about the general utility of the GPU approach for HPC.

    The first problem examined was a 3D protein docking analysis that performs an all-to-all analysis of 1K proteins to 1K proteins. Based on their estimates, a single protein-protein interaction analysis requires about 200 TeraOps while the full 1000x1000 problem requires about 200 ExaOps. In order to maximally exploit GPUs for this problem, a new 3D FFT algorithm was developed that in the end delivered excellent performance and a 4x better performance/watt over IBM's BG/L system, which itself is much more efficient than a more conventional cluster approach.

    In addition, other algorithmic work delivered speedups of 45X over single conventional CPUs for CFD, which is typically limited by available bandwidth. Likewise, a computation involving phase separation liquid delivered a speedup of 160X over a conventional processor.

    Having looked at single node performance and compared it to a single-node GPU approach and found that GPUs do appear to able to deliver interesting performance and performance/watt for an array of useful problem types so long as new algorithms can be created to exploit the specific capabilities of these GPUs, the next question was whether these results could be extended to multi-GPU and cluster environments.

    To test this, the team worked with the RIKEN Himeno CFD benchmark, which is considered the worst memory bandwidth-limited code one will ever see. It is actually worse than any real application one would ever encounter. If this could be parallelized and used with GPUs to advantage, then other less difficult codes should also benefit from the GPU approach.

    To test this, the code was parallelized to run using multiple GPUs per node and with MPI as the communication mechanism between nodes. Results showed about a 50X performance improvement over a conventional CPU cluster on a small-sized problem.

    A multi-GPU parallel sparse solver was also created which showed a 25X-35X improvement over conventional CPUs. This was accomplished using double precision implemented using mixed-precision techniques.

    While all of these results seemed promising, could such a GPU approach be deployed at scale in a very large cluster rather than just within a single node or across a modest-sized cluster? The Institute decided to find out by teaming with nVidia and Sun to enhance TSUBAME by adding Tesla GPUs to some (most) nodes.

    Installing the Tesla cards into the system went very smoothly and resulted in three classes of nodes: those with both Clearspeed and Tesla installed, those with only Tesla installed, and those Opteron nodes with neither kind of accelerator installed.

    Could this funky array of heterogeneous nodes be harnessed to deliver an interesting LINPACK number? It turns out that it could, with much work and in spite of the fact that there was limited bandwidth in the upper links of the InfiniBand fabric and that they had limited PCIx/PCIe bandwidth available in the nodes (I believe due to the number and types of slots available in the x4600 and the number of required devices in some of the TSUBAME compute nodes.)

    As a result of the LINPACK work (which could have used more time--it was deadline-limited) the addition of GPU capability in TSUBAME allowed its LINPACK number to be raised from 67.7 TFLOPs, which was reported in June, to a new high of 77.48 TFLOPs which shows an impressive increase.

    With the Tesla cards installed, TSUBAME can now be viewed as a 900 TFLOPs (single precision) or 170 TFLOPs (double precision) machine. A machine that has either 10K cores or 300K SIMD cores if one counts the components embedded within each installed GPU.

    The conclusion is pretty clearly that GPUs can be used to significant advantage on an interesting range of HPC problem types, though it is worth noting that it also appears that significantly clever, new algorithms may also need to be developed to map these problems efficiently onto GPU compute resources.


    (2008-11-16 21:28:20.0) Permalink Comments [0]


 
archives
links
stats