Wednesday July 01, 2009 | The Navel of Narcissus Josh Simons' Coordinates in the Blogosphere |
|
Run an HPC Cluster...On your Laptop
With one free download, you can now turn your laptop into a virtual three-node HPC cluster that can be used to develop and run HPC applications, including MPI apps. We've created a pre-configured virtual machine that includes all the components you need: Sun Studio C, C++, and Fortran compilers with performance analysis, debugging tools, and high-performance math library; Sun HPC ClusterTools -- MPI and runtime based on Open MPI; and Sun Grid Engine -- Distributed resource management and cloud connectivityInside the virtual machine, we use OpenSolaris 2009.06, the latest release of OpenSolaris, to create a virtual cluster using Solaris zones technology and have pre-configured Sun Grid Engine to manage it so you don't need to. MPI is ready to go as well---we've configured everything in advance. If you haven't tried OpenSolaris before, this will also give you a chance to play with ZFS, with DTrace, with Time Slider (like Apple's Time Machine, but without the external disk) and a host of other cool new OpenSolaris capabilities. For full details on Sun HPC Software, Developer Edition for OpenSolaris check out the wiki. To download the virtual image for VMware, go here. (VirtualBox image coming soon.) If you have comments or questions, send us a note at hpcdev-discuss@opensolaris.org. (2009-07-01 11:49:59.0) Permalink Comments [0] HPC in Hamburg: Sun Customers Speak at the HPC Consortium It's crazy time again. I'm in Hamburg for two HPC events: Sun's HPC Consortium customer event, and ISC '09, the International Supercomputing Conference. The Consortium ran all day Sunday and Monday and then ISC started on Tuesday. It is now Wednesday and this is the first break I've had to post a summary talks of given at the Consortium. Due to the sheer number of presentations, including a wide range of Sun and partner talks, I only summarize those given by our customers. The full agenda is here.
Our first customer talk on Sunday was given by Dr. James Leylek, Executive Director of CU-CCMS, the Clemson University Center Computational Center for Mobility Systems, which focuses on problems in the automotive, aviation/aerospace, and energy industries. The mission of CU-CCMS is not unique -- there are numerous university-based centers that work closely with industry by bringing resources and expertise to bear in a variety of problem domains. What sets CU-CCMS apart is its focus on addressing the mismatch between typical university time-scales and those of their industrial partners. Businesses need results quickly; universities move more slowly. CU-CCMS has addressed this need in a few ways. They've staffed the center with full-time MS and PhD level engineers who have no teaching responsibilities. And they have provided a significant amount of computing gear to enable those engineers to work effectively with their industrial partners and generate results in a timely way. Heterogeneity is another key part of the CU-CCMS strategy. By offering a range of computing platforms from clusters to very large shared-memory machines (from Sun) they are able to map problems to appropriate resources to deliver the fast turnaround times required by their industrial partners. Dr. Leylek also briefly discussed the challenge of introducing HPC to industry as detailed in the Council on Competitiveness study, Reveal. As he noted, many companies are "sitting on the sidelines" of HPC and not engaging even though they could increase their competitiveness by using HPC techniques. He believes CU-CCMS offers a model for how such engagements can be run successfully: assemble a team of expert, dedicated technical resources with appropriate domain knowledge, algorithmic expertise, etc, and combine that with ample high performance computing infrastructure, and an understanding that turnaround time is critical for successful industrial engagements. And then generate valuable results. Lather, rinse, repeat.
Thomas Nau from the University of Ulm gave the next talk, which was a quick tour through several OpenSolaris technologies. He talked about COMSTAR, gave a quick demo of the new OpenSolaris Time Slider, and spent most of time talking about ZFS, specifically about the benefits of solid state disks for increasing ZFS performance. Thomas identified the ZIL -- the ZFS Intent Log -- as the component most often affecting performance. Experiments he has done that involved moving the ZIL from a standard hard disk to a ramdisk have shown significant ZFS performance improvements. In addition, only a small amount of solid state storage is needed to achieve good performance, e.g. perhaps 1-4 GB even for multi-TeraByte drives. Thomas noted that while one could theoretically increase ZFS performance by disabling the ZIL, DO NOT DO THIS. He then ended with the following statement, with which I can only agree: "Hardware RAID is dead, dead, dead. Just use ZFS." :-)
Our first customer talk on Monday was given by Prof. Dr. Thomas Lippert head of the Jülich Supercomputing Center(JCS), site of Sun's largest European deployment to date of our Sun Constellation System architecture. He first gave a brief history of the Jülich Research Center, which is one of the largest civilian research centers in Europe with over 4000 researchers in nine departments, one of which is the new Institute for Advanced Simulation of which JCS is a part. The site has a very long history of computer acquisitions, starting in 1957. This year JCS purchased three systems: a Sun system (JuRoPa), a Bull system (HPC-FF), and an IBM system (Jugene.) These systems have, respectively, 200 TFLOPs, 100 TFLOPs, and 1 PFLOPs of peak performance. Since the Sun and Bull systems are interconnected at the highest level of their switch hierarchies, the two machines can be run as a single system. This combined system delivered 274.8 TFLOPs on LINPACK which earned it the #10 entry on the latest edition of the TOP500 list. Collectively, JCS serves about 250 projects across Europe, including 20-30 highly scalable projects that are chosen by international referees for their potential for producing breakthrough science. Dr. Lippert also spoke briefly about PRACE, the Partnership for Advanced Computing in Europe, which is radically changing the supercomputing landscape across Europe. Due to earlier studies, computing is now considered to be a crucial pillar of research infrastructure and, as such, it is now receiving considerable attention from funding agencies. In closing, Dr. Lippert presented specific details of the JuRoPa system (2208 nodes, 17664 cores, 207 TFLOPs, 48 GB/node, and Sun's new M9 QDR switch.) He also described some of specific issues that will be explored with these systems, including control of jitter through the use of gang scheduling, daemon reduction, a SLERT kernel, etc. And some additional secret sauce from Sun perhaps. :-)
Prof. Satoshi Matsuoka from the Tokyo Institute of Technology spoke next. While he did mention Tsubame, Tokyo Tech's Sun-based supercomputer, he primarily spoke about the return of vector machines to HPC. They have, he believes, been reincarnated as GPGPU-based machines. Dinosaurs are once again walking the earth. :-) In particular, the GPGPU's high compute density, high memory bandwidth, and low memory latency echo some of the fundamental capabilities of vector machines that make them interesting for both tightly coupled codes like N-body as well as sparse codes like CFD. In his view, the GPGPU essentially becomes the main processor while the CPU becomes an ancillary processor. Computers, however, are not useful unless they can be used to solve problems. To support the fact that GPGPU-based clusters can be effective HPC platforms, Prof. Matsuoka presented results from several new algorithms that have been developed at Tokyo Tech to take advantage of GPGPU-based systems. He showed impressive results for 3D FFTs used for protein folding and results for CFD with speedups up to 70X over CPU-based algorithms.
Our next customer speaker was Henry Tufo of the University of Colorado at Boulder (UCB) and the National Center for Atmospheric Research (NCAR.) He gave an update on UCB's upcoming Constellation-based HPC system and also spoke about some of the challenges related to climate modeling. It seems clear at this point that accurate climate modeling is going to be critical for understanding our future and our planet's future. It was a bit daunting to hear that climate modelers would like to increase many dimensions of their simulations, including spatial resolution by 10^3 or 10^5, the completeness of their models by a factor of 100x, the length of their simulator runs by 100x, and increase the number of modeled parameters by 100x. All told, their desires would increase computational needs by 10^10 or 10^12 over current requirements. It was sobering to hear that current technology trajectories predict that a 10^7 improvement will take about 30 years. Not good. Their new Sun-based system will consist of 12 Constellation racks, Nehalem blades, QDR InfiniBand, about 500 TB of storage with about 10% of the clusters nodes accelerated with GPUs. The system will be located next to an existing physics building in three containers -- one for the IT components, one for electrical, and one for cooling.
Stephane Thiell from the Commissariat à l’Énergie Atomique (CEA) gave an overview CEA, talked a bit about CEA's TERA-100 project and then detailed CEA's planned use of Lustre for TERA-100. The CEA computing complex currently has two computing centers, one classified (TERA) and one open (CCRT.) TERA-100 will be a follow-on to TERA-10, which is a 60 TFLOPs, Linux-based system built by Bull in 2005. It includes an impressive 1 PetaByte Lustre filesystem and uses HPSS to archive to Sun StorageTek tape libraries with a 15 PetaByte capacity. TERA-100 aims to increase CEA's classified computing capacity by about 20x with a final size of one PFLOPs or perhaps a little larger. CEA plans to continue with their COTS-based, general-purpose approach rather than move of the main sequence to something more exotic. It will be x86-based with more than 500 GFLOPs per node using 4-socket nodes. There will be 2-4 GB per core and two Lustre file systems will be supported, one with a 300 GB/s transfer requirements and the other with a 200 GB/s requirement. The system will consume less than 5 MW. A 40 TFLOPs demonstrator system will be built first and it will include scaled-down versions of the Lustre file systems as well. In the final system the Lustre servers will be built with four-socket nodes and a four-node HA architecture will be used to guarantee against failure and to avoid long failover times. CEA is involved in some interesting Lustre-based development, including joint work with Sun on a binding between Lustre and external HSM systems with the goal of supporting Lustre levels of performance with transparent access to hierarchical storage management. CEA is also working on Shine, a management tool for Lustre.
Dieter an Mey gave some general information about computing at RWTH Aachen University and then gave an update on their latest acquisition, a Sun-based supercomputer. He ended with a discussion about the pleasures and perils of workload placement on current generation systems. Along the way he shared some feedback on Sun products -- one of those habits that makes customers like Dieter such valuable partners for Sun. Aachen provides both Linux and Windows-based HPC resources for their users. On Linux they record about 40,000 batch jobs per month and perhaps 150 interactive sessions per day. The Windows cluster is used primarily for interactive jobs. It was interesting to hear that Windows is gaining ground with respect to Linux at Aachen: a previous study at Aachen had shown that Windows lagged Linux in performance by about 24%, but a recent re-run of the study now shows the gap to be on the order of about 7%. Aachen's new system will support both Linux and Windows equally with a flexible dividing line between them. The facility is designed to be general purpose with a mix of thin and fat nodes and with the required high-speed interconnect for those who use MPI. A new building is being erected to house this machine which will come fully online over the course of 2009-2010. When complete, the system will have a peak floating point rate in excess of 200 TFLOPs and it will include a 1 PetaByte Lustre file system. Speaking of Lustre, Dieter rated its configuration as "complex", something Sun is working on. The system will also include two of Sun's latest InfiniBand switches, the new 648-port QDR M9 switch. Dieter's final topic was the correct placement of workload on non-uniform system architectures. In particular, he described the difference between compact and scatter placement on multicore NUMA systems. Compact placement uses threads on the same core first, then cores in the same socket --- a strategy that is used to minimize latency and to maximize cache sharing. Scatter placement uses threads on different sockets first, and then threads on different cores -- a strategy that maximizes memory bandwidth. Which strategy is best depends on the details of an application's underlying algorithms. (Dieter noted that currently Sun Grid Engine is not aware of these issues -- it treats nodes as flat arrays of threads or cores.) Placement decisions are further complicated when attempting to schedule more than one application onto a fat node. For example, different strategies would be used depending on whether single job turnaround is more important than overall throughput of jobs.
Our last customer talk at the Consortium was given by the tag team of Arnie Miles (left) from Georgetown University and Tim Bornholtz (right) of the Bornholtz Group. Their topic was the Thebes Consortium for which they presented current status, did a short demo, and announced that the source code would be available by the end of June on sunsource.net. The Thebes Consortium aims to help the widespread adoption of distributed computing technologies by creating an enabling infrastructure that focuses on scalability, security, and simplicity. Arnie described (and Tim demo'ed) the instantiation of the Thebes first use case which assumes 1) that users have usernames and passwords in their home domain, 2) that one or more local resources have a trust relationship with a local STS (secure token service), 3) that these resources are known to users, and 4) that all resources are able to consume SAML. The use case itself consists of the following actions: 1) users create job submission files using the client application or a command line, 2) users use institution usernames and passwords to acquire a signed SAML token, 3) users perform no other logins and do not have to go to a resource command line interface, 4) users manually choose their resources, 5) job scheduling is handled by resources. Note that in this instance "resource" refers to a DRM-managed cluster which will accept the incoming request and then schedule the job appropriately on its managed cluster. In the prototype as it currently exists, a service is a compute service though there is also some level of support for a file system service as well. (2009-06-24 11:13:17.0) Permalink Comments [0] FORTRAN: Calling All Dinosaurs! DO you PROGRAM FORTRAN? IF so, READ on. Please ASSIGN some time to RECORD your opinions about current and future FORTRAN needs in our non-COMPLEX online survey. It is in your INTRINSIC self-interest to PAUSE and DO so. It is IMPLICIT and LOGICAL that you also CALL on your colleagues (those CHARACTERs) to READ this, get REAL, and make an ENTRY as well. You can OPEN the survey IF you GOTO here. (Something we share in COMMON: I am a FORTRAN TYPE as well and am eligible to join the Dinosaur UNION.) (2009-06-18 07:02:23.0) Permalink Comments [0]Rur Valley Journal: What's Up at Jülich?
What's up at Jülich? The latest 200+ TeraFLOPs of Sun-supplied HPC compute power is now up and running! The JuRoPa (Jülich Research on PetaFLOP Architectures) system at the Jülich Research Center in Jülich, Germany has just come online this week. A substantial part the system is built with the Sun Constellation System architecture, which marries highly dense blade systems with an efficient, high-performance QDR InfiniBand fabric in an HPC cluster configuration. We delivered 23 cabinets filled with a total of 1104 Sun Blade x6275 servers or 2208 nodes. Each of these nodes is a dual-socket Nehalem-EP system running at 2.93 GHz. The systems are connected with quad data-rate (QDR) 4X InfiniBand using a total of six of our latest 648-port QDR switches. As usual, we use 12X InfiniBand cables to route three 4X connections, thereby greatly reducing the number of cables and connectors, and increasing the reliability of the fabric. For more detail on the Nehalem-EP blades and other components used in this system, see this blog entry. I've annotated one of the official photos below. Marc Hamilton has many more photos on his blog, including some cool "underground shots" at Jülich.
(2009-05-29 14:02:54.0) Permalink Comments [0] Tickless Clock for OpenSolaris
I've been talking a lot to people about the convergence we see happening between Enterprise and HPC IT requirements and how developments in each area can bring real benefits to the other. I should probably do an entire blog entry on specific aspects of this convergence, but for now I'd like to talk about the Tickless Clock OpenSolaris project. Tickless kernel architectures will be familiar to HPC experts as one method for reducing application jitter on large clusters. For those not familiar with the issue, "jitter" refers to variability in the running time of application code due to underlying kernel activity, daemons, and other stray workloads. Since MPI programs typically run in alternating compute and communication phases and develop a natural synchonization as they do so, applications can be slowed down significantly when some nodes arrive late at these synchronization points. The larger the MPI job, the more likely the this type of noise will cause a problem. Measurements have shown surprisingly large slowdowns associated with jitter. Jitter can be lessened by reducing the number of daemons running on a system, by turning off all non-essential kernel services, etc. Even with these changes, however, there are other sources of jitter. One notable source is the clock interrupt used in virtually all current operating systems. This interrupt, which fires 100 times per second, is used to periodically perform housekeeping chores required by the OS. This interrupt is a known contributor to jitter. It is for this reason that IBM has implemented a tickless kernel on their Blue Gene systems to reduce application jitter. Sun is starting a Tickless Clock project in OpenSolaris to completely remove the clock interrupt and switch to an event-based architecture for OpenSolaris. While I expect this will be very useful for HPC users of OpenSolaris, HPC is not the primary motivator of this project. As you'll hear in the video interview with Eric Saxe, Senior Staff Engineer in Sun's Kernel Engineering group, the primary reasons he is looking at Tickless Clock are power management and virtualization. For power management, it is important that when the system is idle, it really IS idle and not waking up 100 times per second to do nothing since this wastes power and will prevent the system from entering deeper power saving states. For virtualization, since multiple OS instances may share the same physical server resources, it is important that guest OSes that are idle really do stay idle. Again, waking up 100 times per second to do nothing will steal cycles from active guest OS instances, thereby reducing performance in a virtualized environment. While it is true I would argue that both power management and virtualization will become increasingly important to HPC users (more of that convergence thing), it is interesting to me to see that these traditional enterprise issues are stimulating new projects that will benefit both enterprise and HPC customers in the future. Interested in getting involved with implementing a tickless architecture for OpenSolaris? The project page is here. (2009-04-15 11:14:44.0) Permalink Comments [0] You Say Nehalem, I Say Nehali Let me say this first: NEHALEM, NEHALEM, NEHALEM. You can call it the Intel Xeon Series 5500 processor if you like, but the HPC community has been whispering about Nehalem and rubbing its collective hands together in anticipation of Nehalem for what seems like years now. So, let's talk about Nehalem. Actually, let's not. I'm guessing that between the collective might of the Intel PR machine, the traditional press, and the blogosphere you are already well-steeped in the details of this new Intel processor and why it excites the HPC community. Rather than talk about the processor per se, let's talk instead about the more typical HPC scenario: not about a single Nehalem, but rather piles of Nehali and how to best to deploy them in an HPC cluster configuration. Our HPC clustering approach is based on the Sun Constellation architecture, which we've deployed at TACC and other large-scale HPC sites around the world. These existing systems house compute nodes in the Sun Blade 6000 System chassis which holds four blade shelves, each with twelve blade systems for a total of 48 blades per chassis. Constellation also includes matching InfiniBand infrastructure, including InfiniBand Network Express Modules (NEMs) and a range of InfiniBand switches that can be used to build petascale compute clusters. You can see several of the 82 TACC Ranger Constellation chassis in this photo, interspersed with (black) inline cooling units. As part of our continued focus on HPC customer requirements, we've done something interesting with our new Nehalem-based Vayu blade (officially, the Sun Blade X6275 Server Module): each blade houses two separate nodes. Here is a photo of the Vayu blade:
Each of the nodes is a diskless, two-socket Nehalem system with 12 DDR3 DIMM slots per node (up to 96 GB per node) and on-board QDR InfiniBand. It's actually not quite correct to call this node diskless because it does include two Sun Flash Module slots (one per node) that each provide up to 24 GB of FLASH storage through a SATA interface. I am sure our HPC customers will use what amounts to an ultra-fast disk for some interesting applications. Using Vayu blades, each Sun Constellation chassis can now support a total of 2 nodes/blade * 12 blades/shelf * 4 shelves/chassis = 96 nodes with a peak floating-point performance of about 9 TFLOPs. While a chassis can support up to 96 GB / node * 96 nodes = 9.2 TB of memory, there are some subtleties involved in optimizing and configuring memory for Nehalem systems so I recommend reading John Nerl's blog entry for a detailed discussion of this topic. For a quick visual tour of Vayu, see the annotated photo below. The major components are: A = Nehalem 4-core processor; B = Memory DIMMs; C = Mellanox Connect-X QDR InfiniBand ASIC; D = Tylersburg I/O chipset; E = Base Management Controller (BMC) / Service Processor (one per node); F = Sun Flash Modules (SATA, 24GB per node.) The connector at the top right supports two PCIe2 interfaces, two InfiniBand interfaces, and 2 GbE interfaces. The GbE logic is hiding under the BMC daughter board.
To accommodate this high-density blade approach we've developed a new QDR InfiniBand NEM (officially called the SunBlade 6048 QDR IB Switched NEM), which is shown below. This Network Express Module plugs into a blade shelf's midplane and forms the first level of InfiniBand fabric in a cluster configuration. Specifically, the two on-board 36-port Mellanox QDR switch chips act as leaf switches from which larger configurations can be built. Of the 72 total switch ports available, 24 are used for the Vayu blades and 9 from each switch chip are used to interconnect the two switches, leaving a total of 72 - 24 - 2*9 = 30 QDR links available for off-shelf connections. These links leave the NEM through 10 physical connectors, each of which carries three QDR X4 links over a single cable. As we discussed when the first version of Constellation was released, aggregating 4X links into 12X cables results in significant customer benefits related to reliability, density, and complexity. In any case, these cables can be connected to Constellation switches to form larger, tree-based fabrics. Or they can be used in switchless configurations to build torus-based topologies by using the ten cables to carry X, Y, and Z traffic between shelves and across chassis. As mentioned, for example, here. The NEM also provides GbE connectivity for each of the 24 nodes in the blade shelf.
Looking back at the TACC photo, we can now double the compute density shown there with our newest Constellation-based systems using the new Vayu blade. Oh, and by the way, we can also remove those inline cooling units and pack those chassis side-by-side for another significant increase in compute density. I'll leave how we accomplish that last bit for a future blog entry. (2009-04-14 06:05:00.0) Permalink Comments [0]HPC in Second Life (and Second Life in HPC)
We held an HPC panel session yesterday in Second Life for Sun employees interested in learning more about HPC. Our speakers were Cheryl Martin, Director of HPC Marketing; Peter Bojanic, Director for Lustre; Mike Vildibill, Director of Sun's Strategic Engagement Team (SET); and myself. We covered several aspects of HPC: what it is, why it is important, and how Sun views it from a business perspective. We also talked about some of the hardware and software technologies and products that are key enablers for HPC: Constellation, Lustre, MPI, etc. As we were all in-world at the time, I thought it would be interesting to ponder whether Second Life itself could be described as "HPC" and whether we were in fact holding the HPC meeting within an HPC application. Having viewed this excellent SL Architecture talk given by Ian (Wilkes) Linden, VP of Systems Engineering at Linden Lab, I conclude that SL is definitely an HPC application. Consider the following information taken from Ian's presentation.
As you can see, the geography of SL has been exploding in size over the last 5-6 years. As of Dec 2008 that geography is simulated using more than 15K instances of the SL simulator process that in addition to computing the physics of SL also run an average of 30 million simultaneous server-side scripts to create additional aspects of the SL user experience. And look at the size of their dataset: 100TB is very respectable from an HPC perspective. And a billion files! Many HPC sites are worrying what will happen when they get to that level of scale while Linden Lab is already dealing with it. I was surprised they aren't using Lustre, since I assume their storage needs are exploding as well. But I digress.
The SL simulator described above would be familiar to any HPC programmer. It's a big C++ code. The problem space (the geography of SL) has been decomposed into 256m X 256m chunks that are each assigned to once instance of the simulator. Each simulator process runs on its own CPU core and "adjacent" simulator instances exchange edge data to ensure consistency across sub-domain boundaries. And it's a high-level physics simulation. Smells like HPC to me. (2009-04-03 10:53:27.0) Permalink Comments [2] Amazon EC2: More Reserved Than Ever
Back in September, I expressed skepticism that a purely on-demand model for cloud computing would be sufficient for businesses to seriously commit to the cloud model as a way to run their businesses. Apparently, Amazon is an avid reader of the Navel (joke!), because they recently announced a new resource model -- Reserved Instances -- that in large part addresses the issue I raised. Specifically, it is now possible to pay up-front to reserve an instance for use over some period of time. In addition, when the instance is actually used, the rate is lower than for purely on-demand instances. This hybrid model will appeal to those customers who worry about resource availability as demand for cloud computing resources continues to grow and certainly to those who have developed a business- or mission-critical reliance on access to these remote resources. (2009-03-20 14:29:11.0) Permalink Comments [0] Australian Supercomputing: Who's Da BOM?
The Australian press has an article today about a new Sun supercomputer to be installed at the Australian Bureau of Meteorology (BOM.) The new 1.5 TFLOP machine, which will be ten times more powerful than their current system, is said to be the largest in the southern hemisphere. The article is here. (2009-03-18 20:01:44.0) Permalink Comments [0] More Free HPC Developer Tools for Solaris and Linux The Sun Studio team just released the latest version of our HPC developer tools with so many enhancements and additions it's hard to know where to start this blog entry. I suppose with the basics: As usual, all of the software is free. And available for both Solaris and Linux, specifically Solaris, OpenSolaris, RHEL, SuSE, and Ubuntu. Frankly, Sun would like to be your preferred provider for high-performance Fortran, C, and C++ compilers and tools. Given the performance and capabilities we deliver for HPC with Sun Studio, that seems a pretty reasonable goal to me. We think the price has been set correctly to achieve that as well. :-) I have to admit to being confused by the naming convention for this release, but it goes something like this. The release is an EA (Early Access) version of Sun Studio 12 Update 1 -- the first major update to Sun Studio 12 since it was released in the summer of 2007. Since Sun Studio's latest and greatest bits are released every three months as part of the Express program, this release can also be called Sun Studio Express 3/09. Different names, same bits. Don't worry about it -- just focus on the fact that they make great compilers and tools. :-) Regardless of what they call it, the release can be downloaded here. Take it for a spin and let the developers know what you think on the forum or file a request for enhancement (RFE) or a bug report here. For the full list of new features, go here. For my personal list of favorite new features, read on.
As I mentioned above, full details on these new features and many, many more are all documented on this wiki page. And, again, the bits are here. (2009-03-18 15:15:22.0) Permalink Comments [0]HPC and Virtualization: Oak Ridge Trip Report
Just before Sun's Winter Break, I attended a meeting at Oak Ridge National Laboratory in Tennessee with Stephen Scott, Geoffroy Vallee, Christian Engelmann, Thomas Naughton, and Anand Tikotekar, all of the Systems Research Team (SRT) at ORNL. Attending from Sun were Tim Marsland, Greg Lavender, Rebecca Arney, and myself. The topic was HPC and virtualization, an area the SRT has been exploring for some time and one I've been keen on as well as it has become clear v12n has much to offer the HPC community. This is my trip report. I arrived at Logan Airport in Boston early enough on Monday to catch an earlier flight to Dulles, narrowly avoiding the five-hour delay that eventually afflicted my original flight. The flight from Boston to Knoxville via Dulles went smoothly and I arrived without difficulty to a rainy and chilly Tennessee evening. I was thrilled to have made it through Dulles without incident since more often than not I have some kind of travel difficulty when my trips pass through IAD (more on that later.) The 25 mile drive to the Oak Ridge DoubleTree was uneventful. Oak Ridge is still very much a Lab town from what I could see, much like Los Alamos, but certainly less isolated. Movie reviews in the Oak Ridge Observer are rated with atoms rather than stars. Stephen Scott, who leads the System Research Team (SRT) at ORNL, mentioned that the plot plan for his house is stamped "Top Secret -- Manhattan Project" because the plan shows the degree difference between "ORNL North" and "True North", an artifact of the time when period maps of the area deliberately skewed the position of Oak Ridge to lessen the chance that a map could be used to successfully bomb ORNL targets from the air during the war. We spent all day Tuesday with Stephen and most of the System Research Team. Tim talked about what Sun is doing with xVM and our overall virtualization strategy and ended with a set of questions that we spent some time discussing. Greg then talked in detail about both Crossbow and InfiniBand, specifically with respect to aspects related to virtualization. We spent the rest of the day hearing about some of the work on resiliency and virtualization being done by the team. See the end of this blog entry for pointers to some of the SRT papers as well as other HPC/virtualization papers I have found to be interesting. Resiliency isn't something the HPC community has traditionally cared much about. Nodes were thin and cheap. If a node crashed, restart the job, replace the node, use checkpoint-restart if you can. Move on; life on the edge is hard. But the world is changing. Nodes are getting fatter again--more cores, more memory, more IO. Big SMPs in tiny packages with totally different economics from traditional large SMPs. Suddenly there is enough persistent state on a node that people start to care how long their nodes stay up. Capabilities like Fault Management start to look really interesting, especially if you are a commercial HPC customer using HPC in production. In addition, clusters are getting larger. Much larger, even with fatter nodes. Which means more frequent hardware failures. Bad news for MPI, the world's most brittle programming model. Certainly, some more modern programming models would be welcome, but in the meantime what can be done to keep these jobs running longer in the presence of continual hardware failures? This is one promise of virtualization. And one reason why a big lab like ORNL is looking seriously at virtualization technologies for HPC.Live migration -- the ability to shift running OS instances from one node to another -- is particularly interesting from a resiliency perspective. Linking live migration to a capable fault management facility (see, for example, what Sun has been doing in this area) could allow jobs to avoid interruption due to an impending node failure. Research by the SRT (see the Proactive Fault Tolerance paper, below) and others has shown this is a viable approach for single-node jobs and also for increasing the survivability of MPI applications in the presence of node failures. Admittedly, the current prototype depends on Xen TCP tricks to handle MPI traffic interruption and continuation, but with sufficient work to virtualize the InfiniBand fabric, this technique could be extended to that realm as well. In addition, the use of an RDMA-enabled interconnect can itself greatly increase the speed of live migration as is demonstrated in the last paper listed in the reference section below. We discussed other benefits of virtualization. Among them, the use of multiple virtual machines per physical node to simulate a much larger cluster for demonstrating an application's basic scaling capabilities in advance of being allowed access to a real, full-scale (and expensive) compute resource. Such pre-testing becomes very important in situations in which large user populations are vying for access to relatively scarce, large-scale, centralized research resources. Geoffroy also spoke about "adapting systems to applications, not applications to systems" by which he meant that virtualization allows an application user to bundle their application into a virtual machine instance with any other required software, regardless of the "supported" software environment available on a site's compute resource. Being able to run applications using either old versions of operating systems or perhaps operating systems with which a site's administrative staff has no experience, does truly allow the application provider to adapt the system to their application without placing an additional administrative burden on a site's operational staff. Of course, this does push the burden of creating a correct configuration onto the application provider, but the freedom and flexibility should be welcomed by those who need it. Those who don't could presumably bundle their application into a "standard" guest OS instance. This is completely analogous to the use and customization of Amazon Machine Instances (AMIs) on the Amazon Elastic Compute Cloud (EC2) infrastructure.Observability was another simpatico area of discussion. DTrace has taken low-cost, fine-grained observability to new heights (new depths, actually). Similarly, SRT is looking at how one might add dynamic instrumentation at the hypervisor level to offer a clearer view of where overhead is occurring within a virtualized environment to promote user understanding and also offer a debugging capability for developers. A few final tidbits to capture before closing. Several other research efforts are looking at HPC and virtualization. Among them V3VEE (University of New Mexico and Northwestern University), XtreemOS (a bit of a different approach to virtualization for HPC and Grids). SRT is also working on a virtualized version of OSCAR called OSCAR-V. The Dulles Vortex of Bad Travel was more successful on my way home. My flight from Knoxville was delayed with an unexplained mechanical problem that could not be fixed in Knoxville, requiring a new plane to be flown from St. Louis. I arrived very late into Dulles, about 10 minutes before my connection to Boston was due to leave from the other end of the terminal. I ran to the gate, arriving two minutes before the flight was scheduled to depart and it was already gone-- no sign of the gate agents or the plane. Spent the night at an airport hotel and flew home first thing the next morning. Dulles had struck again--this was at least the third time I've had problems like this when passing through IAD. I have colleagues that refuse to travel with me through this airport. With good reason, apparently. Reading list: Proactive Fault Tolerance for HPC with Xen Virtualization, Nagarajan, Mueller, Engelmann, Scott The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software, Youseff, Seymour, You, Dongarra, Wolski Performance Implications of Virtualizing Multicore Cluster Machines, Ranadive, Kesavan, Gavrilovska, Schwan High Performance Virtual Machine Migration with RDMA over Modern Interconnects, Huang, Gao, Liu, Panda (2009-01-09 09:47:13.0) Permalink Comments [4] Beta Testers Wanted: Sun Grid Engine 6.2 Update 2 A busy day for fresh HPC bits, apparently... The Sun Grid Engine team is looking for experienced SGE users interested in taking their latest Update release for a test drive. The Update includes bug fixes, but also some new features as well. Two features in particular caught my eye: a new GUI-based installer and optimizations to support very large Linux clusters (think TACC Ranger.) Full details are below in the official call for beta testers. The beta program will run until February 2nd, 2009. Look no further for something to do during the upcoming holiday season. :-) Sun Grid Engine 6.2 Update 2 Beta (SGE 6.2u2beta) ProgramThis README contains important information about the targeted audience of this beta release, new functionality, the duration of this SGE beta program and your possibilities to get support and provide feedback.
(2008-12-18 11:30:25.0) Permalink Comments [0] Fresh Bits: InfiniBand Updates for Solaris 10 Fresh InfiniBand bits for Solaris 10 Update 6 have just been announced by the IB Engineering Team: The Sun InfiniBand Team is pleased to announce the availability of the Solaris InfiniBand Updates 2.1. This comprises updates to the previously available Solaris InfiniBand Updates 2. InfiniBand Updates 2 has been removed from the current download pages. (Previous versions of InfiniBand Updates need to be carefully matched to the OS Update versions that they apply to.) The primary deliverable of Solaris InfiniBand Updates 2.1 is a set of updates of the Solaris driver supporting HCAs based on Mellanox's 4th generation silicon, ConnectX. These updates include the fixes that have been added to the driver since its original delivery, and functionality in this driver is equivalent to what was delivered as part of OpenSolaris 2008.11. In addition, there continues to be a cxflash utility that allows Solaris users to update firmware on the ConnectX HCAs. This utility is only to be used for ConnectX HCAs. Other updates include: All are compatible with Solaris 10 10/08 (Solaris 10, Update 6), for both SPARC and X86. You can download the package from the "Sun Downloads" A-Z page by visiting http://www.sun.com/download/index.jsp?tab=2 and scrolling down or searching for the link for "Solaris InfiniBand (IB) Updates 2.1" or alternatively use this link. Please read the README before installing the updates. This contains both installation instructions and other information you will need to know before running this product. Please note again that this Update package is for use on Solaris 10/08 (Solaris 10, Update 6) only. A version of the Hermon driver has also been integrated into Update 7 and will be available with that Update's release. Congratulations to the Solaris IB Hermon project team and the extended IB team for their efforts in making this product available! (2008-12-18 08:06:27.0) Permalink Comments [0] Random Notes from the IDC HPC Breakfast Briefing I went to the IDC HPC Breakfast briefing yesterday morning because they are usually pretty interesting. This one felt mostly like a rehash of earlier material and was somewhat disappointing as a result. I did hear a few things I thought were worth passing on and here they are.
I made the above graph based on a table that was flashed quickly on the screen during the briefing. If N was specified, I didn't catch it. It is amazing (depressing?) to see how few ISV applications actually scale beyond 32 processors, even after all these years. I showed the graph to Dave Teszler, US Practice Manager for HPC, and he confirmed that he sees lots of commercial HPC customers who buy large clusters, but who really use them as throughput machines where the unit of throughput might be a 32-process job or smaller. In other words, just because a customer buys a 1024-node cluster and is known to use MPI, one cannot assume they are running 1024-process MPI jobs as one can with other kinds of customers like the National Labs or other large supercomputing centers. Other notes jotted during the meeting: (2008-11-20 07:12:15.0) Permalink Comments [0] Bjorn to be Wild! Fun at Supercomputing '08 It's been crazy-busy here at the Sun booth at Supercomputing '08 in Austin, but we do get to have some fun as well. This is Bjorn Andersson, Director of HPC for Sun. He is Bjorn to be Wild. This photo reminded my friend Kai of Fjorg 2008. Worth a look.
(2008-11-20 06:53:37.0) Permalink Comments [2] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||