Tuesday June 17, 2008 | The Navel of Narcissus Josh Simons' Coordinates in the Blogosphere |
|
Burton Smith: The Killer Micros II: The Software Strikes Back My notes from Burton Smith's talk in the Cluster session at ISC 2008 in Dresden. The Killer Micros II: The Software Strikes Back Dr. Burton Smith , Microsoft Corporation, USA Spent 30 years in High Performance Computing, but not his day job at Microsoft though he does some night work. His day job is parallel computing on clients -- general purpose parallel computing -- multicore. Cluster Software is Primitive. Programming is at too low a level (C++, OpenMP, MPI.) Tools are too few and too thinly supported -- cluster market is not big enough. Applications are too cluster-specific because the infrastructure is different on various clients--islands of cluster specificity. Big Changes are Coming. First, the many-core inflection point--parallel computing comes to the mainstream. Second, cloud and corporate computing--better data searches and access and service-based application software. SOAP, AJAX, XML. Yes, partly. But it is really about breaking applications into communicating pieces and distributing them. At SC '89, Eugene Brooks predicted mainstream hardware would dominate the HPC area, "None will survive the attack of the killer micros!" Now it is software's turn. Meaning that HPC will soon be dominated by software from outside of HPC. Client Many-Core Parallelism. Client computing will soon be parallel, even on mobile devices and phones. Why? More absolute performance, and more performance per watt. Parallel languages and tools are underway--they are need to use the new hardware well. And this does not mean adding parallel for-loops to C++. Cluster versus mainstream software. SPMD parallelism with OpenMP versus mixed task/data parallelism; fixed processor counts versus variable processor counts; MPI versus Internet and MPI; C++ versus C#, F#, Excel, SQL, C++; file system versus databases, cloud, file system; Emerging mainstream software has richer capabilities. Software as a Service. Software services will run everywhere-- in clients, on servers, in the cloud. Which brings distributed computing to the forefront. Service-based apps are distributed with data intensity where data are and computational intensity where processors are. Conclusions. Cluster software is still primitive because the market is so small. Two major revolutions now underway in computing that will change the landscape. Clusters will be affected by these changes. (2008-06-17 09:13:39.0) Permalink Comments [3] HPC Consortium: Sun Recognizes First Constellation System Customers At the Sun HPC Consortium gala dinner last night here in Dresden, Marc Hamilton, Sun's Vice President of Systems Practice Americas, recognized Sun's first four Sun Constellation System customers and presented each with a plaque. The text of the plaques is shown below along with some photos of the awards.
(2008-06-17 06:58:05.0) Permalink Comments [0] HPC Consortium: University of Oslo
The last customer talk at the Sun HPC Consortium here is Dresden was given by Jostein Sundet, who is Program Director for Research Computing Services and Adjunct Professor in Atmospheric Sciences, University of Oslo. He spoke about the University's HPC requirements and infrastructure and about how their new Sun Constellation System enables both capability and capacity computing for the University and its customers. The University of Oslo is the largest university in Norway. And they now have the largest InfiniBand switch in the world -- the Sun Datacenter Switch 3456 (a.k.a. the Magnum switch,) which is currently used to connect 96 Sun Blade 6048 systems to create a Sun Constellation System. In addition, they have seven Sun Fire x4600 8-way systems with up to 256 GB memory, 450 Sun Fire X2200 m2 nodes with 2*AMD quad core processors, 96 Dell 1425 systems, and about 1 PetaByte of storage. The user base includes about 250 active users, including national (advanced) users and local University of Oslo users who may not be highly skilled in computing and come from a wide range of disciplines, including politics. The site also acts as a backup site for the Norwegian MET-Office's operational weather forecast and functions as a Tier1 and Tier2 facility for CERN, while also participating in the National GRID. Operational goals of the facility are to allow easy access for all users, both the elevated top (capability) users and the deepening base of capacity users. To do this, the university has acquired their Sun Constellation System, a homogeneous, InfiniBand fat-tree based HPC cluster system and coupled it with advanced job scheduling capabilities that supports backfilling, suspend/resume/migrate, and resource pooling. Backfilling is critical for supporting increased system utilization in a mixed-use environment. By introducing such a capability, utilization was raised from about 60% to over 80% since smaller jobs can now be scheduled ahead of larger jobs while the larger jobs are waiting for resources. This has allowed users to temporarily exceed their job quota as in the case in which a user was able to temporarily run 1263 concurrent jobs, well above their usual limit of 384 jobs. Titan, the Sun Constellation System to be used for both capacity and capability workloads, is a 4000-core cluster built with AMD quad-core processors using the Sun Datacenter 3456 centralized switch to provide InfiniBand connectivity for the system. The parallel file system is centralized and implemented with IBM's GPFS. Initial performance measurements demonstrated 965 MB/sec between nodes (point-to-point) and 780 MB/sec per node when performing a 96-node all-to-all MPI communication. Latency was measured at about 2 ms. Primary challenges included some early firmware issues that were solved by Sun and connector and cable issues, an issue also mentioned by Karl Schulz in an earlier Consortium talk. (2008-06-17 06:22:53.0) Permalink Comments [0]Jack Dongarra: Four Important Concepts to Consider when Using Computing Clusters My notes from Jack Dongarra's talk in the Cluster session at ISC 2008 in Dresden. Four (maybe 6) Important Concepts to Consider when Using Computing Clusters. Jack Dongarra, University of Tennessee & Oak Ridge National Laboratory, USA The concepts are:
In the "old days" processors became faster each year. Today the clock speed is fixed or slowing. But things are still doubling every 18-24 months. Number of cores will double roughly every two years while clock speed decreases (does not increase.) Need to deal with millions of threads in a system. TOP500 perspective. Regular, milestone increase in system peak performance every 11 years. 1 GFLOP 22 years ago (1 thread), 1 TFLOP 11 years ago (10^3 threads), 1 PFLOP (10^6 threads). Extrapolate to ExaFLOP: 10^9 threads in about 2019. How to code for these multicore systems? Fine granularity to support high level of parallelization. Asychrony will be important as granularity becomes finer. Must rethink the design of our software. A bigger disruption than the move from shared memory to message passing. Much resistance then and likely more at this transition point. Steps in a current LAPACK LU decomposition with partial pivoting. Most of the parallelization comes from the matrid-multiplication step. It is bulk synchronous. Scalar step, parallel, scalar, parallel...in synchrony. Very inefficient use of resources. A more event-driven, multithreaded approach is needed based on an analysis of the directed graph of the computation. Adaptive lookahead. Not a new idea, but reviving it to exploit these new architectures. PLASMA is a redesign of LAPACK/ScaLAPACK. Asynchonry, dynamic scheduling, fine granularity, locality of reference. He shows a graph of LU performance of PLASMA against LAPACK and AMD and Intel implementations -- does better at all or most problem sizes. Same for QR. And Cholesky. Consonsiderably better performance at all problem sizes. Performance of single precision on conventional processors. Roughly a factor of two performance over double precision looking at SGEMM/DGEMM. Less compute, less data being moved. Exploit 32-bit as much as possible. Some algorithms can use mixed precision Intriguing potential. Automatically switch between SP and DP to match desired accuracy. Potential for GPU and FPGA -- use as little precision as you can get away with. Performance Optimization. Embed self-tuning optimizations into algorithms so codes can adapt themselves to underlying system architecture. Past successes: ATLAS, FFTW, Spiral, Open MPI (will adapt itself to the characteristics of global communication interconnect.) Fault Tolerance. MPI not up to the task of dealing with faults. Mismatch between hardware and programming. Two kinds of faults important: Erasures--lose a resource/processor and Errors--detect an error and recover. Much work done looking at these issues -- will become more important. Conclusions. For last decade, research tilted in favor of hardware. Need to rebalance this -- barriers to progress are increasingly in software. Hardware has half-life of a few years, software's is decaded. Parallelism is exploding. Perfomance will be a software problem. Locality will continue to be important. Massive parallelism required, including pipelining and overlap. (2008-06-17 04:25:58.0) Permalink Comments [1]Thomas Sterling: The Idea of Clusters My notes from Thomas Sterling's talk in the Cluster session at ISC 2008 in Dresden. The Idea of Clusters -- from a personal Beowulf perspective Thomas Sterling Louisiana State University Where we are and how we got here, the drivers pushing commodity clusters forward, and clusters in the sunset of Moore's Law: they will still be clusters, but they will look different. Definition of a Commodity Cluster. Distributed/parallel computing system, constructed entirely from commodity subsystems with two major subsystems (compute nodes and system area network.) Use of Commodity Clusters Science and Engineering, Manufacturing, FInancial, Commerce, and a large role in Search Engines. And clusters dominate the TOP500 list--more than 70% of systems. Early History of Cluster Highlights. SAGE for NORAD was essentially a cluster built in 1957. Ethernet in 1976. First NOW workstation cluster at UC Berkeley in 1993. Myrinet introduced in 1993. Beowulf in 1993. MPI standard 1994. Gordon-Bell prize for price-performance 1997... UC Berkeley NOW Project. 32-40 SPARCStation 10 and 20 nodes. ATM interconnect and then later Myrinet. First cluster in the TOP500 list. On the East Coast, NASA Beowulf Project. Three generations between 1994 and 1996. Wiglaf, Hrothgar, and Hyglac. 16 nodes each. Established the vision of low-cost HPC. Empowerment: Users took control and were no longer at the mercy of vendors. Standardization of interfaces was an important driver of clustering. PCI standard. Replaced VESA and ISA. Fast and Gigabit Ethernet -- cost effective, multiple vendors, clustering able to directly leverage LAN technology and market. And then Myrinet appeared with low latency (11usec), scalable to thousands of hosts, though at a higher price point than ethernet. Performance wasn't the best, but more scientists could get their hands on these systems. They could build it themselves and stick it in a closet. With considerable pain and effort, they could get the systems to work better. And the cost-performance was 10X better than vendor solutions. Open Source Software, while not essential, became a motivator and driver of development of clusters. Allowed customers to build their own cluster software. PVM was the first message-passing standard. And then came MPI though the community coming together to create a standard. It was a joining of the cluster and MPP communities at the software level--important. More middleware was needed as clusters became more shared resources. Maui, PBS, etc, were developed as workload management systems with support for MPI. Condor for throughput computing. Basic Principles: Performance to Cost (low hanging fruit), Flexibility (inmates are in control), and Leverage of Technology Opportunities (scum sucking bottom feeders.) Key driver today is multi-core. All cluster nodes are now parallel computers. How we manage this is a real issue. InfiniBand is taking hold as price comes down, performance goes up. Heterogeneous accelerators like Clearspeed boards, nVIdia Tesla, AMD FireStream, etc. There is also the potential of FPGAs. Run 10-100 times slower, but they can show exceptional speedups on certain applications. New things that may be coming next. 3D packaging, lightweight cores, processors in or near memory (PNM), embedded heterogeneous architectures (combining PNM with streaming architectures), smarter memories (transactional memory.) Clusters are in a Phase Change. Next phase change may be driven by clusters as we deal with model of computation, operating systems, and in programming models. Goals of a new model of parallel computation. Address the dominant challenges: latency, overheard, starvation, resource contention, and programmability. ParalleX project held out as an exemplar of an approach that attempts to address these issues. Clusters at Nanoscale. Clusters are forever. It took 15 years to dominance. Technology pressures will drive dramatic change -- component types, usage models, software stack, and programming methods. And classes of applications are about to go through significan change -- knowledge economy, machine intelligence, dynamic directed graphs. (2008-06-17 04:01:23.0) Permalink Comments [0]HPC Consortium: University of Ulm's Solaris Geek
Yesterday afternoon, Thomas Nau who is Head of the Infrastructure Department at the University of Ulm and a self-described Solaris geek, gave a talk titled "Storage the Solaris Way" at the Sun HPC Consortium meeting here in Dresden. The main points of his talk were an overview of the ZFS value proposition and a quick tour through cool things one can do with Solaris out of the box, for example using iSCSI and using the various network attached storage solutions available as part of Solaris. Thomas first reminded the audience of what he and most other people in the HPC community want in a storage solution: safety and reliability, fast error detection and correction, performance, expandability, and interoperability via open standards. All of which are offered by ZFS. With respect to safety and reliability, Thomas mentioned the following ZFS attributes:
With respect to built-in Solaris storage options, Thomas took the audience through a whirlwind tour of Solaris network attached storage (NAS) capabilities as well as block-level access using iSCSI. He also managed to demo all of this using his laptop, which was running two virtual machines called Angelina and Brad. As shipped, Solaris has built-in support for NFSv4, Samba, and (in OpenSolaris) CIFS. As Thomas pointed out, the Samba implementation has been modified to support ZFS as a virtual file system back end and the CIFS server has been implemented in the kernel for maximum performance advantage. To demonstrate iSCSI, Thomas set up several storage pools and then exported them via iSCSI as normal disks from Angelina to Brad where he mounted the disks in a mirrored configuration, which was all quite easy to do. Your correspondent, however, was not fast enough to capture the details of the demo. I expect that slides will be made available on the HPC Consortium website at some point. In terms of performance, a test done at Ulm showed that the iSCSI approach which used several small storage arrays onto 2x2 redundant x4500 servers delivered comparable performance to a previous FC-AL solution that had been used with several small storage arrays. (2008-06-17 02:16:53.0) Permalink Comments [0] HPC Consortium: Technical University of Denmark
Bernd Dammann, Associate Professor at the Technical University of Denmark spoke yesterday afternoon at the Sun HPC Consortium meeting here in Dresden. His talk focused on the benefits of Sun's Studio compiler and tools suite for education and HPC. Sun Studio is an important tool for teaching students about HPC programming techniques at DTU, including data flow through cache-based systems, loop-based optimization techniques, pipelining, and general application tuning techniques. The particular programming course that Bernd described focuses on helping students understand how real computers work, how memory and CPUs are glued together -- the details they don't learn in more theoretical courses. However, once the students are exposed to these techniques they are surprised to learn that many of these techniques are applied automatically to their codes by modern compilers. Good news for ease of use, but frustrating for engineering students who typically don't like black boxes and prefer to understand internal details in such cases. Sun Studio solves this problem with its compiler commentary feature. Compiler commentary allows the programmer to view their source code annotated with compiler-generated comments that describe in detail not just which optimizations were applied to the code, but also cases in which optimizations where not applied. Bernd showed several examples that illustrated how the feature works, including the use of the er_src utility to display source code with interleaved commentary. The benefit, of course, of the compiler commentary is that a suitably educated programmer can use the information to find additional opportunities for performance improvement by making suitable changes to the code or by activating additional compiler features. For example, one of Bernd's examples showed the use of the -xrestrict compiler flag, which was not familiar to me. It allows the programmer to tell the compiler that pointers in the code are known not to overlap, which can potentially allow the compiler to significantly increase performance with additional optimizations. Cool. Bernd noted that Sun's compiler commentary is "in the right place," namely in the binaries rather than in separate log files or merely displayed on a screen. By storing the commentary internally, it can be extracted and looked at later-- potentially long after compilation, which can be very useful. Bernd then gave a brief overview of Sun's primary performance analysis tool, Sun Studio Performance Analyzer. He noted that it can display compiler commentary as well as a variety of application performance metrics (cpu usage, timelines, etc.) He praised it as a tool that both he and his students find very intuitive and easy to use. Sun Studio is also used in a parallel programming course where it makes teaching OpenMP much easier. Being able to look at performance timelines for each thread to see OpenMP overheads and using the Thread Analyzer tool to detect data races were two examples. He also liked Sun's extension to OpenMP that allows the compiler to perform automatic scoping, which can be very useful in dealing with large, legacy codes with hundreds of variables. Students at DTU have been using the Sun Studio tools for four years and like it a lot. They wished they could use the tools on Linux. Well, now they can. Bernd ended his talk with some favorable comparisons of Sun Studio C against GCC and Intel C and showed an example of how easy it was to use the Sun tools to easily parallelize and debug a large, legacy Fortran code and get good parallel performance very quickly. He conclused with the observation that Sun Studio is a world-class product that is easy to learn and use. Bernd is now on my official list of favorite customers. :-) (2008-06-17 02:05:11.0) Permalink Comments [0] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||