Saturday June 30, 2007 | The Navel of Narcissus Josh Simons' Coordinates in the Blogosphere |
|
If you think High Performance Computing (HPC) is a Small Market.... Then consider the fact that at IDC's breakfast briefing at ISC in Dresden this week, we were told HPC currently accounts for 19% of ALL worldwide server sales and 25% of (server) processors. With "niches" like that... We also learned that while overall worldwide server sale growth rates are flattening at around 3%, HPC continues to grow at a healthy 9% rate. I'll post more details after I receive the briefing materials from IDC. (2007-06-30 08:11:54.0) Permalink Comments [0] HPC Consortium: A Brief History of Solaris
Phil Harman from Sun's Solaris group gave an informative and amusing talk at the HPC Consortium meeting in Dresden this week titled, "A Brief History of Solaris." I'm hoping the full talk will be posted on the Consortium site at some point. Phil began his history of Solaris by reminding us of some of the "prehistoric" innovations in SunOS. For example, who but Sun was doing open network computing back in the 1980s with innovations like NFS, NIS, the automounter, XDR, and RPC? How about the STREAMS abstraction? mmap? ld.so? He then moved to innovations done by Sun "within living memory." His list included loadable, configurable kernels; dynamic system domains; /proc; truss; the p-tools; and /etc/nsswitch.conf. Not to mention "audacious" SMP scalability, and a compatible 32/64 bit transition strategy that maintained binary investments through our transition to 64-bit computing. Oh yes, and there was that Java thing as well... Innovations done "just yesterday" included Hierarchical Lgroup Support (HLS), Multiple Page Size Support (MPSS), containers, Service Management Facility (SMF), zones, BrandZ, ZFS, and DTrace. He finished with some comments on ZFS, which he motivated with the graphic I've placed at the top of this blog post. It illustrates the problems of single-bit errors. In this case, a printer was fined by the King of England for what amounted to a life's wages for making this error in a 1631 edition of the King James bible (known as the Wicked Bible). "Got checksums?", asks Phil as he noted that ZFS protects the datapath all the way from the rotating rust (the disk) to memory. Does the "I" in RAID mean "Inexpensive" or "Independent"? The former is correct, so why do some in our industry prefer the "independent" interpretation? Phil explained why during his talk and also in this blog entry. (2007-06-30 04:42:43.0) Permalink Comments [1] HPC Consortium: Big SMPs in Education
Bernd Dammann, Associate Professor at the Technical University of Denmark, spoke this week at Sun's HPC Consortium meeing in Dresden. The title of his talk was "Using Large SMP machines for research and education -- some experiences from the Technical University of Denmark." As part of his introduction, Bernd mentioned that the University was founded in 1829 by H.C. Orsted. The school was relocated in the 1960s to the site of a former airport, which is evident if you look at the site layout. The University is strong in a number of areas, notably wind turbine design and materials optimization--e.g. how much material can be cut away from a jet to reduce weight while still maintaining safety and structural integrity. Work has also been done on magnetic earth imaging via satellite and we were told that students at the University are very involved in corporate-sponsored eco-vehicle design contests. DTU is a Sun Center of Excellence in interval arithmetic and dynamic systems. The HPC Center at DTU has a large amount of Sun "big iron"--large SMP machines-- around which the Center's computational capabilities are centered. They also have several other Sun hardware models in their machine room. Through a series of acquisitions, DTU now has onsite two Sun Fire E25Ks with 96 and 72 cores, three Sun Fire E6900s with 48 cores each, 10 V440s and a Sun Fire T2000 system. All of their SPARC machines are kept at the same revision of Solaris (currently S10 11/06), which makes the complex easy to maintain and administer with two part-time system administrators. In addition to their central compute infrastructure, DTU has the largest deployment of Sunray thin clients in Scandinavia with over 600 in use. Of these, they use about 24 as part of a "mobile classroom" that can be deployed on short notice in locations for temporary use. Students love the thin clients and appreciate the ability to access their desktop sessions from any Sunray on campus using their smartcards. Bernd made several interesting points with respect to their Sun compute infrastructure. First, the variety of SPARC implementations and system architectures in their compute complex is used to advantage in their High Performance Computing course to expose students to a range of systems. They are also able to explore both OpenMP and MPI on their systems. In addition, because the use this single environment for both education and research, those students who move on to become researchers already have familiarity with the full range of scientific and productivity tools deployed on the HPC compute infrastructure. He summarized the value proposition as follows. They don't have wasted desktop cycles tied up in thick clients. They have lower administration costs, they have consolidated their software licenses onto their central infrastructure. In addition, they can do centralized deployment of software, they can ensure that students do not tamper with configuratons while still allowing them the freedom to install their own software in $HOME and they do not have virus issues. As a drawback, Bernd mentioned that their thin client environment was unfortunately not suitable for supporting heavy OpenGL 3D graphics for their users. After the talk, I introduced Bernd to Linda Fellingham, the engineering manager in charge of Sun's shared and scalable visualization products. As it turns out, Sun has a solution for DTU that will allow them to install an existing, but noisy, high-end graphics workstation in their central machine room and then route 3D graphics output directly to Sunrays in a seamless way. It's a pretty slick software solution (read more here. When I asked Bernd later if he found the Consortium meeting useful, he cited this interaction as an example of how meeting with Sun's engineering and other employees at such meetings is very useful for him. (2007-06-30 04:10:49.0) Permalink Comments [3] Dresden at Night
(2007-06-28 13:15:07.0) Permalink Comments [2] HPC Consortium: Making Solaris Transparent with DTrace Thomas Nau, self-described Solaris Geek and head of the Infrastructure Department at the University of Ulm, gave a talk this week on DTrace at the HPC Consortium meeting in Dresden. In his view, DTrace is the tool of choice for understanding system and application performance issues. Thomas described briefly how DTrace works to support dynamic, lightweight instrumentation of both kernel and user code with some 40,000 probe points available for use within Solaris and the D scripting language to perform custom processing. He particularly likes DTrace's aggregation facilities that support gathering and condensing data for easier interpretation (for example, creating simple, text-based histograms of data value collected during a run.) He also pointed out that, contrary to what some say, DTrace does not need to be run as root. Instead one can use Solaris RBAC (role-based access control) facility to grant particular, DTrace-specific privileges to users. These privileges are dtrace_proc, dtrace_user, and dtrace_kernel. See here for more details. For those wanting to go beyond the simple ASCI text output and graphics created by DTrace, Thomas recommended a utility called the Chime Visualization Tool, which is available on the OpenSolaris community website.
Chime sample output. Thomas mentioned that /usr/demo/dtrace contains lots of D scripts that can help the new user learn DTrace. But the Solaris Dynamic Tracing (DTrace) Guide available on docs.sun.com remains his favorite reference. (2007-06-28 13:09:35.0) Permalink Comments [0] HPC System: Ranger System at TACC As they say, things are just bigger in Texas. On Monday, several members of the staff from the Texas Advanced Computing Center (TACC) joined the HPC Consortium meeting in Dresden by remote link. The audio was unfortunately not very good, but we managed to hear most of what was said. Speakers were Jay Boisseau (TACC Director) and (I believe) Tommy Minyard (Assistant Director). If there was a third speaker, I apologize--as I said, the audio was not good. Ranger will be a 504 TFLOPs system, built using the Sun Constellation System architecture with two ultra-dense switches and almost 4000 Sun four-socket, quad-core, nodes using close to 16000 AMD Barcelona processors. With 2GB of memory per core, there will be over a hundred Terabytes of memory total in the system along with 72 Sun "Thumper" storage systems with a total of 1.7 PBytes of raw storage. The InfiniBand interconnect is a 7-stage, non-blocking Clos network with latencies and bandwidths of approximately, 2.3us and 950 Mbytes/sec, respectively. Physically, the system will reside in about 90 racks in six rows. It will require about 3.4 MW of power.Ranger will run Linux and the OpenFabrics InfiniBand stack. It will use Lustre as its cluster file system and will run two MPI libraries: MVAPICH and Open MPI (the code base on which Sun's MPI for Solaris is based.) TACC will use multiple compiler suites, including Sun Studio. Sun Grid Engine will be used as Ranger's distributed resource management system. (2007-06-28 12:50:14.0) Permalink Comments [3] HPC Consortium: TSUBAME
On Monday Professor Satoshi Matsuoka updated the HPC Consortium attendees on TSUBAME, Asia's largest supercomputer center and the largest HPC system to date built with Sun technology. His talk was titled, "TSUBAME Update -- The People's Supercomputer at Tokyo Institute of Technology." Since its installation, the TSUBAME user population has grown to 10,000 user registrations, which includes 1200 registered supercomputer-class users. The system has had a lot of media exposure in Japan, including several magazine cover articles and coverage on national television. As part of creating the vision of a people's supercomputer that supports a wide variety of users, TSUBAME implements multiple usage models from best-effort to higher quality service which is billed per unit of use. Professor Matsuoka gave us a brief overview of some of the work they've done in this area and recommended a detailed Sun Blueprint titled Sun N1 Grid Engine Software and the Tokyo Institute of Technology Supercomputer Grid for further reading. Usage continues to increase. One of Professor Matsuoka's deputies has been known to launch 20,000 simultaneous Gaussian jobs on TSUBAME! Not surprising then that in under 12 months over 1.3M ISV application runs have been done on the machine. Work was done recently to improve TSUBAME's LINPACK performance number, which is how a machine's rankings in the TOP500 list is determined. TSUBAME moved from #9 to #14 on the list in spite of having submitted a new LINPACK run demonstrating an additional 1.5 TFLOPs, wich indicates just how competitive this list is. TSUBAME remains the largest supercomputer in Asia. (2007-06-27 07:47:00.0) Permalink Comments [0]PetaScale Unveiled: Photos From Dresden A few photos from the unveiling of the ultra-dense components of Sun's new Constellation System architecture tonight in Dresden.
Andy Bechtolsheim, Marc Hamilton, and a magnum of champagne
Behold The Switch!
The ultra-dense switch with cable management
Overall booth shot: crowd, press, etc.
A sea of switch ports... Sun Constellation System: Petascale Computing Done Right Today Sun is revealing in a technology preview at the International Supercomputing Conference in Dresden the approach we will use to build TACC's 500+ TFLOP Ranger system later this year. And other large machines that have not yet been announced. We call systems built with this approach Sun Constellation Systems. Such systems can scale from TeraFLOPS up into the PetaFLOPs range. To my mind, Sun's approach to Petascale starts with this:
Yes, it's a connector. Specifically, this connector allows three 4X InfiniBand links to be run across a single cable, rather than three separate cables. The cable is also both higher quality and significantly less bulky than the three separate cables taken together. The connector itself is also mechanically and electrically superior to standard InfiniBand connectors. Sun (by which I mean AndyB and others working closely with him) has put a lot of thought into this because to effectively build a petascale system one needs to closely examine current approaches and assess whether just "doing more of the same" really gets you where you need to be or whether new thinking is needed. The above dense cabling approach complements our ultra-dense blade server and ultra-dense InfiniBand switch components. The blade server supports 48 blades (768 cores) in a single rack/chassis (the rack is the chassis in this case) and the switch sports over 3000 4X Infiniband ports with all of the complexity of a multi-stage network compressed into a single chassis, removing the need for a huge number of cables and a large number of intermediate, discrete InfiniBand switches. Compared to a conventional approach to building a large cluster, the Sun Constellation system uses 1/6 the cables, 20% smaller footprint, and one switch rather than 300 (not a typo.) The switch component is a double-wide chassis and it looks like this:
The blade chassis looks like this:
I'm walking over to the show floor at ISC here in Dresden in an hour so and will post some photos later tonight. (2007-06-26 07:53:09.0) Permalink Comments [0] HPC Consortium: Shared Memory Parallelization on Multi-Core Processors
While I enjoyed Barton and Ruud's talks about the Niagara 2 processor yesterday at Sun's HPC Consortium meeting in Dresden, I always get more of a kick from customer presentations. In this case, Dieter an Mey from RWTH Aachen University gave a nice talk about the pitfalls and benefits of multi-core processors for programmers. While it was delivered here at a High Performance Computing event, the observations and lessons are applicable to anyone interested in application performance on multi-core processors. Given the direction of our industry, that encompasses a lot of programmers. Dieter and his colleague Christian Terboven examined performance on several systems based on a variety of processors:: the UltraSPARC IV, UltraSPARC T2 (Niagara 2), Intel Woodcrest and Clovertown, and a quad-core AMD Opteron. For each of these systems, they measured achievable aggregate bandwidth over a variety of active thread counts, processor bindings, and memory placement. I've included a few of his slides with Niagara 2 performance results removed (sorry.) As Dieter said, the results aren't surprising once you look carefully at the non-uniformities in the underlying system architectures. If, however, programmers do not understand these issues, they will likely achieve very sub-optimal application performance. My own concern is that as multi-core and multi-threaded processors become the norm across the computer industry, programmers will not understand these issues in the way a seasoned HPC programmer might. This is one small part of the challenge the software industry faces in helping programmers achieve high performance on these new processors.
In addition to running the above bandwidth tests, Dieter and Christian ran two applications on each of these systems. The first was a very cache-friendly code used to compute contact interactions between bevel gears. The second was a Navier-Stokes code that put high stress on the memory sub-system due its manipulations of sparse data structures. They also did additional throughput tests, running multiple copies of these applications on each system. The results for Niagara 2 demonstrated the value of the CMT approach in hiding latency for throughput workloads and being able to do so with floating-point intensive codes with the increased FP capabilities of this new processor. Oh, and one more thing. The graph below shows performance results for the bevel gear code run with different numbers of threads. Look at Columns 5 and 6. These results were generated on Solaris and Linux using the same compilers and the exact same hardware. Higher is better. :-)
(2007-06-26 00:46:52.0) Permalink Comments [0] HPC Consortium: Niagara 2 Processor Barton Fiske, one of Sun's Technical Marketing Specialists, and Ruud van de Pas, Sun Senior Staff Engineer, gave back-to-back talks today at the HPC Consortium meeting in Dresden about Sun's upcoming Niagara 2 processor and its applicability to High Performance Computing. You may recall that the current Niagara processor (officially called the UltraSPARC T1) has but a single floating-point unit across all eight cores and 32 threads. Hardly a compute engine for the vast majority of HPC workloads, though of interest perhaps to some customers in the intelligence community and in certain segments of life sciences where integer operations are important. By contrast, the N2 processor has eight floating-point units--one per core. However, N2 inherits its overall philosophical approach from the UltraSPARC T1 in that its cores are relatively simple (no superscalar, no out of order execution, etc), concentrating instead on delivering a significant degree of chip multithreading (CMT) with 8 cores and 64 threads per N2 die. Our HPC customers are wondering whether N2-based systems will be useful for HPC workloads. Barton lead with a description of the N2 processor and a detailed look at the various systems in which Niagara 2 will be used from small systems to those using Victoria Falls. Ruud then followed with a description of some of the benchmarking work he has been doing on early N2 systems, looking at how a variety of HPC applications (primarily C and Fortran codes) perform. His results were very preliminary and since both his talk and Barton's were given under non-disclosure at the Consortium meeting, I cannot share detailed results here. However, I will say that the combination of increased floating point capabilities, a better floating point implementation, and high off-chip bandwidth makes our next generation Niagara-based systems worthy of consideration for some HPC workloads. Sorry to be vague for now, but we will have much more information to share once we start shipping N2-based systems. (2007-06-25 11:18:02.0) Permalink Comments [0] HPC Consortium: Sun Labs Perspectives Hans Eberle presented a Sun Labs perspective on High Performance Computing at the HPC Consortium meeting in Dresden today. He gave overviews of a few of the current projects running in Sun Labs that are relevant to HPC, including proximity interconnect and Fortress. For a complete list of Sun Labs project, visit the Sun Labs project page. Both proximity interconnect and Fortress were explored by Sun as part of the DARPA HPCS project and we continue to invest in both efforts even though we have completed our work under Phase II of that project.Fortress has been designed specifically as a language for expressing scientific applications. Its design principles included support for a mathematical notation, growability, safety, and implicit parallelism. Here is an example of Fortress code formatted using a set of Emacs macros supplied with Fortress:
While the above may look strange to the average programmer, think how strange C or Fortran code looks to a scientist trying to solve a complex set of equations in their domain of expertise. Or actually, meditate on the fact that we as an industry have beaten these poor scientists into submission and forced them to express their problems in something so foreign to them as current programming languages. A bug perhaps? Unlike most languages that begin as serial languages and then add features or annotations to express parallelism, Fortress was designed specifically with parallelism built in as is illustrated on the slide below.
A Fortress interpreter has been released under open source and an open source community has been established to further develop the code. The web site is here. If you are interested, the Fortress language specification can be found here. (2007-06-25 10:25:32.0) Permalink Comments [0] HPC Consortium: Blackbox We have a Blackbox here in Dresden (shown above) and it has been a popular tour for the Sun customers and partners attending the Sun HPC Consortium meeting. We also had a talk about Project Blackbox, delivered by Robert Zwickenpflug of Sun Microsystems GmbH. For those not familiar, Blackbox is a datacenter in a box--specifically, a datacenter built into a standard 20' shipping container. Robert reviewed the Blackbox specs--that it supports up to 266 rack units of equipment in a total of eight 19" racks in about 160 square feet. Depending on the CPU used, one could fit 500 CPUs or 2000 cores or 8000 threads in a single Blackbox container or perhaps 1.5 petabytes of storage. A Blackbox system can handle about 200 kW of power and cooling due to its innovative water-cooled design. Both Sun and 3rd party components may be installed. We figure a Blackbox can be deployed in 1/10 the time of a traditional datacenter (conservatively), that it is about 20% more energy efficient than an AC-cooled datacenter, and that one could save perhaps $150K per year by locating a Blackbox close to low-cost power sources (possible due to its mobile nature and based on estimated energy costs of $0.25/kWh in an urban environment versus a $0.03/kWh rural rate.) (2007-06-25 09:57:54.0) Permalink Comments [0] HPC Consortium: Andy Bechtolsheim
Andy spoke today at the HPC Consortium meeting in Dresden about the five key challenges to building petascale (as in 10^15 floating-point operations per second) computer systems. They are: scaling application performance, keeping the bandwidth-to-FLOPs ratio balanced in the system, scaling the interconnect fabric, power efficiency and cooling, and reliability to support capability applications. He went on to describe in detail how Sun will build petascale computers using a concept that will be officially previewed at ISC in Dresden tomorrow. He talked about network topologies, mechanical issues, power, cooling, and compute density, and then showed how our technologies will be used to build the 500+ TFLOP Ranger system to be installed later this year at the Texas Advanced Computing Center (TACC) and other sites which have not yet been disclosed. More details and perhaps some photos tomorrow after the announcement. (2007-06-25 09:38:35.0) Permalink Comments [1]Arrived in Dresden...
Dresden from my hotel room The trip to Dresden took longer than expected due to a missed connection in Frankfurt. In the end, I took a six-hour train ride from Frankfurt through Leipzig and then on to Dresden. While in the end it was a longer trip, I judge it a better one as I quite enjoyed the scenery and the more leisurely pace.
Train scenery. Excellent clouds! Today (Sunday) we held a full day of training for a group of Sun's field technical people here at the Westin Bellevue hotel. We covered a variety of topics and ended with a lively discussion about Solaris and HPC. Tonight also marks the start of Sun's HPC Consortium meeting, a two-day event that we hold just prior to each of the two annual supercomputing conferences. These events bring together Sun customers, partners, and Sun employees from around the company for discussions and presentations about all aspects of HPC. I hope to blog at least the customer presentations during the conference tomorrow and Tuesday. (2007-06-24 13:17:36.0) Permalink Comments [0] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||