The Navel of Narcissus
Josh Simons' Coordinates in the Blogosphere

20090715 Wednesday July 15, 2009

HPC Trends and Virtualization

Here are the slides (with associated commentary) that I used for a talk I gave recently in Hamburg at Sun's HPC Consortium meeting just prior to the International Supercomputing Conference. The topic was HPC trends with a focus on virtualization and the continued convergence of HPC, Enterprise, and Cloud IT. A PDF version of the slides is available here.

Challenges faced by the HPC community arise from several sources. Some are created by the arrival and coming ubiquity of multi-core processors. Others stem from the continuing increases in problem sizes at the high end of HPC and the commensurate need for ever-larger compute and storage clusters. And still others derive from the broadening of HPC to a wider array of users, primarily commercial/industrial users for whom HPC techniques may offer a significant future competitive advantage.

Perhaps chief among these challenges is the increasingly difficult issue of cluster management complexity, which has been identified by IDC as a primary area of concern. This is especially true at the high end due to the sheer scale involved, but problems exist in the low and midrange as well since commercial/industrial customers are generally much less interested in or tolerant of approaches with a high degree of operational complexity--they expect more than is generally available currently.

Application resilience is also becoming an issue in HPC circles. At the high end there is a recognition that large distributed applications must be able to make meaningful forward progress while running on clusters whose components may be experiencing failures on a near-continuous basis due to the extremely large sizes of these systems. At the midrange and low end, individual nodes will include enough memory and CPU cores that their continued operation in the presence of failures becomes a significant issue.

Gone are the days of macho bragging about the MegaWatt rating of one's HPC datacenter. Power and cooling must be minimized while still delivering good performance for a site's HPC workload. How this will be accomplished is an area of significant interest to the HPC community--and their funding agencies.

While the ramifications of multi-core processors to HPC are a critical issue as are issues related to future programming models and high productivity computing, these are not dealt with in this talk due to time constraints.

Most HPC practitioners are comfortable with the idea that innovations in HPC eventually become useful to the wider IT community. Extreme adherents to this view may point out that the World Wide Web itself is a byproduct of work done within the HPC community. Absent that claim, there are still plenty of examples illustrating the point that HPC technologies and techniques do eventually find broader applicability.

This system was recently announced by Oracle. Under the hood, its architecture should be familiar to any HPC practitioner: it is a DDR InfiniBand, x86-based cluster in a box, expandable in a scalable way to eight cabinets with both compute and storage included.

The value of a high-bandwidth, low-latency interconnect like InfiniBand is a good example of the leverage of HPC technologies in the Enterprise. We've also seen significant InfiniBand interest from the Financial Services community for whom extremely high messaging rates are important in real-time trading and other applications.

It is also important to realize that "benefit" flows in both directions in these cases. While the Enterprise may benefit from HPC innovations, the HPC community also benefits any time Enterprise uptake occurs since adoption by the much larger Enterprise marker virtually assures that these technologies will continue to be developed and improved by vendors. Widespread adoption of InfiniBand outside of its core HPC constituency would be a very positive development for the HPC community.

A few months ago, several colleagues and I held a panel session in Second Life to introduce Sun employees to High Performance Computing as part of our "inside outreach" to the broader Sun community.

As I noted at the time, it was a strange experience to be talking about "HPC" while sitting inside what is essentially an HPC application -- that is, Second Life itself. SL is a great example of how HPC techniques can be repurposed to deliver business value in areas far from what we would typically label High Performance Computing. In this case, SL executes about 30M concurrent server-side scripts at any given time and uses the Havok physics engine to simulate a virtual world that has been block-decomposed across about 15K processor cores. Storage requirements are about 100 TB in over one billion files. It sure smells like HPC to me.

Summary: HPC advances benefit the Enterprise in numerous ways. Certainly with interconnect technology, as we've discussed. With respect to horizontal scaling, HPC has been the poster child for massive horizontal scalability for over a decade when clusters first made their appearance in HPC (but more about this later.)

Parallelization for performance has not been broadly addressed beyond HPC...at least not yet. With the free ride offered by rocketing clock speed increases coming to an end, parallelization for performance is going to rapidly become everyone's problem and not an HPC-specific issue. The question at hand is whether parallelization techniques developed for HPC can be repurposed for use by the much broader developer community. It is beyond the scope of this talk to discuss this in detail, but I believe the answer is that both the HPC community and the broader community have a shared interest in developing newer and easier to use parallelization techniques.

A word on storage. While Enterprise is the realm of Big Database, it is the HPC community that has been wrestling with both huge data storage requirements and equally huge data transfer requirements. Billions of files and PetaBytes of storage are not uncommon in HPC at this point with aggregate data transfer rates of hundreds of GigaBytes per second.

One can also look at HPC technologies that are of benefit to Cloud Computing. To do that, however, realize that before "Cloud Computing" there was "Grid Computing", which came directly from work by the HPC community. The idea of allowing remote access to large, scalable compute and storage resources is very familiar to HPC practitioners since that is the model used worldwide to allow a myriad of individual researchers access to HPC resources in an economically feasible way. Handling horizontal scale and the mapping of workload to available resources are core HPC competencies that translate directly to Cloud Computing requirements.

Of course, Clouds are not the same as Grids. Clouds offer advanced APIs and other mechanisms for accessing remote resources. And Clouds generally depend on virtualization as a core technology. But more on that later.

As a community, we tend to think of HPC as being leading and bleeding edge. But is that always the case? Have there been advances in either Enterprise or Cloud that can be used to advantage HPC? There is no question in my mind that the answer to this question is a very strong Yes.

Let's talk in more detail about how Enterprise and Cloud advances can help address application resilience, cluster management complexity, effective use of resources, and power efficiency for HPC. Specifically, I'd like to discuss how the virtualization technologies used in Enterprise and Cloud can be used to address these current and emerging HPC pain points.

I am going to use this diagram frequently on subsequent slides so it is important to define our terms. When I say "virtualization" I am referring to OS virtualization of the type done with, for example, Xen on x86 systems or LDOMs on SPARC systems. With such approaches, a thin layer of software or firmware (the hypervisor) works in conjunction with a control entity (called DOM0 or the Control Domain) to mediate access to a server's physical hardware and to allow multiple operating system instances to run concurrently on that hardware. These operating system instances are usually called "guest OS" or virtual machine instances.

This particular diagram illustrates server consolidation, the common Enterprise use-case for virtualization. With server consolidation, workload previously run on physically separate machines is aggregated onto a single system, usually to achieve savings on either capital or operational expense or both. This virtualization is essentially transparent to an application running within a guest OS instance, which is part of the power of the approach since applications need not be modified to run in a virtualized environment. Note that while there are cases in which consolidating multiple guest OSes onto a single node as above would be useful in an HPC context, the more common HPC scenario involves running a single guest OS instance per node.

While server consolidation is important in a Cloud context to reduce operational costs for the Cloud provider, the encapsulation of pre-integrated and pre-tested software in a portable virtual machine is perhaps the most important aspect of virtualization for Cloud Computing. Cloud users can create an entirely customized software environment that supports their application, create a virtual machine file that includes this software, and then upload and run this software on a Cloud's virtualized infrastructure. As we will see, this encapsulation can be used to advantage in certain HPC scenarios as well.

Before discussing specific HPC use-cases for virtualization, we must first address the issue of performance since any significant reduction in application performance would not be acceptable to HPC users, rendering virtualization uninteresting to the community.

Yes, I know. You can't read the graphics because they are too small. That was actually intentional even for the slides that were projected at the conference. The graphs show comparisons of performance in virtualized and non-virtualized (native) environments for several aspects of large, linear algebra computations. The full paper, The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software by Youseff, Seymour, You, Dongarra, and Wolski is available here.

The tiny graphs show that there was essentially no performance difference found between native and virtualized environments. The curves in the two lower graphs are basically identical, showing essentially the same performance levels for virtual and native. The top-left histogram shows the same performance across all virtual and native test cases. In the top-right graph each of the "fat" histogram bars represents a different test with separate virtual and native test results shown within each fat bar. The fat bars are flat because there was little or no difference between virtual and native performance results.

These results, while comforting, should not be surprising. HPC codes are generally compute intensive so one would expect such straight-line code to execute at full speed in a virtualized environment. These tests also confirm, however, that aspects of memory performance are essentially unaffected by virtualization as well. These results are a tribute primarily to the maturity of virtualization support included in current processor architectures.

Note, however, that the explorations described in this paper focused solely on the performance of computational kernels running within a single node. For virtualization to be useful for HPC, it must also offer good performance for distributed, parallel applications as well. And there, dear reader, is the problem.

Current virtualization approaches typically use what is called a split driver model to handle IO operations, which involves using a device driver within the guest OS instance that communicates to the "bottom" half of the driver which runs in DOM0. The DOM0 driver has direct control of the real hardware, which it accesses on behalf of any guest OS instances that make IO requests. While this correctly virtualizes the IO hardware, it does so with a significant performance penalty. That extra hop (in both directions) plays havok with achievable IO bandwidths and latencies, clearly not appropriate for either high-performance MPI communications or high-speed access to storage.

Far preferable would be to allow each guest OS instance direct access to real hardware resources. This is precisely the purpose of PCI-IOV (IO Virtualization), a part of the PCI specification that specifies how PCI devices should behave in a virtualized environment. Simply put, PCI-IOV allows a single physical device to masquerade as several separate physical devices, each with their own hardware resources. Each of these pseudo-physical devices can then be assigned directly to a guest OS instance for its own use, avoiding the proxied IO situation shown on the previous slide. Such an approach should greatly improve current IO performance situation.

PCI-IOV requires hardware support from the IO device and such support is beginning to appear. It also requires software support at the OS and hypervisor level and that is beginning to appear as well.

While not based on PCI-IOV, the work done by Liu, Huang, Abali, and Panda and reported in their paper, High Performance VMM-Bypass I/O in Virtual Machines, gives an idea as to what is possible when the proxied approach is bypassed and the high-performance aspects of the underlying IO hardware are made available to the guest OS instance. The full paper is available here.

The graph on the top-left shows MPI latency as a function of message size for the virtual and native cases while the top-right graph shows throughput as a function of message size in both cases. You will note that the curves on each graph are essentially identical. These tests used MVAPICH in polling mode to achieve these results.

By contrast, the bottom-left graph reports virtual and native Netperf results shown as transactions per second across a range of message sizes. In this case, the virtualized results are not as good, especially at the smaller message sizes. This is due to the fact that interrupt processing is still proxied through DOM0 and for small message sizes it has an appreciable effect on throughput, about 25% in the worst case.

The final graph compares the performance of NAS parallel benchmarks in the virtual and native cases and shows little performance impact in these cases for virtualization.

The work makes a plausible case for the feasibility of virtualization for HPC applications, including parallel distributed HPC applications. More exploration is needed, as is PCI-IOV capable InfiniBand hardware.

This and the next few slides outline several of the dozen or so use-cases for virtualization in HPC. The value of these use-cases to an HPC customer needs to be measured against any performance degradations introduced by virtualization to assess the utility of a virtualized HPC approach. I believe that several of these use-cases are compelling enough for virtualization to warrant serious attention from the HPC community as a future path for the entire community.

Heterogeneity. Many HPC sites support end users with a wide array of software requirements. While some may be able to use the default operating system version installed at a site, others may require a different version of the same OS due to application constraints. Yet others may require a completely different OS or the installation of other non-standard software on their compute nodes. With virtualization, the choice of operating system, application, and software stack is left to end users (or ISVs) who package their software into pre-built virtual machines for execution on an HPC hardware resource, much the way cloud computing offerings work today. In such an environment, site administrators manage the life cycle of virtual machines rather than configuring and providing a proliferation of different software stacks to meet their users' disparate needs. Of course, a site may still elect to make standard software environments available for those users who do not require custom compute environments.

Effective distributed resource management is an important aspect of HPC site administration. Scheduling a mixture of jobs of various sizes and priorities onto shared compute resources is a complex process that can often lead to less than optimal use of available resources. Using Live Migration, a distributed resource management system can make dynamic provisioning decisions to shift running jobs onto different compute nodes to free resources for an arriving high-priority job, or to consolidate running workloads onto fewer resources for power management purposes.

For those not familiar, Live Migration allows a running guest OS instance to be shifted from one physical machine to another without shutting it down. More precisely, the OS instance continues to run as its memory pages are migrated and then at some point, it is actually stopped so the remaining pages can be migrated to the new machine, at which point its execution can be resumed. The technique is described in detail in this paper.

Live migration can be significantly accelerated by bringing the speed and efficiency of InfiniBand RDMA to bear on the movement of data between systems as described in High Performance Virtual Machine Migration with RDMA over Modern Interconnects by Huang, Gao, Liu, and Panda. See the full paper for details on yet another example of how the Enterprise can benefit from advances made by the HPC community.

The ability to checkpoint long-running jobs has long been an HPC requirement. Checkpoint-restart (CPR) can be used to protect jobs from underlying system failures, to deal with scheduled maintenance events, and to allow jobs to be temporarily preempted by higher-priority jobs and later restarted. While an important requirement, HPC vendors have generally been unable to deliver adequate CPR functionality, requiring application writers to code customized checkpointing capabilities within their applications. Using virtualization to save the state of a virtual machines, both single-VM and multi-VM (e.g. MPI) jobs can be checkpointed more easily and more completely than was achievable with traditional, process-based CPR schemes.

As HPC cluster sizes continue to grow and with them the sizes of distributed parallel applications, it becomes increasingly important to protect application state from underlying hardware failures. Checkpointing is one method for offering this protection, but it is expensive in time and resources since the state of the entire application must be written to disk at each checkpoint interval.

Using Live Migration, it will possible to dynamically relocate individual ranks of a running MPI application from failing nodes to other healthy nodes. In such a scenario, applications will pause briefly and then continue as affected MPI ranks are migrated and required MPI connections are re-established to the new nodes. Coupled with advanced fault management capabilities, this becomes a fast and incremental method of maintaining application forward progress in the presence of underlying system failures. Both multi-node and single-node applications can be protected with this mechanism.

In summary, virtualization holds much promise for HPC as a technology that can be used to mitigate a number of significant pain points for the HPC community. With appropriate hardware support, both compute and IO performance should be acceptable in a virtualized environment especially when judged against the benefits to be accrued from a virtualized approach.

Virtualization for HPC is mostly a research topic currently. While incremental steps are feasible now, full support for all high-impact use cases will require significant engineering work to achieve.

This talk focused on virtualization and the benefits it can bring to the HPC community. This is one example of a much larger trend towards convergence between HPC, Enterprise, and Cloud. As this convergence continues, we will see many additional opportunities to leverage advances in one sphere to the future benefit of all. I see it as fortuitous that this convergence is underway since it allows significant amounts of product development effort to be leveraged across multiple markets, an approach that is particularly compelling in a time of reduced resources and economic uncertainty.

(2009-07-15 12:06:31.0) Permalink Comments [0]

20090713 Monday July 13, 2009

Performance Facts, Performance Wisdom

I was genuinely excited to see that members of Sun's Strategic Applications Engineering team have started a group blog about performance called BestPerf. These folks are the real deal -- they are responsible for generating all of the official benchmark results published by Sun -- and they collectively have a deep background in all things related to performance. I like the blog because while they do cover specific benchmark results in detail, they also share best practices and include broader discussions about achieving high performance as well. There is a lot of useful material for anyone seeking a better understanding of performance.

Here are some recent entries that caught my eye.

Find out how a Sun Constellation system running SLES 10 beat IBM BlueGene/L on a NAMD Molecular Dynamics benchmark here.

See how the Solaris DTrace facility can be used to perform detailed IO analyses here.

Detailed Fluent cluster benchmark results using the Sun Fire x2270 and SLES 10? Go here.

How to use Solaris containers, processor sets, and scheduling classes to improve application performance? Go here.


(2009-07-13 11:16:16.0) Permalink Comments [0]

20090701 Wednesday July 01, 2009

An Excellent Optical Illusion

Assuming your aren't color blind, you see green and blue spirals in the above graphic. However, were you to download this image and sample it with GIMP, Photoshop or another image manipulation program, you would find that the "blue" and "green" are exactly the same color, RGB=(0,255,150.) I kid you not.

You can also try zooming your browser to verify that the two colors are the same. As you increase the zoom, you will notice the colors looking more and more similar.

For more like this, go here.

Thanks to Monty for this one.

(2009-07-01 15:00:00.0) Permalink Comments [3]

Run an HPC Cluster...On your Laptop

With one free download, you can now turn your laptop into a virtual three-node HPC cluster that can be used to develop and run HPC applications, including MPI apps. We've created a pre-configured virtual machine that includes all the components you need:

Sun Studio C, C++, and Fortran compilers with performance analysis, debugging tools, and high-performance math library; Sun HPC ClusterTools -- MPI and runtime based on Open MPI; and Sun Grid Engine -- Distributed resource management and cloud connectivity

Inside the virtual machine, we use OpenSolaris 2009.06, the latest release of OpenSolaris, to create a virtual cluster using Solaris zones technology and have pre-configured Sun Grid Engine to manage it so you don't need to. MPI is ready to go as well---we've configured everything in advance.

If you haven't tried OpenSolaris before, this will also give you a chance to play with ZFS, with DTrace, with Time Slider (like Apple's Time Machine, but without the external disk) and a host of other cool new OpenSolaris capabilities.

For full details on Sun HPC Software, Developer Edition for OpenSolaris check out the wiki.

To download the virtual image for VMware, go here. (VirtualBox image coming soon.)

If you have comments or questions, send us a note at hpcdev-discuss@opensolaris.org.


(2009-07-01 11:49:59.0) Permalink Comments [0]

20090624 Wednesday June 24, 2009

HPC in Hamburg: Sun Customers Speak at the HPC Consortium

It's crazy time again. I'm in Hamburg for two HPC events: Sun's HPC Consortium customer event, and ISC '09, the International Supercomputing Conference. The Consortium ran all day Sunday and Monday and then ISC started on Tuesday. It is now Wednesday and this is the first break I've had to post a summary talks of given at the Consortium. Due to the sheer number of presentations, including a wide range of Sun and partner talks, I only summarize those given by our customers. The full agenda is here.

Our first customer talk on Sunday was given by Dr. James Leylek, Executive Director of CU-CCMS, the Clemson University Center Computational Center for Mobility Systems, which focuses on problems in the automotive, aviation/aerospace, and energy industries.

The mission of CU-CCMS is not unique -- there are numerous university-based centers that work closely with industry by bringing resources and expertise to bear in a variety of problem domains. What sets CU-CCMS apart is its focus on addressing the mismatch between typical university time-scales and those of their industrial partners. Businesses need results quickly; universities move more slowly.

CU-CCMS has addressed this need in a few ways. They've staffed the center with full-time MS and PhD level engineers who have no teaching responsibilities. And they have provided a significant amount of computing gear to enable those engineers to work effectively with their industrial partners and generate results in a timely way.

Heterogeneity is another key part of the CU-CCMS strategy. By offering a range of computing platforms from clusters to very large shared-memory machines (from Sun) they are able to map problems to appropriate resources to deliver the fast turnaround times required by their industrial partners.

Dr. Leylek also briefly discussed the challenge of introducing HPC to industry as detailed in the Council on Competitiveness study, Reveal. As he noted, many companies are "sitting on the sidelines" of HPC and not engaging even though they could increase their competitiveness by using HPC techniques. He believes CU-CCMS offers a model for how such engagements can be run successfully: assemble a team of expert, dedicated technical resources with appropriate domain knowledge, algorithmic expertise, etc, and combine that with ample high performance computing infrastructure, and an understanding that turnaround time is critical for successful industrial engagements. And then generate valuable results. Lather, rinse, repeat.

Thomas Nau from the University of Ulm gave the next talk, which was a quick tour through several OpenSolaris technologies. He talked about COMSTAR, gave a quick demo of the new OpenSolaris Time Slider, and spent most of time talking about ZFS, specifically about the benefits of solid state disks for increasing ZFS performance. Thomas identified the ZIL -- the ZFS Intent Log -- as the component most often affecting performance. Experiments he has done that involved moving the ZIL from a standard hard disk to a ramdisk have shown significant ZFS performance improvements. In addition, only a small amount of solid state storage is needed to achieve good performance, e.g. perhaps 1-4 GB even for multi-TeraByte drives. Thomas noted that while one could theoretically increase ZFS performance by disabling the ZIL, DO NOT DO THIS. He then ended with the following statement, with which I can only agree: "Hardware RAID is dead, dead, dead. Just use ZFS." :-)

Our first customer talk on Monday was given by Prof. Dr. Thomas Lippert head of the Jülich Supercomputing Center(JCS), site of Sun's largest European deployment to date of our Sun Constellation System architecture. He first gave a brief history of the Jülich Research Center, which is one of the largest civilian research centers in Europe with over 4000 researchers in nine departments, one of which is the new Institute for Advanced Simulation of which JCS is a part. The site has a very long history of computer acquisitions, starting in 1957. This year JCS purchased three systems: a Sun system (JuRoPa), a Bull system (HPC-FF), and an IBM system (Jugene.) These systems have, respectively, 200 TFLOPs, 100 TFLOPs, and 1 PFLOPs of peak performance. Since the Sun and Bull systems are interconnected at the highest level of their switch hierarchies, the two machines can be run as a single system. This combined system delivered 274.8 TFLOPs on LINPACK which earned it the #10 entry on the latest edition of the TOP500 list. Collectively, JCS serves about 250 projects across Europe, including 20-30 highly scalable projects that are chosen by international referees for their potential for producing breakthrough science.

Dr. Lippert also spoke briefly about PRACE, the Partnership for Advanced Computing in Europe, which is radically changing the supercomputing landscape across Europe. Due to earlier studies, computing is now considered to be a crucial pillar of research infrastructure and, as such, it is now receiving considerable attention from funding agencies.

In closing, Dr. Lippert presented specific details of the JuRoPa system (2208 nodes, 17664 cores, 207 TFLOPs, 48 GB/node, and Sun's new M9 QDR switch.) He also described some of specific issues that will be explored with these systems, including control of jitter through the use of gang scheduling, daemon reduction, a SLERT kernel, etc. And some additional secret sauce from Sun perhaps. :-)

Prof. Satoshi Matsuoka from the Tokyo Institute of Technology spoke next. While he did mention Tsubame, Tokyo Tech's Sun-based supercomputer, he primarily spoke about the return of vector machines to HPC. They have, he believes, been reincarnated as GPGPU-based machines. Dinosaurs are once again walking the earth. :-) In particular, the GPGPU's high compute density, high memory bandwidth, and low memory latency echo some of the fundamental capabilities of vector machines that make them interesting for both tightly coupled codes like N-body as well as sparse codes like CFD. In his view, the GPGPU essentially becomes the main processor while the CPU becomes an ancillary processor.

Computers, however, are not useful unless they can be used to solve problems. To support the fact that GPGPU-based clusters can be effective HPC platforms, Prof. Matsuoka presented results from several new algorithms that have been developed at Tokyo Tech to take advantage of GPGPU-based systems. He showed impressive results for 3D FFTs used for protein folding and results for CFD with speedups up to 70X over CPU-based algorithms.

Our next customer speaker was Henry Tufo of the University of Colorado at Boulder (UCB) and the National Center for Atmospheric Research (NCAR.) He gave an update on UCB's upcoming Constellation-based HPC system and also spoke about some of the challenges related to climate modeling. It seems clear at this point that accurate climate modeling is going to be critical for understanding our future and our planet's future. It was a bit daunting to hear that climate modelers would like to increase many dimensions of their simulations, including spatial resolution by 10^3 or 10^5, the completeness of their models by a factor of 100x, the length of their simulator runs by 100x, and increase the number of modeled parameters by 100x. All told, their desires would increase computational needs by 10^10 or 10^12 over current requirements. It was sobering to hear that current technology trajectories predict that a 10^7 improvement will take about 30 years. Not good.

Their new Sun-based system will consist of 12 Constellation racks, Nehalem blades, QDR InfiniBand, about 500 TB of storage with about 10% of the clusters nodes accelerated with GPUs. The system will be located next to an existing physics building in three containers -- one for the IT components, one for electrical, and one for cooling.

Stephane Thiell from the Commissariat à l’Énergie Atomique (CEA) gave an overview CEA, talked a bit about CEA's TERA-100 project and then detailed CEA's planned use of Lustre for TERA-100. The CEA computing complex currently has two computing centers, one classified (TERA) and one open (CCRT.) TERA-100 will be a follow-on to TERA-10, which is a 60 TFLOPs, Linux-based system built by Bull in 2005. It includes an impressive 1 PetaByte Lustre filesystem and uses HPSS to archive to Sun StorageTek tape libraries with a 15 PetaByte capacity.

TERA-100 aims to increase CEA's classified computing capacity by about 20x with a final size of one PFLOPs or perhaps a little larger. CEA plans to continue with their COTS-based, general-purpose approach rather than move of the main sequence to something more exotic. It will be x86-based with more than 500 GFLOPs per node using 4-socket nodes. There will be 2-4 GB per core and two Lustre file systems will be supported, one with a 300 GB/s transfer requirements and the other with a 200 GB/s requirement. The system will consume less than 5 MW. A 40 TFLOPs demonstrator system will be built first and it will include scaled-down versions of the Lustre file systems as well. In the final system the Lustre servers will be built with four-socket nodes and a four-node HA architecture will be used to guarantee against failure and to avoid long failover times.

CEA is involved in some interesting Lustre-based development, including joint work with Sun on a binding between Lustre and external HSM systems with the goal of supporting Lustre levels of performance with transparent access to hierarchical storage management. CEA is also working on Shine, a management tool for Lustre.

Dieter an Mey gave some general information about computing at RWTH Aachen University and then gave an update on their latest acquisition, a Sun-based supercomputer. He ended with a discussion about the pleasures and perils of workload placement on current generation systems. Along the way he shared some feedback on Sun products -- one of those habits that makes customers like Dieter such valuable partners for Sun.

Aachen provides both Linux and Windows-based HPC resources for their users. On Linux they record about 40,000 batch jobs per month and perhaps 150 interactive sessions per day. The Windows cluster is used primarily for interactive jobs. It was interesting to hear that Windows is gaining ground with respect to Linux at Aachen: a previous study at Aachen had shown that Windows lagged Linux in performance by about 24%, but a recent re-run of the study now shows the gap to be on the order of about 7%.

Aachen's new system will support both Linux and Windows equally with a flexible dividing line between them. The facility is designed to be general purpose with a mix of thin and fat nodes and with the required high-speed interconnect for those who use MPI. A new building is being erected to house this machine which will come fully online over the course of 2009-2010. When complete, the system will have a peak floating point rate in excess of 200 TFLOPs and it will include a 1 PetaByte Lustre file system. Speaking of Lustre, Dieter rated its configuration as "complex", something Sun is working on. The system will also include two of Sun's latest InfiniBand switches, the new 648-port QDR M9 switch.

Dieter's final topic was the correct placement of workload on non-uniform system architectures. In particular, he described the difference between compact and scatter placement on multicore NUMA systems. Compact placement uses threads on the same core first, then cores in the same socket --- a strategy that is used to minimize latency and to maximize cache sharing. Scatter placement uses threads on different sockets first, and then threads on different cores -- a strategy that maximizes memory bandwidth. Which strategy is best depends on the details of an application's underlying algorithms. (Dieter noted that currently Sun Grid Engine is not aware of these issues -- it treats nodes as flat arrays of threads or cores.) Placement decisions are further complicated when attempting to schedule more than one application onto a fat node. For example, different strategies would be used depending on whether single job turnaround is more important than overall throughput of jobs.

Our last customer talk at the Consortium was given by the tag team of Arnie Miles (left) from Georgetown University and Tim Bornholtz (right) of the Bornholtz Group. Their topic was the Thebes Consortium for which they presented current status, did a short demo, and announced that the source code would be available by the end of June on sunsource.net.

The Thebes Consortium aims to help the widespread adoption of distributed computing technologies by creating an enabling infrastructure that focuses on scalability, security, and simplicity.

Arnie described (and Tim demo'ed) the instantiation of the Thebes first use case which assumes 1) that users have usernames and passwords in their home domain, 2) that one or more local resources have a trust relationship with a local STS (secure token service), 3) that these resources are known to users, and 4) that all resources are able to consume SAML.

The use case itself consists of the following actions: 1) users create job submission files using the client application or a command line, 2) users use institution usernames and passwords to acquire a signed SAML token, 3) users perform no other logins and do not have to go to a resource command line interface, 4) users manually choose their resources, 5) job scheduling is handled by resources. Note that in this instance "resource" refers to a DRM-managed cluster which will accept the incoming request and then schedule the job appropriately on its managed cluster. In the prototype as it currently exists, a service is a compute service though there is also some level of support for a file system service as well.


(2009-06-24 11:13:17.0) Permalink Comments [0]

20090618 Thursday June 18, 2009

FORTRAN: Calling All Dinosaurs!

DO you PROGRAM FORTRAN? IF so, READ on.

Please ASSIGN some time to RECORD your opinions about current and future FORTRAN needs in our non-COMPLEX online survey. It is in your INTRINSIC self-interest to PAUSE and DO so.

It is IMPLICIT and LOGICAL that you also CALL on your colleagues (those CHARACTERs) to READ this, get REAL, and make an ENTRY as well.

You can OPEN the survey IF you GOTO here.

(Something we share in COMMON: I am a FORTRAN TYPE as well and am eligible to join the Dinosaur UNION.)

(2009-06-18 07:02:23.0) Permalink Comments [0]

20090608 Monday June 08, 2009

unConference: The Future of Software & the Internet

The Massachusetts Technology Leadership Council held an unconference on Sun's Burlington campus last Friday, titled The Future of Software & the Internet. I attended because I was both interested in the topic and also curious about the logistics and effectiveness of unconferences.

I was surprised when the moderator asked everyone in the room to introduce themselves by stating their name and either their company or their location. C'mon! There were well over 200 people in the room and we were not sitting in neat rows. And yet it worked somehow. I of course didn't remember any names, but I got a good sense of the companies represented--the usual suspects (Sun, IBM, HP, Microsoft, Google, CISCO, etc) as well as many (MANY) small companies, venture capitalists, and several attendees with undisclosed affiliations. In addition, there was probably some benefit in having everyone actually make a vocalization at the outset -- something about participating rather than just observing. In any case, it didn't take long and it was a good ice breaker. And it perhaps helped everyone feel the next step was achievable as well: creating an agenda for the rest of the day, based on everyone's input. And doing so in a finite time. :-)

An unconference is an unconference at least in part because the agenda is not defined beforehand by a conference committee--it is created on the fly by participants at the start of the event with the help of a skilled moderator. At the start of the day, our agenda had four hour-long discussion sessions blocked out, but no content at all. Content was identified this way:

  1. Anyone who was interested in hosting a discussion wrote their name and discussion title on a sheet of paper. Proposers would be responsible for running their session, but not for having any answers necessarily.
  2. Proposers then lined up and each gave a short (SHORT) description of their discussion idea. We had two mikes and therefore two lines that alternated, which helped this part run a little faster.
  3. After announcing their discussion idea, each proposer placed their paper onto a matrix posted on the wall. Our matrix had four rows -- one for each of the day's four one-hour sessions. The matrix had 15 columns, one for each conference room or area designated for discussion, each of which was labeled with a letter from A to O. Proposers could place their discussion in any cell, though there was some encouragement from the moderator to ensure that the last session of the day had a good number of discussions scheduled. Each column was also labeled with the approximate size of each discussion area as input to the heuristic placement procedure. Cloud discussions went into big rooms while the "making parallel programming easier" discussion area had just a couch and a few chairs.

I'm guessing there is some rule of thumb that helps to organizers decide how many concurrent sessions will be needed based on the number of attendees. However it was done, the number of proposed topics mapped nicely to the 4 x 15 = 60 available discussion slots.

Once the agenda was complete, the moderator helped everyone get to their first discussion by reading aloud the titles and locations of the first set of concurrent topics. The entire agenda matrix was then moved to a wall in a central location so attendees could easily visit it between sessions to pick their next discussion topic.

All of the above -- from opening, through introductions and agenda forming -- took less than an hour. The resulting agenda cast a wide net over the theme of the unconference. There were discussions on business models, specific technical issues, models of innovation, development and testing processes, open source, cloud computing, etc. I participated in the following four discussions:

  • Simplified Parallel Computing
  • How to Start Your Idea [with Almost No Money]
  • The Future of Software Testing
  • From Data to Answers

I learned something in each discussion, though in the parallel computing case it was merely that talk of SIMD, MIMD, OpenMP, parallel spreadsheets, M language processing, and streaming parallelism is a sure way to keep your discussion group small. They were dropping like flies. :-)

Kidding aside, I was interested to talk to testing practitioners about the 2nd class role played by QA in the engineering hierarchy and how Agile methods might perhaps mitigate that problem by making quality an explicitly shared goal of all team members.

I approached the "data to answers" session wondering if HPC techniques for turning large amounts of data into insights would be applicable in a broader business context and learned that many businesses have a sad lack of experience with even the simplest of analytical methods, including a lack of understanding of even relatively simple data displays.

The "ideas" discussion presented a model for thinking about "intention" as being different from "invention" in the innovation process and how confusing the two can lead to problems in start-up situations. Intention is a statement about who you want to help or what you want to improve, while invention is how one chooses to satisfy the intention. Bill Warner, who lead the session, used the Wildfire voice system as an example to show how confusing these two concepts can lead to problems.

The Innovation unConference, MassTLC's next such event, will be held on the Sun Burlington campus on October 1, 2009. I plan to attend.


(2009-06-08 15:59:05.0) Permalink Comments [1]

20090602 Tuesday June 02, 2009

Building Packages for OpenSolaris: Easier than Ever

In a previous entry I documented in detail how I contributed an open-source package (Ploticus) to OpenSolaris using SourceJuicer, starting with how to write a spec file and ending with the inclusion of the package in the contrib repository. In truth, at the time I published the information I had not actually taken the last step to promote the package from the pending repository to the contrib repository due to a Ploticus bug I discovered during testing. Ploticus ran okay, but it was not configured as I had wanted. It took me some time to create appropriate patch files, rebuild the package, re-test it, etc.

In retrospect, I'm glad I was delayed because in the meantime OpenSolaris 2009.06 and SourceJuicer 1.2.0 were both released, which gave me a chance to see if any improvements had been made in the contribution process. I am happy to report that improvements were definitely made. Read on for details.

Most important, SourceJuicer documentation has been much improved. See, for example, How to Use OpenSolaris SourceJuicer for a good overview of the submission process. In addition, the short (9 min) video below, which walks through the mechanics of submitting files using SourceJuicer, is also an excellent resource:

SourceJuicer itself has also been improved significantly with this latest release. For example, it is now possible to delete a submitted file if it is no longer needed---I was able to use SourceJuicer 1.2.0 to remove an incorrect copyright file I had created when I first submitted Ploticus. While I appreciated that improvement, I found the following much more intriguing:

The screendump above shows the results of recent SourceJuicer builds, including Ploticus. I was happy to see Ploticus built successfully with the patches I had created on my first try. I was also curious about the implied promise of the new Install column. Since I next wanted to install and test this latest package on my 2009.06 system, I clicked on the Install link. And saw this:

Hey, cool. Firefox knows it should invoke the Package Manager to handle my request. How? With OpenSolaris 2009.06 we've enhanced the Package Manager to support a web installer mode and created a new mime type (application/vnd.pkg5.info) to pass package installation requests from a web page to Package Manager. This works from any web browser so long as the web server is configured to handle .p5i files correctly. See John Rice's blog entry on 2009.06 Package Manager enhancements for more details.

I clicked OK and then saw:

Package Manager promises to not only install the requested package, but to automatically add the required repository to my configuration as well. Surely it can't be this simple. I clicked on Proceed:

Apparently, it can be that simple. :-)

I've now tested my patched version of Ploticus on 2009.06 and requested the package be promoted to contrib by sending a note to sw-porters-discuss@opensolaris.org. I'm hopeful Ploticus will soon be available to the entire OpenSolaris community.

(2009-06-02 14:58:38.0) Permalink Comments [0]

20090529 Friday May 29, 2009

Rur Valley Journal: What's Up at Jülich?

What's up at Jülich? The latest 200+ TeraFLOPs of Sun-supplied HPC compute power is now up and running!

The JuRoPa (Jülich Research on PetaFLOP Architectures) system at the Jülich Research Center in Jülich, Germany has just come online this week. A substantial part the system is built with the Sun Constellation System architecture, which marries highly dense blade systems with an efficient, high-performance QDR InfiniBand fabric in an HPC cluster configuration.

We delivered 23 cabinets filled with a total of 1104 Sun Blade x6275 servers or 2208 nodes. Each of these nodes is a dual-socket Nehalem-EP system running at 2.93 GHz. The systems are connected with quad data-rate (QDR) 4X InfiniBand using a total of six of our latest 648-port QDR switches. As usual, we use 12X InfiniBand cables to route three 4X connections, thereby greatly reducing the number of cables and connectors, and increasing the reliability of the fabric. For more detail on the Nehalem-EP blades and other components used in this system, see this blog entry.

I've annotated one of the official photos below. Marc Hamilton has many more photos on his blog, including some cool "underground shots" at Jülich.



(2009-05-29 14:02:54.0) Permalink Comments [0]

20090527 Wednesday May 27, 2009

CommunityOne 2009: Taking the Plunge with OpenSolaris Deep Dives

I was hoping to attend CommunityOne in San Francisco next week (June 1-3), but I'll be beavering away here in Boston instead. C1 is the big, blow-out community event that covers all things OpenSolaris for the technical crowd --- developers and users -- with piles of technical sessions, lightning talks, labs and a host of other activities.

There are several registration options, including one free option that gives you access to two Deep Dive technical tracks on Tuesday as well as some free sessions on Monday. The Tuesday tracks are Developing IN OpenSolaris and Deploying OpenSolaris in Your Datacenter. Topics covered:

If you are interested in dropping by the Moscone Center next Monday or Tuesday for these tech talks, complete the free registration here. For details on the entire C1 event, see the event website or the wiki.

(2009-05-27 09:13:21.0) Permalink Comments [0]

20090525 Monday May 25, 2009

SourceJuicer: How to contribute a package to OpenSolaris

[UPDATE: A few small errors fixed and some clarifications added. See Comments for details.]

I tried recently to add a package to the OpenSolaris contrib repository, but quickly learned I didn't have enough packaging experience to understand the directions provided at SourceJuicer so I did some homework, asked some questions, and eventually did successfully contribute a package. I've documented in this entry everything I've learned hoping it will be helpful to others who want to build and submit OpenSolaris packages. Specifically, I'll describe how I wrote the spec file for Ploticus (my favorite open source plotting/graphing utility) and how I submitted the package to OpenSolaris.

I used SourceJuicer to submit my package because it is the easiest way for a community member to contribute. Before getting into details, a few words about the overall submission process. Packages are first submitted to the pending repository, which is basically a holding area for packages on their way to the contrib repository, the primary repository for community-contributed packages. Once a package has been validated and successfully built, it can then be moved into /contrib. I'll cover all of this below.

On to the details.

To submit a package to SourceJuicer, you need to supply two files: a text file containing copyright information and a spec file. The spec file contains the information SourceJuicer needs to create a final binary package starting from source code. Ideally the OpenSolaris package will be buildable from the standard, community-released source code without changes, which may require asking the community to adopt changes necessary to build the code for OpenSolaris. In practice, this will often not be necessary since many packages are designed to build on several Unix versions. In cases where changes must be made and those changes have not been accepted by the community, it is possible to specify patches that should be applied to the community source code during the build process. Though not desirable, it is sometimes necessary to do this. I'll supply pointers to information on how to do this below.

Spec files are not an OpenSolaris invention--they have been used for a long time to build RPM packages. This is good news because there are several excellent web resources that document spec files in detail. I recommend Maximum RPM by Edward Bailey as a detailed reference. One complication: It seems that OpenSolaris spec files are not exactly the same as RPM spec files. However, for the purposes of this exercise, don't worry about this -- the Ploticus example below should give you enough information to create a valid OpenSolaris spec file in most cases. However, if you insist on worrying, you can read the information I found here and here. If anyone knows of a better explanation of the differences, let me know and I will include a pointer here.

Okay, lets get to it. I started with a spec file template and created the following file for Ploticus. My commentary includes all of the tips and other information I discovered during the process of writing the spec file for this particular open source package. While I've attempted to give pointers to additional information throughout, this is not meant to be the definitive guide to the full capabilities of spec files. There should, however, be enough information here to allow typical open source apps to be packaged and contributed to OpenSolaris. Consult Maximum RPM for additional details.

spec filecommentary
#
# spec file for package: ploticus
#
# This file and all modifications and additions to the pristine
# package are under the same license as the package itself.
#
# include module(s): ploticus
#
This is all boilerplate commentary. Insert the name of your package twice.

%include Solaris.inc

Required for all OpenSolaris packages. For the curious, the source is here.
Name: ploticus

Once you specify the name of your package, you can use the macro %{name} to refer to it later in the spec file. As you will see below, there are other predefined macros available that you will use to write your spec file. You can also define your own macros using the syntax:

%define macro_name macro definition

Summary: ploticus -- creates plots, charts, and graphics from data

Summary is a one-line description of the package that will be displayed by the OpenSolaris Package Manager.
Version: 2.41
The version number can be referenced as %{version} later in the spec file, which can often be used to generalize file and directory names. In the case of Ploticus the version number string (e.g. "2.41") happens not to be used as part of its filenames (e.g. ploticus241src) so I do not use %{version} in this example, except in one instance of boilerplate.
License: GPLv2
Free text field describing code's open source license. I've seen all of these used: GPL, GPLv2, GPLv3, BSD, LGPLv2.1, New BSD License. If GPL, be explicit if you can: GPLv2 or GPLv3. The "or later" licenses might be appropriate as well, e.g. GPLv2-or-later, GPLv3-or-later, etc. There is a nice discussion here about the pros and cons of "or later" licenses.
Source: http://voxel.dl.sourceforge.net/sourceforge/ploticus/pl241src.tar.gz

The source tag specifies the location of the source-code tarball (possibly gzip'ed) that should be downloaded to build the package. Because Ploticus is hosted on sourceforge I had to specify a manual download URL rather than that of the automated download site (downloads.sourceforge.net.)

Note that the source location can also be specified as an ftp:// address.

URL: http://ploticus.sourceforge.net

The open-source community's web address.
Group:  Applications/Graphics and Imaging
The group tag describes the kind of software in the package and will be used by the OpenSolaris Package Manager to categorize the package hierarchically. I chose a group name based on the package classifications listed here.
Distribution:	OpenSolaris
Vendor: OpenSolaris Community

%include default-depend.inc

Boilerplate.
BuildRequires: SUNWxorg-headers, SUNWzlib, SUNWgcc

These are other OpenSolaris packages that must be available on the build system in order to correctly create the binary package. In this case, I am building Ploticus with X-Windows capabilities, so I need to ensure the X client header files are available. I am also enabling a Ploticus compression option so zlib is needed as well. And, to be safe, I've specified which compiler is required. I could have used Sun Studio, but I know for sure that Ploticus compiles with gcc so I've used that.

You can find these package names by searching in the Package Manager on your local OpenSolaris system.

Requires: SUNWzlib
This section lists packages that must be installed on the end-user system for the software to work correctly. In this case, Ploticus will be dynamically-linked against zlib so I need to make sure the Package Manager knows about this dependency. When the users asks for Ploticus from the repository, the Package Manager will know it also needs to download and install the SUNWzlib package as well.
BuildRoot:      %{_tmppath}/%{name}-%{version}-build
SUNW_Basedir:   %{_basedir}

This is boilerplate. The intent of BuildRoot is to define a user- and application-specific path that can be used as the root of an area in which your package will be installed on the build server, allowing the build server to support simultaneous builds of multiple packages by multiple users without interference. Note, however, that I do not use BuildRoot in this spec file because this conversation indicates that $RPM_BUILD_ROOT is the officially supported way to refer to the top of a package install area. I don't know if this is true in the OpenSolaris world as well, but most spec files I've seen for OpenSolaris use $RPM_BUILD_ROOT so I have opted to use that as well.

Note that while $RPM_BUILD_ROOT (and BuildRoot) refers to the root of the installation area on the build server, the top of the build area itself -- the location where your package will actually be untar'ed and built -- is referred to as %{_builddir}.

I do not know how SUNW_Basedir is used.

SUNW_Copyright: %{name}.copyright
This is the name of the copyright file you will upload to SourceJuicer along with this spec file. It must be named as shown (ploticus.copyright in my case.) You will typically find this copyright file on the community's website and/or included within the community's source tarball. In the case of Ploticus, the tarball contains a file in src called Copyright, which I have copied, renamed to ploticus.copyright and then edited to remove html markup. This is the file I will then upload to SourceJuicer. The original src/Copyright file is ignored by SourceJuicer. Update: The preceding was actually not sufficient for my package to be validated. I was asked to append the file GPL.txt, which was also in the tarball's src directory, to ploticus.copyright so that the actual text of the GPL v2 copyright was in the file. The original version of the copyright file (src/Copyright) only refers to the GPL copyleft, it does not include the copyright itself.
Meta(info.upstream): Steve Grubb <ploticus@yahoogroups.com>
Meta(info.maintainer):  Josh Simons <josh.simons@sun.com>

These fields are specific to OpenSolaris's packaging system. The upstream field contains the name and address of the individual or group that creates and supports the open-source software. The maintainer field contains the name and email address of the individual responsible for the OpenSolaris packaging of the open-source project. The preferred format is as shown in these examples.

Additional info fields that can be included are documented here.

%description
A free, GPL, non-interactive software package for producing plots, 
charts, and graphics from data. It was developed in a Unix/C 
environment and runs on various Unix, Linux, and win32 systems. 
ploticus is good for automated or just-in-time graph generation, 
handles date and time data nicely, and has basic statistical capabilities. 
It allows significant user control over colors, styles, options and details. 
Ploticus is a mature package, available since 1999, and version 2.40 has 
more than 12,000 downloads to date.

A more detailed description of the open source software. This description was taken from the Ploticus web page.
%prep
%setup -q -n pl241src

Now we begin specifying what actions are required to build the software. The %setup macro cd's into the build directory, removes any cruft left over from earlier builds, unzips the source tarball (which will have been downloaded at this point), and then untars the sources into the build directory. It then cd's into the package's top-level directory. All of this is done with %{_builddir} as the root directory as described earlier.

Note that %setup assumes the top-level directory specified in the tarball is named %{name}-%{version}. If this is not true for your package, use the -n option to specify the correct name. For Ploticus, all files in the tarball are in the pl241src directory, so I've used the -n option to specify this.

See this page for more details about the %setup macro. The %patch macro, which can also be used in the %prep phase, can be used to apply patches prior to building the binaries if the standard community source code needs to be modified in some way to build successfully on OpenSolaris. See the same page for %patch information. Note that you should try to have your OpenSolaris changes accepted by the community to avoid having to apply these patches.

I don't know what the -q option does.

%build

cd src
make NOX11= XLIBS='-L/usr/openwin/lib -lX11' XOBJ='x11.o interact.o'  \
     XINCLUDEDIR=-I/usr/openwin/include WALL= ZLIB=-lz ZFLAG=-DWZ \
     PREFABS_DIR=/usr/lib/ploticus/prefabs pl

The %build section contains the commands needed to build the package binaries. At the end of the %prep phase we were left sitting in the top-level directory of the source tarball. Since the Ploticus makefile and sources are one level down from this (pl241src/src), I cd into src before invoking the correct make command for OpenSolaris.

Assuming the make ran correctly, we exit this phase with the binaries and other files all built on the build server in a sub-directory under %{_builddir}.

%install

mkdir -p $RPM_BUILD_ROOT%{_mandir}/man1
cp man/man1/pl.1 $RPM_BUILD_ROOT%{_mandir}/man1/pl.1
mkdir -p $RPM_BUILD_ROOT%{_bindir}
cp src/pl $RPM_BUILD_ROOT%{_bindir}
mkdir -p $RPM_BUILD_ROOT%{_libdir}/%{name}
cp -r prefabs $RPM_BUILD_ROOT%{_libdir}/%{name}


In the install phase, we execute a "make install" or equivalent, moving all files that will be included in the binary package to their final installed locations, but relative to $RPM_BUILD_ROOT rather than to "/" to avoid collisions on the build server. Because the Ploticus "make install" action doesn't do exactly what I need, I instead manually move each required file to its final location. For many projects, something similar to "make DESTDIR=$RPM_BUILD_ROOT install" would be appropriate in this phase.

If you are moving files manually, do not assume directories exist -- make them before you use them. And use the predefined directory macros (e.g. %{_mandir} ) to reference standard installation locations. Others are documented here.

%clean
rm -rf $RPM_BUILD_ROOT

This is boilerplate clean-up code. Insert other commands as necessary.
%files

%defattr(-,root,bin)
%attr(0755, root, bin) %dir  %{_bindir}
%attr(0755, root, bin) %dir  %{_mandir}
%attr(0755, root, bin) %dir  %{_mandir}/man1
%attr(0755, root, bin) %dir  %{_libdir}
%attr(0755, root, bin) %dir  %{_libdir}/%{name}
%attr(0755, root, bin) %dir  %{_libdir}/%{name}/prefabs
%{_bindir}/*
%{_libdir}/%{name}/prefabs/*
%{_mandir}/*/*

This can be a complicated section so I suggest reading the Max RPM %files section.

The %files section specifies the locations and attributes of all files that will be placed onto the end-user's system when the binary package is installed. The %attr directive is used to specify permissions and ownership for files and directories. The %dir directive identifies directories. Multiple directives can be applied to objects by including them on the same line.

The first line specifies default mode, default user ID and default group ID for all files created during the build process. The dash ("-") means that a default is not set explicitly for that field. Note that failure to include this line in your spec file will cause an obscure error to be generated when an end-user tries to install your package. That would be very bad.

The next four lines specify the directories in which Ploticus-related files will reside. The last three ensure that the Ploticus binary, all of the Ploticus prefabs config files, and the man page will be included in the binary package. Note again the use of macros to specify standard installation directories.

%changelog
* Tues Apr 28 2009 - Josh Simons <josh.simons@sun.com>
- initial version

Add any changelog information you desire here.

Once you've created your spec file, it is time to feed it to SourceJuicer for syntax and other checking and then iterate as necessary until your spec file is correct and has passed validation. The basic flow is shown in the diagram below.

The first step is to submit the spec file to SourceJuicer along with the project's copyright file. To do so, go to the SourceJuicer Submit page (login required.) Assign a descriptive name to your upload (I used 'ploticus') and then specify your spec file. Use 'add another file' to add your copyright file. Add whatever other files you may need (see 'more help' on the Submit page.) Click Submit and you will see a page like this:


The summary page includes an indication that my spec file successfully passed a syntax check. If an error occurs at this point, make the necessary corrections and use the ReSubmit tab (not shown) at the bottom of this page to upload new versions of your copyright and spec files.

Looking under Reviews, I can see my package has not yet been validated, which means my submission hasn't yet been checked by someone to ensure my copyright file is appropriate, that someone else has not already packaged this program for OpenSolaris, etc.

The next day I receive two email messages with comments from reviewers. When I log back into SourceJuicer and look at the Review tab, I see the two comments that were submitted. The fact that the package is still marked as not validated means I have issues to address:

Clicking on the "[review]" link takes me to the page with detailed information about the Ploticus review. I can also view this page by visiting the MyJuicer tab and then clicking on the appropriate link under My Submissions. This second method is better since it can be difficult to find your review on the main Review page. In any case, the page looks like this:

As you can see from Amanda and Christian's comments, I did not use the correct naming convention for the copyright file I uploaded to SourceJuicer. Rather than "Copyright", the file should have been named "ploticus.copyright" (more generally, %{name}.copyright). Also, Amanda hopes I can remove the html that is for some reason embedded in the standard Ploticus copyright file.

Using this same review page, I submit a clarifying question back to the reviewers to ensure I address their issues. I am not clear on the relationship between the copyright file that is submitted manually to SourceJuicer and the copyright file in the source tarball that is described with the "SUNW_Copyright" tag in the spec file.

Now that I understand the copyright issue and have adjusted my spec file and copyright file appropriately (and also updated the spec file and annotations in this blog entry--meaning you never saw that I had initially called my copyright file "Copyright"), I use the same Review page to Resubmit the spec file and copyright file. Use the tab at the bottom of the Review page to do this:

As of this writing, there is no way to remove a file that has been submitted to SourceJuicer so all three files (Copyright, ploticus.copyright, and ploticus.spec) are associated with the project even though Copyright is now extraneous. Until removal is possible, just ignore the extra files. [UPDATE: As of SJ 1.2.0, files can removed by visiting the MyJuicer review page for the appropriate package.]

I resubmitted the files, the package was subsequently validated, and then it was automatically scheduled to be built on the build server. I did not receive a notification when the build attempt occurred so you need to check status periodically (use the MyJuicer tab). When I checked, I saw my build had completed successfully on the first attempt:

Had the build not succeeded, I would have followed the Log link to view the build log, found the problem, fixed the spec file, and then Resubmitted. The package would then be rescheduled for another build automatically with no need for re-validation.

With the Ploticus build successfully completed, it is now very important to verify that the package installs correctly and that the software actually works. Though I don't cover it here, my first Ploticus package did not work correctly on my test system. I had to make changes to my spec file, rebuild the package, and reinstall it. Therefore, please do install and test your software!

To do the test installation, I first added the pending repository as a package authority on my 2008.11 system. Note carefully the location of this repository; I had expected it to be http://pkg.opensolaris.org/pending, but that is not correct:

% pfexec pkg set-authority -O  http://jucr.opensolaris.org/pending pending

I then started the Package Manager, selected the Pending repository and did a search for Ploticus. Voila! The package is available:

After selecting the package and clicking on Install/Update, the installation proceeds smoothly. I then start a terminal window and verify that Ploticus does, in fact, work correctly:

Once you are sure your package installs and runs correctly, send an email to sw-porters-discuss@opensolaris.org requesting that the package be promoted from the pending repository to the contrib repository. Note that you'll need to subscribe to this mailing list before you can post to it. To subscribe, go here.

Once the package is available in contrib, users will be able to install your package on their systems.

FIN!

[See my later blog entry for additional information about SourceJuicer and OpenSolaris improvements that make package contributions even easier.]


(2009-05-25 08:46:52.0) Permalink Comments [6]

20090420 Monday April 20, 2009

Oracle is No IBM

I see some very exciting synergies in Oracle's acquisition of Sun, many of which are covered in this joint presentation and also this FAQ on the Oracle website.

Very unlike the rumored IBM acquisition of Sun, which I think would have gone like this.

(2009-04-20 13:58:39.0) Permalink Comments [2]

20090415 Wednesday April 15, 2009

Tickless Clock for OpenSolaris

I've been talking a lot to people about the convergence we see happening between Enterprise and HPC IT requirements and how developments in each area can bring real benefits to the other. I should probably do an entire blog entry on specific aspects of this convergence, but for now I'd like to talk about the Tickless Clock OpenSolaris project.

Tickless kernel architectures will be familiar to HPC experts as one method for reducing application jitter on large clusters. For those not familiar with the issue, "jitter" refers to variability in the running time of application code due to underlying kernel activity, daemons, and other stray workloads. Since MPI programs typically run in alternating compute and communication phases and develop a natural synchonization as they do so, applications can be slowed down significantly when some nodes arrive late at these synchronization points. The larger the MPI job, the more likely the this type of noise will cause a problem. Measurements have shown surprisingly large slowdowns associated with jitter.

Jitter can be lessened by reducing the number of daemons running on a system, by turning off all non-essential kernel services, etc. Even with these changes, however, there are other sources of jitter. One notable source is the clock interrupt used in virtually all current operating systems. This interrupt, which fires 100 times per second, is used to periodically perform housekeeping chores required by the OS. This interrupt is a known contributor to jitter. It is for this reason that IBM has implemented a tickless kernel on their Blue Gene systems to reduce application jitter.

Sun is starting a Tickless Clock project in OpenSolaris to completely remove the clock interrupt and switch to an event-based architecture for OpenSolaris. While I expect this will be very useful for HPC users of OpenSolaris, HPC is not the primary motivator of this project.

As you'll hear in the video interview with Eric Saxe, Senior Staff Engineer in Sun's Kernel Engineering group, the primary reasons he is looking at Tickless Clock are power management and virtualization. For power management, it is important that when the system is idle, it really IS idle and not waking up 100 times per second to do nothing since this wastes power and will prevent the system from entering deeper power saving states. For virtualization, since multiple OS instances may share the same physical server resources, it is important that guest OSes that are idle really do stay idle. Again, waking up 100 times per second to do nothing will steal cycles from active guest OS instances, thereby reducing performance in a virtualized environment.

While it is true I would argue that both power management and virtualization will become increasingly important to HPC users (more of that convergence thing), it is interesting to me to see that these traditional enterprise issues are stimulating new projects that will benefit both enterprise and HPC customers in the future.

Interested in getting involved with implementing a tickless architecture for OpenSolaris? The project page is here.


(2009-04-15 11:14:44.0) Permalink Comments [0]

MacBook Pro: Many Screws, All Tiny

This weekend I upgraded my 2.2 GHz Core 2 Duo MacBook Pro's internal hard disk from 160 GB to 320 GB following the excellent instructions at iFixit. The two dodgy steps are freeing the top-case assembly and carefully prying loose the ribbon cables that are attached to the top of the existing hard drive. For the latter you definitely need some sort of thin plastic tool to work gently underneath the ribbons to detach them. To keep track of the variety of tiny screws encountered (both phillips and torx) I organized them according to their iFixit disassembly step.

I chose the 7200RPM Hitachi Travelstar 320GB 16MB SATA drive (model HTS723232L9A360) as a replacement. Since my 160 GB disk was also a 7200 RPM drive, I didn't experience the noticeable performance improvements some people have reported when moving to a faster disk. If you do an upgrade, you should definitely use a 7K drive. I bought mine at Other World Computing.

Before replacing the drive, I did a full back up onto an external Firewire disk and then swapped my new Travelstar into the external drive enclosure and did another full, bootable backup onto the new disk. Both backups were done using Carbon Copy Cloner. After booting from the now externally-attached Travelstar to verify that the backup had worked correctly, I removed the Travelstar from the external disk enclosure and then inserted it into the MBP following the iFixit instructions. Once done, my machine booted with no problem. I now have lots of space for my growing collection of RAW photos, which eat disk space at an alarming clip.


(2009-04-15 07:18:57.0) Permalink Comments [3]

20090414 Tuesday April 14, 2009

You Say Nehalem, I Say Nehali

Let me say this first: NEHALEM, NEHALEM, NEHALEM. You can call it the Intel Xeon Series 5500 processor if you like, but the HPC community has been whispering about Nehalem and rubbing its collective hands together in anticipation of Nehalem for what seems like years now. So, let's talk about Nehalem.

Actually, let's not. I'm guessing that between the collective might of the Intel PR machine, the traditional press, and the blogosphere you are already well-steeped in the details of this new Intel processor and why it excites the HPC community. Rather than talk about the processor per se, let's talk instead about the more typical HPC scenario: not about a single Nehalem, but rather piles of Nehali and how to best to deploy them in an HPC cluster configuration.

Our HPC clustering approach is based on the Sun Constellation architecture, which we've deployed at TACC and other large-scale HPC sites around the world. These existing systems house compute nodes in the Sun Blade 6000 System chassis which holds four blade shelves, each with twelve blade systems for a total of 48 blades per chassis. Constellation also includes matching InfiniBand infrastructure, including InfiniBand Network Express Modules (NEMs) and a range of InfiniBand switches that can be used to build petascale compute clusters. You can see several of the 82 TACC Ranger Constellation chassis in this photo, interspersed with (black) inline cooling units.

As part of our continued focus on HPC customer requirements, we've done something interesting with our new Nehalem-based Vayu blade (officially, the Sun Blade X6275 Server Module): each blade houses two separate nodes. Here is a photo of the Vayu blade:

Each of the nodes is a diskless, two-socket Nehalem system with 12 DDR3 DIMM slots per node (up to 96 GB per node) and on-board QDR InfiniBand. It's actually not quite correct to call this node diskless because it does include two Sun Flash Module slots (one per node) that each provide up to 24 GB of FLASH storage through a SATA interface. I am sure our HPC customers will use what amounts to an ultra-fast disk for some interesting applications.

Using Vayu blades, each Sun Constellation chassis can now support a total of 2 nodes/blade * 12 blades/shelf * 4 shelves/chassis = 96 nodes with a peak floating-point performance of about 9 TFLOPs. While a chassis can support up to 96 GB / node * 96 nodes = 9.2 TB of memory, there are some subtleties involved in optimizing and configuring memory for Nehalem systems so I recommend reading John Nerl's blog entry for a detailed discussion of this topic.

For a quick visual tour of Vayu, see the annotated photo below. The major components are: A = Nehalem 4-core processor; B = Memory DIMMs; C = Mellanox Connect-X QDR InfiniBand ASIC; D = Tylersburg I/O chipset; E = Base Management Controller (BMC) / Service Processor (one per node); F = Sun Flash Modules (SATA, 24GB per node.) The connector at the top right supports two PCIe2 interfaces, two InfiniBand interfaces, and 2 GbE interfaces. The GbE logic is hiding under the BMC daughter board.

To accommodate this high-density blade approach we've developed a new QDR InfiniBand NEM (officially called the SunBlade 6048 QDR IB Switched NEM), which is shown below. This Network Express Module plugs into a blade shelf's midplane and forms the first level of InfiniBand fabric in a cluster configuration. Specifically, the two on-board 36-port Mellanox QDR switch chips act as leaf switches from which larger configurations can be built. Of the 72 total switch ports available, 24 are used for the Vayu blades and 9 from each switch chip are used to interconnect the two switches, leaving a total of 72 - 24 - 2*9 = 30 QDR links available for off-shelf connections. These links leave the NEM through 10 physical connectors, each of which carries three QDR X4 links over a single cable. As we discussed when the first version of Constellation was released, aggregating 4X links into 12X cables results in significant customer benefits related to reliability, density, and complexity. In any case, these cables can be connected to Constellation switches to form larger, tree-based fabrics. Or they can be used in switchless configurations to build torus-based topologies by using the ten cables to carry X, Y, and Z traffic between shelves and across chassis. As mentioned, for example, here. The NEM also provides GbE connectivity for each of the 24 nodes in the blade shelf.

Looking back at the TACC photo, we can now double the compute density shown there with our newest Constellation-based systems using the new Vayu blade. Oh, and by the way, we can also remove those inline cooling units and pack those chassis side-by-side for another significant increase in compute density. I'll leave how we accomplish that last bit for a future blog entry.

(2009-04-14 06:05:00.0) Permalink Comments [0]


 
archives
links
stats