Sunday July 01, 2007 | The Navel of Narcissus Josh Simons' Coordinates in the Blogosphere |
|
HPC Consortium: Sun Visualization System at the LCN Andrew Gormanly of the London Centre for Nanontechnology gave a talk this week at Sun's HPC Consortium meeting in Dresden. He spoke about LCN's experiences with Sun's visualization products. The LCN is a collaboration between University College London and Imperial College London, focusing on nanoCAD and nano-fabrication, among other areas, and exploring both biological and non-biological fabrication techniques for building nano-scale structures. The Center's visualization requirements include the ability to visualize very large data sets and to allow intuitive access to these capabilities for both business decision makers and lab researchers. Workstations were considered and rejected because they cannot offer the computing power needed to handle the large data sets used at the Center. LCN has been working in collaboration with Sun to create a visualization system based on Sun's Scalable Visualization Software. Their system uses a large, active stereo screen, a Sun Fire x4600 as the application engine, a Sun Fire x4500 ("Thumper") node for storage, and two headless Sun Ultra 40 systems as rendering engines. Voltaire InfiniBand is used as the system interconnect. From a user's perspective, one logs into the x4600 and runs an application which then locally generates the 3D stereo renderings they see on the large display. In fact, what is actually happening is a bit more complicated, though it does not interfere with the intuitive user experience. The application is indeed running on the X4600, taking advantage of the available large memory and processing power of this SMP node. However, the graphical rendering is actually performed on the two Ultra 40 machines which each then send pixel streams to the display with each Ultra 40 responsible for computing the image on one half of the display surface. This feat is accomplished through use of Sun's Scalable Visualization Software which transparently interposes itself on the application's OpenGL calls and routes the graphics requests to the Ultra 40s over the InfiniBand link. While LCN currently uses two graphics workstations there is no reason in principle that they could not extend the approach to use four workstations, each responsible for one quarter of the display screen. As the collaboration with Sun continues, LCN is interested in exploring the use of Sun Grid Engine to allow compute nodes beyond their single x4600 to be used as part of this system. In addition, they are interested in exporting graphics output to desktops and Sunrays so as to more broadly share the visualization system's capabilities within LCN. To do this, they will also use Sun's Shared Visualization Software.
(2007-07-01 07:21:42.0) Permalink Comments [0] HPC Consortium: A Customer View of ZFS Thomas Nau of the University of Ulm gave a customer view of ZFS at Sun's HPC Consortium meeting in Dresden this past week. His talk was titled, "ZFS – How safe do you think your data is without?" Thomas motivated his discussion by asking the audience whether it would be okay to miss the one event in a trillion that would lead to the discovery of a new particle, lose all of the email from your mail server, or lose access to all of one's mp3 or video files, or perhaps worse, not be aware of most of the errors that have occurred at all at one's site. Currently, it is a matter of trust when we store data in a file system. We trust that the disk drives, the controllers, the multiple pieces of firmware, the battery backup, the cabling, adapters, and the operating system etc, all perform well enough to protect our data. And we hope as well that the "human factor" does not cause data loss. But of course there are things that can and will go wrong. Bit rot, phantom writes, DMA errors, driver and firmware bugs, accidental overwrites, misdirected reads and/or writes, etc. In addition, because volume managers and file systems are commonly separate pieces of software, the volume manager does not have knowledge of the importance of particular pieces of data--for example, critical metadata whose loss would lead to the loss of an entire file or file system rather than "just" some data within an individual file. This "all data is equal" view of the file system further increases the vulnerability of stored data. Thomas then went on to share a case study involving data corruption at the University of Ulm. It was one of those nightmare scenarios involving the loss of email services for the entire university. It wasn't as if they hadn't thought about data protection at Ulm. Their mail service is supplied by a two-node cluster and two disk arrays fully connected through a SAN with offsite mirrors, regular backups, etc. And yet, one day one of the email servers panic'ed with a "freeing free inode" error message. After fsck'ing the file system for 10+ hours during which no user access to the file system was allowed, they felt they had fixed the problem in that fsck had found and fixed several issues. They rebooted and the system crashed within ten seconds. One more fsck and they saw the same crash again. Because they had been considering and planning a migration to ZFS, they then took this opportunity to invest an additional 40 hours to copy all of their email data into a ZFS file system after rigging up a temporary email server and recovering enough of people's important email to carry them to the following weekend when they could do the swap over. They have had no problems since moving to ZFS. The Infrastructure Department is still doing a root-cause analysis of this failure, but believe at this point that a power failure about four weeks before the outage may have somehow let their mirrors get out of sync. As Thomas pointed out, ZFS cannot eliminate hardware problems, or change the math concerning the number of failures that will result due to the failure rates of underlying components. And it can't make human decisions smarter. But it can detect and inform you about errors. It will detect out of sync mirrors. And it will correct underlying problems if you let it. Thomas then gave further details on several of ZFS's more prominent safety features, including the ubiquitous use of checksums to provide end-to-end data integrity, the built-in volume manager that allows ZFS to selectively double- or triple-replicate file system metadata depending on its importance, and the copy-on-write approach used by ZFS to avoid ever overwriting valid data that is in use. For more information on ZFS, go here. (2007-07-01 03:58:10.0) Permalink Comments [0] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||