HPC ClusterTools 8.2 is based on Open MPI 1.3.3 (open-mpi.org) and includes the following:
- InfiniBand, GbE, 10GbE, and Myrinet interconnect support
- IB multi-rail, IB QDR, and Mellanox ConnectX support
- MPI profiling with Sun Studio Analyzer, plus support for VampirTrace
- Application suspend/resume support
- Automatic Path Migration support
- Performance and scalability, including shared memory optimizations
- DTrace providers on Solaris
- Sun Studio, PGI, Intel, Pathscale, and GNU/gcc compiler support
- Plug-ins for Sun Grid Engine (SGE) and Portable Batch System (PBS)
- Totalview and Allinea DDT parallel debugger support
- Full MPI-2 standard implementation, including MPI I/O and one-sided communication
- Support for Linux (RHEL, SLES, CentOS), OpenSolaris, and Solaris
- Full Service Support offerings available from Sun
Move details can be found at http://www.sun.com/software/products/clustertools/index.xml
We are announcing the availability of the Sun Grid Engine 6.2 Update 3
release.
Sun Grid Engine 6.2u3 is a feature update release. We are delivering cloud
connectivity functionality, initial power saving support, add a few much
demanded features and complete the Microsoft Vista support.
A. What's new
=============
Amazon Elastic Cloud EC2 Adapter
--------------------------------
The Service Domain Mager (SDM) adds connectivity to Amazon Elastic Cloud EC2
and the ability to flexibly add execution hosts as needed on demand.
Initial Power Saving Support
----------------------------
A new power saving scheme in SDM enables the creation of a special resource
spare pool in which systems can be powered on or off when added or removed from
this spare pool.
Service Domain Manager (SDM) Simple Install
-------------------------------------------
It is now possible to install and run an SDM system with only one JVM per
(managed or master) host. Previously, the system was using up to three
separate JVMs per host. This new feature simplifies installation,
configuration and maintenance.
SGE Inspect - Sun Grid Engine Inspect Module
-------------------------------------------
A new Java based Sun Grid Engine Inspect module allows to monitor SGE
clusters and the Service Domain Manager (SDM).
Exclusive Host Scheduling
-------------------------
Exclusive host scheduling allows users to request that jobs and parallel
tasks run exclusively on a host if allowed by an administrator.
Microsoft Windows Vista Display Support
---------------------------------------
With the release of Sun Grid Engine 6.2u3, the display_win_gui feature is
now fully supported. display_win_gui can now be used to display a job GUI on
the visible Desktop, on both the 32- and 64-bit versions of Windows Vista
(Enterprise and Ultimate Edition), and on Windows Server 2008.
This feature allows a Sun Grid Engine job to request the "display_win_gui"
complex attribute, which launches a GUI on the currently visible Desktop on
the Windows host that displays job information. This works only if the job
is a native Windows application.
Changes to licensing
====================
This Sun Grid Engine version changes the terms under which the software can
be used. Without a valid Sun Grid Engine license, evaluation use is only
permitted for 90 days. The courtesy binaries which will be made available,
will continue to allow unlimited use but will not include the Amazon EC2
adapter and the SGE Inspect modules.
Relevant links
==============
Download:
http://www.sun.com/software/sge
Documentation:
http://wikis.sun.com/display/gridengine62u3
Release Notes and detailed information on new features:
http://wikis.sun.com/display/gridengine62u3/Release+Notes
Patch Matrix:
http://wikis.sun.com/display/gridengine62u3/Patch+Matrix
Man Pages Online:
http://gridengine.sunsource.net/manpages.html
List of fixed bugs:
http://gridengine.sunsource.net/project/gridengine/62patches.txt
Announcing Sun HPC Software, Developer Edition 1.0 for OpenSolaris, which
provides a pre-configured, integrated development environment to enable
developers to quickly and efficiently create, debug and deploy parallel
applications. It takes advantage of virtualization for easy installation and
seamlessly integrates with a cloud environment for extreme scalability.
At a Glance
- Fully featured HPC development environment distributed as a virtual machine
(VM).
- Pre-configured as a grid enabled, virtual HPC cluster comprised of three
OpenSolaris zones.
- Turn-key parallel application development environment with distributed
resource management and cloud connectivity built in and ready to go.
- Sun Studio and Sun HPC ClusterTools come pre-installed and configured,
providing Fortran, C and C++ compilers, MPI libraries, performance analysis
and debugging tools, high performance scientific libraries and an intuitive IDE
for application development.
- Sample applications are included in the installation to get you up and
running quickly.
Learn more at:
http://www.sun.com/software/products/hpcsoftware/hpcdev
This software is particularly aimed at professors and students for parallel
programming classes since it is very lightweight and provides an complete
development environment for HPC on a laptop.
Sun Studio 12 Update 1, the latest production release of Sun Studio Compilers and Tools, is now available for download:
http://developers.sun.com/sunstudio/
Supported on Solaris 10, OpenSolaris (2008. 11 and 2009.06) and the leading Linux distributions (SuSE Linux Enterprise Server 10, Red Hat Enterprise Linux 5, CentOS 5), feature highlights since Sun Studio 12 include:
* C, C++ and Fortran compiler optimizations for the latest UltraSPARC and SPARC64-based architectures
* C, C++ and Fortran compiler optimizations for the latest x86 architectures from Intel and AMD including SSSE3, SSSE4a, SSe4.1, SSE4.2 compiler intrinsics support
* Compiler, debugger, and profiling support for OpenMP 3.0
* Profiling of distributed MPI-based applications
* DLight - New tool for unified application and system profiling using Dynamic Tracing (DTrace) technology on Solaris platforms
* dbxTool - New stand-alone graphical debugger
* Highly tuned and parallelized scientific libraries, including ScaLAPACK
* Update IDE based on NetBeans 6.5.1 software
Sun Studio 12 Update 1 Features page: http://developers.sun.com/sunstudio/features/index.jsp
Sun Studio 12 Update 1 Press Release: http://www.sun.com/aboutsun/pr/2009-06/sunflash.20090623.1.xml
Sun Studio Blogging Contest: http://developers.sun.com/sunstudio/community/campaigns/blogcontest_062009/welcome.jsp
Having been involved in MD Nastran performance tuning and benchmarking activities on Sun hardware for many years it always catches my attention when I see some stand-out performance on new hardware. I saw some of this exceptional performance during some recent MD Nastran benchmarking I did on the new Intel Xeon Processor 5500 Series (aka Nehalem) found in the Sun Fire X4270 server. This benchmarking effort was part of a larger effort I was involved in to gather benchmarking performance data across various Sun hardware configurations--my goal was to study the effects on MD Nastran performance (elapsed and cpu times ) for different processors, disk, and memory configurations. I saw exceptional performance (reduced elapsed times) on the Sun Fire X4270 compared to all the various platforms and configurations I tested. As one example from my benchmark study I chose the following X4150 machine configuration for this blog:
Sun Fire X4150 Server
2x Xeon X5460 3.1 Ghz processors, 24GB RAM
4x 146GB 10K RPM SAS drives
OS: Solaris 10
The new Sun Fire X4270 configuration:
Sun Fire X4270
2 Xeon X5570 2.9 Ghz processors, 24GB RAM
4x 146 10K RPM SAS drives
OS: Solaris 10
On both servers I used one disk for the Solaris 10 OS, MD Nastran binary, and the standard MD Nastran *.f04,*.f06, and *.log output files. I configured the remaining disks with ZFS and used these for the MD Nastran database files (more on ZFS later).
The table below shows the "% reduction in elapsed time" for the MD Nastran MDR3 benchmarks on the Sun Fire X4270 compared to the Sun Fire X4150:
| getrag | gm20a_1_1 | md0mdf1_1 | xl1fn40 | vl0sst1 | xl0imf1 | xl0tdf1_1 | xx0cmd2_1 |
| 34% | 27% | 32% | 32% | 16% | 55% | 17% | 35% |
As I mentioned earlier, this performance data came out of a larger study I was involved in to look at various machine configurations and their effect on MD Nastran performance. I'll blog in more detail on that study in a future blog--here's a few highlights:
1. Solid State Drive (SSD) performance vs SAS disk:
During my performance study with SSD's I saw some noteworthy performance improvements (elapsed time reductions) on some of the above individual MD Nastran benchmarks and also when I ran a combination of these benchmarks concurrently on the same machine. In some cases I saw up to a 56% reduction in elapsed times when using the SSD's compared to the internal SAS disks --with the amount of reduction corresponding to the amount of I/O and memory used by the benchmark(s). For example, the relatively large MD Nastran DMP (Distributed Memory Parallel) benchmark "xx0cmd2_8, DMP=8" was 49% faster (elapsed time) using the SSD's compared to the internal SAS disks, while other smaller benchmarks like the "getrag" benchmark showed a 19% reduction in elapsed time. I was also able to get significant performance improvement using the SSD's when I ran a combination of Nastran benchmarks concurrently on one machine. For example, running 3 benchmarks concurrently (two getrag jobs and one xl1fn40 job), which together utilized 20Gb of memory and generated 1TB of total I/O, I saw a 44 - 56% reduction in elapsed time with the SSD's compared to the internal disks. The reason for the range of 44-56% is explained in the next section below on "ZFS Intent Log (ZIL)".
Here's the configuration I used for this SSD benchmarking:
Sun Fire X4270
2 Xeon X5570 2.9 Ghz processors, 24GB RAM
4x 146 10K RPM SAS drives (formatted with ZFS (Raid 0))
3x SSD's (32GB SSD's) (formatted with ZFS (Raid 0))
OS: Solaris 10
2. ZFS Intent Log (ZIL):
I've blogged in the past on the benefits and "ease-of-use" of using "ZFS with MD Nastran".
Recently I discovered a configuration option with ZFS that's worth noting related to what's called the ZFS Intent Log (ZIL). By simply turning off the ZIL I was able to turn the 44% reduction in elapsed time (mentioned above in my discussion on SSD performance) into a 56% reduction. The ZIL is a mechanism to guarantee ZFS in terms of writes in the event of a machine crash. However, if you're running an MD Nastran "scratch (scr=yes)" job then you will probably be comfortable experimenting with ZIL turned off to see if your particular mix of jobs will benefit. I'm currently in the process of running various Nastran benchmark combinations with ZIL turned off and will post the results of that study in a future blog.
To turn off the ZIL edit the /etc/system file and add the following:
set zfs:zil_disable=1
For more information:
2. MD Nastran
In my prior blogs in this series on integrating "Sun Grid Engine and MSC.Software's MD Nastran" I described the following:
1. "Sun Grid Engine and MD Nastran" [recommended SGE configurations/queues for MD Nastran users]
2. "Part 1--How to submit "MD Nastran" (serial) jobs with Sun Grid Engine"
3. "Part 2--How to submit "MD Nastran" DMP (Distributed Memory Parallel) jobs with Sun Grid Engine"
4. "Part 3--How to configure consumable resources (Disk, Memory, and License Tokens)"
In this final blog in the series I'll describe some SGE configuration details I mentioned in the earlier blogs.
1. How to create "runtime limiting queues"
2. How to create an "SGE parallel environment"
First, a quick comment on my use of SGE's CLI (command line interface) instead of the GUI interface tool (QMON).
I continue to give examples using the SGE's CLI instead of the GUI interface (QMON) to configure SGE because I've found the CLI to be an extremely flexible and powerful way to change settings once you become familiar with SGE. [That said, I still go back to the SGE QMON GUI interface tool when I'm not sure how something works--then it's a really great way to see the relationship of a feature or setting to the other components within SGE]
Sun Studio Express 3/09, the official build used for the Sun Studio 12
Update 1 Early Access Program, is now available for download:
http://developers.sun.com/sunstudio/downloads/express/index.jsp
The Sun Studio Early Access build is available on Solaris, OpenSolaris
and the latest Linux distributions (Red Hat Enterprise Linux, SuSE Linux
Enterprise, CentOS, Ubuntu). Feature highlights since the Sun Studio 12
release include:
* C/C++/Fortran compiler optimizations for the latest x86
architectures from Intel and AMD including SSSE3, SSSE4a, SSe4.1, SSE4.2
compiler intrinsics support
* C/C++/Fortran compiler optimizations for the latest UltraSPARC
and SPARC64-based architectures
* DLight - New tool to utilize and visualize the power of Solaris
Dynamic Tracing (DTrace) technology
* dbxTool - New stand-alone GUI debugger
* Full OpenMP 3.0 compilers and tools support
* MPI performance analysis in the Performance Analyzer
* NetBeans IDE 6.5 including new remote development features
These new features are described in the readme and wiki pages:
Readme: http://developers.sun.com/sunstudio/downloads/ssx/express_March2009.html
Wiki pages: http://wikis.sun.com/display/SunStudio/Sun+Studio+Express+March+2009+Release
The Early Access program gives developers a chance to evaluate new
capabilities of Sun Studio software, provide their feedback to the
product team and influence future product releases. Please encourage
your customer's participation in this program so that we can gather
valuable feedback to assess the readiness of our release.
Sun Studio 12 Update 1 Early Access Program:
http://developers.sun.com/sunstudio/overview/earlyaccess/index.jsp
LUG09 - Seventh Annual Lustre User Group Meeting
April 16-17, 2009
Cavallo Point Lodge
Sausalito, California
Preliminary agenda:
https://www.regonline.com/custImages/241834/LUG/090312lug09agenda-P1.pdf
Registration is now open for the Lustre User Group, the premier event for learning new
technical information, acquiring best practices, and sharing knowledge about Lustre
technology. LUG09 is a once-a-year opportunity for users to get answers, advice, and
suggestions regarding their specific Lustre implementations.
Attendees will have access to experts and peers who will share their real-world experiences.
With updates on the community development project, Birds of a Feather sessions, demos,
and tutorials, LUG09 is the perfect opportunity to meet with the Lustre development team
and discuss upcoming enhancements and capabilities.
Hurry! To take advantage of the $350 Early Bird registration rate, you must register
by April 1, 2009.
http://www.regonline.com/LUG09
Special Course Available: Lustre Advanced Administration and Support
April 15, 2009
Cavallo Point Lodge
Sausalito, California
For the first time ever, a special course on Lustre Advanced Administration and Support
will be offered on April 15 before the User Group meeting. This course is designed for
people who already have a good understanding and and experience with the Lustre File System.
The class will cover a host of advanced architectural and support techniques. Space will be
limited for this course and tuition discounts will be offered for LUG attendees.
Register now for the Lustre Advanced Administration and Support Seminar. LUG attendees
will get a special discount on course registration.
See you at LUG09!
If you have any questions or interest about LUG09, please contact us at
LUG2009@SUN.COM
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters.
For this setup we will use the following software packages:
1. Ganglia - the core Ganglia package
2. Zlib - zlib compression libraries
3. Libgcc - low-level runtime library
4. Rrdtool - round Robin Database graphing tool
5. Apache web server with php support
You can get the packagers ( 1-3) from sunfreeware (depending on your architecture - x86 or SPARC)
Unzip and Install the packages
1. gzip -d ganglia-3.0.7-sol10-sparc-local.gz pkgadd -d ./ganglia-3.0.7-sol10-sparc-local 2. gzip -d zlib-1.2.3-sol10-sparc-local.gz pkgadd -d ./zlib-1.2.3-sol10-sparc-local 3. gzip -d libgcc-3.4.6-sol10-sparc-local.gz pkgadd -d ./libgcc-3.4.6-sol10-sparc-local
4. You will need pkgutil from blastwave in order to install rrdtool software packages
/usr/sfw/bin/wget http://blastwave.network.com/csw/unstable/sparc/5.8/pkgutil-1.2.1,REV=2008.11.28-SunOS5.8-sparc-CSW.pkg.gz
gunzip pkgutil-1.2.1,REV=2008.11.28-SunOS5.8-sparc-CSW.pkg.gz
pkgadd -d pkgutil-1.2.1,REV=2008.11.28-SunOS5.8-sparc-CSW.pkg
Now you can install packages with all required dependencies with a single command:
/opt/csw/bin/pkgutil -i rrdtool5. You will need to download Apache ,PHP and Core libraries from Cool stack
Core libraries used by other packages
bzip2 -d CSKruntime_1.3.1_sparc.pkg.bz2 pkgadd -d ./CSKruntime_1.3.1_sparc.pkg
Apache 2.2.9, PHP 5.2.6
bzip2 -d CSKamp_1.3.1_sparc.pkg.bz2
pkgadd -d ./CSKamp_1.3.1_sparc.pkg
The following packages are available:
1 CSKapache2 Apache httpd
(sparc) 2.2.9
2 CSKmysql32 MySQL 5.1.25 32bit
(sparc) 5.1.25
3 CSKphp5 PHP 5
(sparc) 5.2.6
Select package(s) you wish to process (or 'all' to process
all packages). (default: all) [?,??,q]:1,3
Select the 1 and 3 option
Enable the web server service
svcadm enable svc:/network/http:apache22-csk
Verify it is working
svcs svc:/network/http:apache22-csk STATE STIME FMRI online 17:02:13 svc:/network/http:apache22-csk
Locate the Web server DocumentRoot
grep DocumentRoot /opt/coolstack/apache2/conf/httpd.conf
DocumentRoot "/opt/coolstack/apache2/htdocs"
Copy the Ganglia directory tree
cp -rp /usr/local/doc/ganglia/web /opt/coolstack/apache2/htdocs/ganglia
Change the rrdtool path on /opt/coolstack/apache2/htdocs/ganglia/conf.php
from /usr/bin/rrdtool to /opt/csw/bin/rrdtool

Start the gmond daemon with the default configuration
/usr/local/sbin/gmond --default_config > /etc/gmond.conf
Edit /etc/gmond.conf ,change name = "unspecified" to name="grid1" (This is our grid name.)
Verify that it has started :
ps -ef | grep gmond nobody 3774 1 0 16::57:41 ? 0:55 /usr/local/gmond
In order to debug any problem, try:
/usr/local/sbin/gmond --debug=9
Build the directory for the rrd images
mkdir -p /var/lib/ganglia/rrds chown -R nobody /var/lib/ganglia/rrdsAdd the folowing line to /etc/gmetad.conf
data_source "grid1" localhost
Start the gmetad daemon
/usr/local/sbin/gmetad
Verify it -->
ps -ef | grep gmetad
nobody 4350 1 0 17:10:30 ? 0:24 /usr/local/sbin/gmetad
To debug any problem
/usr/local/sbin/gmetad --debug=9Point your browser to: http://server-name/ganglia
![]()
We are announcing today the availability of the Sun Grid Engine 6.2 Update 2
release. We are also announcing a major reconstruction of the SGE wiki docs
which will lead to better usability and navigation.
SGE 6.2u2 is a "feature update" release. We are delivering a few much
demanded features, scalability improvements and memory foot print reductions
in huge HPC clusters, and bug fixes.
There are no changes in licensing and pricing. Patches will be available
within the next 24 hours on Sunsolve. Open source courtesy binaries will be
made available next week.
What's new
==========
GUI Installer
------------
Sun Grid Engine 6.2u2 comes with a new GUI installer to simplify the
installation process. The GUI installer enables you to easily install a
whole cluster interactively. To install a cluster, you need to set up the
environment in a similar way to an automatic installation.
Job Submission Verifiers (JSVs)
-------------------------------
JSVs allow users and administrators to define rules that determine which
jobs are allowed to enter into a cluster and which jobs should be rejected
immediately. A JSV is a script or binary that can be used to verify, modify,
or reject a job during the time of job submission or on the master host.
Consumable Resources Per Job
----------------------------
Consumable complex attributes can now be configured as per job. Such
consumables are consumed as requested and are no longer multiplied by the
requested slots. This makes resource requests for parallel jobs much easier
to define, especially when using slot ranges.
jemalloc Library
----------------
Linux distributions (x64 platforms) come with a default memory allocator
library which is not as efficient as the open source jemalloc memory
allocator library also used by the Firefox browser. SGE 6.2 Update 2
replaces the native Linux malloc library with the jemalloc library. This has
a positive effect on the master host performance in large and high
throughput Sun Grid Engine clusters on Linux and reduces the memory
footprint up to 20%. This will lead to a significant performance increase.
Relevant links
--------------
Download:
http://www.sun.com/software/sge/get_it.jsp
Documentation
http://wikis.sun.com/display/gridengine62u2
Release Notes:
http://wikis.sun.com/display/gridengine62u2/Release+Notes
Patch Matrix:
http://wikis.sun.com/display/gridengine62u2/Patch+Matrix
Man Pages Online:
http://gridengine.sunsource.net/manpages.html
Please mark your calendars for the upcoming Sun HPC Consortium event
This year's European event will take place in the city of Hamburg, Germany.
We will meet before the ISC'09 event (http://www.supercomp.de/isc09/) starting on June 23rd.
As usual, we will have our customers and partners report on their latest research and systems.
This is the place to hear the latest technology updates from Sun and how partners and customers are using it.
*LUG09 - Seventh Annual Lustre User Group Meeting*
*April 16-17, 2009*
*The Lodge at Golden Gate*
*Sausalito, California*
Sun Microsystems welcomes you to the Lustre User Group, the premier
event for learning new technical information, acquiring best practices,
and sharing knowledge about Lustre technology. LUG09 is a once-a-year
opportunity for users to get answers, advice, and suggestions regarding
their specific Lustre implementations.
Attendees will have access to experts and peers who will share their
real-world experiences. With updates on the community development
project, Birds of a Feather sessions, demos, and tutorials, LUG09 is the
perfect opportunity to meet with the Lustre development team and discuss
upcoming enhancements and capabilities.
*Lustre Advanced User Seminar
*
Lustre Advanced User Seminar will be offered on *April 15* before the
User Group meeting. This seminar is designed for senior systems
administrators, engineers and integrators needing more comprehensive
knowledge of Lustre Administration and Troubleshooting techniques. Prior
completion of Lustre Administration and Support Level 1 (ES-288) and/or
prior experience administrating Lustre is strongly recommended in order
to receive maximum value from this seminar. Space will be limited and
registration fee discounts will be offered for LUG attendees.
Links to the LUG09 Registration Site will be posted soon at Lustre.org
<http://lustre.org/>, so stay tuned for further announcements.
If you have any questions or interest about LUG09, please contact us at
LUG2009@SUN.COM <mailto:LUG2009@SUN.COM>
See you at LUG09!
In this blog I continue my discussion on how to integrate "Sun Grid Engine and MD Nastran".
I'll explain how to utilize SGE to manage disk space, memory, and MD Nastran license tokens.
In my first blog in this series I discussed 2 frequent questions that Nastran users have before submitting their jobs:
1.) "What machine(s) have enough disk space to satisfy the Nastran output database file requirements (scratch and scr300)?"
2.) "What machine(s) have enough physical memory for optimum Nastran performance?"
I also mentioned in the prior blog that I wanted to show how to solve the issue of managing Nastran jobs in an environment that has a
limited pool of license "tokens"--(utilizing a combination of MSC's "ESTIMATE" program together with SGE's ability to
track "consumable resources").
Before I begin here's some background on SGE terminology that I'll be referring to:
1.) SGE has what's called "resource attributes" that are stored in an entity called the Grid Engine "complex".
Disk space, machine memory, and license tokens are examples of resource attributes.
2.) A resource attribute can be classified/defined as a "consumable resource"--when this is the case SGE will
do the bookkeeping necessary to track the availability of the resource (increasing or decreasing the
availability based on usage). Note that this availability can be monitored/modified through two different
methods: (1) by telling SGE how much of the resource you'll be using on the SGE job submittal command
(e.g., -l disk_space=10G), and/or (2) by using a "load sensor" script that dynamically determines the availability
of the "consumable resource"--a script that runs in the background and periodically determines the availability
of a particular resource.
The rest of this blog will be separated into 3 sections:
Section 1: "How to manage disk space"
Section 2: "How to manage Memory"
Section 3: "How to manage license tokens"
Step #1: Create a "load sensor" script that does the following: (1) dynamically determines disk space on the filesystem of interest to you,
and (2) sets the corresponding "resource attribute" to the available disk space (in this case I created a "consumable resource" called "export_size").
[In Step 2 below I show how I added this "export_size" consumable resource to the SGE environment.]
In this example, I decided to use "/export" as the filesystem of interest--to change this to your filesystem just change the line [FS="/export"].
Here's the script (disk_space.sh) that I used to determine free disk space on /export:
tm19-231:$PWD#cat disk_space.sh
#!/bin/sh
FS="/export"
export FS
myhost=`uname -n`
ende=false
while [ $ende = false ]; do
# ----------------------------------------
# wait for an input
#
read input
result=$?
if [ $result != 0 ]; then
ende=true
break
fi
if [ "$input" = "quit" ]; then
ende=true
break
fi
#set export_host to free space on /export
dfutput="`df -kh $FS | tail -1`"
diskfree=`echo $dfutput | awk '{ print $4}'`
echo begin
echo "$myhost:export_size:${diskfree}"
echo end
done
# we never get here
exit 0
## end of disk_space.sh script
Step #2: Create a consumable resource called "export_size" using the "qconf -mc" edit command:
tm19-231:/dpl/sge.sc08#qconf -mc
"/var/tmp/2481-9OzK0o" 55 lines, 4571 characters
#name shortcut type relop requestable consumable default urgency
#----------------------------------------------------------------------------------------
...
export_size e_size MEMORY <= YES YES 0 0
...
# >#< starts a comment but comments are not saved across edits --------
Step #3: Point to the above disk_space.sh script (Step #1) that has been customized for your site's storage environment,
In my example I'm assuming all machines will use /export for their local Nastran scratch space--this
can be modified to include more than one filesystem by adding another complex attribute setting
in this script--you can also have one for each host.
Now use "aconf -mconf global" to indicate that this load_sensor should apply to all hosts.
Note that you can list (coma separated) more than one script for the load_sensor below:
# qconf -mconf global
"/var/tmp/9800-xONwHt" 50 lines, 1938 characters
execd_spool_dir /gridware/sge/default/spool
mailer /bin/mailx
xterm /usr/openwin/bin/xterm
load_sensor /gridware/sge/util/resources/loadsensors/tmpspace.sh, \
/dpl/sge.sc08/disk_space.sh
prolog none
epilog none
shell_start_mode posix_compliant
login_shells sh,ksh,csh,tcsh
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project false
enforce_user auto
load_report_time 00:00:40
max_unheard 00:05:00
reschedule_unknown 00:00:00
loglevel log_warning
administrator_mail dale.layfield@sun.com
"/var/tmp/9800-xONwHt" 50 lines, 1938 characters
Step #3a: As an optional step you can set the consumable resource export_size value to an initial size for all hosts
You can use "qconf -me global" to modify/edit the exec host (global) setting to "export_size=500G.
To verify this setting use the following "show" command:
tm19-231:/dpl/sge.sc08#qconf -se global
hostname global
load_scaling NONE
complex_values nastran_tokens=500,export_size=500G
load_values NONE
processors 0
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
tm19-231:/dpl/sge.sc08#
Step #4: Now you can verify that the "export_size" is being dynamically set to the
actual /export free space. On my machine (hostname tm19-231) there is 50G available/free in /export:
m19-231:/dpl/sge.sc08#qhost -F
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
gc:nastran_tokens=500.000000
gc:export_size=500.000G
tm19-231 sol-amd64 8 0.09 8.0G 1.7G 8.0G 0.0
gc:nastran_tokens=500.000000
hl:export_size=50.000G
...............
Step #5: You can now submit your Nastran jobs with a request for the amount of disk space your job needs
and SGE will only dispatch the job to hosts that meet your requirements.
For example, because the following qsub command requests 10GB of disk space "-l export_size=10G" the job will only be
dispatched to host(s) that have sufficient free space in /export.
#cat nastran_qsub.sh
qsub -pe nastran 2 -q large.q -l mem_free=6G -l nastran_tokens=`token_estimate.sh` -l export_size=10G -S /bin/ksh nastran_sge.sh
-->Section II: "How to manage Memory"
Making sure that your job only runs on machine(s) with sufficient physical memory can be accomplished by using the built-in "consumable attribute" "mem_free". As shown in the above example you would specify how much memory you need using
"-l mem_free=nG"--your job would then be dispatched to only those machine(s) that meet this request.
As I discussed in a prior blog I wanted to find a way avoid the scenario where a Nastran job can
stop/"time out" partially through the analysis because there are no more Nastran license tokens left in
the "token" pool. This scenario can occur if the customer site has a limited license
"token" pool. The problem then arises when two or more Nastran jobs are running.
First, the jobs request an initial set of license tokens to get started.
Then later during the analysis these same job(s) may need more tokens because of some additional
Nastran "feature" that gets invoked. At this point there may not be enough "tokens" left in the license pool,
resulting in the job(s) stopping/"timing out" for lack of licenses tokens.
A solution for this issue is by using MSC Software's "ESTIMATE" program. The ESTIMATE program does a quick scan
and analysis of the Nastran input file to determine its resource requirements (disk, memory, and license tokens). This
information can then be used along with a "consumable resource" to ensure that sufficient license tokens are
available. You can read more about the ESTIMATE program at:http://www.mscsoftware.com/support/prod_support/mdnastran/cog.pdf
Here's what I did to configure SGE to work with MSC's ESTIMATE program:
Step #1: Create a script (estimate.sh) that will call MSC's "ESTIMATE" program.
tm19-231:/dpl/sge.sc08/dmp_jobs#cat estimate.sh
#
MD2008=/msc/nastran/bin/md2008
#
$MD2008 estimate dmp_101c.dat report=keyword out=estimate.out
The output from the above estimate.sh script (with this particular Nastran input file) is:
tm19-231:/dpl/sge.sc08/dmp_jobs#estimate.sh
ESTIMATE - (Version 2008.0 Jun 6 2008 12:16:44)
======================
Licensing Information:
======================
Features Required:
NASTRAN
Token counts for each feature may be adjusted by modifying "/msc/nastran/md2008/solaris/feature.lis"
Feature Tokens
NASTRAN 250
Total: 250
WORDSIZE=32
K=1024
BUFFSIZE=8193
SOL=101
SE=No
SOLVER=Direct
NGRID=3811
NDOF=10845
N0D=202
N1D=0
N2D=0
N3D=575
MEMORY=128.0MB
DISK=65.4MB
DBALL=33.6MB
SCRATCH=16.8MB
SCR300=11.8MB
SDBALL=7813MB
SSCR=7813MB
ERRORS=1
JID=./dmp_101c.dat
tm19-231:/dpl/sge.sc08/dmp_jobs#
Step #2: Now parse the above "ESTIMATE" output and return the number of required license tokens.
tm19-231:/dpl/sge.sc08/dmp_jobs#cat token_estimate.sh
#
estimate.sh > estimate.out 2> /dev/null
nawk '/^ *Total/ {print $2}' < estimate.out
#
... token_estimate.sh returns the total number of Nastran license tokens required for this job-->250:
tm19-231:/dpl/sge.sc08/dmp_jobs#token_estimate.sh
250
Step #3: Create a "consumable resource" for Nastran tokens.
tm19-231:/dpl/sge.sc08#qconf -mc
...
nastran_tokens nt INT <= FORCED YES 0 0
,,,
Step #4: Set the consumable resource "nastran_tokens" to the total amount available in the
token pool (and make this available to all hosts and queues (i.e., "global").
For example, I added "nastran_tokens=500" to the exec host (global) setting using SGE edit
command "qconf -me global":
To verify this setting use the following "show" command:
tm19-231:/dpl/sge.sc08#qconf -se global
hostname global
load_scaling NONE
complex_values nastran_tokens=500,export_size=500G
load_values NONE
processors 0
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
Step #5: Now you can submit your Nastran job and specify the required nastran tokens needed for your job
using "-l nastran_tokens=`token_estimate.sh` on the SGE command line.
(in this example nastran_tokens=`token_estimate.sh`will resolve to nastran_tokens=250).
So, for example, with 500 total tokens available you would be able to submit 2 of these 250
token jobs concurrently-any additional jobs would not be dispatched until one of these
jobs completed and released its license tokens.
Here's an example taken from my earlier blog showing the use of the nastran_tokens attribute for a distributed memory parallel (DMP) Nastran job:
#cat nastran_qsub.sh
qsub -pe nastran 2 -q large.q -l mem_free=6G -l nastran_tokens=`token_estimate.sh` -l export_size=10G
-S /bin/ksh nastran_sge.sh
In my next blog in this series I'll show how I configured the various queues and parallel environment for MD Nastran.
Wondering what relevance Solaris and Sun's SPARC and CMT technology has in HPC? Do you care about security with HPC?
Read HPC Wire's latest news piece and learn how Canada's High Performance Computing Virtual Laboratory (HPCVL) is using Solaris, SPARC and CMT to provide computational services for a fairly traditional set
of HPC applications, including everything from biomedical research to
computational fluid dynamics in a super secure environment.
HPC Wire reports, "The HPCVL has a cluster of 8 Sun SPARC Enterprise M9000 servers, each with 64 quad-core 2.52 GHz Sparc64 VII processors supporting two hardware threads per core, for compute intensive jobs with large memory requirements. There is also a cluster of 7 Sun Fire 25000 servers, each of which has 72 dual-core UltraSPARC-IV+ processors, aimed at a similar (but perhaps less demanding) workload. HPCVL's Victoria Falls cluster is built from 73 Sun SPARC Enterprise T5140 Servers, each of which has two UltraSparc T2 chips with 8 cores apiece, each supporting 8 hardware threads. At full capacity this cluster can support just over 9,300 threads, and the system provides a throughput compute platform for HPCVL's users. All of the systems use Sun's Grid Engine workload management tool, and run Solaris."
The HPCVL's Web page calls it "one of Canada's leading secure HPC environments".
Full article can be found here: http://www.hpcwire.com/features/Canadian-HPC-Lab-Maintains-Warm-Relationship-with-Sun-39811502.html
Continuing from my last blog in this series on integrating "Sun Grid Engine and MD Nastran" I will now describe how to submit MSC.Software's MD Nastran DMP (Distributed Memory Parallel) jobs to SGE. In subsequent blogs I'll explain how I configured the SGE queues and parallel execution environment "nastran" referenced below.
Here's a brief overview of how all this works for submitting DMP jobs:
First, some background. MD Nastran offers the ability to run certain solution sequences in parallel using the Message Passing Interface (MPI), an industry-wide standard library for C and Fortran message-passing programs. For the Solaris x86 version of MD Nastran Sun HPC ClusterTools (based on Open MPI) is used for the MPI.
There are 2 basic requirements for submitting MD Nastran (DMP) jobs:
(1) You must specify the number of parallel processes you want (MD Nastran command line keyword "dmp=n"),
(2) And, if you want to spread the parallel processes out to more than just the local machine you are submitting Nastran from then you must specify target "host" machine(s) using the MD Nastran command line keyword "hosts=" (e.g, hosts=node1,node2, .....). With these basic requirements in mind I show below how to use the SGE qsub command to control a Nastran DMP job submission, including the determination of the "host" machines to run these parallel jobs. I also show how to use MSC's job resources estimation tool ("estimate") to automate the determination of the required (memory, disk, and license tokens)
So, here's what you would do to run a DMP(Distributed Parallel) MD Nastran job using SGE.
Step #1. Create a script containing the qsub command.
....for example,
#cat nastran_qsub.sh
qsub -pe nastran 2 -q large.q -l mem_free=6G -l nastran_tokens=`token_estimate.sh` -l export_size=10G -S /bin/ksh nastran_sge.sh
The above SGE command line defines the following:
1. "qsub" : qsub is the SGE command that I used to submit the Nastran job submittal script "nastran_sge.sh" to the parallel envrionment (nastran) I configured within SGE. The standard Nastran job submittal command "mdnast2008..." that Nastran user's are familiar with is in the script "nastran_sge.sh" (see Step 2. below).
2." -pe nastran 2" : This tells qsub to submit "nastran_sge.sh" to the SGE parallel execution environment (nastran). The "nastran_sge.sh" script will then start two Nastran jobs (running in parallel) on one or more of the host machines as defined in the queue (large.q).
[The queues that are suitable for this job, like the large.q, are queues that are associated with the parallel environment interface "nastran" by the parallel environment configuration. Suitable queues also must satisfy the resource requirement specification specified by the qsub -l command (see item 4. below).] I'll explain how I configured the parallel environment "nastran" and the associated queues in a subsequent blog.
3. "-q large.q" : I configured three SGE queues for my Nastran environment (small.q, medium.q, and large.q)--each one having a different limit on the amount of elapsed time allowed for jobs within the queue. In this example I chose large.q to handle a long (elapsed time) running job.
4. "-l mem_free=6G" -l nastran_tokens=`token_estimate.sh` -l export_size=10G" defines the "complex resource attributes" that I configured within SGE. I'll explain how I created these in a later blog, but basically they allow the user to define the amount of available real memory (6GB in this case) required for this job, the number of Nastran license tokens that are required, and also the required amount of disk space (10GB in this case) for the Nastran database files---if either of these requirements (free memory, tokens, or disk space) is not met the jobs will not be dispatched.
[In this example, I only show MSC's "estimate" program being used to calculate the required license tokens. In a subsequent blog in this series I'll show
when/how you can use "estimate" to also automate the calculation of the memory and disk requirements.
Step #2. Create a wrapper script (referenced in Step 1. above) containing the standard MD Nastran DMP job submittal command:
...for example,
#cat nastran_sge.sh
#! /bin/ksh
#
# sge_nast: Sun Grid Engine wrapper script to use with MSC.Nastran V2001.0.9 and greater.
#
# Usage: qsub -pe nastran $Nproc .... nastran_sge.sh
#
#
#Set nastran information for Head node:
mdnast2008=/msc/nastran/bin/mdnast2008
#
#
#Set up list of host(s) for use by mdnast2008 job submittal command (see below):
#
HOSTS=""
while read FILE
do
NODE=`echo $FILE | awk '{ print $1}'`
HOST0="$HOSTS"
HOSTS="$HOST0:$NODE"
echo $HOSTS
done < $PE_HOSTFILE
# Remove leading ':'
HOSTS=`echo $HOSTS | sed 's/://'`
echo $HOSTS
echo $PE_HOSTFILE
echo $NSLOTS
NSLOTS=`echo $NSLOTS`
#
# Got hosts names, now run Nastran with DMP (Distributed Memory Parallel) spreading the parallel Nastran jobs across the list of
# host computers ($HOSTS):
#
$mdnast2008 /dpl/sge.sc08/dmp_jobs/dmp_101c.dat out=out.dmp.sge.2008.r2.101c dmp=$NSLOTS hosts=$HOSTS scr=yes bat=no auth=1700@matsci.sfbay
#
# End
#
The above script uses the following SGE parallel execution environment variables:
NSLOTS – The number of queue slots in use by a parallel job.
PE_HOSTFILE – The path of a file that contains the definition of the virtual parallel machine that is assigned to a parallel job by the grid engine system. This variable is used for parallel jobs only. It contains the list of "host" machines that satisfy the resource and queue requirements as specified on the above qsub command in Step 1.
The above mdnast2008 command will result in the parallel nastran jobs being started on different hosts (e.g., dmp=2 hosts=tmp19-231:tm19-232 ).
Below is an example of the output you'll see using the above scripts:
Note that the output from the SGE "qstat -f" command shown below shows an "r" to indicate the 2 nastran jobs are running:
200 0.55500 nastran_sg dpl r 02/08/2009 17:52:10 1
If one of the requested resources had not been available (memory, disk, or tokens) then instead of "r" you would see the following output from qstat -f, indicating that the job has been put into a "hold/wait" status, pending availability of the requested resource(s).
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
201 0.55500 nastran_sg dpl qw 02/08/2009 18:05:58 2
tm19-231#nastran_qsub.sh
Your job 200 ("nastran_sge.sh") has been submitted
tm19-231#
tm19-231#qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@tm19-232 BIP 0/0/8 0.20 sol-amd64
---------------------------------------------------------------------------------
large.q@tm19-231 BIP 0/1/1 0.10 sol-amd64
200 0.55500 nastran_sg dpl r 02/08/2009 17:52:10 1
---------------------------------------------------------------------------------
large.q@tm19-232 BIP 0/1/1 0.20 sol-amd64
200 0.55500 nastran_sg dpl r 02/08/2009 17:52:10 1
---------------------------------------------------------------------------------
medium.q@tm19-232 BIP 0/0/4 0.20 sol-amd64
---------------------------------------------------------------------------------
small.q@tm19-231 BIP 0/0/4 0.10 sol-amd64
In subsequent blogs in this series I'll describe in detail how I configured SGE for the above MD Nastran parallel job submittal (including "how to create the "nastran" parallel environment", creating consumable resources (nastran_tokens), using MSC.Software's "estimate" program, and other MD Nastran specific configurations that I found useful in conjunction with SGE.
1. Three SGE queues configured for Nastran (small.q, medium.q, and large.q) {I'll show how I configured these Nastran queues in a later blog}
2. "qsub" is the SGE command that will be used to submit the Nastran jobs to SGE.
3. "-q %queue%" is the SGE keyword that specifies which SGE queue to use, where %queue% will be replaced by the "queue=" value specified by the user on the Nastran command line (see Nastran command line below: "queue=small.q")
4. "-l nastran_tokens=%options% -l export_size=10G" defines the "complex resource attributes" that I configured within SGE. I'll explain how I created these in a later blog, but basically they allow the user to define the number of Nastran license tokens that are required to run this job and also the required amount of disk space (10GB in this case) for the Nastran database files---if either of these requirements (tokens or disk space) is not met the jobs will not be dispatched.
1. "queue=" specifies which SGE queue to use--in this example it's small.q
2. "qoption=" is a special Nastran keyword that allows you specify your own job submission parameters for SGE (in this case the RC file (sge_rc), defined above, will have the parameter %qoption% replaced by the value returned from the script `token_estimate.sh`). {I'll explain in a later blog how this particular script `token_estimate.sh` calculates the number of license tokens required for this job.}
Small: Time limit: 5 minutes elapsed
Medium: Time limit: 30 minutes elapsed
Large: Time limit: unlimited
1. For DMP (distributed parallel) jobs ensure that the distributed jobs run on the same CPU (processor speed).
2. For DMP jobs ensure that no more than one of the distributed jobs is run on any machine.
3. For DMP jobs ensure the network interconnect is the same on all jobs
For the past 3 years I've been involved in supporting the UCSD Rocks team in their port to Solaris x86_64. The Rocks development team at UCSD is comprised of a group of talented software engineers who are always looking for ways to make it easier to deploy, manage, and upgrade clusters.
Last year at SC08 I joined the Rocks team as they demonstrated their latest "Rocks and Solaris" cluster distribution in the Sun Booth. In their demo you had the chance to see that Rocks now brings the same ease of installation for Solaris cluster deployment that Rocks has had for Linux for several years.
You can download a pre-release (alpha) version of this "Rocks on Solaris" at:
https://wiki.rocksclusters.org/wiki/index.php/Rocks_on_Solaris
I'm sure the Rocks team would appreciate any feedback you can give them on this alpha version.
What you'll see in this version is:
1. A fully automated provisioning of Solaris compute nodes, and support for Thumper/Thor appliances from a Linux frontend.
2. Rolls support
3. MPI support using Sun HPC Cluster Tools
4. Sun Grid Engine support
I think it's noteworthy because MD Nastran is a highly I/O and compute intensive application and you will often be told to run your Nastran jobs on separate machines in order to get “acceptable” performance. The reason for this recommendation of distributing your Nastran jobs to separate machines is that the combination of an MD Nastran analysis job's memory requirements (as specified by the “mem=” keyword on the Nastran command line) and the frequently large amount of I/O can have a significant impact on MD Nastran performance and this impact can be even greater if multiple Nastran jobs are run on the same machine and are using the same file system.
This blog contains important information about the targeted audience of
this beta release, new functionality, the duration of this SGE beta program
and your possibilities to get support and provide feedback.
Content
-------
1. Audience of this beta program
2. Duration of the beta program and release date
3. New functionality delivered with this release
4. Installing SGE 6.2u2beta in parallel to a production cluster
5. Beta program feedback and evaluation support
1. Audience of this beta program
--------------------------------
This Beta is intended for users who already have experience with the Sun
Grid Engine software or DRM (Distributed Resource Management) systems of
other vendors. This beta adds new features to the SGE 6.2 software. Users
new to DRM systems or users who are seeking a production ready release
should use the Sun Grid Engine 6.2 Update 1 (SGE 6.2u1) release which is
available from
http://www.sun.com/software/gridware/get_it.jsp
For the shipping SGE 6.2u1 release we are offering a free 30 day evaluation
email support.
2. Duration of the Beta program and release date
------------------------------------------------
This beta program lasts until Monday, February 2, 2009. The final release of
Sun Grid Engine 6.2 Update 2 is planned for March 2009.
3. New functionality delivered with this release
------------------------------------------------
Sun Grid Engine 6.2 Update 2 (SGE 6.2u2) is a feature update release for SGE
6.2 which adds the following new functionality to the product:
- a GUI based installer helping new users to more easily install the
software. It complements the existing CLI based installation routine
- new support for 32-bit and 64-bit editions of Microsoft Windows Vista
(Enterprise and Ultimate Edition), Windows Server 2003R2 and Windows
Server 2008.
- a client and server side Job Submission Verifier (JSV) allows an
administrator to control, enforce and adjust jobs requests, including
job rejection. JSV scripts can be written in any scripting language,
e.g. Unix shells, Perl or TCL.
- consumable resource attributes can now be requested per job. This makes
resource requests for parallel jobs much easier to define, especially
when using slot ranges.
- on Linux, the use of the 'jemalloc' malloc library improves performance
and reduces memory requirements
- the use of the poll(2) system call instead of select(2) on Linux
systems improves scalability of qmaster in extremely huge clusters
4. Installing SGE 6.2u2 in parallel to a production cluster
-----------------------------------------------------------
Like with every SGE release it is safe to install multiple Grid Engine
clusters running multiple versions in parallel if all of the the following
settings are different:
- <sge_root> directory
- ports (environment variables) for qmaster and execution
daemons
- unique "cluster name" - from SGE 6.2 the cluster name is
appended to the name of the system wide startup scripts
- group id range ("gid_range")
Starting with SGE 6.2 the Accounting and Reporting Console (ARCo) supports
to accept reporting data from multiple Sun Grid Engine clusters. Following
the installation directions for ARCo and using a unique cluster name for
this beta release there is no risk of losing or mixing reporting data from
multiple SGE clusters.
5. Beta Program Feedback and Evaluation Support
-----------------------------------------------
We welcome your feedback and questions on this Beta. We are asking you to
restrict your questions specific to this Beta release. In case you are
seeking for general evaluation support for the Sun Grid Engine software
please subscribe to the free evaluation support by downloading and using the
shipping version of SGE 6.2 Update 1.
There are the following email aliases available:
Technical support alias: sge-beta@sun.com
Feedback on documentation: sge-beta-doc@sun.com
General feedback: sge-beta-feedback@sun.com