sungrid link
20060322 Wednesday March 22, 2006
Part I: SunGrid “The Beginning”

Albert Einstein said “The significant problems we have cannot be solved at the same level of thinking with which we created them.”

The Sun Grid projects started in earnest more than 2 years ago under pretty different pretenses. Sun's Sales organization was struggling with the high levels of efforts and residual expenses associated with letting Scott McNealy visit with a customer. He couldn't help but tell them about this Ultra-Thin Client device which could save on power, operational expense, and even allow for hot-desking to reduce real-estate burdens - in fact Sun saved more than $27M in the first year of the program alone. And so Scott would go on to promise each CxO level executive a SunRay that Sun would come and install for them as a trial.

Alamo PlantThe Technical Sales Organization (TSO) first setup a blueprint and demo kits, but we weren't as fast as Scott's plane. Someone (Brian F.) had the fantastic idea, to put a Desktop “Service” on the Internet, and now we could just parachute a SunRay to each of these executives, and they just plug it into their home broadband router, and they are running... coined “CxOnet” the plant has been running for about 2.5 years (thanks Brian), which grew up to become Project Alamo, and now “Sun Grid”.

I believe that this was a precipitating event in realizing Scott McNealy's long term vision of the Big Friggin Webtone Switch as well as Greg Papadopoulos' predictions around the demise of shrink-wrapped software, and the emergence of next generation utility data centers, and of course to Jonathan Schwartz who acted as a catalyst in moving towards subscription/utility software models. (Obviously a lot of credit has to be given to the Networking Infrastructure Providers who, during the great build out precipitated by the DotCom boom laid a tremendous amount of Bandwidth around the country, and brought substantially more reliable/redundant networks to every business and most every home.)

This new value proposition, of not having to “run” your own data center is obviously still in it's adolescence, but I think that the very patterns for multi-tenant secured computing across a computing mesh - not unlike the power grid, offers scale, reliability, consistency, and affordability that will be difficult to match. The multi-tenant isolation - has been the critical tenant of security design, and many will herald this as a return to mainframe computing. Let's just remember that mainframes had to be multi-tenant because of the cost of the physical plant, they had multi-tenancy engineered into the very Operating System (as does OpenSolaris today, I might mention). In today's horizontal compute fabrics, the very networks that connect computers lay challenge to this tenant, and the Sun Grid architecture is designed to enable a network operating system of similar control, and yet the economics of running it are rarely brought into question... The question being, what is the cost of low utilization (a lack of true multi-tenant virtualization) on your operating expense, and potentially on your needs for complex compliance legislation.

Sun Grid is not perfect, it never will be, but I believe that this is a monumental step towards more affordable computing for end-users and corporations alike.

I would like to take this time to thank the whole Sun Grid Team who has worked around the clock for months as we have been transiting the final hurdles towards release, and the many experts throughout Sun who have provided invaluable insight and assistance towards this shared goal! Thank You All!


Permalink
Trackback: Technorati cosmos http://blogs.sun.com/dhushon/entry/part_i_sungrid_the_beginning
20060319 Sunday March 19, 2006
Running Jobs on Sun Grid that require “Service Containers”

Sun Grid's resource management semantics basically dictate that jobs be self-contained, and terminate all processes in order to exit. The problem with terminating processes in a grid context is that it's not quite as simple as doing a PID trap on a single host, instead, you need to use the qsub, qstat and qdel commands to better manage your distributed jobs.

The example pattern that I'd like to elaborate is one of a “server/framework” which needs to run in order to support a client. Whether a simple RMID, or a more complex instance of a web server, app server or JavaSpace, the pattern is very similar. The developer wants to:

  1. Start up one or more servers (in our case 2, the httpd and the GigaSpaces Enterprise Server)
  2. Make sure that the servers are running
  3. Submit the client and wait for the client to complete
  4. Shutdown the Servers so that the Sun Grid Job can terminate and stop the meter

First some basic syntax:

  • #$ = new directives for SGE which do things like populate environment variables (-V)
  • qsub = submit this task to the grid for scheduling.. we use a couple of opt
  • “-sync n” fire and forget... don't wait for the job to be scheduled
  • “-N <jobname>” not required but could be used for parsing qstat... unfortunately qdel requires a jobid instead of a job name (to keep you from shutting down similarly named jobs)
  • “-t 1” or -t 1-4:1“ submit a job to one or multiple nodes with a minimum
  • qstat = get the status of the SGE queue, which in the case of Sun Grid will only return the jobs that you own for privacy purposes
  • ”-r“ only return the ”running“ jobs... jobs that are waiting (status=”qw“) are excluded
  • qdel = delete / stop the specified jobs

Now onto the listing:

#! /bin/bash
#$ -V

# if we are running against an older version of SGE, the ”$ -V“ direction
# will not exist, so be sure that we source the SGETOOLS (or at least try to)
if [[ ${SGETOOLS:-”unset“} = ”unset“ ]]
then
echo setting SGETOOLS
SGETOOLS=/home/sgeadmin/N1GE/bin/sol-amd64
export SGETOOLS
PATH=$SGETOOLS:$PATH
fi

echo ”Starting the GigaSpaces Servers“
GSEE_HOME=GigaSpacesEE5.0
GRID_HOME=$GSEE_HOME/ServiceGrid
GSC=`qsub -sync n -N gsee-gsc -v GSEE_HOME=$GSEE_HOME -v GRID_HOME=$GRID_HOME -t 1-4:1 $GRID_HOME/bin/gsc`
GSM=`qsub -sync n -N gsee-gsm -v GSEE_HOME=$GSEE_HOME -v GRID_HOME=$GRID_HOME -t 1$GRID_HOME/bin/gsm $GRID_HOME/config/overrides/gsm-override.xml`
echo ${GSC}
echo ${GSM}

#SGE Job return syntax is XXXX:X-X:X where $JobID:$rested_min-$max:$Actual_min
# so trim out just the first XXXX which is a regex matched from the 3rd field
MATCH=”\(.*\) \(.*\) \([0-9]*\)\.\([0-9]*\)-\([0-9]*\):\([0-9]*\)“ #simple match for multi-node job
MATCH2=”\(.*\) \(.*\) \([0-9]*\) \(.*\)“ #simple match for simple 1 node job

GSCparsed=( `echo $GSC | sed -n -e ”s/${MATCH}/\3/p“` )
if [[ ${GSCparsed:-”unset“} = ”unset“ ]] then
GSCparsed=( `echo $GSC | sed -n -e ”s/${MATCH2}/\3/p“`)
fi

GSMparsed=( `echo $GSM | sed -n -e ”s/${MATCH}/\3/p“` )
if [[ ${GSMparsed:-”unset“} = ”unset“ ]] then
GSMparsed=( `echo $GSM | sed -n -e ”s/${MATCH2}/\3/p“`)
fi
echo ”Jobs $GSCparsed and $GSMparsed submitted“

# wait for these jobs to showup in qstat
GSMstatus=0
GSCstatus=0
until [[(”$GSMstatus“ > 0) && (”$GSCstatus“ > 0)]]
do
#evaluate the qstat -s r response (running jobs) to make sure that the
#requisite jobs are running
GSCstatus=$(qstat -s r | nawk '/'${GSCparsed}'/{var1+=1} END {print var1}')
GSMstatus=$(qstat -s r | nawk '/'${GSMparsed}'/{var1+=1} END {print var1}')
echo ”GSCstatus = $GSCstatus“
echo ”GSMstatus = $GSMstatus“
echo Server status is $(qstat -s r)
sleep 10
done

#run our application - in this case, use multiple nodes to help us calculate prime factor
echo ”crunching“
~/prime-crunch.sh $1
echo ”done“
#clean up
#parse jobid's out of GSM and GSC
echo $(qdel $GSMparsed $GSCparsed)
#go ahead and print out the queue status on the way out to verify cleanup (optional)
sleep 10
echo ”Leaving...“ echo $(qstat)

Hopefully, this example sheds some light on some of the mechanisms that a developer might enlist in order to launch more complex, server dependent applications against the Sun Grid. Please let me know if I need to elaborate further. I want to take this opportunity to recognize GigaSpaces, and specifically Dennis Reedy for his help in putting together a grid job which could flex a couple of nodes against their GigaSpaces Enterprise Server 5.0 environment. I'd also like to thank Bill Meine and Fay Salwen for their scripting assistance.

Keywords: ,

Permalink
Trackback: Technorati cosmos http://blogs.sun.com/dhushon/entry/running_jobs_on_sun_grid
Disclaimer: These are the express views of Dan Hushon, and in no way are indicative of the views, strategies nor plans of Sun Microsystems, Inc. Creative Commons License
All content on this website (including text, photographs, audio files, and any other original works), unless otherwise noted, is licensed under a Creative Commons License.
Valid HTML 4.01! Valid CSS! Listed on BlogShares