Pascal's Weblog
The Grid...



Archives
« December 2009
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today
Click me to subscribe
Search

Links
 

Today's Page Hits: 10

« Previous page | Main
Tuesday Sep 12, 2006
Monte Carlo Simulations
The problem we have examined so far was based on distributed pieces of a large problems on several nodes to speed up computations. Basically, a master distributed pieces of the job to be calculated by workers then collected these pieces.

Amazingly, there is a class of applications even easier to distribute on a Grid. For Monte Carlo simulations, the same program is executed on each node and the master only averages the results!

The most famous example of a Monte Carlo simulation is used to approximate the value of π.

Monte Carlo simulations are widely used for simulating the behavior of various physical and mathematical systems. They are also used in finance.
Posted at 12:49PM Sep 12, 2006 by Pascal Ledru in Grid Computing  |  Comments[0]

Wednesday Sep 06, 2006
A large job
To try something little bit larger, I ran the Mersenne Prime search program described below on 100 nodes searching all the Mersenne primes with an exponent up to 25,000; the last one in this interval being 2^23209 -1 discovered in 1979 (less than 30 years ago!). The program ran for 12 hours, was simple to write (a simple Java program), and it was just a matter to upload it to the Grid, submit it, and dowload the result. While of course, it is difficult to compare with other searches, this page describes how the 32nd Mersenne Prime was found: a specialized program written for a Cray-2 supercomputer.
Posted at 08:55AM Sep 06, 2006 by Pascal Ledru in Grid Computing  |  Comments[0]

Friday Sep 01, 2006
No Multicasting
As an option to try to discover a job from another job, I tried to multicast a request. Did not seem to work, opened the Develper's Guide to read that... indeed Network multicasting is disabled on the Grid!
Posted at 04:07PM Sep 01, 2006 by Pascal Ledru in Grid Computing  |  Comments[0]

Thursday Aug 31, 2006
Job Monitoring
Running a long duration job, I ran into the problem of wanted to know what was going on with my job. As the files to download are only available when the job terminates this is not easy to get intermediate data directly.

Below is a suggestion from one of my colleagues here at Sun. Let's extend the application I presented previously to get intermediate data. If you recall the interface of my Server:

public interface MersenneServer extends Remote {
  public int[] getInterval() throws RemoteException;
  public void postResult(Result result) throws RemoteException;
  public String getStatus() throws RemoteException;
}


I added an administration interface. For this simple example, just a method getStatus() which allows us to access intermediate data.

Once this job is submitted, you just need to submit a second job which calls this new method, terminates and the progress is available to be downloaded from the Grid Web UI.

The trick here is to find the location of the initial server. Recall that all jobs are submitted as different users on a server selected by the Grid Engine. To find the identity of the server, this solution relies on the qstat command.

Submit a job with a somewhat unique id:
svrResp=$(qsub -N mserv08 startServers.ksh)

Here the id is 'mserv08'. When the monitoring job is submitted, it first executes the qstat command:
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r213cpn21.retail.nyc     1 8
  58642 0.60500 start      user0104     r     08/27/2006 00:50:38 all.q@nyc1r213cpn24.retail.nyc   102
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r213cpn32.retail.nyc     1 12
  59914 0.50500 clients    user0105     r     08/31/2006 00:04:37 all.q@nyc1r214cpn14.retail.nyc     1 5
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r214cpn15.retail.nyc     1 16
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r214cpn20.retail.nyc     1 1
  59917 0.50500 Admin      user0107     r     08/31/2006 00:05:37 all.q@nyc1r219cpn24.retail.nyc     1
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r219cpn28.retail.nyc     1 11
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r220cpn04.retail.nyc     1 13
  59913 0.50500 mserv08    user0105     r     08/31/2006 00:04:07 all.q@nyc1r220cpn17.retail.nyc     1
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r220cpn21.retail.nyc     1 14

The monitoring job, then looks for the server the initial job is running on: nyc1r220cpn17.retail.nyc (after printing a hostname I noticed that the format of the host was: nyc1r220cpn17.retail.nyc1.sungrid.net), add '1.sungrid.net' to get the fully qualified name (Possibly nyc1r220cpn17 would have work too), executes an RMI looking on the server and calls the administration method:
Local host: nyc1r219cpn24.retail.nyc1.sungrid.net
Looking for host where process: mserv08 is running
Host is: nyc1r220cpn17.retail.nyc1.sungrid.net
Looking up rmi://nyc1r220cpn17.retail.nyc1.sungrid.net/MersenneSolver...
Status: ....


The drawback of this solution is that you need to submit a second job to monitor the initial real job.. and this will cost you one dollar (even if this job takes only couple of seconds!).
Posted at 08:10AM Aug 31, 2006 by Pascal Ledru in Grid Computing  |  Comments[1]

Wednesday Aug 30, 2006
Solaris 10 / Open Solaris
Sun Grid is based on Solaris 10. Somehow, somewhere you will need to have access to a x86 Solaris 10 based machine to prepare your jobs (Except if the jobs are all Java programs and you feel very confident with your scripts!).

I do all my development on an Ultra 20 Solaris 10 based machine. While lots of people have heard how great Solaris was on the server, I am not too sure people realize that Solaris is also a GREAT development platform on the desktop. I use an Ultra 20 using an AMD Opteron processor with two cores running at 2.4 GHz and 2 Gigabytes of memory running Solaris 10. Did I mention these boxes were really inexpensive?

Solaris 10 comes with JDS (Java Desktop System), a customized version of GNOME.


Here are some of the Software tools I use. Most of them are available at Blastwave:


Give it a try! Did I mention no viruses?
Posted at 08:29AM Aug 30, 2006 by Pascal Ledru in Solaris  |  Comments[1]

Friday Aug 18, 2006
Distributed Mersenne Primes on Sun Grid
Just finished up running the distributed version of my test program. Only took 15 minutes using 5 processors of the Grid. An almost perfect acceleration!


What I think I need to get used to when using the Grid are the couple of scripts using the Grid engine commands to initiate all the processes on the nodes of the Grid. They are basically the same as the ones used by the RMI examples listed on the community but I will still list these scripts as I found them useful:

cat Mersenne.ksh
#!/bin/ksh
#$ -N Mersenne
#$ -cwd

# Set the environment variables
if [ -f $HOME/.profile ]; then
   . $HOME/.profile
fi

numClients=5

echo "Starting the server..."
svrResp=$(qsub -N server startServers.ksh)
echo svrResp is $svrResp
svrJobId=$(echo "$svrResp" | awk '{print $3}')
echo svrJobId is $svrJobId

# Wait until the server is started
status="not running"
until [ "$status" == "r" ]
do
  status=$( qstat | nawk '/'$svrJobId'/ {print $5}' )
  echo Server job status is $status
  sleep 10
done

#Wait until the serverhost file is created
filename="$HOME/serverhost"
until test -f $filename
do
  sleep 10
done
# then pull the server node name from the file
servernode=$(cat $filename)
rm $filename

echo "Server is running on" $servernode "Submitting a set of clients to the grid for remote execution..."
qsub -N clients -t 1-$numClients startClient.ksh $servernode

echo "Submitting a cleanup job that will wait until the clients are complete"
qsub -hold_jid clients cleanup.ksh $svrJobId



 cat startServers.ksh
#!/bin/ksh
#$ -N startServers
#$ -cwd

echo "Starting the registry in the background..."
rmiregistry &

# Wait until the registry is started
proc=0
while [ "$proc" == 0 ]
do
  proc=$( ps -ef | grep "[r]miregistry" )
  echo $proc is running
  sleep 10
done

# Place the name of this host in a file, so that clients can read it
hostname > $HOME/serverhost
echo "The servers location is" $(hostname)

echo "Starting the server..."
java MersenneServerImpl



 cat startClient.ksh
#!/bin/ksh
#$ -N startClient
#$ -cwd

servernode=$1

echo "Starting a client on $(hostname) to talk to server running on" $servernode
java MersenneClient $servernode



cat cleanup.ksh
#!/bin/ksh
#$ -N cleanup
#$ -cwd

# Set the environment variables
if [ -f $HOME/.profile ]; then
   . $HOME/.profile
fi

echo Killing the server, job number $1 ...
qdel $1

Posted at 03:10PM Aug 18, 2006 by Pascal Ledru in Grid Computing  |  Comments[0]

Learning the hard way!
I learnt this one the hard way!

Do not harcode environment variables as the location of commands (e.g., Grid commands) could change. DO NOT do something like:

#!/bin/ksh
#$ -N Test_qstat
#$ -cwd

SGETOOLS=/home/sgeadmin/n1ge60/bin/sol-x86
export SGETOOLS

echo "Starting Test program..."
$SGETOOLS/qstat
echo "End Test program..."



Instead use:

#!/bin/ksh
#$ -N Test_qstat
#$ -cwd

# Set the environment variables
if [ -f $HOME/.profile ]; then
   . $HOME/.profile
fi

echo "Starting Test program..."
echo `which qstat`
qstat
echo "End Test program..."

Posted at 02:24PM Aug 18, 2006 by Pascal Ledru in Grid Computing  |  Comments[0]

Wednesday Aug 16, 2006
Grid Engine crash course
The three commands to get started with the Grid Engine are:
- qsub
- qdel
- qstat


qsub is used to submit a job on a grid. qstat is used to see what are the jobs running on a grid, and qdel is used to delete a job from the grid. Here is a quick example:

cat startServer.ksh
#!/bin/ksh

SGETOOLS=/home/sgeadmin/N1GE/bin/sol-amd64
export SGETOOLS

numClients=10


$SGETOOLS/qsub -N client -t 1-$numClients startClient.ksh


This program starts 10 copies of the client program on the grid. The client has access to some environment variables such as: SGE_TASK_ID
> cat startClient.ksh
#!/bin/ksh
host=$( hostname )
echo "Starting client on" $host with ID  $SGE_TASK_ID >> /home/pascal/test1/file$host
proc=0
while [ "$proc" == 0 ]
do
  echo $proc
done
the output of qstat looks like:

> qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
   1221 0.55500 client     pascal       r     08/16/2006 17:39:50 all.q@node01a                      1 1
   1221 0.55500 client     pascal       r     08/16/2006 17:39:50 all.q@node01b                      1 2
   1221 0.55500 client     pascal       r     08/16/2006 17:39:50 all.q@node02a                      1 3
   1221 0.55500 client     pascal       r     08/16/2006 17:39:50 all.q@node02b                      1 10
   1221 0.55500 client     pascal       r     08/16/2006 17:39:50 all.q@node03a                      1 9
   1221 0.55500 client     pascal       r     08/16/2006 17:39:50 all.q@node04a                      1 7
   1221 0.55500 client     pascal       r     08/16/2006 17:39:50 all.q@node04b                      1 5
   1221 0.55500 client     pascal       r     08/16/2006 17:39:50 all.q@node05a                      1 8
   1221 0.55500 client     pascal       r     08/16/2006 17:39:50 all.q@node05b                      1 6
   1221 0.55500 client     pascal       r     08/16/2006 17:39:50 all.q@node06b                      1 4

Posted at 02:48PM Aug 16, 2006 by Pascal Ledru in Grid Computing  |  Comments[0]

Tuesday Aug 15, 2006
Distributed Mersenne Primes
Let's rewrite the previous program to take advantage of multiple CPUs! The idea is to have a master generating some pieces of work, some workers executing slices of the work and posting their results to the master. Fairly straightforward to write in Java using RMI! Starting with the interface defining the master:

import java.rmi.*;

public interface MersenneServer extends Remote {
  public int[] getInterval() throws RemoteException;
  public void postResult(int[] values) throws RemoteException;
}

Workers simply grab slices of the work and test the integers in the slices are primes:

import java.math.*;
import java.rmi.*;
import java.rmi.server.*;
import java.util.ArrayList;

public class MersenneClient {

  static final BigInteger one  = new BigInteger("1");
  static final BigInteger two  = new BigInteger("2");
  static final BigInteger four = new BigInteger("4");

  public MersenneClient() throws RemoteException {
  }

  private static boolean LucasLehmerTest(int p) {
    BigInteger s = four;
    BigInteger n = one.shiftLeft(p).subtract(one);
    for (int i = 3; i <= p; i++) {
      s = s.multiply(s).subtract(two).mod(n);
    }
    if (s.bitCount() == 0) {
      return true;
    } else {
      return false;
    }
  }


  public static void main(String[] args) {
    String host = args[0];
    String name = "rmi://" + host + "/MersenneSolver";
    System.out.println("Looking up " + name + "...");
    MersenneServer server = null;

    try {
      server = (MersenneServer)Naming.lookup(name);
    } catch (Exception ex) {
      System.out.println("Caught an exception looking up Solver.");
      ex.printStackTrace();
      System.exit(1);
    }

    while (true) {
      try {
        int[] interval = server.getInterval();
        if (interval == null) break; // no more intervals
        ArrayList list = new ArrayList();
        for (int i = interval[0]; i <= interval[1]; i++) {
          if (LucasLehmerTest(i)) {
            list.add(i);
          }
        }
        int[] values = new int[list.size()];
        for (int i = 0; i < list.size(); i++) values[i] = list.get(i);
        server.postResult(values);
      } catch (RemoteException ex) {
        System.out.println("Caught remote exception.");
        System.out.println("Probably server shutdown as all intervals are evaluated");
        System.exit(1);
      }
    }
  }

}


The implementation of the master generates slices and checks if all slices have been evaluated:

import java.rmi.*;
import java.rmi.server.UnicastRemoteObject;
public class MersenneServerImpl extends UnicastRemoteObject implements MersenneServer {

  private int i = 1;
  private int interval = 100;

  //private int totalInterval = 40;
  private int totalInterval = 80;

  public MersenneServerImpl() throws RemoteException {
  }

  // calcuate the Mersenne primes up to 5000
  // break the range into small intervals
  // each client will test the primes within a given interval
  public synchronized int[] getInterval() throws RemoteException {
    if (i >= 5000) return null;
    //if (i >= 3000) return null;
    if (i >= 2000) interval = 50;
    int j = i;
    int k = i + interval-1;
    i = i + interval;
    return new int[] {j, k};
  }

  public synchronized void postResult(int[] values) throws RemoteException {
    for (int i = 0; i < values.length; i++) {
      System.out.println("2^" + values[i] + "-1 is prime");
    }
    // check if we should exit
    totalInterval--;
    if (totalInterval == 0) System.exit(0);
  }

  public static void main(String[] args) {
    try {
      String name = "MersenneSolver";
      System.out.println("Registering Mersenne Solver");
      MersenneServerImpl solver = new MersenneServerImpl();
      Naming.rebind(name, solver);
      System.out.println("Remote Solver ready...");
    } catch (Exception ex) {
      ex.printStackTrace();
    }
  }

}


I first test this program locally using this simple script:

#!/bin/ksh

echo "Starting registry"
rmiregistry &

# Wait until the registry is started
proc=0
while [ "$proc" == 0 ]
do
  proc=$( ps -ef | grep "[r]miregistry" )
  echo $proc is running
  sleep 10
done

echo "Starting server"
java MersenneServerImpl &
sleep 10
echo "Starting client"
java MersenneClient localhost &
sleep 2
echo "Starting client"
java MersenneClient localhost &


Tomorrow, I will go over the Grid Engine commands to actually run this application on a Grid.
Posted at 11:41AM Aug 15, 2006 by Pascal Ledru in Grid Computing  |  Comments[0]

Monday Aug 14, 2006
Mersenne Primes
Mersenne primes are primes of the form: 2^p-1
Mersenne primes are found using the Lucas-Lehmer Test.

A simple Java implementation of this test taking advantage of the BigInteger class is the following:

import java.math.*;
public class Mersenne {

  static final BigInteger one  = new BigInteger("1");
  static final BigInteger two  = new BigInteger("2");
  static final BigInteger four = new BigInteger("4");

  private static boolean LucasLehmerTest(int p) {
    BigInteger s = four;
    BigInteger n = one.shiftLeft(p).subtract(one);
    for (int i = 3; i <= p; i++) {
      s = s.multiply(s).subtract(two).mod(n);
    }
    if (s.bitCount() == 0) {
      return true;
    } else {
      return false;
    }
  }

  public static void main(String[] args) {
    for (int i = 0; i < 5000; i++) {
      if (LucasLehmerTest(i)) {
        System.out.println("2^" + i + "-1 is prime");
      }
    }
  }

}

Ant the output will look like:
2^3-1 is prime
2^5-1 is prime
2^7-1 is prime
2^13-1 is prime
2^17-1 is prime
2^19-1 is prime
2^31-1 is prime
2^61-1 is prime
2^89-1 is prime
2^107-1 is prime
2^127-1 is prime
2^521-1 is prime
2^607-1 is prime
2^1279-1 is prime
2^2203-1 is prime
2^2281-1 is prime
2^3217-1 is prime
2^4253-1 is prime
2^4423-1 is prime

(Of course 2^2-1 is also prime!).
Well, this program does not exactly take advantage of the full power of the Grid (The run took 1.189 hour)! All the computation is done by just one CPU! Let see, if we can rewrite this program to take advantage of the Grid...
Posted at 01:43PM Aug 14, 2006 by Pascal Ledru in Grid Computing  |  Comments[0]

Grid server
Want to actually check what are the underlying servers used by the Grid? Create a small script such as:

cat /etc/release

/usr/sbin/prtconf

/usr/sbin/psrinfo -v

Submit it as a job to the grid. Extract the result from the output zip file. You should get something such as:
                          Solaris 10 3/05 s10_74L2a X86
           Copyright 2005 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                            Assembled 22 January 2005
System Configuration:  Sun Microsystems  i86pc
Memory size: 7808 Megabytes
System Peripherals (Software Nodes):

i86pc
    scsi_vhci, instance #0
    ib, instance #0
        rpcib, instance #0
        daplt, instance #0
    isa, instance #0
        motherboard (driver not attached)
        i8042, instance #0
            mouse, instance #0
            keyboard, instance #0
        asy, instance #0
        fdc, instance #0
            fd, instance #0
            fd, instance #1 (driver not attached)
    pci, instance #0
        pci1022,7460, instance #0
            pci17c2,10, instance #0
            pci17c2,10, instance #1
            display, instance #0
        pci17c2,10 (driver not attached)
        pci-ide, instance #0
            ide, instance #0 (driver not attached)
            ide, instance #1
                sd, instance #0
                st, instance #7 (driver not attached)
        pci1022,7450, instance #1
            pci17c2,10, instance #0
            pci17c2,10, instance #1
            pci17c2,10, instance #0
                sd, instance #1 (driver not attached)
                sd, instance #2
                sd, instance #3 (driver not attached)
                sd, instance #4 (driver not attached)
                sd, instance #5 (driver not attached)
                sd, instance #6 (driver not attached)
                sd, instance #7 (driver not attached)
                sd, instance #8 (driver not attached)
                sd, instance #9 (driver not attached)
                sd, instance #10 (driver not attached)
                sd, instance #11 (driver not attached)
                sd, instance #12 (driver not attached)
                sd, instance #13 (driver not attached)
                sd, instance #14 (driver not attached)
                sd, instance #15 (driver not attached)
                sd, instance #16 (driver not attached)
                st, instance #0 (driver not attached)
                st, instance #1 (driver not attached)
                st, instance #2 (driver not attached)
                st, instance #3 (driver not attached)
                st, instance #4 (driver not attached)
                st, instance #5 (driver not attached)
                st, instance #6 (driver not attached)
        pci17c2,10 (driver not attached)
        pci1022,7450, instance #2
            pci15b3,5a46, instance #3
                pci15b3,5a44, instance #0
        pci17c2,10 (driver not attached)
        pci1022,1100 (driver not attached)
        pci1022,1101 (driver not attached)
        pci1022,1102 (driver not attached)
        pci1022,1103 (driver not attached)
        pci1022,1100 (driver not attached)
        pci1022,1101 (driver not attached)
        pci1022,1102 (driver not attached)
        pci1022,1103 (driver not attached)
    pseudo, instance #0
    options, instance #0
    xsvc, instance #0
    objmgr, instance #0 (driver not attached)
    used-resources (driver not attached)
    cpus (driver not attached)
        cpu, instance #0 (driver not attached)
        cpu, instance #1 (driver not attached)
Status of virtual processor 0 as of: 08/14/2006 19:17:29
  on-line since 07/21/2006 17:15:45.
  The i386 processor operates at 2191 MHz,
        and has an i387 compatible floating point processor.
Status of virtual processor 1 as of: 08/14/2006 19:17:29
  on-line since 07/21/2006 17:15:51.
  The i386 processor operates at 2191 MHz,
        and has an i387 compatible floating point processor.

That indicates that this server is a two processors (2191 MHz) server with 7808 Megabytes of Memory running Solaris 10.
Posted at 12:29PM Aug 14, 2006 by Pascal Ledru in Grid Computing  |  Comments[0]