Pascal's Weblog
The Grid...



Archives
« July 2009
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today
Click me to subscribe
Search

Links
 

Today's Page Hits: 33

« Solaris 10 / Open... | Main | No Multicasting »
Thursday Aug 31, 2006
Job Monitoring
Running a long duration job, I ran into the problem of wanted to know what was going on with my job. As the files to download are only available when the job terminates this is not easy to get intermediate data directly.

Below is a suggestion from one of my colleagues here at Sun. Let's extend the application I presented previously to get intermediate data. If you recall the interface of my Server:

public interface MersenneServer extends Remote {
  public int[] getInterval() throws RemoteException;
  public void postResult(Result result) throws RemoteException;
  public String getStatus() throws RemoteException;
}


I added an administration interface. For this simple example, just a method getStatus() which allows us to access intermediate data.

Once this job is submitted, you just need to submit a second job which calls this new method, terminates and the progress is available to be downloaded from the Grid Web UI.

The trick here is to find the location of the initial server. Recall that all jobs are submitted as different users on a server selected by the Grid Engine. To find the identity of the server, this solution relies on the qstat command.

Submit a job with a somewhat unique id:
svrResp=$(qsub -N mserv08 startServers.ksh)

Here the id is 'mserv08'. When the monitoring job is submitted, it first executes the qstat command:
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r213cpn21.retail.nyc     1 8
  58642 0.60500 start      user0104     r     08/27/2006 00:50:38 all.q@nyc1r213cpn24.retail.nyc   102
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r213cpn32.retail.nyc     1 12
  59914 0.50500 clients    user0105     r     08/31/2006 00:04:37 all.q@nyc1r214cpn14.retail.nyc     1 5
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r214cpn15.retail.nyc     1 16
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r214cpn20.retail.nyc     1 1
  59917 0.50500 Admin      user0107     r     08/31/2006 00:05:37 all.q@nyc1r219cpn24.retail.nyc     1
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r219cpn28.retail.nyc     1 11
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r220cpn04.retail.nyc     1 13
  59913 0.50500 mserv08    user0105     r     08/31/2006 00:04:07 all.q@nyc1r220cpn17.retail.nyc     1
  58219 0.50500 clients    user0103     r     08/25/2006 16:51:22 all.q@nyc1r220cpn21.retail.nyc     1 14

The monitoring job, then looks for the server the initial job is running on: nyc1r220cpn17.retail.nyc (after printing a hostname I noticed that the format of the host was: nyc1r220cpn17.retail.nyc1.sungrid.net), add '1.sungrid.net' to get the fully qualified name (Possibly nyc1r220cpn17 would have work too), executes an RMI looking on the server and calls the administration method:
Local host: nyc1r219cpn24.retail.nyc1.sungrid.net
Looking for host where process: mserv08 is running
Host is: nyc1r220cpn17.retail.nyc1.sungrid.net
Looking up rmi://nyc1r220cpn17.retail.nyc1.sungrid.net/MersenneSolver...
Status: ....


The drawback of this solution is that you need to submit a second job to monitor the initial real job.. and this will cost you one dollar (even if this job takes only couple of seconds!).
Posted at 08:10AM Aug 31, 2006 by Pascal Ledru in Grid Computing  |  Comments[1]

Comments:

Glad to see that it worked. Yes, it would have worked without the FQDN.

Posted by Yakshaving on August 31, 2006 at 08:56 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed