Shared Pool
Limiting Certain Concurrent Jobs on SGE Cluster
A customer asked about how to limit total number of certain jobs on an SGE cluster. The customer wants to limit certain jobs running on the cluster in such a way that only one job can be allowed to run on each execution host but no more than a given number of concurrent jobs can be allowed at a given time. The reason to limit the total number of concurrent jobs is to avoid that the jobs may create any resource contention.
The customer is running an old version of SGE, which doesn't have the resource quota set feature available starting SGE 6.1 release. It is doable with pre-SGE 6.1 release but it requires a lot of work as compared to what can be done with the SGE resource quota set.
The following demonstrates how easy it is to set up such a customization with the SGE resource quota set feature.
First thing to do is to create a resource counter that tracks how many jobs are being executed. Using the SGE complex parameter, one can define:
#name shortcut type relop requestable consumable default urgency
#------------------------------------------------------------------------------------
concurjob ccj INT <= FORCED YES 0 0
...
Now, all these special jobs should be executed on a special queue called "archive" queue. The archive queue will be configured so that all these special jobs must use the special resource counter when submitting the job.
# qconf -sq archive
qname archive
...
complex_values concurjob=1
...
As shown above, only one job will be scheduled to the archive queue instance per machine.
Now it's time to control the total number of such jobs globally. This can be done very easily with the resource quota set (RQS). The following command can be used to create such a RQS rule.
# qconf -arqs
{
name limit_concur_jobs
description NONE
enabled TRUE
limit to concurjob=10
}
The red-colored, italicized entries are actually modified on the template. This will complete all the customization that can limit the total number of special jobs running concurrently on the entire SGE cluster.
Now when you submit a special job to the archive queue, you must use the "-l concurjob=1" resource request, which in turn, will be used to monitor how many those special jobs are being run.
The following shows an example. For demonstration purpose, the archive queue is modified to accommodate two jobs per queue instance and the total number of allowed concurrent jobs to be 1.
s4u-80a-bur02# qconf -sq archive |egrep 'host|archive|concur'
qname archive
hostlist @allhosts
complex_values concurjob=2
s4u-80a-bur02# qconf -srqs
{
name limit_concur_jobs
description NONE
enabled TRUE
limit to concurjob=1
}
s4u-80a-bur02# qsub -b y -o /dev/null -j y -l ccj=1 sleep 3600
Your job 53 ("sleep") has been submitted
s4u-80a-bur02# qsub -b y -o /dev/null -j y -l ccj=1 sleep 3600
Your job 54 ("sleep") has been submitted
s4u-80a-bur02# qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
archive@s4u-80a-bur02 BIP 0/2/10 0.02 sol-sparc64
53 0.55500 sleep root r 10/24/2008 15:05:59 1
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
54 0.00000 sleep root qw 10/24/2008 15:05:57 1
s4u-80a-bur02# qstat -j 54
...
scheduling info: cannot run because it exceeds limit "/////" in rule "limit_concur_jobs/1"
As observed here, the job 54 is waiting to be scheduled when the resources become available.
Posted at 03:37PM Oct 24, 2008 by Chansup Byun in Grid | Comments[0]
Friday Oct 24, 2008