Stephan Grell's Weblog
Stephan Grell's Weblog

20060808 Dienstag August 08, 2006

N1GE 6 - Scheduler Hacks: Exclusive master host for the master task
There was some discussion on the Open Source mailing lists and a lot of interests how one can single out the master task to a special host and have all the slave tasks on the compute nodes. There can be multiple reasons to do this, the one I heard most was that the master tasks needs a lot of memory and a special host exists just for that purpose.
During the discussion, we came across 3 work-arounds for the problem. I will start with the easiest setup and end with the most complicated. Since they are all workarounds, none of them is perfect. Never the less, they do archive the goal more or less. :-)

1) Using the host sorting mechanism:

Description:

Grid Engine allows to sort hosts / queues by sequence number. Assuming that we have only one cluster queue and the parallel environment is configured to use fill -up, we can assign the compute queue instances a smaller sequence number than the master machines. The job would request the pe to run in and the master machine as the masterq. This way, all slaves would run on the compute nodes, which are filled-up first, and the master task is singled out to the master machine due to its special request.
If the environment has more than one master host, wild cards in the masterq request can be used to select one of the master host.

Advantages:
Makes best use of all resources, is easy to setup, to understand and debug. This setup has also the least performance impact.

Problems:

As soon as there are not enough compute nodes available, the scheduler will assign more than one task to the master machine.

Configuration:

Change the queue sort oder in the scheduler config:
qconf -msconf
   queue_sort_method                 seqno

The queue for on the small hosts gets:
qconf -mq <queue>
   seq_no                0

The queue for the master hosts gets:
qconf -mq <queue>
   seq_no                1

A job submit would look like:
qsub -pe <PE> 6 -masterq "*@master*" ...

2) Making accesive use of pe objects and cluster queues:

Description:
Each slot on a master host needs its own cluster queue and its own pe. The compute nodes are combined under 1 cluster queue with all pe objects that are used on the master hosts. Each master cluster queue has exactly one slot. The job submit will now request the master queue via wild cards and the pe it should run in with wild cards.

Advantages:
Archives the goal.

Problems:
Many configuration objects. Slows done the scheduler quite a bit.

Configuration:
I will leave the configuration for this one open. Should not be complicated...

3) Using load adjustments:

Description

The scheduler uses the load adjustments for not overloading an host. The system can be configured in such a way, that the scheduler starts not more than one task on one host eventhough more slots are available. We will use this configuration to archive the desired goal.

Advantages:
Archives exactly what we are looking for whichout any additionl configuration objects.

Problems:
Slows down scheduling. Only one job requesting the master host will be started in one scheduling run. Supporting backup master hosts is not easy.

The master machine is only allowed to have one queue instance, or all queue instances of the master machine have to share the same load threshold. If that is not the case, it will not work.

Configuration:
I have the following setup:

 qstat -f
queuename                qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@big                    BIP   0/4       0.02     sol-sparc64
----------------------------------------------------------------------------
small.q@small1        BIP   0/1       0.00     lx24-amd64
----------------------------------------------------------------------------
small.q@small2        BIP   0/1       0.02     sol-sparc64

And a configured pe in all queue instances:

qconf  -sp make
pe_name               make
slots                          999
user_lists                 NONE
xuser_lists              NONE
start_proc_args     NONE
stop_proc_args        NONE
allocation_rule        $fill_up
control_slaves         TRUE
job_is_first_task    FALSE
urgency_slots           min

We now go ahead and change the load_threshold in the all.q@big queue instance to be a load value that is not used in the other queue instances, such as:

qconf -sq all.q
qname                 all.q
hostlist              big
seq_no                0
load_thresholds       NONE,[big=load_avg=4]

The used load threshold has to be a real load value and cannot be a fixed or consumable value.

Next step to make our enviroment work is to change the scheduler configuration to the following:

qconf -ssconf
algorithm                         default
schedule_interval                 0:2:0
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              load_avg=4.100000
load_adjustment_decay_time        0:0:1

By changing the configuration of the scheduler to use the job_load_adjustments like this, it will add an artificial load to each host, that will run a task. With this configuration we can start one task on the master machine in each scheduling run. Since the load_adjustment_decay_time is only 1 second, the scheduler has forgotten about the artificial load in the next scheduling run and can start a new task on the master host. This way, we archive what we have been looking for.

Extended Configuration:
If the usage of multiple master hosts is requriered, one need to create one pe object per master host. The compute hosts are part of all pe objects. The same rule as above still applies, each master host is only allowed to have one queue instance. The configuration of the all.q queue would look as follows:

qconf -sq all.q
qname                 all.q
hostlist              big
seq_no                0
load_thresholds       NONE,[big=load_avg=4],[big1=load_avg=4][big2=load_avg=4]
pe_list               big_pe big1_pe big2_pe,[big=big_pe]
,[big1=big1_pe],[big2=big2_pe]

The job submit would look like:
qsub -pe "big*" 5 -masterq="all.q@big*" ....




( Aug 08 2006, 09:44:12 AM CEST ) Permalink

Kommentare:

Senden Sie einen Kommentar:

Kommentare sind ausgeschaltet.

Archive
Sprache
Links
Referenzierte URLs