
Donnerstag Mai 12, 2005
N1GE 6 - Scheduler: Department-Based policy setup
In his latest howto Chris Dagdigian wrote about configuring the
different policies of the N1GE 6 system to enable Department-Based
scheduling. He gives a good summary of the available policies and how
to setup the share tree policy. His Howto can be found under: http://bioteam.net/dag/sge6-funct-share-dept.html.
( Mai 12 2005, 10:51:28 AM CEST )
Permalink

Montag April 25, 2005
N1GE 6 - Scheduler Hacks: The ticket policy hierarchy
The N1GE 6 support fair share scheduling through its ticket policy. The
ticket policy consists of three different parts:
- the Share Tree policy
- the functional policy
- the override policy
The share tree and the functional policy have a fixed amount of
tickets, which gets distributed over all jobs in the system. The
override policy is open. If "share override tickets" or "share
functional tickets" are enabled, the ticket amount of a job depends in
its submission time. The ticket amount in the share tree always depends
on the submission time as well as the past usage. Of course to all
three of them depend on the assignments from the configuration.
The statement, that the jobs' ticket amount depends on the submission
time is a bit simple. An additional parameter comes into play through
the Ticket Policy Hierarchy.
It specifies the order in which the different ticket policies are
called. The default is: OFS, which means:
1. O = override ticket policy
2. F = functional ticket policy
3. S = share tree policy
The real dependency for the final tickets are:
compute override tickets:
- assigned tickets by the configuration
- job submission time
compute functional tickets:
- assigned tickets by the configuration
- assigned override tickets
- job submission time
compute share tree tickets:
- assigned tickets by the configuration
- usage
- assigned override tickets
- assigned functional tickets.
- job submission time
As you see, the previous computed tickets have an effect on the
following ticket policy. The next example will demonstrate it.
Setup:
qconf -msconf
ticket_policy_hierarchy OFS
weight_tickets_functional
100000
weight_tickets_share
0
weight_ticket
1.000000
weight_waiting_time
0.000000
weight_deadline
3600000.000000
weight_urgency
0.000000
weight_priority
0.000000
qconf -aprj PRJ1
name PRJ1
oticket 10
fshare 0
acl NONE
xacl NONE
qconf -muser <NAME>
fshare 100
Jobs:
7 jobs without a project:
qsub
$SGE_ROOT/examples/jobs/sleeper.sh
2 PRJ1 jobs:
qsub -P PRJ1
$SGE_ROOT/examples/jobs/sleeper.sh
qstat
output:
JobId
P S Project
Tot-Tkt ovrts otckt ftckt
stckt shr
--------------------------------------------------------
223690
1.00000 qw PRJ1
25010
0 10
25000 0 0.35
223691
0.50000 qw PRJ1
12505
0 5
12500 0 0.18
223683 0.33320
qw NA
8333
0 0
8333 0 0.12
223684 0.24990
qw NA
6250
0 0
6250 0 0.09
223685 0.19992
qw NA
5000
0 0
5000 0 0.07
223686 0.16660
qw NA
4166
0 0
4166 0 0.06
223687 0.14280
qw NA
3571
0 0
3571 0 0.05
223688 0.12495
qw NA
3125
0 0
3125 0 0.04
223689 0.11107
qw NA
2777
0 0
2777 0 0.04
Setup
change:
I know modify the policy hierarchy:
qconf -msconf
ticket_policy_hierarchy FSO
and the qstat output changes to:
JobId
P S Project
Tot-Tkt ovrts otckt ftckt
stckt shr
--------------------------------------------------------
223683 1.00000
qw NA
25000
0 0
25000 0 0.35
223684 0.50000
qw NA
12500
0 0
12500 0 0.18
223685 0.33333
qw NA
8333
0 0
8333 0 0.12
223686 0.25000
qw NA
6250
0 0
6250 0 0.09
223687 0.20000
qw NA
5000
0 0
5000 0 0.07
223688 0.16667
qw NA
4166
0 0
4166 0 0.06
223689 0.14286
qw NA
3571
0 0
3571 0 0.05
223690
0.12540 qw PRJ1
3135
0 10
3125 0 0.04
223691
0.11131 qw PRJ1
2782
0 5
2777 0 0.04
( Apr 25 2005, 12:27:04 PM CEST )
Permalink

Montag April 18, 2005
N1GE 6 - Scheduler Hacks: job execution priority
The nice level of a job can be set in different ways. The simple way is
to turn the reprioritization feature off (it is the default setting)
and set the nice level via the queue configuration.
qconf -mq all.q
priority 0
All jobs running in the queue instance will run with defined nice
level. One can now easily configure different cluster queues (such as
low, medium, and high priority) with different nice levels.
This is the easy way and allows the user to decide how important the
job is by submitting it to a specific cluster queue.
This approach is not always fine grained enough. Sometimes it is
important to rank the jobs based on the scheduling priority. A high
priority job should not only be scheduled as fast as possible but also
run on a lower nice level than low priority jobs. The importance
ranking for the scheduling decision is done via the ticket policy and
others. But only the ticket policy has a direct impact on the job nice
level when "reprioritize" is enabled. There are two places to enable
and controll job reprioritization:
qconf -mconf
reprioritize 0
qconf -msconf
reprioritize_interval 0:0:0
One could assume, that one can also influence the reprioritization via:
qocnf -mconf <host_name>
but, even though the setting is accepted, if does not have an effect.
The "reprioritize" flag enables/disables the feature. If it sets to
true, the execd will monitor the usage of each job that it is
running. It knows the amount of tickets for each job and will ensure,
that the ticket ratio between the jobs is the same ratio as the usage
between the jobs. Every job gets initial start tickets. The scheduler
will most certainly change them while the job is running. Therefore we
have the reprioritize_interval, which will update the jobs on the execd
side and ensure that the ratio between the usage reflects the ratio
between the tickets via the nice level. Since it takes some time to
adjust the jobs' usage via the nice level, the tickets should not be
send too often. The recommendation is 2 minutes for the
reprioritize_interval.
If the reprioritize_interval is set to 0:0:0, the reprioritize feature
is disabled (e.q. reprioritize is set 0). It also works the other
way around, setting reprioritize_interval enables to feature by setting
reprioritize to 1.
A sample
setup with two projects:
PRJ10 100 functional shares
PRJ1 10 functional shares
qstat
shows:
JobId
P S Project
Tot-Tkt ovrts otckt ftckt
stckt
--------------------------------------------------
223670 1.50000 qw PRJ10
25000
0 0
25000 0
223671 0.59091 qw
PRJ1 2272
0 0
2272 0
223672 0.50000 r
NA
0
0
0
0 0
223673 0.50000 r
NA
0
0
0
0 0
223674 0.50000 r
NA
0
0
0
0 0
Top output
(note the changes in the nice level):
1)
PID USERNAME THR PRI NICE SIZE RES
STATE TIME CPU COMMAND
11137 sg144703 1 40 -10 4240K 3352K
cpu/3 2:35 21.93% work
11139 sg144703 1 37 -9 4240K 3352K
cpu/2 2:31 21.10% work
11749 sg144703 1 0 17 4240K 3352K
cpu/0 1:07 17.66% work
11743 sg144703 1 0 12 4240K 3352K
run 0:54 17.20% work
11751 sg144703 1 0 19 4240K 3352K
run 1:04 14.00% work
2)
PID USERNAME THR PRI NICE SIZE RES
STATE TIME CPU COMMAND
11137 sg144703 1 30 -10 4240K 3352K
cpu/1 3:23 23.92% work
11139 sg144703 1 27 -9 4240K 3352K
cpu/2 3:19 23.41% work
11743 sg144703 1 0 19 4240K 3352K
run 1:21 16.28% work
11751 sg144703 1 0 18 4240K 3352K
cpu/3 1:36 15.28% work
11749 sg144703 1 8 17 4240K 3352K
run 1:37 12.66% work
3)
11137 sg144703 1 30 -10 4240K 3352K
run 4:30 24.02% work
11139 sg144703 1 27 -9 4240K 3352K
cpu/1 4:26 23.92% work
11751 sg144703 1 0 19 4240K 3352K
cpu/3 2:25 16.32% work
11749 sg144703 1 0 15 4240K 3352K
run 2:17 13.70% work
11743 sg144703 1 0 17 4240K 3352K
run 1:56 11.83% work
And the
qstat usage output:
job-id
project
department state cpu
mem io tckts
ovrts otckt ftckt stckt
----------------------------------------------------------------------
223670
PRJ10
defaultdep r 0:00:04:41 1.13824 0.00000
90909 0 0
90909 0
223671
PRJ1
defaultdep r 0:00:04:37 1.11933 0.00000
9090 0 0
9090 0
223672
NA
defaultdep r 0:00:02:04 0.50110
0.00000 0
0 0
0 0
223673
NA
defaultdep r 0:00:02:25 0.58774
0.00000 0
0 0
0 0
223674
NA
defaultdep r 0:00:02:29 0.60243
0.00000 0
0 0
0 0
The machine is used for this test had 4 processors and there were
always enough CPUs for the PRJ10 and PRJ1 job. Therefore they have more
or less the same usage. The others are way behind. They have to share
the resources with the other tasks and are way behind. The min / max
values for the nice level are defined in the source file: source/daemons/execd/ptf.h
A different job mix results in different nice levels:
qstat
output:
JobId
P S Project
Tot-Tkt ovrts otckt ftckt
stckt
----------------------------------------------------
223675 1.50000 r PRJ10
30303
0 0
30303 0
223676 1.50000 r PRJ10
30303
0 0
30303 0
223677 1.50000 r PRJ10
30303
0 0
30303 0
223678 0.80000 r
PRJ1 9090
0 0
9090 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
top:
1)
21625 sg144703 1 40 -10 4240K 3352K
cpu/1 1:32 20.61% work
21589 sg144703 1 47 -9 4240K 3352K
run 1:34 20.50% work
21590 sg144703 1 37 -9 4240K 3352K
cpu/0 1:34 20.30% work
21633 sg144703 1 0 16 4240K 3352K
run 1:21 18.31% work
The used nice range might be a bit extrem. There are two switches to
specify the range. The settings PTF_MIN_PRIORITY and PTF_MAX_PRIORITY allow to control
used nice range. It can be set via:
qconf -mconf
execd_params
PTF_MIN_PRIORITY=19, PTF_MAX_PRIORITY=0
( Apr 18 2005, 05:00:32 PM CEST )
Permalink

Mittwoch April 13, 2005
N1GE 6 - Scheduler Hacks: Comment on the qmaster <-> scheduler protocol
If one is using any of the ticket policies one will most likely see
something similar to:
04/12/2005 09:16:28|qmaster|xxx|E| orders user/project version (16468) is not uptodate (16469) for user/project "PRJ147"
in the qmaster messages file ($SGE_CELL/spool/qmaster/messages). I
would like to explain this can happen and why it is not necessary
a bug when these messages are logged. The scheduler is implemented as
an event client. This means that it will receive an event when ever an
object in the qmaster is added, removed, or modified. These events are
usually delivered to the event clients right away or with a delay that
the event client can specify. In the case of the scheduler, it is every
scheduling_interval (default 15s). The event delivery does not only
update the data in the scheduler but also triggers a scheduling run.
Depending on the amount of jobs and the complexity of the jobs it can
take a while before a scheduling run has finished. With a couple 10k
jobs in the system it might take longer than the scheduling interval.
In this case, a second event client configuration setting is activated.
It allows to specify what the event master should do, when events are
not acknowledged or the client is busy. In case of the scheduler no
events are send while the event client is marked as busy. This means,
that the scheduling data will not be updated during a scheduling run.
It can happen, that a administrator is modifying an object during a
scheduling run. This will lead to the error message we saw in the
beginning. After every scheduling run send the scheduler a package of
orders to the qmaster. While the qmaster executes the orders it
validates them and ensures that the affected objects did not change. If
such a change is detected the order will be ignored and we see an error
message that the order failed.
Commands which might lead to
the error message:
- qconf -mq //
modify a queue
- qmod
// change a queue
- qconf -clearusage
- qconf -mprj // modify a project
and others.
Due to bugs in the event master, these error messages were logged quite
frequent in older version (N1GE 6.0 FCS, u1, u2, and u3). Though, if
nobody changed anything and these error messages are logged, one might
have found a bug.
( Apr 13 2005, 02:13:36 PM CEST )
Permalink

Freitag April 08, 2005
Untidy? Overworked? Rent a German
Well, everyone knows that Germany like to travel and everyone knows that the have other "good" characteristics. One that they were not know for yet, was making money. But having millions unemployed surely one will come up with new and "different" ideas. The Sunday Independent features an article about renting a German. If you want to have a look around yourself, take a look at the Rent a German web side.
( Apr 08 2005, 10:23:31 AM CEST )
Permalink

Dienstag April 05, 2005
N1GE 6 - Scheduler Hacks: "least used" / "fill up" configuration I also want to use this blog to talks about some "hacks" around the N1 Grid Engine 6 software and its scheduler. The scheduler in the Grid
Engine project is in theory a very comfortable tool. It makes the decision for the user where to run the jobs. The user only has to specify a couple of constraints which have to be meet by the execution host. However, there are some configuration settings, which are not very intuitive. The one I want to talk about today is the configuration of"least used host first" and "fill up host".
I will not go into detail of the Grid Engine terminology. I also assume that the reader has a basic understanding of the N1 Grid Engine 6 software.
The default setting for the scheduler is to distribute jobs load based.It looks at every host and assigns a job to the host with the least load.This is not always desired. There are use cases, where the scheduler should distribute the jobs equally over all available hosts and only assign multiple jobs to one host, when all hosts in use or in contrary, fill up a host first before assigning jobs to the next
host.
I think that the equal distribution will be the more usable use-case.For example:
It is useful in the case of over-subscripting the hosts in the grid. "Over-subscription" means that one host will execute more jobs than it has CPUs. The setting "use least used host first" ensures that all available CPUs are used first, before Grid Engine starts to over-subscribe a host.
I will setup a grid with two hosts ("host A", "host B") and one cluster queue "all.q" in my example. The hosts are referenced in the
host group "@allhosts".
'qconf -sq all.q' will show (reduced to the important details):
qname all.q
hostlist @allhosts
slots 1,[host_A=4],[host_B=4]
We see that each host can run 4 jobs at the same time. To prepare the hosts for "least used host first" or "fill up" we have to configure:
'qconf -me host_A' and set complex_values slots=4:
hostname host_A
load_scaling NONE
complex_values slots=4
We do the same setting for host_B ('qconf -me host_B')
This setting is needed because the scheduler distributes jobs to the hosts based on load values. A load value can be an external script, which reports values (such as: load_avg, mem_use, ...) or a consumable. We use a consumable for our configuration. To give the scheduler access to the value, we need to define it for each host, as we just did. The N1 Grid Engine 6 software will now count the running jobs not only on a queue level but also on host level. If the sum of slots for all queue instances on a given host is bigger than the defined value for that host, the scheduler will limit the number of running jobs on that host to the defined value in the host configuration. If we have less slots in all queue instances on that host than defined for that host, the number of running jobs will not exceed the number of slots in the queue instances.
After we did the preparation, we need to tell the scheduler to use the least used host first or to fill it up.
To enable "use least used host first" we configure:
'qconf -msconf'
and set "queue_sort_method load" and "load_formula -slots".
algorithm default
schedule_interval 0:2:0
maxujobs 0
queue_sort_method load
job_load_adjustments NONE
load_adjustment_decay_time 0:0:0
load_formula -slots
schedd_job_info true
flush_submit_sec 1
flush_finish_sec 1
To enable "fill up host" we configure:
'qconf -msconf' and set "queue_sort_method load" and "load_formula slots".
algorithm default
schedule_interval 0:2:0
maxujobs 0
queue_sort_method load
job_load_adjustments NONE
load_adjustment_decay_time 0:0:0
load_formula slots
schedd_job_info true
flush_submit_sec 1
flush_finish_sec 1
If performance is a critical to you, I am not sure, that I can recommend this configuration. If that is teh case please use profiling to validate the performance impact.
Well, having setup the scheduler this way, one might wounder how this setting works together with the parallel environment (pe) allocation rule. The default setting is, what ever is specified in the pe, overwrites the scheduler configuration. Only if "pe_slots"is set as an allocation rule, the scheduler configuration is used.
Links:
( Apr 05 2005, 09:50:24 PM CEST )
Permalink

Samstag April 02, 2005
New blogger and a bit about climbing Fuji-san Hello
I am another blogger here in this space. I do not know yet, what I am going to blog about. I will most likely be about our product, the N1 Grid Engine 6 and some other stuff, most likely some nonsense. To my person, I am part of the N1 Grid Engine engineering team in Regensburg, Germany. Daniel Templeton wrote already a lot about his experiences as an American in a German Engineering team far, far away from the Sun core development centres. Every once in a while, one gets the impression, that we are a little town in the enemy’s domain, similar to Asterix and Obelix…. ;-). I am a north German working and living in south Germany. It might be unbelievable, but I still do not understand the German spoken in Baveria and it is not making it simpler, that every town speaks its own dialect…
Okay, why do not start with a little nonsense…
I have done Software Development for some time now and a couple years ago happened to work in Japan. Among the travelling to get the country to know, I climbed up mount Fujiyama. On the way back to Tokyo, I came up with the idea, that climbing Fuji-san is the perfect metaphor for software development, at least for most projects.
We started the climb at 11:00 p.m. in the dark. We followed the narrow tracks up the mountain with not more than a tiny flash light. This is the same for software projects. One starts with nearly no knowledge on the project. One is looking around, tries to follow the tiny path and hops
not to stumble and fall. The path is a very stony and sandy one. One has to watch out for ones steps.
After being up the mountain a bit, we saw a thunderstorm coming our way. We sow the clouds coming, the lightning, how it jumped between the clouds before the flash went down to earth. That was a very scary moment. We almost had to turn around and abandon the project. The usual worries during the development. Is ones political situation stable and strong enough? Will the company have enough money to continue the founding or will we work on a different project next week? We were lucky, the wind changed, and the thunderstorm did not get close the Fuji-san. We witnessed an unbelievable show.
Continuing from there, it got rougher. The sand was gone, and we had to climb. At some occasions, we had to go done on all four and really climb. The path was about 1 meter wide and we came into a “traffic yam”. To many people claiming up the mountain. This was a very dangerous part. We could hardly see the rocks. The batteries of my flashlight drained and I had no light. Other climbers using sticks to make the climb easier, did not know what to do with their sticks. More than one time, I was almost hit by one, because that person did not care for the other ones. Any difference to software engineering in big companies with little budget? Groups are fighting for funding. They do not always care for the other groups, right? What about groups being a thread, because they want to work in the same space? One of our group-members had bad luck and hurt her knee. She missed on of the rocks, while she blocked a stick. She had to turn around and get down…
Having come that far, the light was not a problem anymore. The trail got easier, but the air got thin (being over 3500 meters above see level). We were very short on air, we were exhausted. We almost turned around, but it was not possible. I think, nobody is surprised when I tell you, that we did not make it all the way to the top in time. We should have been on top during the sunrise. But due to the people and the exhaustion we did not make it. We were about 100 meters below the top, when the sun rose. Again, so much different from Software Development? Most of the projects I know, are never on time. There is always something that no one expected. Something surprising. The good news is, that in most cases, one has the extra time. It makes the product better and it ensures, that the programmers do not exhaust themselves too much. For us it was also better not to be on the top during the sunrise (because of the crowds up there). We paused and found a good place for resting. We were still sitting there, when the sun rose. We had a very, very good view on the sunrise.
A moment later, we reached the top and finished our project. But it was not nice up there. Cold, windy. We did not stay there for long. The way down was also much more painful than we expected. However we did it. Only our team-mate, that got insured, did not make it all the way and it took her very, very long to get back to the base station. Sitting in the bus and reflecting the climb: I can say, we were not proper equipped and not well enough trained. We also had wrong information about the climb. It was much tougher than in the descriptions said which we read before we started. We need 11 hours for the climb.
Well, what a surprise. I usually find this to be very similar to developing projects…
Though, does this work as a metaphor for software development? I consider this a nice and little analogy. My girl friend said, that this text sounds a bit depressing, that should not be the case. I like chalenges. This was not the first mountain I climbed and will certainly not be the last one. The same is true for software development. I enjoy the challenges during the development and the pride of having managed something, that seemed
to be impossible at the beginning. It is fun to have the possibility to do something new every day. I like it a lot.
And now a couple pictures from the climb:
The climb in the dark:
The last station before the top:
On the way down:
Beeing back at the base:
( Apr 02 2005, 08:01:10 PM CEST )
Permalink
|
|
| Archive |
|
|
| « Dezember 2009 | | Mo | Di | Mi | Do | Fr | Sa | So |
|---|
| | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | | | | | | | | | | | | Heute |
|
|
|
|
|
|
| Sprache |
|
|
|
|
|
| Links |
|
|
|
|
|
| Referenzierte URLs |
|
|
|
Page Hits heute: 21
|
|
|
|
|
|