Stephan Grell's Weblog
Stephan Grell's Weblog

20050512 Donnerstag Mai 12, 2005

N1GE 6 - Scheduler: Department-Based policy setup
In his latest howto Chris Dagdigian wrote about configuring the different policies of the N1GE 6 system to enable Department-Based scheduling. He gives a good summary of the available policies and how to setup the share tree policy. His Howto can be found under: http://bioteam.net/dag/sge6-funct-share-dept.html.
( Mai 12 2005, 10:51:28 AM CEST ) Permalink Kommentare [0]

20050425 Montag April 25, 2005

N1GE 6 - Scheduler Hacks: The ticket policy hierarchy
The N1GE 6 support fair share scheduling through its ticket policy. The ticket policy consists of three different parts:

- the Share Tree policy
- the functional policy
- the override policy

The share tree and the functional policy have a fixed amount of tickets, which gets distributed over all jobs in the system. The override policy is open. If "share override tickets" or "share functional tickets" are enabled, the ticket amount of a job depends in its submission time. The ticket amount in the share tree always depends on the submission time as well as the past usage. Of course to all three of them depend on the assignments from the configuration.

The statement, that the jobs' ticket amount depends on the submission time is a bit simple. An additional parameter comes into play through the Ticket Policy Hierarchy. It specifies the order in which the different ticket policies are called. The default is: OFS, which means:

1. O = override ticket policy
2. F = functional ticket policy
3. S = share tree policy

The real dependency for the final tickets are:

compute override tickets:
 - assigned tickets by the configuration
 - job submission time

compute functional tickets:
 - assigned tickets by the configuration
 - assigned override tickets
 - job submission time

compute share tree tickets:
 - assigned tickets by the configuration
 - usage
 - assigned override tickets
 - assigned functional tickets.
 - job submission time

As you see, the previous computed tickets have an effect on the following ticket policy. The next example will demonstrate it.

Setup:
qconf -msconf
   ticket_policy_hierarchy   OFS
   weight_tickets_functional         100000
   weight_tickets_share              0
   weight_ticket                     1.000000
   weight_waiting_time               0.000000
   weight_deadline                   3600000.000000
   weight_urgency                    0.000000
   weight_priority                   0.000000

qconf -aprj PRJ1
   name PRJ1
   oticket 10
   fshare 0
   acl NONE
   xacl NONE

qconf -muser <NAME>
   fshare 100

Jobs:
7 jobs without a project:
   qsub $SGE_ROOT/examples/jobs/sleeper.sh

2 PRJ1 jobs:
   qsub -P PRJ1 $SGE_ROOT/examples/jobs/sleeper.sh

qstat output:
JobId     P      S   Project  Tot-Tkt   ovrts   otckt  ftckt   stckt   shr
--------------------------------------------------------
223690 1.00000   qw     PRJ1   25010       0      10   25000       0    0.35
223691 0.50000   qw     PRJ1   12505       0       5   12500       0    0.18
223683 0.33320   qw       NA    8333       0       0    8333       0    0.12
223684 0.24990   qw       NA    6250       0       0    6250       0    0.09
223685 0.19992   qw       NA    5000       0       0    5000       0    0.07
223686 0.16660   qw       NA    4166       0       0    4166       0    0.06
223687 0.14280   qw       NA    3571       0       0    3571       0    0.05
223688 0.12495   qw       NA    3125       0       0    3125       0    0.04
223689 0.11107   qw       NA    2777       0       0    2777       0    0.04

Setup change:
I know modify the policy hierarchy:

qconf -msconf
   ticket_policy_hierarchy   FSO

and the qstat output changes to:
JobId     P      S   Project  Tot-Tkt   ovrts   otckt  ftckt   stckt   shr
--------------------------------------------------------
223683 1.00000   qw       NA   25000       0       0   25000       0    0.35
223684 0.50000   qw       NA   12500       0       0   12500       0    0.18
223685 0.33333   qw       NA    8333       0       0    8333       0    0.12
223686 0.25000   qw       NA    6250       0       0    6250       0    0.09
223687 0.20000   qw       NA    5000       0       0    5000       0    0.07
223688 0.16667   qw       NA    4166       0       0    4166       0    0.06
223689 0.14286   qw       NA    3571       0       0    3571       0    0.05
223690 0.12540   qw     PRJ1    3135       0      10    3125       0    0.04
223691 0.11131   qw     PRJ1    2782       0       5    2777       0    0.04

( Apr 25 2005, 12:27:04 PM CEST ) Permalink Kommentare [0]

20050418 Montag April 18, 2005

N1GE 6 - Scheduler Hacks: job execution priority

The nice level of a job can be set in different ways. The simple way is to turn the reprioritization feature off (it is the default setting) and set the nice level via the queue configuration.

qconf -mq all.q
   priority  0

All jobs running in the queue instance will run with defined nice level. One can now easily configure different cluster queues (such as low, medium, and high priority) with different nice levels.

This is the easy way and allows the user to decide how important the job is by submitting it to a specific cluster queue.

This approach is not always fine grained enough. Sometimes it is important to rank the jobs based on the scheduling priority. A high priority job should not only be scheduled as fast as possible but also run on a lower nice level than low priority jobs. The importance ranking for the scheduling decision is done via the ticket policy and others. But only the ticket policy has a direct impact on the job nice level when "reprioritize" is enabled. There are two places to enable and controll job reprioritization:

qconf -mconf
   reprioritize  0

qconf -msconf
   reprioritize_interval  0:0:0

One could assume, that one can also influence the reprioritization via:

qocnf -mconf <host_name>

but, even though the setting is accepted, if does not have an effect. The "reprioritize" flag enables/disables the feature. If it sets to true, the execd will monitor the usage of each job  that it is running. It knows the amount of tickets for each job and will ensure, that the ticket ratio between the jobs is the same ratio as the usage between the jobs. Every job gets initial start tickets. The scheduler will most certainly change them while the job is running. Therefore we have the reprioritize_interval, which will update the jobs on the execd side and ensure that the ratio between the usage reflects the ratio between the tickets via the nice level. Since it takes some time to adjust the jobs' usage via the nice level, the tickets should not be send too often. The recommendation is 2 minutes for the reprioritize_interval.

If the reprioritize_interval is set to 0:0:0, the reprioritize feature is disabled (e.q. reprioritize  is set 0). It also works the other way around, setting reprioritize_interval enables to feature by setting reprioritize to 1.

A sample setup with two projects:

PRJ10 100 functional shares
PRJ1   10 functional shares

qstat shows:

JobId     P      S   Project  Tot-Tkt   ovrts   otckt  ftckt   stckt   
--------------------------------------------------
223670 1.50000   qw    PRJ10   25000       0       0   25000       0
223671 0.59091   qw     PRJ1    2272       0       0    2272       0
223672 0.50000    r       NA       0       0       0       0       0
223673 0.50000    r       NA       0       0       0       0       0
223674 0.50000    r       NA       0       0       0       0       0

Top output (note the changes in the nice level):
1)
  PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
11137 sg144703   1  40  -10 4240K 3352K cpu/3    2:35 21.93% work
11139 sg144703   1  37   -9 4240K 3352K cpu/2    2:31 21.10% work
11749 sg144703   1   0   17 4240K 3352K cpu/0    1:07 17.66% work
11743 sg144703   1   0   12 4240K 3352K run      0:54 17.20% work
11751 sg144703   1   0   19 4240K 3352K run      1:04 14.00% work

2)
  PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
11137 sg144703   1  30  -10 4240K 3352K cpu/1    3:23 23.92% work
11139 sg144703   1  27   -9 4240K 3352K cpu/2    3:19 23.41% work
11743 sg144703   1   0   19 4240K 3352K run      1:21 16.28% work
11751 sg144703   1   0   18 4240K 3352K cpu/3    1:36 15.28% work
11749 sg144703   1   8   17 4240K 3352K run      1:37 12.66% work

3)
11137 sg144703   1  30  -10 4240K 3352K run      4:30 24.02% work
11139 sg144703   1  27   -9 4240K 3352K cpu/1    4:26 23.92% work
11751 sg144703   1   0   19 4240K 3352K cpu/3    2:25 16.32% work
11749 sg144703   1   0   15 4240K 3352K run      2:17 13.70% work
11743 sg144703   1   0   17 4240K 3352K run      1:56 11.83% work

And the qstat usage output:

 job-id project          department state cpu        mem       io    tckts ovrts otckt ftckt stckt
 ----------------------------------------------------------------------
 223670 PRJ10            defaultdep r     0:00:04:41 1.13824 0.00000 90909     0     0 90909     0
 223671 PRJ1             defaultdep r     0:00:04:37 1.11933 0.00000  9090     0     0  9090     0
 223672 NA               defaultdep r     0:00:02:04 0.50110 0.00000     0     0     0     0     0
 223673 NA               defaultdep r     0:00:02:25 0.58774 0.00000     0     0     0     0     0
 223674 NA               defaultdep r     0:00:02:29 0.60243 0.00000     0     0     0     0     0

The machine is used for this test had 4 processors and there were always enough CPUs for the PRJ10 and PRJ1 job. Therefore they have more or less the same usage. The others are way behind. They have to share the resources with the other tasks and are way behind. The min / max values for the nice level are defined in the source file: source/daemons/execd/ptf.h

A different job mix results in different nice levels:

qstat output:

JobId     P      S   Project  Tot-Tkt   ovrts   otckt  ftckt   stckt  
----------------------------------------------------
223675 1.50000    r    PRJ10   30303       0       0   30303       0
223676 1.50000    r    PRJ10   30303       0       0   30303       0
223677 1.50000    r    PRJ10   30303       0       0   30303       0
223678 0.80000    r     PRJ1    9090       0       0    9090       0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

top:
1)
21625 sg144703   1  40  -10 4240K 3352K cpu/1    1:32 20.61% work
21589 sg144703   1  47   -9 4240K 3352K run      1:34 20.50% work
21590 sg144703   1  37   -9 4240K 3352K cpu/0    1:34 20.30% work
21633 sg144703   1   0   16 4240K 3352K run      1:21 18.31% work

The used nice range might be a bit extrem. There are two switches to specify the range. The settings PTF_MIN_PRIORITY and PTF_MAX_PRIORITY allow to control used nice range. It can be set via:

qconf -mconf
execd_params       PTF_MIN_PRIORITY=19, PTF_MAX_PRIORITY=0

( Apr 18 2005, 05:00:32 PM CEST ) Permalink Kommentare [2]

20050413 Mittwoch April 13, 2005

N1GE 6 - Scheduler Hacks: Comment on the qmaster <-> scheduler protocol

If one is using any of the ticket policies one will most likely see something similar to:

04/12/2005 09:16:28|qmaster|xxx|E| orders user/project version (16468) is not uptodate (16469) for user/project "PRJ147"

in the qmaster messages file ($SGE_CELL/spool/qmaster/messages). I would like to explain this can happen and why it is not  necessary a bug when these messages are logged. The scheduler is implemented as an event client. This means that it will receive an event when ever an object in the qmaster is added, removed, or modified. These events are usually delivered to the event clients right away or with a delay that the event client can specify. In the case of the scheduler, it is every scheduling_interval (default 15s). The event delivery does not only update the data in the scheduler but also triggers a scheduling run. Depending on the amount of jobs and the complexity of the jobs it can take a while before a scheduling run has finished. With a couple 10k jobs in the system it might take longer than the scheduling interval. In this case, a second event client configuration setting is activated. It allows to specify what the event master should do, when events are not acknowledged or the client is busy. In case of the scheduler no events are send while the event client is marked as busy. This means, that the scheduling data will not be updated during a scheduling run. It can happen, that a administrator is modifying an object during a scheduling run. This will lead to the error message we saw in the beginning. After every scheduling run send the scheduler a package of orders to the qmaster. While the qmaster executes the orders it validates them and ensures that the affected objects did not change. If such a change is detected the order will be ignored and we see an error message that the order failed.

Commands which might lead to the error message:

- qconf -mq   // modify a queue
- qmod          // change a queue
- qconf -clearusage
- qconf -mprj // modify a project

and others.

Due to bugs in the event master, these error messages were logged quite frequent in older version (N1GE 6.0 FCS, u1, u2, and u3). Though, if nobody changed anything and these error messages are logged, one might have found a bug.

( Apr 13 2005, 02:13:36 PM CEST ) Permalink Kommentare [1]

20050408 Freitag April 08, 2005

Untidy? Overworked? Rent a German

Well, everyone knows that Germany like to travel and everyone knows that the have other "good" characteristics. One that they were not know for yet, was making money. But having millions unemployed surely one will come up with new and "different" ideas. The Sunday Independent features an article about renting a German. If you want to have a look around yourself, take a look at the Rent a German web side. ( Apr 08 2005, 10:23:31 AM CEST ) Permalink Kommentare [0]

20050405 Dienstag April 05, 2005

N1GE 6 - Scheduler Hacks: "least used" / "fill up" configuration I also want to use this blog to talks about some "hacks" around the N1 Grid Engine 6 software and its scheduler. The scheduler in the Grid Engine project is in theory a very comfortable tool. It makes the decision for the user where to run the jobs. The user only has to specify a couple of constraints which have to be meet by the execution host. However, there are some configuration settings, which are not very intuitive. The one I want to talk about today is the configuration of"least used host first" and "fill up host".
I will not go into detail of the Grid Engine terminology.  I also assume that the reader has a basic understanding of the N1 Grid Engine 6 software.
The default setting for the scheduler is to distribute jobs load based.It looks at every host and assigns a job to the host with the least load.This is not always desired. There are use cases, where the scheduler should distribute the jobs equally over all available hosts and only assign multiple jobs to one host, when all hosts in use or in contrary, fill up a host first before assigning jobs to the next host.
I think that the equal distribution will be the more usable use-case.For example: It is useful in the case of over-subscripting the hosts in the grid. "Over-subscription" means that one host will execute more jobs than it has CPUs.  The setting "use least used host first" ensures that all available CPUs are used first, before Grid Engine starts to over-subscribe a host.
I will setup a grid with two hosts ("host A", "host B") and one cluster queue "all.q" in my example. The hosts are referenced in the host group "@allhosts".
'qconf -sq all.q' will show (reduced to the important details):

qname all.q hostlist @allhosts slots 1,[host_A=4],[host_B=4]
We see that each host can run 4 jobs at the same time. To prepare the hosts for "least used host first" or "fill up" we have to configure:
'qconf -me host_A' and set complex_values slots=4:
hostname host_A load_scaling NONE complex_values slots=4
We do the same setting for host_B ('qconf -me host_B')
This setting is needed because the scheduler distributes jobs to the hosts based on load values. A load value can be an external script, which reports values (such as: load_avg, mem_use, ...) or a consumable. We use a consumable for our configuration. To give the scheduler access to the value, we need to define it for each host, as we just did.  The N1 Grid Engine 6 software will now count the running jobs not only on a queue level but also on host level. If the sum of slots for all queue instances on a given host is bigger than the defined value for that host, the scheduler will limit the number of running jobs on that host to the defined value in the host configuration. If we have less slots in all queue instances on that host than defined for that host, the number of running jobs will not exceed the number of slots in the queue instances.
After we did the preparation, we need to tell the scheduler to use the least used host first or to fill it up.
To enable "use least used host first" we configure: 'qconf -msconf' and set "queue_sort_method  load" and "load_formula  -slots".
algorithm default schedule_interval 0:2:0 maxujobs 0 queue_sort_method load job_load_adjustments NONE load_adjustment_decay_time 0:0:0 load_formula -slots schedd_job_info true flush_submit_sec 1 flush_finish_sec 1
To enable "fill up host" we configure: 'qconf -msconf' and set "queue_sort_method  load" and "load_formula  slots".
algorithm default schedule_interval 0:2:0 maxujobs 0 queue_sort_method load job_load_adjustments NONE load_adjustment_decay_time 0:0:0 load_formula slots schedd_job_info true flush_submit_sec 1 flush_finish_sec 1
If performance is a critical to you, I am not sure, that I can recommend this configuration. If that is teh  case please use profiling to validate the performance impact.
Well, having setup the scheduler this way, one might wounder how this setting works together with the parallel environment (pe) allocation rule. The default setting is, what ever is specified in the pe, overwrites the scheduler configuration. Only if  "pe_slots"is set as an allocation rule,  the scheduler configuration is used.
Links: ( Apr 05 2005, 09:50:24 PM CEST ) Permalink Kommentare [3]

20050402 Samstag April 02, 2005

New blogger and a bit about climbing Fuji-san Hello
I am another blogger here in this space. I do not know yet, what I am going to blog about. I will most likely be about our product, the N1 Grid Engine 6 and some other stuff, most likely some nonsense. To my person, I am part of the N1 Grid Engine engineering team in Regensburg, Germany. Daniel Templeton wrote already a lot about his experiences as an American in a German Engineering team far, far away from the Sun core development centres. Every once in a while, one gets the impression, that we are a little town in the enemy’s domain, similar to Asterix and Obelix…. ;-). I am a north German working and living in south Germany. It might be unbelievable, but I still do not understand the German spoken in Baveria and it is not making it simpler, that every town speaks its own dialect…
Okay, why do not start with a little nonsense…
I have done Software Development for some time now and a couple years ago happened to work in Japan. Among the travelling to get the country to know, I climbed up mount Fujiyama. On the way back to Tokyo, I came up with the idea, that climbing Fuji-san is the perfect metaphor for software development, at least for most projects.
We started the climb at 11:00 p.m. in the dark. We followed the narrow tracks up the mountain with not more than a tiny flash light. This is the same for software projects. One starts with nearly no knowledge on the project. One is looking around, tries to follow the tiny path and hops not to stumble and fall. The path is a very stony and sandy one. One has to watch out for ones steps.
After being up the mountain a bit, we saw a thunderstorm coming our way. We sow the clouds coming, the lightning, how it jumped between the clouds before the flash went down to earth. That was a very scary moment. We almost had to turn around and abandon the project. The usual worries during the development. Is ones political situation stable and strong enough? Will the company have enough money to continue the founding or will we work on a different project next week? We were lucky, the wind changed, and the thunderstorm did not get close the Fuji-san. We witnessed an unbelievable show.
Continuing from there, it got rougher. The sand was gone, and we had to climb. At some occasions, we had to go done on all four and really climb. The path was about 1 meter wide and we came into a “traffic yam”. To many people claiming up the mountain. This was a very dangerous part. We could hardly see the rocks. The batteries of my flashlight drained and I had no light. Other climbers using sticks to make the climb easier, did not know what to do with their sticks. More than one time, I was almost hit by one, because that person did not care for the other ones. Any difference to software engineering in big companies with little budget? Groups are fighting for funding. They do not always care for the other groups, right? What about groups being a thread, because they want to work in the same space? One of our group-members had bad luck and hurt her knee. She missed on of the rocks, while she blocked a stick. She had to turn around and get down…
Having come that far, the light was not a problem anymore. The trail got easier, but the air got thin (being over 3500 meters above see level). We were very short on air, we were exhausted. We almost turned around, but it was not possible. I think, nobody is surprised when I tell you, that we did not make it all the way to the top in time. We should have been on top during the sunrise. But due to the people and the exhaustion we did not make it. We were about 100 meters below the top, when the sun rose. Again, so much different from Software Development? Most of the projects I know, are never on time. There is always something that no one expected. Something surprising. The good news is, that in most cases, one has the extra time. It makes the product better and it ensures, that the programmers do not exhaust themselves too much. For us it was also better not to be on the top during the sunrise (because of the crowds up there). We paused and found a good place for resting. We were still sitting there, when the sun rose. We had a very, very good view on the sunrise.
A moment later, we reached the top and finished our project. But it was not nice up there. Cold, windy. We did not stay there for long. The way down was also much more painful than we expected. However we did it. Only our team-mate, that got insured, did not make it all the way and it took her very, very long to get back to the base station. Sitting in the bus and reflecting the climb: I can say, we were not proper equipped and not well enough trained. We also had wrong information about the climb. It was much tougher than in the descriptions said which we read before we started. We need 11 hours for the climb.
Well, what a surprise. I usually find this to be very similar to developing projects…
Though, does this work as a metaphor for software development? I consider this a nice and little analogy. My girl friend said, that this text sounds a bit depressing, that should not be the case. I like chalenges. This was not the first mountain I climbed and will certainly not be the last one. The same is true for software development. I enjoy the challenges during the development and the pride of having managed something, that seemed to be impossible at the beginning. It is fun to have the possibility to do something new every day. I like it a lot.
And now a couple pictures from the climb:
The climb in the dark: in the dark

The last station before the top: getting down

On the way down: last station below the top(3500 meters above see level)

Beeing back at the base: beeing back at the base station ( Apr 02 2005, 08:01:10 PM CEST ) Permalink Kommentare [2]


Archive
Sprache
Links
Referenzierte URLs