Freitag September 08, 2006 | Stephan Grell's Weblog |
|
It was fun, It was interesting. I still move on. A warm welcome to Praha The Sun re-org worked for us very well. I use the chance to send a warm welcome to our new colleagues in Praha.With this new branch in our team will we become really distributed. We do now have the folks from the open source helping us with the project and 3 locations within Sun, where we are doing our development from. It will be interesting to see how fast we can do the integration and which new projects we can now take on. So, again, a warm welcome to our new team members. I hope we will continue as we started last week with the big get to gether. ( Sep 05 2006, 09:55:12 AM CEST ) Permalink
N1GE 6 - Profiling
N1GE 6 - Scheduler Hacks: Exclusive master host for the master task Grid Engine allows to sort hosts /
queues by sequence number. Assuming that we have only one cluster queue
and the parallel environment is configured to use fill -up, we can
assign the compute queue instances a smaller sequence number than the
master machines. The job would request the pe to run in and the master
machine as the masterq. This way, all slaves would run on the compute
nodes, which are filled-up first, and the master task is singled out to
the master machine due to its special request.
Advantages:If the environment has more than one master host, wild cards in the masterq request can be used to select one of the master host. Makes best use of all resources, is
easy to setup, to understand and debug. This setup has also the least
performance impact.
Problems: As soon as there are not enough compute
nodes available, the scheduler will assign more than one task to the
master machine.
Configuration: Change the queue sort oder in the scheduler config: qconf -msconf queue_sort_method seqno The queue for on the small hosts gets: qconf -mq <queue> seq_no 0 The queue for the master hosts gets: qconf -mq <queue> seq_no 1 A job submit would look like:
2) Making accesive use of pe objects
and cluster queues:qsub -pe <PE> 6 -masterq "*@master*" ... Description: Each slot on a master host needs its
own cluster queue and its own pe. The compute nodes are combined under
1 cluster queue with all pe objects that are used on the master hosts.
Each master cluster queue has exactly one slot. The job submit will now
request the master queue via wild cards and the pe it should run in
with wild cards.
Advantages:Archives the goal.
Problems:Many configuration objects. Slows done
the scheduler quite a bit.
Configuration:I will leave the configuration for this
one open. Should not be complicated...
3) Using load adjustments: Description The scheduler uses the load adjustments for not overloading an host. The system can be configured in such a way, that the scheduler starts not more than one task on one host eventhough more slots are available. We will use this configuration to archive the desired goal. Advantages: Archives exactly what we are looking
for whichout any additionl configuration objects.
Problems:Slows down scheduling. Only one job
requesting the master host will be started in one scheduling run.
Supporting backup master hosts is not easy.
The master machine is only allowed to have one queue instance, or all
queue instances of the master machine have to share the same load
threshold. If that is not the case, it will not work.
Configuration: I have the following setup:
qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@big BIP 0/4 0.02 sol-sparc64 ---------------------------------------------------------------------------- small.q@small1 BIP 0/1 0.00 lx24-amd64 ---------------------------------------------------------------------------- small.q@small2 BIP 0/1 0.02 sol-sparc64 And a configured pe in all queue instances: qconf -sp make pe_name make slots 999 user_lists NONE xuser_lists NONE start_proc_args NONE stop_proc_args NONE allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min We now go ahead and change the load_threshold in the all.q@big queue instance to be a load value that is not used in the other queue instances, such as: qconf -sq all.q qname all.q hostlist big seq_no 0 load_thresholds NONE,[big=load_avg=4] The used load threshold has to be a real load value and cannot be a fixed or consumable value. Next step to make our enviroment work is to change the scheduler configuration to the following: qconf -ssconf algorithm default schedule_interval 0:2:0 maxujobs 0 queue_sort_method load job_load_adjustments load_avg=4.100000 load_adjustment_decay_time 0:0:1 By changing the configuration of the scheduler to use the job_load_adjustments like this, it will add an artificial load to each host, that will run a task. With this configuration we can start one task on the master machine in each scheduling run. Since the load_adjustment_decay_time is only 1 second, the scheduler has forgotten about the artificial load in the next scheduling run and can start a new task on the master host. This way, we archive what we have been looking for. Extended Configuration: If the usage of multiple master hosts
is requriered, one need to create one pe object per master host. The
compute hosts are part of all pe objects. The same rule as above still
applies, each master host is only allowed to have one queue instance.
The configuration of the all.q queue would look as follows:
qconf -sq all.q qname all.q hostlist big seq_no 0 load_thresholds NONE,[big=load_avg=4],[big1=load_avg=4][big2=load_avg=4] pe_list big_pe big1_pe big2_pe,[big=big_pe],[big1=big1_pe],[big2=big2_pe] The job submit would look like: qsub -pe "big*" 5 -masterq="all.q@big*" .... ( Aug 08 2006, 09:44:12 AM CEST ) Permalink
N1GE 6 - Monitoring the qmaster If we use the following settings: qconf -mconf qmaster_params Monitor_Time=0:0:20 LOG_Monitor_Message=0 We will need to use qping to gain access to the monitoring messages. Thiis should be the prefered way because we will get the statics from the communication layer with the statistics in the qmaster. Here is an example: 04/25/2006 19:09:53: SIRM version: 0.1 SIRM message id: 3 start time: 04/25/2006 08:45:06 (1145947506) run time [s]: 37487 messages in read buffer: 0 messages in write buffer: 0 nr. of connected clients: 3 status: 0 info: TET: R (1.99) | EDT: R (0.99) | SIGT: R (37486.73) | MT(1): R (3.99) | MT(2): R (0.99) | OK Monitor: 04/25/2006 19:09:47 | TET: runs: 0.40r/s (pending: 9.00 executed: 0.40/s) out: 0.00m/s APT: 0.0001s/m idle: 100.00% wait: 0.00% time: 20.00s 04/25/2006 19:09:37 | EDT: runs: 1.00r/s (clients: 1.00 mod: 0.00/s ack: 0.00/s blocked: 0.00 busy: 0.00 | events: 0.00/s added: 0.00/s skipt: 0.00/s) out: 0.00m/s APT: 0.0001s/m idle: 99.99% wait: 0.00% time: 20.00s 04/25/2006 08:45:07 | SIGT: no monitoring data available 04/25/2006 19:09:36 | MT(1): runs: 0.15r/s (execd (l:0.04,j:0.04,c:0.04,p:0.04,a:0.00)/s GDI (a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.00m/s APT: 0.0002s/m idle: 100.00% wait: 0.00% time: 26.86s 04/25/2006 19:09:39 | MT(2): runs: 0.14r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.00m/s APT: 0.0000s/m idle: 100.00% wait: 0.00% time: 21.04s ( Apr 25 2006, 07:14:12 PM CEST ) Permalink
N1GE 6 - Scheduler Hacks: Seperated Master host for pe jobs
N1GE 6 - A couple lines on halftime and usage
N1GE 6 - We want you! We changed some of the xml output and a
lot in the schema to make sure that the go hand in hand. The nice part
about reworking the schemas is, that you can use JAX-B
to generated Java classes out of them. With this change it will be very
easy to write a Java Program, which works on the qstat output. In
combination with the DRMAA
Java interface, one has know an "API" at hand, which allows to
write grid enabled applications without to much effort.
Most of the changes make only shure, that the xml output matches the schema files. One change migth break already existing parsers. The dates are now printed in the xml datatime format and no longer in a human readable format. This was required to support JAXB parsers. 2) qmaster monitoring: To measure the performance enhancements
in the qmaster, we implemented a monitoring facility into the qmaster.
It is collecting information on the different requests and how long it
took to process them. The statistic is generated for each thread and
can be printed into the messages files or via qping. More information
on the generated statistics can be found here
and the man page explains how to enable it.
We are looking forward to get your feedback on this snapshot. Download: Grid Engine 6.0 Scalability Update 2 Snapshot 1 ( Sep 20 2005, 10:08:37 AM CEST ) Permalink Kommentare [0]
Sun Grid Engine Team - flooding in Regensburg
N1GE 6 - Scheduler Hacks: consumables as load threshold
N1GE 6 u6 is available Grid Engine 6.0 Update 6 is now ready and courtesy binaries are
N1GE 6 - Scheduler Hacks: Sorting queues
queue_sort_method load job_load_adjustments np_load_avg=0.50 load_adjustment_decay_time 0:7:30 load_formula np_load_avg This setting will use the load for sorting, it adds for each started job 0.5 to the load of that host and the load will decay over 7.5 minutes. Hint: If a host has more than 1 slot, the load adjustment can lead to not using all slots on that host, because the next job might overload that host. qstat -j <job_id> will show the reasons, why a job was not dispatched including the hosts, which will not be used due to load adjustments. If np_load_avg is used for the load adjustments and the load formula, the number of processors in one machine is put into account. Example (using job_load_adjustments np_load_avg=1.5). As one can see, not all slots are used. es-ergb01-01% qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@host1 BIP 1/5 0.03 lx24-amd64 103 0.55500 job sg144703 r 07/21/2005 09:10:04 1 8 ---------------------------------------------------------------------------- all.q@host2 BIP 3/5 0.78 sol-sparc64 103 0.55500 job sg144703 r 07/21/2005 09:10:04 1 5 103 0.55500 job sg144703 r 07/21/2005 09:10:04 1 7 103 0.55500 job sg144703 r 07/21/2005 09:10:04 1 11 ---------------------------------------------------------------------------- all.q@host3 BIP 2/5 0.28 sol-sparc64 103 0.55500 job sg144703 t 07/21/2005 09:10:04 1 6 103 0.55500 job sg144703 t 07/21/2005 09:10:04 1 12 ---------------------------------------------------------------------------- all.q@host4 BIP 1/5 0.16 sol-x86 103 0.55500 job sg144703 r 07/21/2005 09:10:04 1 10 ---------------------------------------------------------------------------- all.q@host5 BIP 0/5 0.01 sol-x86 ---------------------------------------------------------------------------- test.q@host1 BIP 1/5 0.03 lx24-amd64 103 0.55500 job sg144703 r 07/21/2005 09:10:04 1 2 ---------------------------------------------------------------------------- test.q@host2 BIP 0/5 0.78 sol-sparc64 D ---------------------------------------------------------------------------- test.q@host3 BIP 2/5 0.28 sol-sparc64 103 0.55500 job sg144703 r 07/21/2005 09:10:04 1 3 103 0.55500 job sg144703 t 07/21/2005 09:10:04 1 9 ---------------------------------------------------------------------------- test.q@host4 BIP 1/5 0.16 sol-x86 103 0.55500 job sg144703 r 07/21/2005 09:10:04 1 4 ---------------------------------------------------------------------------- test.q@host5 BIP 1/5 0.01 sol-x86 103 0.55500 job sg144703 r 07/21/2005 09:10:04 1 1 ############################################################################ PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 103 0.00000 job sg144703 qw 07/21/2005 09:10:02 1 13-20:1 qstat -j 103 scheduling info: queue instance "test.q@ori" dropped because it is overloaded: np_load_avg=2.511719 (= 0.011719 + 2.50 * 1.000000 with nproc=1) >= 1.75 queue instance "all.q@ori" dropped because it is overloaded: np_load_avg=2.511719 (= 0.011719 + 2.50 * 1.000000 with nproc=1) >= 2.05 queue instance "all.q@carc" dropped because it is overloaded: np_load_avg=2.515000 (= 0.015000 + 2.50 * 2.000000 with nproc=1) >= 2.05 queue instance "test.q@carc" dropped because it is overloaded: np_load_avg=2.515000 (= 0.015000 + 2.50 * 2.000000 with nproc=1) >= 1.75 queue instance "test.q@gimli" dropped because it is overloaded: np_load_avg=1.945312 (= 0.070312 + 2.50 * 3.000000 with nproc=1) >= 1.75 queue instance "all.q@nori" dropped because it is overloaded: np_load_avg=2.580078 (= 0.080078 + 2.50 * 2.000000 with nproc=1) >= 2.05 queue instance "test.q@nori" dropped because it is overloaded: np_load_avg=2.580078 (= 0.080078 + 2.50 * 2.000000 with nproc=1) >= 1.75 queue instance "all.q@es-ergb01-01" dropped because it is overloaded: np_load_avg=2.070312 (= 0.195312 + 2.50 * 3.000000 with nproc=1) >= 2.05 queue instance "all.q@gimli" dropped because it is overloaded: np_load_avg=2.570312 (= 0.070312 + 2.50 * 4.000000 with nproc=1) >= 2.05 As we can see, this configuration can be a very powerful tool to setup rather complicated environments. However, there are cases were one would like to ensure that a certain queue is used before another queue. (I am using queue here to reference cluster queues and queue instances together) In these cases, one can assign a sequence number to the queues via qconf -mq <cluster queue name>: seq_no 0 This sequence number is used, when the scheduler configuration is changed to: queue_sort_method seqno After this change, queue instances with a low seq_no will be chosen first. If there are are multiple queue instances with the same sequence number, the configured load value will be used to determine, which queue instance to pick. This means, if all queue instances have the same seq_no and the scheduler should use the seq_no for sorting, it is ultimately using the load from the hosts. Example: "test.q" has a sequence number of 0 "all.q" has a sequence number of 2 queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- test.q@host1 BIP 2/5 0.26 lx24-amd64 108 0.55500 job sg144703 r 07/21/2005 09:24:44 1 4 108 0.55500 job sg144703 r 07/21/2005 09:24:44 1 8 ---------------------------------------------------------------------------- test.q@host2 BIP 0/5 0.58 sol-sparc64 D ---------------------------------------------------------------------------- test.q@host3 BIP 4/5 0.44 sol-sparc64 108 0.55500 job sg144703 r 07/21/2005 09:24:44 1 3 108 0.55500 job sg144703 r 07/21/2005 09:24:44 1 5 108 0.55500 job sg144703 r 07/21/2005 09:24:44 1 7 108 0.55500 job sg144703 r 07/21/2005 09:24:44 1 9 ---------------------------------------------------------------------------- test.q@host4 BIP 2/5 0.08 sol-x86 108 0.55500 job sg144703 r 07/21/2005 09:24:44 1 2 108 0.55500 job sg144703 r 07/21/2005 09:24:44 1 6 ---------------------------------------------------------------------------- test.q@host5 BIP 2/5 0.01 sol-x86 108 0.55500 job sg144703 r 07/21/2005 09:24:44 1 1 108 0.55500 job sg144703 r 07/21/2005 09:24:44 1 10 ---------------------------------------------------------------------------- all.q@host1 BIP 0/5 0.26 lx24-amd64 ---------------------------------------------------------------------------- all.q@host2 BIP 0/5 0.58 sol-sparc64 ---------------------------------------------------------------------------- all.q@host3 BIP 0/5 0.44 sol-sparc64 ---------------------------------------------------------------------------- all.q@host4 BIP 0/5 0.08 sol-x86 ---------------------------------------------------------------------------- all.q@host5 BIP 0/5 0.01 sol-x86 As one can see, only the test.q was used and within the test.q, the load values had an evect. ( Jul 21 2005, 09:35:42 AM CEST ) Permalink Kommentare [0]
A software such as our Grid Engine can a critical component in a
production environment. Its perfect functioning has the highest
priority. However there are cases in which the grid goes down or one of
its components is not available. When this happens the administrator or
the software has to react right a way. N1GE 6 provides two ways to
monitor the correct functioning of its components:
- the heartbeat file at: <CELL>/common/heartbeat - qping. 1) Heartbeat file: The heartbeat file is a simple number
that gets increased in a fixed interval. If that number does not change
for a couple minutes, that qmaster will most likely stopped its
execution.
2) qping: Qping gives a more comprehensive way of
monitoring the grid. It can be used to monitor the qmaster and the
execd deamon. Depending on the parameter it is invoked with, one gets a
heartbeat replacement or profound information about the status of the
daemon. I will give a short introduction into qping for more
information consult the qping(1) man page. The monitoring part of the
qping command can be executed from every machine under every user.
Heartbeat
file replacement:
Command: qping <MASTER_HOST> $SGE_QMASTER_PORT qmaster 1 qping <EXECD_HOST> $SGE_EXECD_PORT execd 1 output: 07/14/2005 14:38:19 endpoint scrabe.workgroup/qmaster/1 at port 7171 is up since 194 seconds The output format is: <DATE> <TIME> endpoint <MASTER_HOST/qmaster/1> at port <PORT_NUMBER> is up since <SECONDS> seconds Extensive health information: Command: qping -f <MASTER_HOST> $SGE_QMASTER_PORT qmaster 1 qping <EXECD_HOST> $SGE_EXECD_PORT execd 1 output: 07/14/2005 14:38:10:
SIRM version: 0.1 SIRM message id: 2 start time: 07/14/2005 14:35:05 (1121344505) run time [s]: 185 messages in read buffer: 0 messages in write buffer: 0 nr. of connected clients: 3 status: 0 info: TET: R (4.71) | EDT: R (0.71) | SIGT: R (184.61) | MT(1): R (6.17) | MT(2): R (4.62) | OK The important information, which we did not get in the other output, is a monitoring per thread and the number of messages in the read buffer. The per-thread information allows on to have a more fine grained monitoring and to detect dead locks in the master. The messages in the read buffer can be used as and identifier for an overloaded qmaster. The qping in update 4 and 5 do only show one MT thread even though 2 are used. This will be changed, as one can see in the output above. The other functions of qping are belong into the debug and analysis domain and definetly worth playing with. ( Jul 14 2005, 03:31:55 PM CEST ) Permalink Kommentare [0]
N1GE 6 - Migration From LSF to Sun N1 GE Software Kirk Patton (Transmeta) describes in his paper how he replaced LSF by
N1GE 6 and why
Index N1GE 6:N1GE
6 - Scheduler: Department-Based policy setup
N1GE 6 - A couple lines on halftime and usage N1GE 6 - Scheduler Hacks: Seperated Master host for pe jobs N1GE 6 - Scheduler Hacks: The ticket policy hierarchy N1GE 6 - Scheduler Hacks: job execution priority N1GE 6 - Scheduler Hacks: Comment on the qmaster <-> scheduler protocol N1GE 6 - Scheduler Hacks: "least used" / "fill up" configuration N1GE 6 - Scheduler Hacks: sorting queues N1GE 6 - Scheduler Hacks: consumables as load thresholds N1GE 6 - Scheduler Hacks: Exclusive master host for the master task N1GE 6 - health monitoring N1GE 6 - Qmaster monitoring N1GE 6 - profiling N1GE 6 - LSF -> N1GE 6 migration N1GE 6 - Announcing Grid Engine 6.0 Scalability Update 2 Snapshot 1 General:Comments:( Mai 12 2005, 11:12:26 AM CEST ) Permalink Kommentare [1] |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||