Shared Pool
An Example to Manage a Grid Engine cluster with SDM
Once you have added the Grid Engine (GE) service to the SDM (Service Domain Manager), you can manage a Grid Engine cluster (adding or removing execution hosts from the Grid Engine cluster) using the SDM as the Grid Engine work loads are changing.
Currently the following SLOs are available with the two available services: spare_pool and GE adapter services:
o MinResourceSLO and FixedUsageSLO (can be used for any services)
o PermanentRequestSLO (only for the spare_pool service)
o MaxPendingJobsSLO (only for the GE service)
You can find more details about SLO in this wiki page.
In order to provide more execution hosts as the number of pending jobs increases, you need to add the MaxPendingJobsSLO to the GE service configuration as shown below. Please note that the urgency is set to "99", which is the highest value so that this SLO can have the highest priority. Also note that the execd configuration has been extended to be automatically provisioned by the SDM.
node1# sdmadm
mc -c gesvc [modify GE adapter configuration]
...
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<common:componentConfig xsi:type="ge_adapter:GEServiceConfig"
mapping="default"
xmlns:executor="http://hedeby.sunsource.net/hedeby-executor"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:reporter="http://hedeby.sunsource.net/hedeby-reporter"
xmlns:security="http://hedeby.sunsource.net/hedeby-security"
xmlns:resource_provider="http://hedeby.sunsource.net/hedeby-resource-provider"
xmlns:common="http://hedeby.sunsource.net/hedeby-common"
xmlns:ge_adapter="http://hedeby.sunsource.net/hedeby-gridengine-adapter">
<common:slos>
<common:slo xsi:type="common:FixedUsageSLOConfig"
urgency="50"
name="fixed_usage"/>
<common:slo name="maxPendingJobs"
xsi:type="ge_adapter:MaxPendingJobsSLOConfig"
urgency="99"
max="10">
</common:slo>
</common:slos>
<ge_adapter:connection
keystore="/var/sgeCA/port6236/default/userkeys/sdmadmin/keystore"
password=""
username="sdmadmin"
jmxPort="6238"
execdPort="6237"
masterPort="6236"
cell="default"
root="/var/opt/sge/6.2beta"
clusterName="p6236"/>
<ge_adapter:sloUpdateInterval unit="minutes"
value="5"/>
<ge_adapter:execd adminUsername="root"
defaultDomain=""
ignoreFQDN="true"
rcScript="false"
adminHost="true"
submitHost="false"
cleanupDefault="true">
<ge_adapter:localSpoolDir>/var/spool/sge/execd</ge_adapter:localSpoolDir>
<ge_adapter:installTemplate
executeOn="exec_host">
<ge_adapter:script>/opt/sdm/6.2beta/util/templates/ge-adapter/install_execd.sh</ge_adapter:script>
<ge_adapter:conf>/opt/sdm/6.2beta/util/templates/ge-adapter/install_execd.conf</ge_adapter:conf>
</ge_adapter:installTemplate>
<ge_adapter:uninstallTemplate
executeOn="exec_host">
<ge_adapter:script>/opt/sdm/6.2beta/util/templates/ge-adapter/uninstall_execd.sh</ge_adapter:script>
<ge_adapter:conf>/opt/sdm/6.2beta/util/templates/ge-adapter/uninstall_execd.conf</ge_adapter:conf>
</ge_adapter:uninstallTemplate>
</ge_adapter:execd>
</common:componentConfig>
After saving the configuration changes, in order to make the changes are effective, update the configuration:
node1# sdmadm uc [-c gesvc]
comp host message
----------------------------------------
ca node0 reload triggered
executor node0 reload triggered
node1 reload triggered
node2 reload triggered
node3 reload triggered
gesvc node1 reload triggered
reporter node0 reload triggered
resource_provider node0 reload triggered
In the following example, the SDM is going to add more execution hosts to the GE cluster by moving resources from the spare_pool service when the number of pending jobs exceeds the limit defined in the MaxPendingJobsSLO. For this purpose, we are going to provide some available resources (node1, node2 and node3) to the spare_pool service as shown below:
node2# sdmadm add_resource -r node1 -t host -s spare_pool
resource message
-----------------------------------------------------------
node1 Resource was added to the system.
node2# sdmadm add_resource -r node2 -t host -s spare_pool
resource message
-----------------------------------------------------------
node2 Resource was added to the system.
node2# sdmadm add_resource -r node3 -t host -s spare_pool
resource message
-----------------------------------------------------------
node3 Resource was added to the system.
node2# sdmadm show_resource
service id state type flags usage annotation
-----------------------------------------------------
spare_pool node1 ASSIGNED host 1
node2 ASSIGNED host 1
node3 ASSIGNED host 1
node2# sdmadm show_resource -s gesvc
No resources has been found.
One way to add an execution host is to use the SDM command manually as shown below. This command moves a particular resource to the designated service.
node3# sdmadm mvr -r node1 -s gesvc
Then, the SDM will automaticaly invoke the GE execution host installation procedure. One thing to note here is that, since the node1 is the qmaster host, it will be flagged as the static resource. So the node1 host will not be removed from the GE service in the future.
As we submit more and more jobs to the GE cluster, the SDM will add more execution hosts to the existing GE cluster as shown below:
node1# qstat -f | more
queuename qtype resv/used/tot. load_avg
arch states
---------------------------------------------------------------------------------
all.q@node1 BIP 0/4/4 0.56 sol-sparc64
40 0.55500 sleep root r 06/24/2008 09:31:05 1
41 0.55500 sleep root t 06/24/2008 09:31:05 1
42 0.55500 sleep root t 06/24/2008 09:31:05 1
43 0.55500 sleep root t 06/24/2008 09:31:05 1
---------------------------------------------------------------------------------
all.q@node2 BIP 0/4/4 0.49 sol-sparc64
36 0.55500 sleep root r 06/24/2008 09:31:05 1
37 0.55500 sleep root t 06/24/2008 09:31:05 1
38 0.55500 sleep root t 06/24/2008 09:31:05 1
39 0.55500 sleep root t 06/24/2008 09:31:05 1
---------------------------------------------------------------------------------
all.q@node3 BIP 0/4/4 0.57 sol-sparc64
32 0.55500 sleep root r 06/24/2008 09:25:26 1
33 0.55500 sleep root r 06/24/2008 09:25:29 1
34 0.55500 sleep root r 06/24/2008 09:25:29 1
35 0.55500 sleep root r 06/24/2008 09:25:29 1
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
JOBS
############################################################################
44 0.00000 sleep root qw 06/24/2008 09:22:47 1
45 0.00000 sleep root qw 06/24/2008 09:22:51 1
46 0.00000 sleep root qw 06/24/2008 09:22:51 1
47 0.00000 sleep root qw 06/24/2008 09:22:51 1
48 0.00000 sleep root qw 06/24/2008 09:22:51 1
49 0.00000 sleep root qw 06/24/2008 09:22:52 1
50 0.00000 sleep root qw 06/24/2008 09:22:52 1
51 0.00000 sleep root qw 06/24/2008 09:22:52 1
52 0.00000 sleep root qw 06/24/2008 09:22:52 1
53 0.00000 sleep root qw 06/24/2008 09:22:53 1
54 0.00000 sleep root qw 06/24/2008 09:22:53 1
55 0.00000 sleep root qw 06/24/2008 09:22:53 1
56 0.00000 sleep root qw 06/24/2008 09:22:53 1
57 0.00000 sleep root qw 06/24/2008 09:22:53 1
58 0.00000 sleep root qw 06/24/2008 09:22:54 1
59 0.00000 sleep root qw 06/24/2008 09:22:54 1
60 0.00000 sleep root qw 06/24/2008 09:22:54 1
61 0.00000 sleep root qw 06/24/2008 09:22:55 1
62 0.00000 sleep root qw 06/24/2008 09:22:55 1
63 0.00000 sleep root qw 06/24/2008 09:22:55 1
64 0.00000 sleep root qw 06/24/2008 09:22:55 1
65 0.00000 sleep root qw 06/24/2008 09:22:56 1
66 0.00000 sleep root qw 06/24/2008 09:22:56 1
67 0.00000 sleep root qw 06/24/2008 09:22:56 1
68 0.00000 sleep root qw 06/24/2008 09:22:56 1
69 0.00000 sleep root qw 06/24/2008 09:22:56 1
node1# sdmadm sr
service id state type flags usage annotation
--------------------------------------------------------------
gesvc node1 ASSIGNED host S inf Got execd update event
node2 ASSIGNED host inf Got execd update event
node3 ASSIGNED host 99 Got execd update event
node3# sdmadm sr
service id state type flags usage annotation
--------------------------------------------------------------
gesvc node1 ASSIGNED host S 50 Got execd update event
node2 ASSIGNED host 50 Got execd update event
node3 ASSIGNED host 50 Got execd update event
However, in the current configuration, even if all the jobs were completed, the execution hosts will be remained in the GE service. It will not be put back into the spare_pool service where they were drafted originally.
In order to put these resources back to the spare_pool service, you need to modify the urgency value (=1) defined in the PermanentRequestSLO used by the spare_pool service so that its urgency value is greater than the urgency value (=50) defined in the FixedUsageSLO for the gesvc service. In this example, the urgency value has changed to be 51.
node1# sdmadm mc -c spare_pool
...
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<common:componentConfig xsi:type="spare_pool:SparePoolServiceConfig"
xmlns:executor="http://hedeby.sunsource.net/hedeby-executor"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:spare_pool="http://hedeby.sunsource.net/hedeby-sparepool"
xmlns:reporter="http://hedeby.sunsource.net/hedeby-reporter"
xmlns:security="http://hedeby.sunsource.net/hedeby-security"
xmlns:resource_provider="http://hedeby.sunsource.net/hedeby-resource-provider"
xmlns:common="http://hedeby.sunsource.net/hedeby-common"
xmlns:ge_adapter="http://hedeby.sunsource.net/hedeby-gridengine-adapter">
<common:slos>
<common:slo xsi:type="common:PermanentRequestSLOConfig"
quantity="10"
urgency="51"
name="PermanentRequestSLO">
<common:request>type = "host"</common:request>
</common:slo>
</common:slos>
</common:componentConfig>
Once this changes are effective, the execution hosts that were added to the GE cluster will be removed from the GE cluster and put back to the spare_pool service. The SLO update interval is defined in the configuration. Currently it is defined as 5 minutes in the GE service configuration.
<ge_adapter:sloUpdateInterval unit="minutes"
value="5"/>
As I mentioned before, the node1 will remain becuase it is flagged as a static resource.
node1# qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@node1 BIP 0/0/4 0.26 sol-sparc64
node1# sdmadm sr
service id state type flags usage annotation
-----------------------------------------------------------------
gesvc node1 ASSIGNED host S 50 Got execd update event
spare_pool node2 ASSIGNED host 51
node3 ASSIGNED host 51
Another burst of jobs makes the SDM to add more execution hosts to the GE cluster:
node1# qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@node1 BIP 0/4/4 0.22 sol-sparc64
213 0.55500 sleep root r 07/09/2008 14:34:57 1
218 0.55500 sleep root r 07/09/2008 14:35:00 1
219 0.55500 sleep root r 07/09/2008 14:35:00 1
220 0.55500 sleep root r 07/09/2008 14:35:00 1
---------------------------------------------------------------------------------
all.q@node2 BIP 0/4/4 0.21 sol-sparc64
214 0.55500 sleep root r 07/09/2008 14:34:57 1
215 0.55500 sleep root r 07/09/2008 14:34:57 1
216 0.55500 sleep root r 07/09/2008 14:34:57 1
217 0.55500 sleep root r 07/09/2008 14:34:57 1
---------------------------------------------------------------------------------
all.q@node3 BIP 0/4/4 0.21 sol-sparc64
221 0.55500 sleep root r 07/09/2008 14:39:47 1
222 0.55500 sleep root r 07/09/2008 14:39:47 1
223 0.55500 sleep root r 07/09/2008 14:39:47 1
224 0.55500 sleep root r 07/09/2008 14:39:47 1
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
225 0.00000 sleep root qw 07/09/2008 14:25:08 1
226 0.00000 sleep root qw 07/09/2008 14:29:41 1
227 0.00000 sleep root qw 07/09/2008 14:29:52 1
228 0.00000 sleep root qw 07/09/2008 14:29:52 1
229 0.00000 sleep root qw 07/09/2008 14:29:53 1
node1# sdmadm sr
service id state type flags usage annotation
--------------------------------------------------------------
gesvc node1 ASSIGNED host S 99 Got execd update event
node2 ASSIGNED host 99 Got execd update event
node3 ASSIGNED host 99 Got execd update event
When all the jobs are cleared, the execution hosts without the static flag will be put back to the spare_pool service.
node1# sdmadm sr
service id state type flags usage annotation
-----------------------------------------------------------------
gesvc node1 ASSIGNED host S 50 Got execd update event
spare_pool node2 ASSIGNED host 51
node3 ASSIGNED host 51
Posted at 03:07PM Jul 09, 2008 by Chansup Byun in Grid | Comments[0]
Wednesday Jul 09, 2008