Richard Hierlmeier's Weblog
- All
- General
- Grid Engine
Friday Jul 17, 2009
Tips & Tricks for SDM Cloud Adapter with Zones
The zones scripts for the SDM Cloud Adapter not really feature complete. A reboot of the master host can lead into the following problems:
- SDM system does not startup automatically. You have to start it with
sdmadm startup_jvm -force
The -force option is needed by the SDM system does not cleanup the pid file a shutdown. - sge service might go after startup into ERROR state.
# sdmadm ss
host service cstate sstate
-------------------------------------
lappy sge STOPPED ERROR
spare_pool STARTED RUNNING
zones STARTED RUNNING
The reason can be start the qmaster has not been started automatically. Start qmaster and startup the sge service again:
# /gridware/sge/default/common/sgemaster
# sdmadm suc -c sge -h localhost
comp host message
---------------------------------
sge lappy startup triggered
# sdmadm ss
host service cstate sstate
-------------------------------------
lappy sge STARTED RUNNING
spare_pool STARTED RUNNING
zones STARTED RUNNING
- zones service goes into error recovery mode. You will find the following error messages in the log file:
# tail /var/sdm/sdm1/log/cs_vm-0.log
07/01/2009 11:27:18|21|...|I|Service zones: Started up 1 cloud hosts: [[hostname: z1, instanceId: i-z1, launchTime: 2009-07-01T11:27:18.000Z] ].
07/01/2009 11:27:20|21|...|W|Service zones:The registered set of cloud host does not match the reported set! Registered mismatches [[hostname: z1, instanceId: i-z1, launchTime: 2009-07-01T11:27:18.000Z] ]. Reported mismatches []
07/01/2009 11:27:20|21|...|W|Service zones:Problem: VPN server z1 is corrupted! The cloud host is not reported anymore!
The problem is that the zones (and the SDM components on the zones) are not started automatically. Normally the zones script should detect this problem and recover it. However the recovery is not implemented.
To solve the problem shutdown the SDM system:
# sdmadm sdj -all -h localhost
And cleanup the spool directory on the master host:
# rm `find /var/spool/sdm/sdm1/spool/*.srf`
# rm /var/spool/sdm/sdm1/spool/cloud_hosts.spool
Finally you can restart the system:
# sdmadm suj
To avoid this problem you have remove all zone resources from the SDM system. This can be easily done by setting the min and max attribute of the resource amount optimizer to 0. The zones service will shutdown all zones:
# sdmadm mc -c zones
<common:componentConfig xsi:type="cloud_adapter:CloudAdapterConfig"
...
<cloud_adapter:optimizer xsi:type="cloud_adapter:MinMaxResourceAmountOptimizerConfig"
max="0"
min="0">
...
</common:componentConfig>
# sdmadm uc -c zones
After a short time all resource from the zones service will disappear. You can finally shutdown the host.
Posted at 12:20PM Jul 17, 2009 by rhierlmeier in Service Domain Manager | Comments[0]
Comments: