dimanche janv. 13, 2008
start/stop of Common Agent Container hangs
One question which is raised more and more often is :
"why can I stop/start the Common Agent Container ?"
or
"why /usr/sbin/cacaoadm stop/start command never returns ?".
Users facing that are on Solaris 10/11. This is due to users' modules executing sub processes and
not doing proper cleanup when they are asked to stop. Another cause is user module start/stop method wrong implementation.
There are at least two rules that a module must follow :
- start and stop methods of a module must be relatively quick and always return.
The
Common Agent Container start the registered modules sequentially. Its overall
start sequence is only completed when all registered modules have started.
The opposite is true for the container shutdown.
If a module never return from its start/stop method, it will then block the all sequence
and the container will never terminate its start/stop sequence. A module which have a complex initialisation/termination phase (like connecting to a database, ...) should
delegate all possible hanging actions to a thread beside.
- stop method of a module must insure that all resources created by it are cleaned-up.
A module can be locked or unlocked (started/stopped) dynamically at runtime. Its life-cycle may not be aligned with the container life-cycle. This implies that the
module is clean regarding what it creates. I.e If a module leave things behind,
depending on its logic , future (re)start of it may fail.This particularly true for modules executing sub processes on Solaris 10 and later.
On S1x the Common Agent Container is registered as an SMF service. Processes
created by a service lives in a process contract ( see contract(4) ).
A process contract has a lifetime , by default this lifetime is as long as there are
processes running in it (contract is not empty). So as any other service, the Common Agent Container has an associated contract.# svcs -p -o CTID,SVC svc:/application/management/common-agent-container-1:default CTID SVC 4924 application/management/common-agent-container-1 18:40:23 2536 launch 18:40:23 2537 java
# /usr/bin/ptree -c 2537 [process contract 1] 1 /sbin/init [process contract 4] 7 /lib/svc/bin/svc.startd [process contract 4924] 2536 /usr/lib/cacao/lib/tools/launch -w /usr/lib/cacao -f -U root -G sys -- /usr/jdk 2537 /usr/jdk/jdk1.6.0_03/bin/java -Xms4M -Xmx128M -classpath /usr/share/lib/jdmk/jd |
SMF will consider the service as "terminate" only when associated process contract becomes empty. User executing sub processes from within module (using Runtime.exec(...)) will populate the contract with new processes.As an example I've wrote a simple module executing a ping command.You see that the ping process (2787) is now part of the contract
ctstat -v -i `/usr/bin/svcs -H -o CTID svc:/application/management/common-agent-container-1:default` CTID ZONEID TYPE STATE HOLDER EVENTS QTIME NTIME 4924 0 process owned 7 0 - - cookie: 0x20 informative event set: none critical event set: core signal hwerr empty fatal event set: none parameter set: inherit regent member processes: 2536 2537 2785 2787 inherited contracts: none
# pargs 2787 2787: /usr/sbin/ping -s 127.0.0.1 argv[0]: /usr/sbin/ping argv[1]: -s argv[2]: 127.0.0.1 |
When I will try to stop the Common Agent Container , the associated svcadm(1M) command will hang waiting for the service to terminate : i.e waiting for the contract to become empty.The Common Agent Container is properly stopped (Jvm is no longer running) but the command will never return.Until I kill the orphan process (pid = 2787). You can also notice that the service state is now wrong: real logic has gone but the state is still online# /usr/sbin/cacaoadm stop (we are hanging here...)
| # ptree -c `pgrep -x cacaoadm`
[process contract 1]
1 /sbin/init
[process contract 4]
7 /lib/svc/bin/svc.startd
[process contract 87]
1346 gnome-terminal
27663 /bin/bash
27699 -bash
2842 /bin/sh /usr/sbin/cacaoadm stop
2951 /usr/sbin/svcadm disable -st svc:/application/management/common-agent-container |
| svcs -p -o CTID,STATE,SVC svc:/application/management/common-agent-container-1:default CTID STATE SVC 4924 online* application/management/common-agent-container-1 18:52:26 2787 ping
|
(cacaoadm command is now completed)
| #kill -INT 2787
|
What users should do about that ? There are two solutions : - A list of all executed commands is kept by the module and he terminates still running
ones during its stop method.
- All sub commands are executed using the container InvokeCommand or the UserProcess class.These two helpers will execute the command taking care of
sub contract. The sub command is launchedin a sub-contract using the
ctrun(1) command.
In run again the previous example but this time a second ping command is launched using the InvokeCommand class.You can see two ping commands but the one which is directly a child of the container is running in its own contract.# svcs -p -o CTID,STATE,SVC svc:/application/management/common-agent-container-1:default CTID STATE SVC 4947 online application/management/common-agent-container-1 19:32:33 3590 launch 19:32:33 3591 java 19:33:42 3744 ctrun 19:33:42 3745 ping
# ptree 3591 3590 /usr/lib/cacao/lib/tools/launch -w /usr/lib/cacao -f -U root -G sys -- /usr/jdk 3591 /usr/jdk/jdk1.6.0_03/bin/java -Xms4M -Xmx128M -classpath /usr/share/lib/jdmk/jd 3744 /usr/bin/ctrun -l child -o pgrponly /usr/lib/cacao/lib/tools/suexec /usr/sbin/p 3746 /usr/sbin/ping -s localhost bighal# ptree -c 3591 [process contract 1] 1 /sbin/init [process contract 4] 7 /lib/svc/bin/svc.startd [process contract 4947] 3590 /usr/lib/cacao/lib/tools/launch -w /usr/lib/cacao -f -U root -G sys -- /usr/jdk 3591 /usr/jdk/jdk1.6.0_03/bin/java -Xms4M -Xmx128M -classpath /usr/share/lib/jdmk/jd 3744 /usr/bin/ctrun -l child -o pgrponly /usr/lib/cacao/lib/tools/suexec /usr/sbin/p [process contract 4950] 3746 /usr/sbin/ping -s localhost # ptree -c 3745 [process contract 1] 1 /sbin/init [process contract 4] 7 /lib/svc/bin/svc.startd [process contract 4947] 3745 /usr/sbin/ping -s 127.0.0.1
|
When the application is a container which may run external code executing sub-processes,developers should keep in mind contracts. For C application , libcontract(3LIB) is here to help, for Java application this becomes a little bit more problematic.
Another point which can be a problem is that one of the critical event of process contract can be core dump of a process or signal received by it . For instance let's take a Common Agent Container case. If a process executed by one of the registered module crash or is killed by receiving a signal, the entire contract will be restarted and so do the container...oups...
Posted at
04:56PM janv. 13, 2008
by ejannett in Common Agent Container FAQ |