Thursday Jul 14, 2005

Someone asked on an internal Sun alias how svc.startd(1M) determines whether there was a fault in the service and when to put the service in maintenance. Unfortunately, this is not well-described in our existing manpages, but I'm working on that. Still, it takes a little while for manpage changes to propagate into the mainline Solaris release so I figured I'd include my description from email here. A more formal version will be coming, and I'll update this post if subsequent questions yield a better description.

It is important to mention first that svc.startd(1M) offers three separate service models: contract, transient, and wait. These are described in the Service Developer Introduction. I'll only touch on the fault/retry models for the common ones, contract and transient here.

Next, I'd like to point out that there's a distinction between method failures and service failures, from svc.startd(1M)'s point of view. So, I'll go over each type of failure and how it is handled.

svc.startd(1M) believes a method has failed if it returns a non-zero exit code. Method failures cause a service to go into the maintenance state immediately if the exit code is $SMF_EXIT_ERR_CONFIG or $SMF_EXIT_ERR_FATAL. All other failures will cause the service to go back to offline. Remember, as smf(5) describes, if a service is offline and its dependencies are satisfied, we try to start the service. But, if 3 method failures happen in a row, or if the service is restarting too quickly, that service will go into maintenance.

A service failure is determined by a combination of the service model (transient or contract) and the value of the startd/ignore_error property.

A contract type service is considered to have failed if any of the following conditions occur:

  • all processes in the service exit

  • any processes in the service coredump

  • a process outside the service sends a service process a fatal signal (e.g. an admin pkills a service process)

The latter two of these conditions may be ignored by the service by specifying core and/or signal in startd/ignore_error. All of these service failures are detected by contract events. I've talked earlier about contracts and fault isolation in smf(5) too.

Defining a service as transient means that svc.startd(1M) doesn't track processes for that service, so none of the service errors above matter. Thus, a transient service only goes to maintenance if a method failure occurs.

Technorati Tags: , , and .

This blog copyright 2009 by lianep