Wednesday Jul 27, 2005

A really simple tip today. If you want rpcbind(1M) or any other service started while in single-user mode (boot -s) or netbooted to single-user for repairs, all you need to do is temporarily enable the service. Using temporary enable (svcadm enable -t) gets around the repository not yet being writable -- after all, you weren't looking for your change to take effect outside of your maintenance environment.

You may also find that the service you want requires other services. svcs -l will show you the complete dependency list with current states displayed. Rather than going through and doing manual enables of everything, you can use the -r option for svcadm enable to tell it to recursively enable all services that are required. So, to temporarily enable rpcbind(1M) and all the services it requires while in single-user mode, use:

   # svcadm enable -rt rpc/bind

As promised, pretty simple. Someone commented this isn't well covered in our current documentation set. I've filed a bug, so you should see a similar task in the System Administration Guide in a future release.

Technorati Tags: , , and .

Thursday Jul 14, 2005

Someone asked on an internal Sun alias how svc.startd(1M) determines whether there was a fault in the service and when to put the service in maintenance. Unfortunately, this is not well-described in our existing manpages, but I'm working on that. Still, it takes a little while for manpage changes to propagate into the mainline Solaris release so I figured I'd include my description from email here. A more formal version will be coming, and I'll update this post if subsequent questions yield a better description.

It is important to mention first that svc.startd(1M) offers three separate service models: contract, transient, and wait. These are described in the Service Developer Introduction. I'll only touch on the fault/retry models for the common ones, contract and transient here.

Next, I'd like to point out that there's a distinction between method failures and service failures, from svc.startd(1M)'s point of view. So, I'll go over each type of failure and how it is handled.

svc.startd(1M) believes a method has failed if it returns a non-zero exit code. Method failures cause a service to go into the maintenance state immediately if the exit code is $SMF_EXIT_ERR_CONFIG or $SMF_EXIT_ERR_FATAL. All other failures will cause the service to go back to offline. Remember, as smf(5) describes, if a service is offline and its dependencies are satisfied, we try to start the service. But, if 3 method failures happen in a row, or if the service is restarting too quickly, that service will go into maintenance.

A service failure is determined by a combination of the service model (transient or contract) and the value of the startd/ignore_error property.

A contract type service is considered to have failed if any of the following conditions occur:

  • all processes in the service exit

  • any processes in the service coredump

  • a process outside the service sends a service process a fatal signal (e.g. an admin pkills a service process)

The latter two of these conditions may be ignored by the service by specifying core and/or signal in startd/ignore_error. All of these service failures are detected by contract events. I've talked earlier about contracts and fault isolation in smf(5) too.

Defining a service as transient means that svc.startd(1M) doesn't track processes for that service, so none of the service errors above matter. Thus, a transient service only goes to maintenance if a method failure occurs.

Technorati Tags: , , and .

This blog copyright 2009 by lianep