"I'm sorry Mr. Bond, even if you are a secret agent for Her Majesty's secret service, now that you have seen the source code of Solaris I am afraid that you leave me no option other than to kill you. These Sun Blades are not sharp to the touch however when dropped from a height, ha, ha, ha."
"I'm sorry to disappoint you Dr. X but have you seen this CDDL license for OpenSolaris?"
My first integration to Solaris 10 was a fix for bug 4530367. The observed issue was that poor communication from a Domain Name Server (DNS) resulted in the client requesting less retries when attempting to resolve hosts names.
Those familiar with DNS will know that the resolver(3RESOLV) by default may query up to three name servers (configured in resolv.conf(4)) up to four times each (one initial query plus three retries), making a total number of twelve requests.
Solaris provides a configurable name service known as the Name Service Switch (NSS). This is configured by the administrator in nsswitch.conf(4). Applications using the Networking Services Library Functions (gethostbyname(3NSL) and the like) are therefore under control of the NSS.
For DNS NSS by default also provides the same default 'resolver' functionality, within nsswitch.conf the default 'dns' hosts entry translates to [SUCCESS=return NOTFOUND=continue UNAVAIL=continue TRYAGAIN=3].
Or to put it another way, internally the state of NSS and nsswitch.conf is recorded in an external static structure (nss_db_state) which is initialised on the first invocation of the NSS front end. For each 'status' there is an 'action'. For 'dns' the default status for action='retry' is 'tryagain_ntimes', where 'n' is 3. Equivalent to 'dns [tryagain=3]' in nsswitch.conf.
Solaris 8 saw an overhaul to the NSS functions and also added an undocumented Name Server Switch environment variable NSS_OPTIONS option. Using this option it is possible to see how NSS is applying its configuration. Therefore using a test server that receives DNS queries but only replies to certain names and a client with only one name server entry the following trace was recorded:
# nscd -e hosts,no # NSS_OPTIONS='debug_eng_loop=1' # export NSS_OPTIONS # getent hosts host1 host2 NSS_retry(0): 'services': trying 'files' ... result=SUCCESS, action=RETURN NSS: 'services': return. NSS_retry(0): 'hosts': trying 'files' ... result=NOTFOUND, action=CONTINUE NSS: 'hosts': continue ... NSS_retry(0): 'hosts': trying 'dns' ... result=TRYAGAIN, action=TRYAGAIN_NTIMES (N=3) NSS_retry(2): 'hosts': trying 'dns' ... result=TRYAGAIN, action=TRYAGAIN_NTIMES (N=3) NSS_retry(3): 'hosts': trying 'dns' ... result=TRYAGAIN, action=TRYAGAIN_NTIMES (N=3) NSS: 'hosts': return. NSS_retry(0): 'hosts': trying 'files' ... result=NOTFOUND, action=CONTINUE NSS: 'hosts': continue ... NSS_retry(0): 'hosts': trying 'dns' ... result=TRYAGAIN, action=CONTINUE NSS: 'hosts': return.
Note, as the bug described the action has changed from TRYAGAIN_NTIMES to CONTINUE. In nsswitch.conf terms the rule had changed to [SUCCESS=return NOTFOUND=continue UNAVAIL=continue TRYAGAIN=continue].
The result being that subsequent lookups were not being retried. Meaning a client configured with only one name-server would only make one lookup using dns.
Finding the cause of this change in behaviour was then very easy to track down to retry_test(). This changed the action from 'tryagain_ntimes' to 'continue' if the name service returned the state 'tryagain' after the successive number of retires. The idea being to remove the burden of performing further lookups to what appears to be a malfunctioning service. This limitation however persisted for the remaining process time of the application. For long term processes such as ncsd(1M) this was disastrous.
Several possible solutions were considered:
- Remove the functionality that changes the action based on the current "action and results" for different states.
- Modify the functionality such that it does not remove the automatic retry for 'dns' service.
- Implement a new state that restores the original action once the service is deemed to be working again.
- Leave it as is and correct man page (which will need to be addressed either way (6275116))
- Put an algorithm in place that ramps up to 'tryagian=n'
- Implement some new action.
After consultation I opted for the third solution and implemented a new state named TRYAGAIN_PAUSED that restores the TRYAGAIN_NTIMES action once a successful result occurs:
# getent hosts host1 host2 host3 host4 NSS_retry(0): 'hosts': trying 'files' ... result=NOTFOUND, action=CONTINUE NSS: 'hosts': continue ... NSS_retry(0): 'hosts': trying 'dns' ... result=TRYAGAIN, action=TRYAGAIN_NTIMES (N=3) NSS_retry(1): 'hosts': trying 'dns' ... result=TRYAGAIN, action=TRYAGAIN_NTIMES (N=3) NSS_retry(2): 'hosts': trying 'dns' ... result=TRYAGAIN, action=TRYAGAIN_NTIMES (N=3) NSS_retry(3): 'hosts': trying 'dns' ... result=TRYAGAIN, action=TRYAGAIN_NTIMES (N=3) NSS: 'hosts': return. NSS_retry(0): 'hosts': trying 'files' ... result=NOTFOUND, action=CONTINUE NSS: 'hosts': continue ... NSS_retry(0): 'hosts': trying 'dns' ... result=TRYAGAIN, action=TRYAGAIN_PAUSED NSS: 'hosts': return. NSS_retry(0): 'hosts': trying 'files' ... result=NOTFOUND, action=CONTINUE NSS: 'hosts': continue ... NSS_retry(0): 'hosts': trying 'dns' ... result=SUCCESS, action=RETURN NSS: 'hosts': return. 192.168.0.3 host3.sun.com NSS_retry(0): 'hosts': trying 'files' ... result=NOTFOUND, action=CONTINUE NSS: 'hosts': continue ... NSS_retry(0): 'hosts': trying 'dns' ... result=TRYAGAIN, action=TRYAGAIN_NTIMES (N=3) NSS_retry(1): 'hosts': trying 'dns' ... result=TRYAGAIN, action=TRYAGAIN_NTIMES (N=3) NSS_retry(2): 'hosts': trying 'dns' ... result=SUCCESS, action=RETURN NSS: 'hosts': return. 192.168.0.4 host4.sun.com # nscd -e hosts,yes #
While the fix was rather straight forward, getting to know the code (its various structures, states, actions, results, macros (nsswitch_priv.h)), creating suitable tests and verifying that my new state remained opaque was time consuming and yet very satisfactory. Which is after all what I love about this job.
Stace
Technorati Tag: OpenSolaris
Technorati Tag: Solaris

