Layers of Complexity
I've been learning a lot about the layering of software from the kernel through up to userland utilities lately. There are frequently many more layers of indirection than are immediately obvious. Many of the tests I'm running which still have problems have this problem as a result of the side-effect of one test which turns off default binding property of all pools. Normally this is fine to do becausepooladm -x; pooladm -d; pooladm -e will clear everything out. Disabling pools wipes out all properties on the default pool and pool_pset, and enabling recreates the properties with default values. When I made disabling a no-op at the kernel level, this prevented these properties from being cleared by the test suite's environment cleanup routines. So a number of tests failed because they couldn't bind to anything. Another more obnoxious side effect of this is that sshd is unable to accept new sessions because it can't find a pool to bind to. Changing the kernel code for pool_set_status() to reinitialize the properties of the default resources should fix things, right? Well, it would be if the sequence of events I assumed occurred was correct. What I assumed initially, without thinking about it, was that the sequence of events would be:
- Execute
pooladm -din a shell - The above command opens
/dev/pooland performs theioctl()for disabling pools - The
ioctl()code callspool_ioctl(), which dispatches topool_status(), now modified to do the right thing with disable and enable requests.
psrset command simply calls the function in the syscall layer which carries out the appropriate task. I completely forgot about both libpool and poold. The actual call sequence is:
- Execute
pooladm -din a shell - The above code calls into
libpool. - The code called executes '
svcadm disable svc:/system/pools:default' - The shutdown code for the pools service,
/lib/svc/method/svc-pools, which executespooladm -din a shell with a particular environment variable set - With this environment variable set, when the command calls into
libpool, this time it actually opens/dev/pooland performs theioctl()for disabling pools - The
ioctl()code callspool_ioctl(), which dispatches topool_status(), now modified to do the right thing with disable and enable requests.
libpool saw that pools were already enabled, and didn't restart the pools service. So when the later tests ran which disable pools partway through, the invocation of svcadm saw the pools service was already disabled, and did not execute the additional pooladm -d, so no call was ever made into the kernel to reset these properties. I had not really noticed before how differently userland and kernelspace were written. The complexity in kernelspace tends to be intricacies - locking, complicated invariants and such. Userspace's complexity seems to be largely layering and indirection. Or perhaps that's just what I perceive most because it's so large, and unlike the kernel, where the definition a symbol corresponds to is always clear (making following flow of control with tools like cscope or OpenGrok usually straightforward), it's not always easy to figure out the flow of control between programs in userspace without some preexisting knowledge.Fixing this cleaned up a lot of inconclusive tests, and a number of tests which were failing because of properties not being reset. After this and fixing a couple tests which made assumptions about processor sets and pools being mutually exclusive, as of this afternoon the pools test tally has improved from that in the last entry to:
Result Total:
FAIL: 7
PASS: 705
UNSUPPORTED: 5
In addition to this, I've cleaned up some code, merged redundant code paths, done the final implementation for the last class of process sets, and fixed several race conditions and deadlocks related to the conversion of pool reference counting to per-thread. And because two of the failures also fail on the gate (it was one the other week, but I've found bugs in several test cases), there are really only 5 failures left. Current builds also hold up to as many runs of the processor set stress test as I've been able to subject it to, without deadlock or panic. Things are looking good.
Posted by csg [Personal] ( July 24, 2007 05:03 PM ) Permalink | Comments[0]
