« Previous month (Jun 2007) | Main | Next month (Aug 2007) »
http://blogs.sun.com/csg/date/20070724 Tuesday July 24, 2007

Layers of Complexity

I've been learning a lot about the layering of software from the kernel through up to userland utilities lately. There are frequently many more layers of indirection than are immediately obvious. Many of the tests I'm running which still have problems have this problem as a result of the side-effect of one test which turns off default binding property of all pools. Normally this is fine to do because pooladm -x; pooladm -d; pooladm -e will clear everything out. Disabling pools wipes out all properties on the default pool and pool_pset, and enabling recreates the properties with default values. When I made disabling a no-op at the kernel level, this prevented these properties from being cleared by the test suite's environment cleanup routines. So a number of tests failed because they couldn't bind to anything. Another more obnoxious side effect of this is that sshd is unable to accept new sessions because it can't find a pool to bind to. Changing the kernel code for pool_set_status() to reinitialize the properties of the default resources should fix things, right? Well, it would be if the sequence of events I assumed occurred was correct. What I assumed initially, without thinking about it, was that the sequence of events would be: Unfortunately, I had been spending too much time in pset.c, where it really is that simple - the psrset command simply calls the function in the syscall layer which carries out the appropriate task. I completely forgot about both libpool and poold. The actual call sequence is: But this is enclosed in a conditional which only does all of this if you are setting the pools state to a different state. So because pools are permanently enabled, when tests cleaned up for themselves to set up a fresh testing environment, libpool saw that pools were already enabled, and didn't restart the pools service. So when the later tests ran which disable pools partway through, the invocation of svcadm saw the pools service was already disabled, and did not execute the additional pooladm -d, so no call was ever made into the kernel to reset these properties. I had not really noticed before how differently userland and kernelspace were written. The complexity in kernelspace tends to be intricacies - locking, complicated invariants and such. Userspace's complexity seems to be largely layering and indirection. Or perhaps that's just what I perceive most because it's so large, and unlike the kernel, where the definition a symbol corresponds to is always clear (making following flow of control with tools like cscope or OpenGrok usually straightforward), it's not always easy to figure out the flow of control between programs in userspace without some preexisting knowledge.

Fixing this cleaned up a lot of inconclusive tests, and a number of tests which were failing because of properties not being reset. After this and fixing a couple tests which made assumptions about processor sets and pools being mutually exclusive, as of this afternoon the pools test tally has improved from that in the last entry to:
Result Total:
        FAIL: 7
        PASS: 705
        UNSUPPORTED: 5
In addition to this, I've cleaned up some code, merged redundant code paths, done the final implementation for the last class of process sets, and fixed several race conditions and deadlocks related to the conversion of pool reference counting to per-thread. And because two of the failures also fail on the gate (it was one the other week, but I've found bugs in several test cases), there are really only 5 failures left. Current builds also hold up to as many runs of the processor set stress test as I've been able to subject it to, without deadlock or panic. Things are looking good.

Posted by csg [Personal] ( July 24, 2007 05:03 PM ) Permalink | Comments[0]
http://blogs.sun.com/csg/date/20070713 Friday July 13, 2007

A Cleaner Prototype - with full pset support

I haven't written in a while, but I've been busy.

At the start of July, I started digging through all of the old processor sets and processor pools correctness and stress tests. Right now I'm basically working on having the pools functionality test pass. Here are the results of testing the gate from a couple weeks ago (output from a test result parsing script):

Result Total:
        FAIL: 1
        PASS: 711
        UNSUPPORTED: 5
UNSUPPORTED means it doesn't apply to the platform - I haven't looked into them, but I suspect those tests require more than only 4 processors to run. Right now, I'll be happy to match the gate.

Just before the 4th, I ran the test suite on a machine running my mostly-finished prototype:

Result Total:
        FAIL: 34
        OTHER_139: 2
        PASS: 552
        UNRESOLVED: 124
        UNSUPPORTED: 5
UNRESOLVED and OTHER mean various random error conditions occurred which prevented the test from completing (e.g. some of the setup for the test failed) but the actual test does not necessarily yield incorrect results. The UNSUPPORTED tests in this run (and the next) are the same as those which were unsupported on the gate build. The prototype this ran on is the one mentioned in my previous entry, which implements support for most of the basic use cases of psets as temporary pools. It lacked support for (read: crashes upon) performing some tasks under conditions the pools framework doesn't like (e.g. deleting a pset which has things bound to it is valid for processor sets, but invalid for a pool) or things without direct analogs in the pools world (e.g. binding individual threads).

Some random stress testing however frequently took the system down, failing an assertion that the number of processes in a pool was greater than 0. It turns out much of the pools framework verifies consistency using counts of how many processes are in a pool - which doesn't make sense when we want to coexist with an API which permits binding (potentially) every thread of a program to a different pool! struct proc contains a pointer to the pool the process is in, used for querying where processes are - which also makes no sense if this varies within a process. So in my latest build, in addition to having various bug fixes and more feature support, the system is converted to having a per-thread pool pointer, and having the pool reference count be a thread count rather than a process count. There is still one race condition (famous last words) to fix in the new refcount code, but it's almost there. The build is also the first which actually turns pools on at boot, and does not permit turning them off - which along with psets and pools coexisting, is the main goal of the whole project.

The latest build test results from 7/12, before I turned my attention back to the refcount race condition in the psets code:

Result Total:
        FAIL: 16
        OTHER_13: 32
        PASS: 664
        UNSUPPORTED: 5

Of course, some of these tests no longer make sense to have, and some of them simply expect the wrong thing - mainly all of the tests which try to reproduce different behavior with pools turned on or off. Right now my kernel simply lies and always says that enabling or disabling succeeded, though queries for state always return POOL_ENABLED. This accounts for at least a couple of the remaining failures. Many of the OTHER results are hard to interpret - most of the tests giving these results are written differently from most of the others, and don't provide as much information - but this is probably an indication of something giving them errno 13 - EACCES.



Posted by csg [Sun] ( July 13, 2007 02:12 PM ) Permalink | Comments[0]