A method to the madness

One feature that is sorely lacking in our test automation harness is the ability to share and lock resources between competing groups of tests. Consider this situation: Our group is responsible for testing some of the systems in Sun's X64 family of servers. In particular, much of our focus is on the Sun Blade 8000 (code named Andromeda). Given two indisputable facts:

  1. These systems are expensive, and thus there are not an infinite supply for us to test.
  2. There are many different components to one system, some of which can be tested independently.
It is possible (and in fact very likely) that we might try to run competing groups of tests on the same hardware at the same time. The reality is that we've run into this problem a lot more than we care for, and our current test automation harness provides no mechanism to ensure that separate groups of tests do not collide in this manner.

Thus we are faced with a somewhat classic resource sharing problem, but with a unique twist: some resources (such as an Andromeda chassis) might have sub-resources which can be locked and used independently and without interference. Therefore, a resource can only be locked for use if it and all of its sub-resources are available. Our specific problem details can be described as follows:

  • Resources are structured in a hierarchical format as just described.
  • The resource hierarchy is stored in a centralized SQL database.
  • Each logical group of automated tests defines a set of resources that must be locked before execution can begin.
  • Each group of automated tests is run by an execution daemon. There may be multiple execution daemons running on possibly different servers which will be attempting to lock the same resources in the centralized database before executing the tests.
One of the considerations in coming up with a resource sharing policy for this situation is that we should strive for simplicity over any optimizations made for speed or to maximize concurrent execution. While both of these benefits would be nice, the number one priority is to ensure a consistent, well-known, and repeatable test environment so that the test results are not invalidated by external interference. Any complexity that we add to the design introduces additional possibilities for infrastructure problems which could create overhead in test maintenance and debugging. Throughput priorities should be secondary which is why I have enforced the policy that all resources required for the entire life of a group of tests (even if it is only required for a short period of time or only a few test cases) must be locked by an execution group before any tests in that group can begin.

It turns out that having a hierarchy of resources does not present a significant problem in coming up with a resource sharing policy. As long as resources and their children are properly ordered per Dijkstra's solution to the infamous dining philosopher's problem, and resource locks are always acquired in the same order, the same basic principles apply. Thus here is my basic locking algorithm presented as a solution to the above described problem:

  • An execution group requires a set of resources, we'll call them R
  • Foreach resource r in R (in alphabetical order)
    • Lock the database tables associated with resources (to avoid race conditions in database access).
    • If r is unlocked AND there is nobody waiting in the queue for r (the queue is maintained in the database), lock it (by assigning your name to the resource as the locker in the database).
      • Recursively lock each child resource of r in alphabetical order.
      • If any children of r are locked, release all children of r (but do NOT release r itself). This will allow any execution groups that require only children of r to execute while we are waiting for resources to be released. After waiting n (10?) minutes, go back to the beginning of the children locking loop and re-attempt to lock the children. This time, do NOT release the child locks if a locked child is encountered, enter into the queue for that child instead and initiate the pseudo-busy waiting process for that resource.
    • Else enter into the queue for resource r and initiate the pseudo-busy waiting process for r.
    • Unlock the database tables associated with resources
  • Once all resources are acquired, execute all tests in the group and then release the locked resources in R
  • Pseudo-busy waiting process:
    • sleep for (5?) minutes.
    • check if the resource is available. If it is, lock it and remove yourself from queue. Otherwise, go back to the previous step.
As you can see, the algorithm is not built for speed and is decidedly greedy in its locking policy. Like I said above, though, speed and concurrency are not the main goals. Reliability, consistency, and simplicity are. There are still a few problems to iron out, but this basic algorithm should ensure that groups of automated tests contending for the same resources do not interfere.