This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.
Spending a lot of time writing test cases to try and reproduce system panics has led me to use an interesting(ish) methodology. You stare at the data structures in the dump, you stare at the code and see if you can work out how to get things into a similar state. Then comes the difficult bit, I've taken to working out what operations are possible from the userland code and then randomising them using lrand48().
So this weeks exercise has been to reproduce a panic in the poll() code that from code inspection is impossible. The per process file table (indexed by file descriptor) is per process but the poll structures are per lwp, so there is a linked list of interested threads attached to a file entry if any threads are polling on that file entry. In the dump there are 3 threads chained off one file entry. So we know that we have a multi threaded process performing poll() on a single file descriptor from several threads at once. We panic'ed in close as we traversed that list as one of the threads has been reused by a process that does no polling so its per thread poll structures were null. So now we know that it has threads exiting and threads closing the file that we are polling on.
So I wrote a threaded program that opened a net connection and then went into a loop, it randomly started a new thread, those threads then possibly polled on that connection, or possibly exited, or possibly closed the connection. The main thread dealt with all of this, re-opening the connection if it got closed, starting new threads as ones exited - all under the choice of lrand48().
Did this reproduce it ? No, so then I randomised the number and contents of the pollfd array passed to poll() and suddenly the machine paniced with an identical stack trace to the customer's machine - the power of lrand48()
The good news is that it is fixed in solaris 10 already...
Posted by Alan Burlison on March 21, 2006 at 12:14 AM GMT+00:00 #