
Friday August 13, 2004
Stop Using Customers as a Test Group
A few years ago, I began looking into RAS (Reliability, Availability, Serviceability)
improvements on our higher-end platforms. I'm a software guy, so I focused on SW RAS.
My first inclination was that customers cannot tell what is wrong when something
is wrong. There are many ways to attack this problem, of course, not the least of
which is making the system more idiot proof, providing a way to detect downrev SW, etc.
But what really grabbed my attention is the practice in our industry to
force our customers be our test group. Sure, every test plan we write has
some things that might be considered "fault injection," but in my opinion
we were barely scratching the surface of the single most important type
of product test. In fact, to back up a step, I felt we did not even
perform a careful analysis of how faults were expected to be handled
on our platforms. Depending on the platform, some amount of analysis
was done, but it was done non-uniformly by the various groups involved.
Looking at it from a Fault Management viewpoint, we did not even fully
understand how our products would behave for a good percentage of faults.
Of course, in all fairness, we were no worse than many of our competitors.
But who wants to settle for being like the competition? We can do better
than that.
So I set out to define a methodology for fault tree analysis on our products,
and algorithms to enable our products to diagnose themselves. Within a
couple months, I discovered a hardware RAS engineer, Emrys Williams, was
attacking the very same problems. Both of us had starting points, but had
not completed the work. We merged our two projects into one, and the result
was code-named "eversholt" (no longer proprietary information, since
we've submitted our patents -- at least, some portions of it are no longer
proprietary and I can blog about those portions).
Over the coming blog entries, I will describe our research. That research
has led to new methodologies for fault tree analysis which are taking hold
in Sun. It has resulted in a new feature, buried under the covers in Solaris,
which can consume error reports and produce fault diagnoses. This is just
the beginning, but it is a very exciting beginning!