andy's blog
Weblog
Archives
« August 2004 »
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today
XML
Search

Links
Referrers

Today's Page Hits: 24

All | General | Java | Music
« Previous day (Aug 12, 2004) | Main | Next day (Aug 13, 2004) »
20040813 Friday August 13, 2004
Stop Using Customers as a Test Group


A few years ago, I began looking into RAS (Reliability, Availability, Serviceability) improvements on our higher-end platforms. I'm a software guy, so I focused on SW RAS. My first inclination was that customers cannot tell what is wrong when something is wrong. There are many ways to attack this problem, of course, not the least of which is making the system more idiot proof, providing a way to detect downrev SW, etc.

But what really grabbed my attention is the practice in our industry to force our customers be our test group. Sure, every test plan we write has some things that might be considered "fault injection," but in my opinion we were barely scratching the surface of the single most important type of product test. In fact, to back up a step, I felt we did not even perform a careful analysis of how faults were expected to be handled on our platforms. Depending on the platform, some amount of analysis was done, but it was done non-uniformly by the various groups involved. Looking at it from a Fault Management viewpoint, we did not even fully understand how our products would behave for a good percentage of faults.

Of course, in all fairness, we were no worse than many of our competitors. But who wants to settle for being like the competition? We can do better than that.

So I set out to define a methodology for fault tree analysis on our products, and algorithms to enable our products to diagnose themselves. Within a couple months, I discovered a hardware RAS engineer, Emrys Williams, was attacking the very same problems. Both of us had starting points, but had not completed the work. We merged our two projects into one, and the result was code-named "eversholt" (no longer proprietary information, since we've submitted our patents -- at least, some portions of it are no longer proprietary and I can blog about those portions).

Over the coming blog entries, I will describe our research. That research has led to new methodologies for fault tree analysis which are taking hold in Sun. It has resulted in a new feature, buried under the covers in Solaris, which can consume error reports and produce fault diagnoses. This is just the beginning, but it is a very exciting beginning!

Aug 13 2004, 10:43:27 PM MDT Permalink