andy's blog
Weblog
Archives
« October 2009
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today
XML
Search

Links
Referrers

Today's Page Hits: 18

All | General | Java | Music
20050815 Monday August 15, 2005
A Brief Hello
你好!


[thanks to my penpal and fellow Sun engineer, 吴君超, for proofreading & grammar help]

Aug 15 2005, 10:53:52 PM MDT Permalink

20050613 Monday June 13, 2005
Predictive Self-Healing/EFT Overview

Whose Fault Is It?

Today is the OpenSolaris launch! I would like to take this opportunity to blog about the part of Solaris I'm currently involved in. This blog entry is about Fault Management. The fault management framework in Solaris is known as Predictive Self-Healing and actually encompasses quite a bit. I will provide a general overview of the framework, and then focus in on my area which uses uses Fault Trees to model parts of the system and perform fault diagnosis. Now, you may think this topic sounds a bit on the dry side, so let me start with a few words about the motivation behind all this.

To Err is Human

For years people have had to deal with UNIX machines spewing error messages at them. I'm not talking about errors like the usage message from ls(1), I'm talking about things like driver-detected errors when a piece of hardware has flaked out, or assertion failures in software modules. When these errors are detected, some free-format message gets spewed to one of the various logs in the system in hopes that some God-like sys admin with enough experience will understand what's broken and fix it. To make up for the lack of God-like sys admins, companies have produced add-on software which, using a rule-based system or some other heuristics, attempts to deduce the problems experienced on a system.

Obviously, the problem is that you cannot depend on an all-knowing sys admin to correctly diagnose a problem, and even the add-on heuristics aren't very reliable because they depend on very fragile information (I once heard of someone naming their system panic and finding that every line written to syslog was flagged as a problem by some poorly-written diagnosis software which pattern matched for the word panic in syslog).

What we really want is for diagnosis to be built into the system from the beginning. That's more that just adding software to diagnose the problem -- much more. It also means modifying every part of the system that produces random, free-form error messages and converting those points to error telemetry collection points instead. This is exactly the massive undertaking of the Solaris Predictive Self-Healing project.

In Simple Terms

Before going on, let me define a few of the terms we use in the Fault Management project. These terms are not new, they are not revolutionary, and they are not even as complex as many of the so-called industry standard terms out there. The point was to come up with a simple set of concepts to use throughout our project, so I describe those concepts here.



Go With the Flow

I like to think that any well-designed software architecture is simple and easy to explain. The fault management framework, at the high-level view, certainly is simple. It is basically divided into three areas: Error Handlers, Diagnosis Engines, and Agents. Of course there's much more detail to it, and that will surely be the subject of many future blog entries, but the overall architecture is shown by the following diagram.

[Fault Management Flow]

As an example, think of a device driver as the box on the left. When it detects errors, it packages up the information into Ereports and sends them to the Fault Management Daemon (fmd), which routes the information to the appropriate Diagnosis Engine. When a diagnosis engine arrives at a diagnosis, it emits a Supsect List which is routed by the fmd to the appropriate Agent to take action. Without going into great detail, I will mention there are a number of diagnosis engines and agents as part of the fault management framework delivered in Solaris so far. You can see them with the fmadm command: The source for the fault management daemon (another fine product from Mike Shapiro), lives in the Solaris source tree under the directory:

In that directory you'll find the fmd, the related commands, and all the modules (diagnosis engines and agents). You'll also find the eversholt compiler and diagnosis engine, which I'll describe in a little more detail later in this blog entry.

When a subsystem is converted to use the fault management framework, it no longer spews random error messages into syslog (or anywhere else). Instead, it packages up the error information into well-defined ereports and sends them to the fmd (the exact mechanisms for defining ereports and sending them to the fmd are details we'll skip for now).

Before describing what happens next, let's talk about how you know what should happen in your system when a problem occurs.

Introspection

Before shipping any product, we should all understand how faults, defects, and upsets will be handled by that product, right? Well, maybe that sounds obvious, and maybe some people assume this is exactly what "product test" or "quality assurance" is testing, but as far as I can tell, most of the industry is quite lacking in this area. Without going into the gory details of fault injection testing, etc. let me just say a good place to start would be to define a model of how these problems impact a system. Once we have a model, we can review it with the system designers and perhaps write simulations based on the model to understand the system's fault management strategy.

One way to define such a model is by constructing a Fault Tree for the system. As described above, the fault tree shows the causal relationship shown in the following diagram:

[Fault Tree]

Of course a real fault tree contains zillions of events and zillions of propagations, but the above picture shows the basics of how problems flow to errors and then to ereports. (The full nomenclature for fault trees, including what the little numbers in the circles mean, is beyond the scope of this particular blog entry -- let's tackle that at a later date!)

Constructing a fault tree is a laborious process. But you really don't get anything for free -- if you want to understand how these events propagate through a system, you need to work through the details thoroughly. The eversholt technology provides a way to construct these fault tree models, along with tools to use them in simulations in order to validate the models. (eversholt is the original code-name for the project inside Sun. It is the name of the town in the UK where the co-inventor of the technology lives and where we were working at the time.)

What's the Problem?

Remember that we never observe problems directly, we only see their effects. The observations are the events at the leaves of the fault tree -- the ereports. Once a fault tree model is created for a system, one might imagine that you could infer which problems are present in the system by taking the ereports you've seen and following your way back up the fault tree to the problems. Well, that's exactly how the eversholt diagnosis engine does its job. In other words, once you go through the exercise of understanding how a subsystem works, once you write the fault tree model (and review it with the appropriate people and run simulations to make sure it is correct), you get the diagnosis engine for free! That's because the eversholt diagnosis engine operates directly from the fault tree.

And now for a brief reality check: the OpenSolaris launch makes all the eversholt code open for anyone to see. But the theory and practice behind fault trees and getting the fault trees correct is really the hard part. Sorry, I do not have the time to describe all the very-involved details here just now, but keep watching out for more information! (And just in case it isn't obvious, I'm always interested in hearing more ideas from the community to build on what we've been doing in this space!)

Where Next?

Now that we have our framework in place, as well as some initial subsystems using it, it opens up the fault management world to all the other subsystems. Our immediate goals are to convert the higher pay-off systems out there -- systems where correct fault diagnosis is a big win. Over time, we'll also want to improve on our framework so we've been very careful to design it to evolve over time without ever forcing us to start over yet again. Many of these design points, future enhancements, and hardening of more and more subsystems will all be the subject of future blog entries, no doubt.


Thanks for reading... if you made it this far and want to contact me, join my first name, Andy, with my last name, Rudoff, with a dot in between them and add an "at" sign and Sun's domain name and drop me a line.



Technorati Tag:
Technorati Tag:


Jun 13 2005, 11:39:31 PM MDT Permalink

20040813 Friday August 13, 2004
Stop Using Customers as a Test Group


A few years ago, I began looking into RAS (Reliability, Availability, Serviceability) improvements on our higher-end platforms. I'm a software guy, so I focused on SW RAS. My first inclination was that customers cannot tell what is wrong when something is wrong. There are many ways to attack this problem, of course, not the least of which is making the system more idiot proof, providing a way to detect downrev SW, etc.

But what really grabbed my attention is the practice in our industry to force our customers be our test group. Sure, every test plan we write has some things that might be considered "fault injection," but in my opinion we were barely scratching the surface of the single most important type of product test. In fact, to back up a step, I felt we did not even perform a careful analysis of how faults were expected to be handled on our platforms. Depending on the platform, some amount of analysis was done, but it was done non-uniformly by the various groups involved. Looking at it from a Fault Management viewpoint, we did not even fully understand how our products would behave for a good percentage of faults.

Of course, in all fairness, we were no worse than many of our competitors. But who wants to settle for being like the competition? We can do better than that.

So I set out to define a methodology for fault tree analysis on our products, and algorithms to enable our products to diagnose themselves. Within a couple months, I discovered a hardware RAS engineer, Emrys Williams, was attacking the very same problems. Both of us had starting points, but had not completed the work. We merged our two projects into one, and the result was code-named "eversholt" (no longer proprietary information, since we've submitted our patents -- at least, some portions of it are no longer proprietary and I can blog about those portions).

Over the coming blog entries, I will describe our research. That research has led to new methodologies for fault tree analysis which are taking hold in Sun. It has resulted in a new feature, buried under the covers in Solaris, which can consume error reports and produce fault diagnoses. This is just the beginning, but it is a very exciting beginning!

Aug 13 2004, 10:43:27 PM MDT Permalink

20040616 Wednesday June 16, 2004
Woken from a blog slumber, I ponder Fault Diagnosis...

Computers should diagnose faults rather than simply spewing errors that result from those faults. Okay, I just wanted to get that off my chest and you may think anyone who would open a new blog with that statement is, well, not alot of fun at parties. You may be right, but that's not my fault (it is, in fact, my defect).

Why don't most computer systems print faults instead of errors? Well, it is very hard, for one thing. Most attempts to diagnose faults take only a small portion of the computer, or a single sub-system into account. The reason is that the error telemetry from different system components is handled in completely different ways. Consider all the various free-form messages in all the various logs on a typical UNIX system. In order to use that telemetry for fault diagnosis, one would have to grok all those formats and, even worse, keep up with the changing world of each subsystem.

Anyway, once you've solved the telemetry problem (left as an exercise to the blogger, for now), you then have to decide how to consume the telemetry. Conventional logic leads to some sort of rule-based, heuristic approach at guessing what's wrong based on the error information. Historically this leads to simplistic diagnoses, which may be good enough in certain circumstances. But any complex, real-world system will soon cause the heuristics to grow out of control.

After solving the diagnosis problem, there's a vast system management and administrative rat's nest to tackle. Even if a system knows what is wrong with itself, what does it do with that information? What approaches actually cause the correct service action, minimize system unavailability, and feed information back into the process to improve the failure rate over time?

I have discovered a truly elegant solution to these problems, but this blog entry is too narrow to contain it. Perhaps a future blog entry will spill the beans...


Jun 16 2004, 06:54:49 AM MDT Permalink