andy's blog
Weblog
Archives
« November 2009
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
     
       
Today
XML
Search

Links
Referrers

Today's Page Hits: 5

All | General | Java | Music
« Stop Using Customers... | Main | A Brief Hello »
20050613 Monday June 13, 2005
Predictive Self-Healing/EFT Overview

Whose Fault Is It?

Today is the OpenSolaris launch! I would like to take this opportunity to blog about the part of Solaris I'm currently involved in. This blog entry is about Fault Management. The fault management framework in Solaris is known as Predictive Self-Healing and actually encompasses quite a bit. I will provide a general overview of the framework, and then focus in on my area which uses uses Fault Trees to model parts of the system and perform fault diagnosis. Now, you may think this topic sounds a bit on the dry side, so let me start with a few words about the motivation behind all this.

To Err is Human

For years people have had to deal with UNIX machines spewing error messages at them. I'm not talking about errors like the usage message from ls(1), I'm talking about things like driver-detected errors when a piece of hardware has flaked out, or assertion failures in software modules. When these errors are detected, some free-format message gets spewed to one of the various logs in the system in hopes that some God-like sys admin with enough experience will understand what's broken and fix it. To make up for the lack of God-like sys admins, companies have produced add-on software which, using a rule-based system or some other heuristics, attempts to deduce the problems experienced on a system.

Obviously, the problem is that you cannot depend on an all-knowing sys admin to correctly diagnose a problem, and even the add-on heuristics aren't very reliable because they depend on very fragile information (I once heard of someone naming their system panic and finding that every line written to syslog was flagged as a problem by some poorly-written diagnosis software which pattern matched for the word panic in syslog).

What we really want is for diagnosis to be built into the system from the beginning. That's more that just adding software to diagnose the problem -- much more. It also means modifying every part of the system that produces random, free-form error messages and converting those points to error telemetry collection points instead. This is exactly the massive undertaking of the Solaris Predictive Self-Healing project.

In Simple Terms

Before going on, let me define a few of the terms we use in the Fault Management project. These terms are not new, they are not revolutionary, and they are not even as complex as many of the so-called industry standard terms out there. The point was to come up with a simple set of concepts to use throughout our project, so I describe those concepts here.



Go With the Flow

I like to think that any well-designed software architecture is simple and easy to explain. The fault management framework, at the high-level view, certainly is simple. It is basically divided into three areas: Error Handlers, Diagnosis Engines, and Agents. Of course there's much more detail to it, and that will surely be the subject of many future blog entries, but the overall architecture is shown by the following diagram.

[Fault Management Flow]

As an example, think of a device driver as the box on the left. When it detects errors, it packages up the information into Ereports and sends them to the Fault Management Daemon (fmd), which routes the information to the appropriate Diagnosis Engine. When a diagnosis engine arrives at a diagnosis, it emits a Supsect List which is routed by the fmd to the appropriate Agent to take action. Without going into great detail, I will mention there are a number of diagnosis engines and agents as part of the fault management framework delivered in Solaris so far. You can see them with the fmadm command: The source for the fault management daemon (another fine product from Mike Shapiro), lives in the Solaris source tree under the directory:

In that directory you'll find the fmd, the related commands, and all the modules (diagnosis engines and agents). You'll also find the eversholt compiler and diagnosis engine, which I'll describe in a little more detail later in this blog entry.

When a subsystem is converted to use the fault management framework, it no longer spews random error messages into syslog (or anywhere else). Instead, it packages up the error information into well-defined ereports and sends them to the fmd (the exact mechanisms for defining ereports and sending them to the fmd are details we'll skip for now).

Before describing what happens next, let's talk about how you know what should happen in your system when a problem occurs.

Introspection

Before shipping any product, we should all understand how faults, defects, and upsets will be handled by that product, right? Well, maybe that sounds obvious, and maybe some people assume this is exactly what "product test" or "quality assurance" is testing, but as far as I can tell, most of the industry is quite lacking in this area. Without going into the gory details of fault injection testing, etc. let me just say a good place to start would be to define a model of how these problems impact a system. Once we have a model, we can review it with the system designers and perhaps write simulations based on the model to understand the system's fault management strategy.

One way to define such a model is by constructing a Fault Tree for the system. As described above, the fault tree shows the causal relationship shown in the following diagram:

[Fault Tree]

Of course a real fault tree contains zillions of events and zillions of propagations, but the above picture shows the basics of how problems flow to errors and then to ereports. (The full nomenclature for fault trees, including what the little numbers in the circles mean, is beyond the scope of this particular blog entry -- let's tackle that at a later date!)

Constructing a fault tree is a laborious process. But you really don't get anything for free -- if you want to understand how these events propagate through a system, you need to work through the details thoroughly. The eversholt technology provides a way to construct these fault tree models, along with tools to use them in simulations in order to validate the models. (eversholt is the original code-name for the project inside Sun. It is the name of the town in the UK where the co-inventor of the technology lives and where we were working at the time.)

What's the Problem?

Remember that we never observe problems directly, we only see their effects. The observations are the events at the leaves of the fault tree -- the ereports. Once a fault tree model is created for a system, one might imagine that you could infer which problems are present in the system by taking the ereports you've seen and following your way back up the fault tree to the problems. Well, that's exactly how the eversholt diagnosis engine does its job. In other words, once you go through the exercise of understanding how a subsystem works, once you write the fault tree model (and review it with the appropriate people and run simulations to make sure it is correct), you get the diagnosis engine for free! That's because the eversholt diagnosis engine operates directly from the fault tree.

And now for a brief reality check: the OpenSolaris launch makes all the eversholt code open for anyone to see. But the theory and practice behind fault trees and getting the fault trees correct is really the hard part. Sorry, I do not have the time to describe all the very-involved details here just now, but keep watching out for more information! (And just in case it isn't obvious, I'm always interested in hearing more ideas from the community to build on what we've been doing in this space!)

Where Next?

Now that we have our framework in place, as well as some initial subsystems using it, it opens up the fault management world to all the other subsystems. Our immediate goals are to convert the higher pay-off systems out there -- systems where correct fault diagnosis is a big win. Over time, we'll also want to improve on our framework so we've been very careful to design it to evolve over time without ever forcing us to start over yet again. Many of these design points, future enhancements, and hardening of more and more subsystems will all be the subject of future blog entries, no doubt.


Thanks for reading... if you made it this far and want to contact me, join my first name, Andy, with my last name, Rudoff, with a dot in between them and add an "at" sign and Sun's domain name and drop me a line.



Technorati Tag:
Technorati Tag:


Jun 13 2005, 11:39:31 PM MDT Permalink

Comments:

Post a Comment:

Comments are closed for this entry.