Monday June 13, 2005
Today is the
OpenSolaris
launch! I would like to take this opportunity to
blog about the part of Solaris I'm currently involved in.
This blog entry is about Fault Management. The
fault management framework in Solaris is known as
Predictive Self-Healing
and actually encompasses quite a bit. I will provide a general
overview of the framework, and then focus in on my area which uses
uses Fault Trees to model parts of the system and perform
fault diagnosis. Now, you may think
this topic sounds a bit on the dry side, so let me start with a few
words about the motivation behind all this.
For years people have had to deal with UNIX machines spewing error messages at them. I'm not talking about errors like the usage message from ls(1), I'm talking about things like driver-detected errors when a piece of hardware has flaked out, or assertion failures in software modules. When these errors are detected, some free-format message gets spewed to one of the various logs in the system in hopes that some God-like sys admin with enough experience will understand what's broken and fix it. To make up for the lack of God-like sys admins, companies have produced add-on software which, using a rule-based system or some other heuristics, attempts to deduce the problems experienced on a system.
Obviously, the problem is that you cannot depend on an all-knowing sys admin to correctly diagnose a problem, and even the add-on heuristics aren't very reliable because they depend on very fragile information (I once heard of someone naming their system panic and finding that every line written to syslog was flagged as a problem by some poorly-written diagnosis software which pattern matched for the word panic in syslog).
What we really want is for diagnosis to be built into the system from
the beginning. That's more that just adding software to diagnose
the problem -- much more. It also means modifying every part of the
system that produces random, free-form error messages and converting
those points to error telemetry collection points instead. This
is exactly the massive undertaking of the
Solaris Predictive Self-Healing project.
Before going on, let me define a few of the terms we use in the Fault Management project. These terms are not new, they are not revolutionary, and they are not even as complex as many of the so-called industry standard terms out there. The point was to come up with a simple set of concepts to use throughout our project, so I describe those concepts here.
I like to think that any well-designed software architecture is
simple and easy to explain. The fault management framework, at
the high-level view, certainly is simple. It is basically divided
into three areas: Error Handlers,
Diagnosis Engines, and Agents. Of course there's
much more detail to it, and that will surely be the subject of
many future blog entries, but the overall architecture is shown
by the following diagram.
![[Fault Management Flow]](http://blogs.sun.com/roller/resources/andy/fmaflow.jpg)
# fmadm config MODULE VERSION STATUS DESCRIPTION cpumem-diagnosis 1.3 active UltraSPARC-III CPU/Memory Diagnosis cpumem-retire 1.0 active CPU/Memory Retire Agent eft 1.12 active eft diagnosis engine fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis io-retire 1.0 active I/O Retire Agent syslog-msgs 1.0 active Syslog Messaging AgentThe source for the fault management daemon (another fine product from Mike Shapiro), lives in the Solaris source tree under the directory:
In that directory you'll find the fmd, the related commands, and all the modules (diagnosis engines and agents). You'll also find the eversholt compiler and diagnosis engine, which I'll describe in a little more detail later in this blog entry.
When a subsystem is converted to use the fault management framework, it no longer spews random error messages into syslog (or anywhere else). Instead, it packages up the error information into well-defined ereports and sends them to the fmd (the exact mechanisms for defining ereports and sending them to the fmd are details we'll skip for now).
Before describing what happens next, let's talk about how you
know what should happen in your system when a problem occurs.
Before shipping any product, we should all understand how faults, defects, and upsets will be handled by that product, right? Well, maybe that sounds obvious, and maybe some people assume this is exactly what "product test" or "quality assurance" is testing, but as far as I can tell, most of the industry is quite lacking in this area. Without going into the gory details of fault injection testing, etc. let me just say a good place to start would be to define a model of how these problems impact a system. Once we have a model, we can review it with the system designers and perhaps write simulations based on the model to understand the system's fault management strategy.
One way to define such a model is by constructing a Fault Tree
for the system. As described above, the fault tree shows the
causal relationship shown in the following diagram:
![[Fault Tree]](http://blogs.sun.com/roller/resources/andy/ft.jpg)
Constructing a fault tree is a laborious process. But you really
don't get anything for free -- if you want to understand how
these events propagate through a system, you need to work through
the details thoroughly. The eversholt technology provides
a way to construct these fault tree models, along with tools to
use them in simulations in order to validate the models.
(eversholt is the original code-name for the project inside Sun.
It is the name of the town in the UK where the co-inventor of
the technology lives and where we were working at the time.)
Remember that we never observe problems directly, we only see their effects. The observations are the events at the leaves of the fault tree -- the ereports. Once a fault tree model is created for a system, one might imagine that you could infer which problems are present in the system by taking the ereports you've seen and following your way back up the fault tree to the problems. Well, that's exactly how the eversholt diagnosis engine does its job. In other words, once you go through the exercise of understanding how a subsystem works, once you write the fault tree model (and review it with the appropriate people and run simulations to make sure it is correct), you get the diagnosis engine for free! That's because the eversholt diagnosis engine operates directly from the fault tree.
And now for a brief reality check: the
OpenSolaris launch
makes all the eversholt code open for anyone to see. But the
theory and practice behind fault trees and getting the fault
trees correct is really the hard part. Sorry, I do not have
the time to describe all the very-involved details here just
now, but keep watching out for more information! (And just
in case it isn't obvious, I'm always interested in hearing
more ideas from the community to build on what we've been
doing in this space!)
Now that we have our framework in place, as well as some initial
subsystems using it, it opens up the fault management world to
all the other subsystems. Our immediate goals are to convert the
higher pay-off systems out there -- systems where correct fault
diagnosis is a big win. Over time, we'll also want to improve on
our framework so we've been very careful to design it to evolve
over time without ever forcing us to start over yet again. Many
of these design points, future enhancements, and hardening of
more and more subsystems will all be the subject of future blog
entries, no doubt.
Thanks for reading... if you made it this far and want to
contact me, join my first name, Andy, with my last name,
Rudoff, with a dot in between them and add an "at" sign
and Sun's domain name and drop me a line.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris