The
Solaris Fault
Management Architecture has come a long way since Mike
Shapiro and I started talking about it way back in 2001. We
started out with a bang as the industry leader in fault management
technology:
August 10, 2001: First discussions of a new
approach to fault management begin at Sun.
January 15, 2002: First internal presentation of plans for a
Solaris Fault Management Architecture
March 18, 2004: FMA integrates into Solaris 10 Build 56,
providing CPU/Mem for US-III and IV
March 7, 2005: FMA ships to customers as part of Solaris 10 G/A
The members of our original development team have
changed along the way, but our commitment to improving the
architecture and adding new content remains steadfast. Since the
introduction of FMA in Solaris 10, additional content has been added
to support new platforms and extend FMA concepts into other
subsystems. Just look at what we've delivered since S10 was released
a short 2 years ago:
- New for SPARC: US-IV+, US-T1, Niagara & Niagara-2, Fire
PCI-E I/O
- New for x64: CPU/Memory error handling and diagnosis for AMD
Opteron and Athlon 64
Enables all detector banks and sets
all documented MCi_CTL bits
Full machine-check and error-poller
handling for all error types documented in the BKDG
Diagnosis engine rules for all error
types
Response agent: core offline, page
retire
Diagnostic correlation based on
transmit/receiver error information
Connections to platform machine-check
error handling
Connections to FMA-aware leaf drivers
for increased availability and diagnosability
Diagnosis engine rules for all error
described in PCI-E Base Specification
Generates SNMP traps (notifications)
for FMA diagnosis
FM MIB permits additional details by
UUID
Web browsable interface to view
3730 FMA Events
338 FMA Knowledge Articles
CLIs to extract event payload and
message content
Updated WDD chapter for writing
FMA-aware drivers
Infrastructure to inject errors in a
simulation environment
What's best is that Solaris FMA is getting noticed and showing
real benefits. The Sun Service organization estimates that platforms
shipping without FMA support can cost $252 per-unit per-year. Let's
do the math...if Sun sells 100,000 units per year that means
after 3 years, Solaris with FMA is saving Sun $75,600,000.
100000 units per year x $252 per unit
x 3 years = $75,600,000
I don't know about you, but I wouldn't mind saving $75,000,000.00
a year. A paper
presented by Mike Shapiro and Dong Tang at the Dependable
Systems Network 2006 demonstrated a decrease in annual system
downtime by 37-54% using quantitative analysis of the FMA memory
retirement capabilities. InfoWorld gave Solaris FMA a nod by awarding
our team members its 2005 Innovation of the Year Award.
So, what are we working on now? Well, we are continuing to deliver
on the promise of Predictive
Self-Healing. Work is on-going to support out-the-door fault
management capabilities for new processors, platforms and I/O
subsystems. With the announced support for Intel on Solaris (or is it
Solaris on Intel?), we are busily working on a FMA
implementation for Intel processors. Solaris will be the first OS
to take full advantage of industry-leading x86 processor error
handling features. In the I/O space, we are beefing up leaf drivers,
adding FMA error handling and diagnosis for SCSI problems and using
SMART disk data to actively predict impending disk failures for all
platforms. The Xen project gives us an opportunity to deploy a FMA in
a virtualized environment. We'll take some of the infrastructure we
delivered for LDOMs and use it to connect hypervisor error handling
to a DOM0 diagnosis environment. But that's not all...we are looking
at ways to use sensor telemetry to offer better fault prediction, manage
resource guarantees and power budgeting. On the software front, we are
modifying the techniques we've used to diagnose hardware problems to
be useful for software diagnosis. This is a huge under-explored area
that will keep Solaris in the fore-front with leading-edge
availability and serviceability.
Stay tuned, we're not done with FMA just yet.
Cindi