The Solaris Fault Management Architecture has come a long way since Mike Shapiro and I started talking about it way back in 2001. We started out with a bang as the industry leader in fault management technology:
The members of our original development team have changed along the way, but our commitment to improving the architecture and adding new content remains steadfast. Since the introduction of FMA in Solaris 10, additional content has been added to support new platforms and extend FMA concepts into other subsystems. Just look at what we've delivered since S10 was released a short 2 years ago:
- New for SPARC: US-IV+, US-T1, Niagara & Niagara-2, Fire PCI-E I/O
- New for x64: CPU/Memory error handling and diagnosis for AMD Opteron and Athlon 64
Enables all detector banks and sets all documented MCi_CTL bits
Full machine-check and error-poller handling for all error types documented in the BKDG
Diagnosis engine rules for all error types
Response agent: core offline, page retire
New for x64: PCI-Express
Diagnostic correlation based on transmit/receiver error information
Connections to platform machine-check error handling
Connections to FMA-aware leaf drivers for increased availability and diagnosability
Diagnosis engine rules for all error described in PCI-E Base Specification
New for Sun Fire X4500: SMART-based disk diagnosis
New for Solaris: Initial ZFS integration with out-of-the-box FMA support
New for Admins: FMA SNMP trap and MIB support
Generates SNMP traps (notifications) for FMA diagnosis
FM MIB permits additional details by UUID
New for Developers: FMA Event Registry
Web browsable interface to view
3730 FMA Events
338 FMA Knowledge Articles
CLIs to extract event payload and message content
New for Developers: Public interfaces for IO FMA
Updated WDD chapter for writing FMA-aware drivers
Deployment: FMA Demo Package
Infrastructure to inject errors in a simulation environment
What's best is that Solaris FMA is getting noticed and showing real benefits. The Sun Service organization estimates that platforms shipping without FMA support can cost $252 per-unit per-year. Let's do the math...if Sun sells 100,000 units per year that means after 3 years, Solaris with FMA is saving Sun $75,600,000.
100000 units per year x $252 per unit x 3 years = $75,600,000
I don't know about you, but I wouldn't mind saving $75,000,000.00 a year. A paper presented by Mike Shapiro and Dong Tang at the Dependable Systems Network 2006 demonstrated a decrease in annual system downtime by 37-54% using quantitative analysis of the FMA memory retirement capabilities. InfoWorld gave Solaris FMA a nod by awarding our team members its 2005 Innovation of the Year Award.
So, what are we working on now? Well, we are continuing to deliver on the promise of Predictive Self-Healing. Work is on-going to support out-the-door fault management capabilities for new processors, platforms and I/O subsystems. With the announced support for Intel on Solaris (or is it Solaris on Intel?), we are busily working on a FMA implementation for Intel processors. Solaris will be the first OS to take full advantage of industry-leading x86 processor error handling features. In the I/O space, we are beefing up leaf drivers, adding FMA error handling and diagnosis for SCSI problems and using SMART disk data to actively predict impending disk failures for all platforms. The Xen project gives us an opportunity to deploy a FMA in a virtualized environment. We'll take some of the infrastructure we delivered for LDOMs and use it to connect hypervisor error handling to a DOM0 diagnosis environment. But that's not all...we are looking at ways to use sensor telemetry to offer better fault prediction, manage resource guarantees and power budgeting. On the software front, we are modifying the techniques we've used to diagnose hardware problems to be useful for software diagnosis. This is a huge under-explored area that will keep Solaris in the fore-front with leading-edge availability and serviceability.
Stay tuned, we're not done with FMA just yet.
Cindi