Top Solaris 10 features for fault management
A number of Sun bloggers have posted top N lists of cool and new features in Solaris 10 (e.g., in
Adam's blog). I thought I'd have
a go at a Solaris 10 top 10 from the error handling and fault
management and "Predictive Self-Healing" point of view, and then go into each item in a bit more
detail in future entries.
- Sun has adopted a fault management architecture and Solaris 10 delivers the first (of many planned) offerings implementing this architecture in the fault management daemon (part of the svc:/system/fmd:default service).
- Error event handlers now propogate structured error reports to the fault manager where they are logged in perpetuity.
- Diagnosis Engines now exist to automate much diagnosis.
- Agent software can implement various policies given a diagnosis, e.g. to offline and blacklist a cpu or to retire some memory.
- Only diagnoses will appear on consoles etc, and they reference web-based knowledge articles.
- The contract filesystem ctfs provides a mechanism by which we can communicate hardware errors to groups of affected processes.
- The Service Management Facility is available to manage services affected by errors.
- Error trap handlers, now that there is a clear separation of responsibilities, are more robust.
- Getting error telemetry out of the kernel is dead easy now.
- Fault management is no longer an afterthought! And it is set to grow and grow.