And one retro fave:
:wq
Tuesday Apr 14, 2009
Sun's Xeon 5500 servers and blades have two fault management precincts - the service processor (SP) and the host operating system. This new line of systems has SP-based fault detection and diagnosis for several subsystems, providing a solid base level of fault management irrespective of the host operating system (but you're all running Solaris or OpenSolaris, right? :) A quick rundown of the subsystems that are fault managed:
- Nehalem CPUs: The SP detects and diagnoses all processor uncorrectable errors (UEs). Correctable errors (CEs) are remanded to the host operating system. If the OS is Solaris, processors can be offlined by Solaris FMA if CEs occur too frequently. Also, Solaris FMA will capture error state on CPU UEs and report them upon the next Solaris restart.
- Memory: As with CPUs, memory UEs are diagnosed and reported within the SP. Unlike CPU errors, memory UEs are not visible to the host operating system.
- IO Subsystem: Errors in the IO Hub (Tylersburg) itself are detected and reported via the SP. PCI/PCIE fabric errors are handled by the host operating system (via AER).
- Power, Cooling & Environmentals: The SP detects and diagnoses problems with the bulk power supplies and fan trays. It also monitors the various component sensors (temp, voltage, etc.), reporting components that have gone out of tolerance.
The service processor provides coverage for a good portion of the errors in the system, yet the host operating system can augment the SP, notably in the area of recovery and/or isolation of problematic resources. If you're at all familiar with Solaris, you'll know that Solaris has the capacity to offline individual processor strands (no further software threads are scheduled on the affected strand), retire individual pages of memory (8KB granularity), and cease using problematic IO devices (configurations & active usage permitting).
Another interesting look at Sun's new systems is the fault management functionality for various host operating systems - what diagnosis and recovery features are available straight out of the box:
| Host Operating System | |||
| Subsystem Diagnosis | Solaris | Windows | Linux |
| CPU Correctable | Yes | No | No |
| CPU Uncorrectable | Yes | Yes | Yes | Memory Correctable | Yes | Yes | Yes |
| Memory Uncorrectable | Yes | Yes | Yes |
| IO Hub (Tylersburg) | Yes | Yes | Yes |
| PCI/PCIE Fabric (assuming AER support) | Yes | Yes | Yes 1 |
| Recovery/Isolation | Solaris | Windows | Linux |
| CPU Strand Offline | Yes | No | No | Memory Page Retire | Yes | No | No | IO Device Retire | Yes | No | No |
| 1 Provided Linux kernel has PCI AER support | |||
Irrespective of your choice of host operating system, Sun's Xeon 5500 suite of systems provides solid fault detection and reporting capabilities. And when coupled with Solaris, fault resilience is improved thanks its recovery and isolation capabilities.
Click here for more Sun blogs about Sun's new Xeon 5500 systems.
:wq
1 If running OpenSolaris on an Ultra 27, it is recommended to install build 111 or later to include the fix to CR 6804867 This fix is also planned for a future Solaris 10 Update release.
Tuesday Apr 07, 2009
For the purposes of this discussion, I'll refer to an SNMP trap from Solaris FMA as an "FMA SNMP trap" and one from ILOM as an "ILOM SNMP trap".
The key to understanding why both ILOM and Solaris must be monitored is the flow of fault information in the system. This picture (admittedly simplified) should help:
For faults generated in their respective precinct, ILOM diagnosed faults will produce an ILOM SNMP trap. And Solaris FMA faults will produce an FMA SNMP trap (via the snmp-trapgen plugin). And there's a level of fault sharing between ILOM and Solaris - but notice the flow of fault information is from Solaris to ILOM.
In Solaris, an FMD plugin called the Event Transport Module (ETM) subscribes to selected fault events and (you guessed it) transports them to ILOM. ILOM then updates its state and view of the components in the system. And for faults received from Solaris, ILOM will also generate an SNMP trap. However, ETM does not transport all fault events. Some fault events are not meaningful to ILOM as they represent components beyond ILOM's visibility. Precisely which faults are forwarded by ETM is driven by a configuration file, etm.conf, tailored for each platform or platform family.
This gives us a few flows for SNMP trap generation.
- FMA diagnosed fault that is transported to ILOM: snmp-trapgen in Solaris generates an FMA SNMP trap. And ILOM generates an ILOM SNMP trap when the fault is received in the service processor.
- FMA diagnosed fault that is not transported to ILOM: snmp-trapgen in Solaris generates an FMA SNMP trap.
- ILOM diagnosed fault: ILOM generates an ILOM SNMP trap
- ILOM chassis event: ILOM generates an ILOM SNMP trap
Summing this up, taking into account the faults that ETM will forward to ILOM, we can expect the following SNMP trap generation for the various subsystems:
| Subsystem | FMA SNMP Trap | ILOM SNMP Trap |
| Processor/Cache | yes | yes |
| Memory | yes | yes |
| PCI/PCIE | yes | yes |
| Coherency Links1 | yes | yes |
| ZFS | yes | no |
| Disks | yes | no |
| SCSI | yes | no |
| Power/Cooling | no | yes |
| Environmental/Sensors | no | yes |
| ASR Disables | no | yes | Component Insertion/Removal | no | yes |
| 1 T5440 and derivatives only | ||
As I'm not an expert on ILOM, the table above may not be an exhaustive list of all of the ILOM events that can trigger an ILOM SNMP trap. But I believe it's sufficient to illustrate the point that if you're monitoring your SPARC CMT systems via SNMP, you must monitor both ILOM SNMP and FMA SNMP traps.
:wq
This blog copyright 2009 by Scott Davenport