UltraSPARC-T2 Fault Management
As Sun announced the
UltraSPARC-T2
processor today, I thought I'd take a quick dip into the blogging pool and describe, at a high level, some of the fault management and predictive self-healing capabilities we have forthcoming in support of the chip. If you haven't read about the UltraSPARC-T2, the chip overview is:
- 8 cores, each core with 8 CPU strands
- integrated IO root complex
- integrated 10GbE networking
- dedicated cryptographic and floating point units per core
- Diagnosis of CPU errors at the strand, core, and chip level: This is an improvement over UltraSPARC-T1, which did everything at the strand level. Now, with the next release of Solaris 10, resources that are shared across all strands within a core will offline all impacted strands, not just the detecting one. And, yes, this will apply retroactively to UltraSPARC-T1 with the next release of Solaris 10 (or today, if you're running Open Solaris).
- Diagnosis of the memory subsystem: diagnosis to the memory page level, and page retire operations, are available on UltraSPARC-T2. Additionally, memory SERDing is done at the page level (vs. the DIMM level).
- Diagnosis of the IO subsystem, including PCI Express Fabric: In addition to diagnosis of the root complex itself, UltraSPARC-T2 takes advantage of additional support put into Solaris for PCIE Fabric diagnosis.
- Offlining of cryptographic units: If a fault is diagnosed to one of the crypto unit in a core, the crypto drivers will stop using that crypto units. Other crypto units in your set of domain resources are still available. And, all CPU strands remain active.
- Diagnosis of the on-chip network unit: the new 10 GbE network unit can report errors, and we'll diagnose them.
- POST/FMA interaction: In T1000/T2000 systems, POST initially would fail a DIMM based on a single correctable error. By virtue of the configuration requirements of the memory subsystem, this had the net effect of taking away half of system memory. Starting with UltraSPARC-T2 platforms, when POST encounters a correctable memory error, the error is queued up for FMA diagnosis. When the domain comes online, if a page retire is necessary, it is performed. At worst, lose an 8K page instead of 50% of memory.
- Inclusion of part/serial numbers in fault events: for those that service systems, fault events on UltraSPARC-T2 platforms now include the FRU part and serial number of the faulted component(s).
- Single 'fmadm repair' operation: For faults diagnosed by FMA in the Solaris domain, they can be repaired with a single 'fmadm repair' command on the OS side. The SP state of the component(s) are kept in sync. For those familiar with the two-step process on T1000/T2000 systems, this goes away.
:wq