AMD Opteron/Athlon64/Turion64 Fault Management
In February we (the Solaris Kernel/RAS group) integrated the "fma x64" project into Solaris Nevada, delivering Fault Management for the AMD K8 family of chips (Athlon(TM) 64, Opteron(TM), Turion(TM) 64). This brings fault management on the Solaris AMD64 platform for cpu and memory up to par with that already present on Sun's current SPARC platforms, and addresses one of the most-requested missing functionalities required by customers (or potential customers) of Sun's impressive (and growing) AMD Opteron family offerings (the project, of course, benefits all AMD 64-bit systems not just those from Sun). We had planned some blogging from the rooftops about the project back at integration, but instead we all concentrated on addressing the sleep deficit that the last few hectic weeks of a big project brought and since putback to the Solaris 11 gate there has also been much effort in preparing the backport to Solaris 10 Update 2 (aka Solaris 10 06/06).
Well, it has already hit the streets now that Solaris Express Community Edition build 34 is available for download and the corresponding source is available at cvs.opensolaris.org (around 315 files, search for "2006/020" in file history). There are a few bug fixes that will appear in build 36, but build 34 has all of the primary fault management functionality.
In this blog I'll attempt an overview of the project functionality, with some examples. In future entries I'll delve into some of the more subtle nuances. Before I begin I'll highlight something the project does not deliver: any significant improvement in machine error handling and fault diagnosis for Intel chips (i.e., anything more than a terse console message). This stuff is very platform/chip dependent, and since Sun has a number of AMD Opteron workstation and server offerings with much more in the pipeline it was the natural first target for enhanced support. The project also does not delivered hardened IO drivers and corresponding diagnosis - that is the subject of a follow-on project due in a couple of months.
AMD64 Platform Error Detection Architecture Overview
The following image shows the basics of a 2 chip dual-core (4 cores total) AMD system (with apologies to StarOffice power users):

This project is concerned with the initial handling of an error, marshalling of error telemetry from the cpu and memory components (a followup project, due in the next few months, will do the same for telemetry from the IO subsystem), and then consuming that telemetry to produce any appropriate diagnosis of any fault that is determined to be present. These chips have a useful array of error detectors, as described in the following table:
| Functional Unit | Array | Protection |
|---|---|---|
| Instruction Cache (IC) | Icache main tag array | Parity |
| Icache snoop tag array | Parity | |
| Instruction L1 TLB | Parity | |
| Instruction L2 TLB | Parity | |
| Icache data array | Parity | |
| Data Cache (DC) | Dcache main tag array | Parity |
| Dcache snoop tag array | Parity | |
| Dcache L1 TLB | Parity | |
| Dcache L2 TLB | Parity | |
| Dcache data array | ECC | |
| L2 Cache ("bus unit") (BU) | L2 cache main tag array | ECC and Parity |
| L2 cache data array | ECC | |
| Northbridge (NB) | Memory controlled by this node | ECC (depends on dimms) |
If an error is recoverable then it does not raise a Machine Check Exception (MCE or mc#) when detected. The recoverable errors, broadly speaking, are single-bit ECC errors from ECC-protected arrays and parity errors on clean parity-protected arrays such as the Icache and the TLBs (translation lookaside buffers - virtual to physical address translations). Instead of a mc# the recoverable errors simply log error data into machine check architecture registers of the detecting bank (IC/DC/BU/NB, and one we don't mention in the table above the Load-Store unit LS) and the operating system (or advanced BIOS implementations) can poll those registers to harvest the information. No special handling of the error is required (e.g., no need to flush caches, panic etc).
If an error is irrecoverable then detection of that error will raise a machine check exception (if the bit that controls mc# for that error type is set; if not you'll either never know or you pick it up by polling). The mc# handler can extract information about the error from the machine check architecture registers as before, but has the additional responsibility of deciding what further actions (which may include panic and reboot) are required. A machine check exception is a form of interrupt which allows immediate notification of an error condition - you can't afford to wait to poll for the error since that could result in the use of bad data and associated data corruption.
Traditional Error Handling - The "Head in the Sand" Approach
The traditional operating system (all OS, not just Solaris) approach to errors in the x86 cpu architecture is as follows:
- leave the BIOS to choose which error detecting banks to enable, and which irrecoverable errors that are detected will raise a mc#
- if a mc# is raised the OS fields it and terminates with a useful diagnostic message such as "Machine Check Occured", not event hinting at the affected resource
- ignore recoverable errors, i.e. don't poll for their occurence, a more advanced BIOS will perhaps poll for these errors but is not in a position to do anything about them while the OS is running
Recognising the increased need for error handling and fault management on the x86 platform, some operating systems have begun to offer limited support in this area. Solaris has been doing this for some time on sparc (let's just say the the US-II E-cache disaster did have some good side-effects!) and so in Solaris we will offer the well-rounded end-to-end fault management on amd64 platforms that we already have on sparc.
A Better Approach - Sun's Fault Management Architecture "FMA"
In a previous blog entry I described the Sun Fault Management Architecture. Error events flow into a Fault Manager and associated Diagnosis Engines which may produce fault diagnoses which can be acted upon not just for determining repair actions but also to isolate the fault before it affects system availability (e.g., to offline a cpu that is experiecing errors at a sustained rate). This architecture has been in use for some time now in the sparc world, and this project expands it to include AMD chips.
FMA for AMD64
To deliver FMA for AMD64 systems the project has:
- made Solaris take responsibility (in addition to the BIOS) for deciding which error-detecting banks to enable and which error types will raise machine-check exceptions
- taught Solaris how to recognize all the error types documented in the AMD Bios and Kernel Developer's Guide
- delivered an intelligent machine-check exception handler and periodic poller (for recoverable errors) which collect all error data available, determine what error type has occured and propogate it for logging, and take appropriate action (if any)
- introduced cpu driver modules to the Solaris x86 kernel (as have existed on sparc for many years) so that features of a particular processor family (such as the AMD processors) may be specifically supported
- introduced a memory-controller kernel driver module whose job it is to understand everything about the memory configuration of a node (e.g., to provide translation from a fault address to which dimm is affected)
- developed rules for consuming the error telemetry with the "eft" diagnosis engine; these are written using the "eversholt" diagnosis language, and their task is to diagnose any faults that the incoming telemetry may indicate
- delivered an enhanced "platform topology" library to describe the inter-relationship of the hardware components of a platform and to provide a repository for hardware component properties
With all this in place we are now able to diagnose the following fault classes:
| Fault Class | Description |
|---|---|
| fault.cpu.amd.dcachedata | DC data array fault |
| fault.cpu.amd.dcachetag | DC main tag array fault |
| fault.cpu.amd.dcachestag | DC snoop tag array fault |
| fault.cpu.amd.l1dtlb | DC L1TLB fault |
| fault.cpu.amd.l2dtlb | DC L2TLB fault |
| fault.cpu.amd.icachedata | IC data array fault |
| fault.cpu.amd.icachetag | IC main tag array fault |
| fault.cpu.amd.icachestag | IC snoop tag array fault |
| fault.cpu.amd.l1itlb | IC L1TLB fault |
| fault.cpu.amd.l2itlb | IC L2TLB fault |
| fault.cpu.amd.l2cachedata | L2 data array fault |
| fault.cpu.amd.l2cachetag | L2 tag fault |
| fault.memory.page | Individual page fault |
| fault.memory.dimm_sb | A DIMM experiencing sustained excessive single-bit errors |
| fault.memory.dimm_ck | A DIMM with a ChipKill-correctable multiple-bit faults |
| fault.memory.dimm_ue | A DIMM with an uncorrectable (not even with ChipKill, if present and enabled) multiple-bit fault |
An Example - A CPU With Single-bit Errors
The system is a v40z with hostname 'parity' (we also have other cheesy
hostnames such as 'chipkill', 'crc', 'hamming' etc!). It has 4 single-core
Opteron cpus. If we clear all fault management history and let it run
for a while (or give it a little load to speed things up) we very soon
see the following message on the console:
Running the indicated command we see that cpu 3 has a fault:SUNW-MSG-ID: AMD-8000-5M, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Wed Mar 15 08:06:08 PST 2006 PLATFORM: i86pc, CSN: -, HOSTNAME: parity SOURCE: eft, REV: 1.16 EVENT-ID: 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65 DESC: The number of errors associated with this CPU has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-5M for more information. AUTO-RESPONSE: An attempt will be made to remove this CPU from service. IMPACT: Performance of this system may be affected. REC-ACTION: Schedule a repair procedure to replace the affected CPU. Use fmdump -v -u <EVENT-ID> to identify the module.
# fmdump -v -u 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65
TIME UUID SUNW-MSG-ID
Mar 15 08:06:08.5797 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65 AMD-8000-5M
100% fault.cpu.amd.l2cachedata
Problem in: hc:///motherboard=0/chip=3/cpu=0
Affects: cpu:///cpuid=3
FRU: hc:///motherboard=0/chip=3
That tells us the resource affected (chip3, cpu core 0), it's logical
identifier (cpuid 3, as used in psrinfo etc), and the field replaceable
unit that should be replaced (the chip, you can't replace a core).
In future we intend to extract FRU labelling information from SMBIOS
but at the moment there are difficulties with smbios data and the
accuracy thereof that make that harder than it should be.
If you didn't see or notice the console message then running
fmadm faulty highlights resources that have been diagnosed
as faulty:
# fmadm faulty
STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
faulted cpu:///cpuid=3
1e8e6d7a-fd3a-e4a4-c211-dece00e68c65
-------- ----------------------------------------------------------------------
In Solaris 11 already and coming to Solaris 10 Update 2 is
SNMP trap support for FMA fault events,
which provides another avenue by which you can become aware of a
newly-diagnosed fault.
We can see the automated response that was performed upon making
the diagnosis of a cpu fault:
The faulted resource has been isolated by offlining the cpu. If you reboot then the cache of faults will cause the cpu to offlined again.# psrinfo 0 on-line since 03/11/2006 00:27:08 1 on-line since 03/11/2006 00:27:08 2 on-line since 03/10/2006 23:28:51 3 faulted since 03/15/2006 08:06:08
Note that the event id appears in the fmadm faulty output,
so you can formulate the
fmdump command line shown in the console message if you wish and visit
sun.com/msg and enter the
indicated SUNW-MSG-ID (quick aside: we have some people working on
beefing up the amd64 knowledge articles there, the current ones
are pretty uninformative). We can also use the event id to see what
error reports led to this diagnosis:
The -e option selects dumping of the error log instead of the fault log, so we can see the error telemetry that led to the diagnosis. So we see that in the space of a few seconds this cpu experienced 4 single-bit errors from the L2 cache - we are happy to tolerate occasional single-bit errors but not at this rate, so we diagnose a fault. If we use option -V we can see the full error report contents, for example for the last ereport above:# fmdump -e -u 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65 TIME CLASS Mar 15 08:05:18.1624 ereport.cpu.amd.bu.l2d_ecc1 Mar 15 08:04:48.1624 ereport.cpu.amd.bu.l2d_ecc1 Mar 15 08:04:48.1624 ereport.cpu.amd.dc.inf_l2_ecc1 Mar 15 08:06:08.1624 ereport.cpu.amd.dc.inf_l2_ecc1
Mar 15 2006 08:06:08.162418201 ereport.cpu.amd.dc.inf_l2_ecc1
nvlist version: 0
class = ereport.cpu.amd.dc.inf_l2_ecc1
ena = 0x62a5aaa964f00c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = motherboard
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = chip
hc-id = 3
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = cpu
hc-id = 0
(end hc-list[2])
(end detector)
bank-status = 0x9432400000000136
bank-number = 0x0
addr = 0x5f76a9ac0
addr-valid = 1
syndrome = 0x64
syndrome-type = E
ip = 0x0
privileged = 1
__ttl = 0x1
__tod = 0x44183b70 0x9ae4e19
One day we'll teach fmdump (or some new command) to mark all that
stuff up into human-readable output. For now it shows the raw(ish)
telemetry read from the machine check architecture registers when
we polled for this event. This telemetry is consumed by the
diagnosis rules
to produce any appropriate fault diagnosis.
We're Not Done Yet
There are a number of additional features that we'd like to bring to amd64 fault management. For example:
- use more SMBIOS info (on platforms that have SMBIOS support, and which give accurate data!) to discover FRU labelling etc
- introduce serial number support so that we can detect when a cpu or dimm has been replaced (currently you have to perform manual fmadm repair
- introduce some form of communication (if only one-way) between the service processor (on systems that have such a thing, such as the current Sun AMD offerings) and the diagnosis software
- extend the diagnosis rules to perform more complex predictive diagnosis for DIMM errors based on what we have learned on sparc
Technorati Tags: OpenSolaris, Solaris