Fault Management Top 10: #2
Structured Error Events
When subsystems are left to invent their own error handling infrastructure the natural baseline they tend towards is either that of zero infrastructure (just die horribly on any error) or printf logging/debugging (some odd message that has little or no structure, is not registered in any repository, probably means little to anyone other than the developer, and often contains no useful diagnosis information). I'll term this unstructured printf messaging for this discussion.There was a time several years ago when Solaris hardware error handling amounted to little greater than either of those baselines. "Fortunately" the UltraSPARC-II Ecache ... errmmm ... niggle provided the much-needed catalyst towards the very substantial improvements we see today (and a grand future roadmap). I'll use this as an example of what I'll term structured printf messaging below.
Finally I will contrast these with structured error events.
Unstructured Printf Messaging
The following is an example message that would be barfed (I find that word really suitable in these descriptions) to /var/adm/messages and console in older/unpatched versions of Solaris prior to the first round of hardware error handling improvements. The message is the result of a corrected error from memory.At least there is some diagnostic information, such as telling as which bit was in error (we could grok for patterns). While you could argue that there is some structure to that message you still have to write/maintain a custom grokker to extract it. The task, of course, is complicated by other related messages sharing little or no common structure. Here's an uncorrectable error from memory:May 8 14:35:30 thishost SUNW,UltraSPARC-II: CPU1 CE Error: AFSR 0x00000000.00100000 AFAR 0x00000000.8abb5a00 UDBH Syndrome 0x85 MemMod U0904 May 8 14:35:30 thishost SUNW,UltraSPARC-II: ECC Data Bit 63 was corrected May 8 14:35:30 thishost unix: Softerror: Intermittent ECC Memory Error, U0904
Making things worse, in the corrected memory error case, these things were poured out to the console on every occurence. The implicit expectation had been that you'd see very few such errors - not allowing for a bad pin, for instance. So while the errors are correctable and survivable (albeit coming thick and fast) the handlers chose to spam the console making it near useless!panic[CPU1]/thread=2a1000R7dd40: UE Error: AFSR 0x00000000.80200000 AFAR 0x00000000.089cd740 Id 0 Inst 0 MemMod U0501 U0401
Structured Printf Messaging
Taking the time to plan and coordinate error messaging can pay substantial dividends. This was one of the first reponses to the UltraSPARC-II Ecache problem - to analyse the requirements for error handling and diagnosis on that platform and, with understanding of that umbrella-view in place, restructure associated handling and messaging. Here are correctable and uncorrectable memory errors in this improved format:Behind the scenes a new shared error/fault information structure was introduced, and all kernel hardware error handlers were changed to accrue relevant information in an instance of one of those structures. A new and robust (lockless) error dispatching mechanism would carry these structures to logging code which, instead of multiple independent printf (cmn_err) calls passed the error structure to a common function to log the error and passed flags to that function indicating what was relevant for the particular error.[AFT0] Corrected Memory Error on CPU1, errID 0x00000036.629edc25 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.00347dc0 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1002fe20 UDBH Syndrome 0x85 Memory Module 1904 [AFT0] errID 0x00000036.629edc25 Corrected Memory Error on 1904 is Intermittent [AFT0] errID 0x00000036.629edc25 ECC Data Bit 63 was in error and corrected WARNING: [AFT1] Uncorrectable Memory Error on CPU1 Instruction access at TL=0, errID 0x0000004f.818d9280 AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.0685c7a0 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x7815c7a0 UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00 UDBH Syndrome 0x3 Memory Module 190x WARNING: [AFT1] errID 0x0000004f.818d9280 Syndrome 0x3 indicates that this may not be a memory module problem
Such messaging (along with many other improvements) significantly enhanced diagnosis and trend analysis. You still had to grok messages files, but at least the messages were in a structured format that was quite readily extracted. Of course there was still much to be done. Programmtic access to the logged data was not readily available, syslog is not the most reliable message transport, syslog logs get aged away, harmless errors (say from an isolated cosmic ray incident) litter messages files and cause unnecessary alarm, etc.
Structured Error Events
Roll on the structured events described in my last blog entry. When we prepare an error report we tag it with a pre-registered name (event class) and fill it with all relevant (and also pre-registered) event payload data. That event payload data is presented to us a name-value list (really a name-type-value list). When it is received by the fault manager is it logged away (independently of any diagnosis modules) in a structured fault log (binary, records are kept in the extended accounting format and the record data is in the form of name-value lists). Here is a corrected memory event dumped from the error log using fmdump(1M):
TIME CLASS
Jun 23 2004 02:59:03.207995640 ereport.cpu.ultraSPARC-IIIplus.ce
nvlist version: 0
class =
ereport.cpu.ultraSPARC-IIIplus.ce
ena = 0xd67a7253e3002c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = cpu
cpuid = 0xb
cpumask = 0x23
serial = 0xd2528092482
(end detector)
afsr = 0x200000187
afar-status = 0x1
afar = 0x23c8f72dc0
pc = 0x10391e4
tl = 0x0
tt = 0x63
privileged = 1
multiple = 0
syndrome-status = 0x1
syndrome = 0x187
error-type = Persistent
l2-cache-ways = 0x2
l2-cache-data = 0xec0106f1a6
0x772dc0 0x0 0xbd000000 0xe8 0xc05b9 0x840400003e9287b6 0xc05b9
0x850400003e92a7b6 0xfeb5 0xc05b9 0x860400003e92c7b6 0xc05b9
0x870400003e92e7b6 0x3dd24 0xec0106f1a6 0x372dc0 0x1 0xbd000000 0xe8
0x0 0x0 0x0 0x4000000001 0x2e 0x0 0x0 0x0 0x0 0x0
dcache-ways = 0x0
icache-ways = 0x0
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = mem
unum = /N0/SB4/P3/B1/D2 J16501
(end resource)
In software (such as a diagnosis module) given a name-value list we can access the associated data either through generic event walkers (use the libnvpair(3LIB) API to iterate over list pairs, get the type for a pair, lookup that name-value pair for the given type) or (for example in the case of a diagnosis module that has subscribed to corrected memory error events and knows, from the registry, what payload data and types to expect):
void
analyze_error(nvlist_t *nvl, ...)
{
uint64_t afar;
uint16_t synd;
char *type;
if (nvlist_lookup_pairs(nvl, 0,
"afar", DATA_TYPE_UINT64, &afar,
"syndrome", DATA_TYPE_UINT16, &synd,
"error-type", DATA_TYPE_STRING, &type,
NULL) != 0)
return (BAD_PAYLOAD);
...
}
Bye-bye hacky grokking scripts - we now have self-describing error event information!
Posted by 66.223.59.220 on February 19, 2007 at 03:51 PM GMT #