The FMA Triad: Topology, Telemetry & Diagnosis Rules - Part 3
As a reminder, the intention of this series is to illustrate how topology, telemetry, and diagnosis rules fit together, where they must agree, and - as a teaser for the last installment - what problems arise when they don't agree.
Part 3 - Diagnosis Rules
Many of the diagnosis rules are written in the Eversholt language.
An excerpt from the rules that diagnose the I/O root complex on UltraSPARC-T1/T2/T2plus systems:
event ereport.io.fire.pec.lup@hostbridge/pciexrc{within(5s)}; ... prop upset.io.fire.nodiag@hostbridge/pciexrc (0)-> ereport.io.fire.jbc.ce_asyn@hostbridge/pciexrc, /* CPU */ ereport.io.fire.jbc.jbe@hostbridge/pciexrc, /* CPU */ ereport.io.fire.jbc.jte@hostbridge/pciexrc, /* CPU */ ereport.io.fire.jbc.ue_asyn@hostbridge/pciexrc, /* CPU */ ereport.io.fire.jbc.unsol_intr@hostbridge/pciexrc, /* CPU */ ereport.io.fire.jbc.unsol_rd@hostbridge/pciexrc, /* CPU */ ereport.io.fire.pec.lin@hostbridge/pciexrc, ereport.io.fire.pec.lup@hostbridge/pciexrc, ...
Now sadly, there's no reference to a language description I can include here....the language specs aren't public anywhere I can find. But very briefly, the event line declares an ereport event. This one in particular is for a link-up event detected by the root complex (more details on this event are available in the events registry. And the prop line describes a propagation of an ereport event to an upset (an uninteresting fault). The portion of the rules following the @ symbol describe a component in the system.For our purposes here, this short explanation will be enough. And it's the component description I will focus on. In the first 2 parts of this series, we saw in both the topology and telemetry, we looked at fully qualified FMRIs to describe resources in the system. The example we used in Part 2 was:
-
hc:///ioboard=0/hostbridge=0/pciexrc=0
- T1000: hc:///motherboard=0/hostbridge=0/pciexrc=0
- T2000: hc:///ioboard=0/hostbridge=0/pciexrc=0
- T5120: hc:///motherboard=0/chip=0/hostbridge=0/pciexrc=0
- T5140: hc:///motherboard=0/hostbridge=0/pciexrc=0 and hc:///motherboard=0/hostbridge=1/pciexrc=1
-
hc:///foo=#/hostbridge=#/foobar=#/pciexrc=#
A glimpse into the final installment - while the diagnosis rules use relative FMRIs, the detector FMRI in telemetry uses a fully qualified FMRI (we saw this in Part 2). And of course the topology itself uses fully qualified FMRIs to describe resources and FRUs (we saw this in Part 1). When the topology and telemetry aren't aligned, the diagnosis rules don't work and we see "undiagnosable" messages to the console.
:wq