In Part 1 of the "FMA Triad: Topology, Telemetry, and Diagnosis Rules" I focused on topology. Part 2 focused on telemetry. This third installment examines diagnosis rules, specifically how the rules must associate and align with the topology and telemetry.

As a reminder, the intention of this series is to illustrate how topology, telemetry, and diagnosis rules fit together, where they must agree, and - as a teaser for the last installment - what problems arise when they don't agree.


Part 3 - Diagnosis Rules

Many of the diagnosis rules are written in the Eversholt language.

An excerpt from the rules that diagnose the I/O root complex on UltraSPARC-T1/T2/T2plus systems:

event ereport.io.fire.pec.lup@hostbridge/pciexrc{within(5s)}; ... prop upset.io.fire.nodiag@hostbridge/pciexrc (0)-> ereport.io.fire.jbc.ce_asyn@hostbridge/pciexrc, /* CPU */ ereport.io.fire.jbc.jbe@hostbridge/pciexrc, /* CPU */ ereport.io.fire.jbc.jte@hostbridge/pciexrc, /* CPU */ ereport.io.fire.jbc.ue_asyn@hostbridge/pciexrc, /* CPU */ ereport.io.fire.jbc.unsol_intr@hostbridge/pciexrc, /* CPU */ ereport.io.fire.jbc.unsol_rd@hostbridge/pciexrc, /* CPU */ ereport.io.fire.pec.lin@hostbridge/pciexrc, ereport.io.fire.pec.lup@hostbridge/pciexrc, ...

Now sadly, there's no reference to a language description I can include here....the language specs aren't public anywhere I can find. But very briefly, the event line declares an ereport event. This one in particular is for a link-up event detected by the root complex (more details on this event are available in the events registry. And the prop line describes a propagation of an ereport event to an upset (an uninteresting fault). The portion of the rules following the @ symbol describe a component in the system.

For our purposes here, this short explanation will be enough. And it's the component description I will focus on. In the first 2 parts of this series, we saw in both the topology and telemetry, we looked at fully qualified FMRIs to describe resources in the system. The example we used in Part 2 was:

    hc:///ioboard=0/hostbridge=0/pciexrc=0
The components in Eversholt can be thought of as relative, uninstantiated FMRIs. Comparing the rules to the fully qualified FMRI, the hostbridge/pciexrc relationship is the same. This is quite powerful - these rules can apply to many different systems with many different topologies. Case in point - the rule excerpt above is used on several SPARC sun4v systems, each with a different topology:
  • T1000: hc:///motherboard=0/hostbridge=0/pciexrc=0
  • T2000: hc:///ioboard=0/hostbridge=0/pciexrc=0
  • T5120: hc:///motherboard=0/chip=0/hostbridge=0/pciexrc=0
  • T5140: hc:///motherboard=0/hostbridge=0/pciexrc=0 and hc:///motherboard=0/hostbridge=1/pciexrc=1
Where hostbridge/pciexrc appears in the topology is immaterial to the rule set. As long as hostbridge/pciexrc can be found somewhere in the topology, the diagnosis rules can be applied. An example of an FMRI that would not work with the rules above is:
    hc:///foo=#/hostbridge=#/foobar=#/pciexrc=#
In this case, the hostbridge/pciexrc relationship is broken. The Eversholt diagnosis engine would not be able to match up the rules with the topology node (although a different set of rules representing hostbridge/foobar/pciexrc would work).


A glimpse into the final installment - while the diagnosis rules use relative FMRIs, the detector FMRI in telemetry uses a fully qualified FMRI (we saw this in Part 2). And of course the topology itself uses fully qualified FMRIs to describe resources and FRUs (we saw this in Part 1). When the topology and telemetry aren't aligned, the diagnosis rules don't work and we see "undiagnosable" messages to the console.

Part 4-->

:wq

Comments:

Post a Comment:
Comments are closed for this entry.

This blog copyright 2009 by Scott Davenport