Over the course of the last several months, I've described how topology, telemetry, and diagnosis rules work together within the Fault Management Architecture (FMA). Something I've dubbed the FMA Triad. It's high time I finished off this little mini-series with an example of what happens when the members of the Triad don't play nicely together.

Part 4 - When Things Go Wrong

This is a real world example. I briefly mentioned at the tail end of Part 3 that the diagnosis rules often use relative FMRIs, but the telemetry and the topology must use fully qualified FMRIs. When there's a disconnect, the diagnosis engine cannot determine the system resource and is unable to understand the incoming telemetry.

Customers upgraded their T2000 systems to Solaris 10 Update 4 and when starting the OS were greeted with this:

SUNW-MSG-ID: SUNOS-8000-1L, TYPE: Defect, VER: 1, SEVERITY: Minor EVENT-TIME: Feb 08 15:30:39 CST 2008 PLATFORM: SUNW,Netra-T2000, CSN: -, HOSTNAME: FOO SOURCE: eft, REV: 1.16 EVENT-ID: 1be16b2d-158e-e73b-f097-b744c2eb8cd3 DESC: The EFT Diagnosis Engine encountered telemetry for which it is unable to produce a diagnosis. Refer to http://sun.com/msg/SUNOS-8000-1L for more information. AUTO-RESPONSE: Error reports from the component will be logged for examination by Sun. IMPACT: Automated diagnosis and response for these events will not occur. REC-ACTION: Run pkgchk -n SUNWfmd to ensure that fault management software is installed properly. Contact Sun for support.

Ouch. Not good. Not only is there an apparent hardware problem, FMA can't make heads or tails of it. First thing to do is find out what events led to this "diagnosis". We can use fmdump for that:

# fmdump -e -u 1be16b2d-158e-e73b-f097-b744c2eb8cd3 TIME CLASS Feb 08 15:27:30.4960 ereport.io.fire.pec.lup

Ok. After checking the event registry, "lup" is a link up error report. Normally, the diagnosis engines ignore these error - and on boot we flat out expect them (the links have to come up). In fact, I tipped my cards in Part 3 by showing you the rules for a "lup" error. Looking deeper at the telemetry:

# fmdump -eV -u 1be16b2d-158e-e73b-f097-b744c2eb8cd3 TIME CLASS Feb 08 2008 15:27:30.496065920 ereport.io.fire.pec.lup nvlist version: 0 class = ereport.io.fire.pec.lup ena = 0xa3e9ab80002 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = hc hc-root = hc-list-sz = 3 hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = ioboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = hostbridge hc-id = 0 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = pciexrc hc-id = 0 (end hc-list[2]) (end detector) primary = 1 tlu-oeele = 0xffffff tlu-oeie = 0xffffff00ffffff tlu-oeis = 0x100 tlu-oeess = 0x0 __ttl = 0x1 __tod = 0x47ace562 0x1d915d80

The telemetry is describing an FMRI of hc:///ioboard=0/hostbridge=0/pciexrc=0, but the diagnosis rules aren't understanding that. But, the rules are only looking for errors against hostbridge/pciexrc. So the rules themselves are fine. The problem must be that this FMRI can't be located in the topology. Turning to fmtopo:

# /usr/lib/fm/fmd/fmtopo ... hc:///motherboard=0/hostbridge=0/pciexrc=0 ...

Eureka! There is in fact a disconnect between the telemetry and the topology. This explains the undiagnosable errors from FMA.

Now, to the question of what changed to cause this problem? On T2000 class systems, telemetry for root complex errors (like the "lup" error) is generated in the Service Processor. Topology is constructed by enumerators in Solaris. Since the change made to the system was an upgrade of Solaris, something has gone wrong in Solaris 10 Update 4.

The mechanisms for generating a topology changed significantly between S10U3 and S10U4. In S10U3, there were .topo files, and the one for Netra,T2000 described an ioboard/hostbridge/pciexrc arrangement. When the newer XML map mechanism came with S10U4 (the one I described in Part 1 of this series), a platform specific topology map for Netra,T2000 was overlooked. As Part 1 details, when there is no platform specific topology map, FMD reverts to the architecture specific topology map - sun4v in this case. The sun4v XML map in S10U4 describes a motherboard/hostbridge/pciexrc arrangement.

The solution was to provide a platform specific topo map for the Netra,T2000 system. If memory serves, a similar change for the Netra,CP3060 was also needed.

I hope you found this to be a good example that ties this series up. If you've enjoyed this half as much as I have, then I've enjoyed it twice as much as you. :)

:wq

Comments:

Post a Comment:
Comments are closed for this entry.

This blog copyright 2009 by Scott Davenport