Wednesday Aug 06, 2008

Solaris Nevada build 96 is an important milestone build for the Sensor Abstraction Layer project for FMA, as it introduces the software infrastructure (the plumbing) on which the functionality described in the original design doc1 will built.

This was a collaborative effort between the Solaris FMA team and the Fishworks team (specifically Eric Schrock) and involved over 7000 lines of change to over 60 source files.  Below are the two putback notifications that comprised the combined work:

First my putback done on July 31st:

Comment:
PSARC 2008/428 Extending libnvpair for type double
PSARC 2008/463 Extending HC FMRI scheme to represent sensors/indicators
6579615 fmtopo -e has lots of memory leaks
6635159 libtopo: extend hc scheme to allow for representing sensors and indicators in the topology
6692392 fmtopo -x doesn't handle property methods properly
6718703 Need to extend libnvpair to support type double
6718712 libtopo: Need to implement facility provider module for IPMI
6722594 libtopo: the topo_prop_set_* interfaces need to learn to play well with propmethods
6727190 libtopo: add support for node properties of type double
6727459 libipmi: need interface to convert raw sensor readings to unit-based values
6727470 libipmi: need convenience routine to convert sensor unit defines to string
6729595 libtopo: add <set> case in fan and psu xml maps for SUN-FIRE-X4600-M2
6732318 fmd: small leak in sysevent modelling code

Files:
update: usr/src/cmd/fm/fmd/common/fmd_sysevent.c
update: usr/src/cmd/fm/fmtopo/common/fmtopo.c
update: usr/src/common/nvpair/nvpair.c
update: usr/src/lib/fm/topo/libtopo/Makefile.com
update: usr/src/lib/fm/topo/libtopo/common/hc.c
update: usr/src/lib/fm/topo/libtopo/common/libtopo.h
update: usr/src/lib/fm/topo/libtopo/common/mapfile-vers
update: usr/src/lib/fm/topo/libtopo/common/topo_2xml.c
update: usr/src/lib/fm/topo/libtopo/common/topo_error.h
update: usr/src/lib/fm/topo/libtopo/common/topo_fmri.c
update: usr/src/lib/fm/topo/libtopo/common/topo_method.c
update: usr/src/lib/fm/topo/libtopo/common/topo_method.h
update: usr/src/lib/fm/topo/libtopo/common/topo_mod.h
update: usr/src/lib/fm/topo/libtopo/common/topo_node.c
update: usr/src/lib/fm/topo/libtopo/common/topo_parse.h
update: usr/src/lib/fm/topo/libtopo/common/topo_prop.c
update: usr/src/lib/fm/topo/libtopo/common/topo_subr.c
update: usr/src/lib/fm/topo/libtopo/common/topo_subr.h
update: usr/src/lib/fm/topo/libtopo/common/topo_xml.c
update: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4500/Sun-Fire-X4500-hc-topology.xmlgen
update: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4540/Sun-Fire-X4540-hc-topology.xmlgen
update: usr/src/lib/fm/topo/maps/common/topology.dtd.1
update: usr/src/lib/fm/topo/maps/i86pc/chip-hc-topology.xml
update: usr/src/lib/fm/topo/maps/i86pc/fan-hc-topology.xmlgen
update: usr/src/lib/fm/topo/maps/i86pc/i86pc-hc-topology.xml
update: usr/src/lib/fm/topo/maps/i86pc/psu-hc-topology.xml
update: usr/src/lib/fm/topo/modules/common/Makefile
update: usr/src/lib/libipmi/Makefile.com
update: usr/src/lib/libipmi/common/ipmi_impl.h
update: usr/src/lib/libipmi/common/ipmi_sdr.c
update: usr/src/lib/libipmi/common/ipmi_util.c
update: usr/src/lib/libipmi/common/libipmi.h
update: usr/src/lib/libipmi/common/mapfile-vers
update: usr/src/lib/libipmi/common/mktables.sh
update: usr/src/lib/libnvpair/libnvpair.c
update: usr/src/lib/libnvpair/mapfile-vers
update: usr/src/pkgdefs/SUNWfmd/prototype_com
update: usr/src/uts/common/sys/fm/protocol.h
update: usr/src/uts/common/sys/nvpair.h
create: usr/src/lib/fm/topo/modules/common/fac_prov_ipmi/Makefile
create: usr/src/lib/fm/topo/modules/common/fac_prov_ipmi/fac_prov_ipmi.c

Examined files: 41

Contents Summary:
       2   create
      39   update

Names Summary:
       2   update parent's name history
       2   update children's name history

And now Eric Schrock's putback, done on August 1st:

Comment:
PSARC 2008/485 SES Sensors and Enumerator
6720433 SES enumerator should provide controller revision information
6720435 SES enumerator should prefer description over class-description
6720452 SES enumerator should support indicators and sensors
6722807 SES enumerator should work with internal enclosures
6722809 want a way to identify enclosures as internal
6722811 SES enumerator should prefer elements with known status
6723603 x86 xmlgen topo scripts should make use of propmap
6732875 typo in fan-hc-topology.xmlgen
6732879 broken logic in pad_process()

Files:
update: usr/src/lib/fm/topo/libtopo/common/topo_parse.h
update: usr/src/lib/fm/topo/libtopo/common/topo_xml.c
update: usr/src/lib/fm/topo/maps/Makefile.map
update: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4200-M2/Makefile
update: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4200-Server/Makefile
update: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4500/Makefile
update: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4540/Makefile
update: usr/src/lib/fm/topo/maps/common/topology.dtd.1
update: usr/src/lib/fm/topo/maps/i86pc/fan-hc-topology.xmlgen
update: usr/src/lib/fm/topo/maps/i86pc/i86pc-hc-topology.xml
update: usr/src/lib/fm/topo/modules/common/ses/Makefile
update: usr/src/lib/fm/topo/modules/common/ses/ses.c
update: usr/src/lib/scsi/plugins/ses/Makefile
update: usr/src/lib/scsi/plugins/ses/libses/common/libses.h
update: usr/src/pkgdefs/SUNWfmd/prototype_i386
update: usr/src/pkgdefs/SUNWscsip/prototype_com
update: usr/src/pkgdefs/SUNWscsip/prototype_i386
update: usr/src/pkgdefs/SUNWscsip/prototype_sparc
update: usr/src/tools/scripts/bfu.sh
update: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4200-M2/Sun-Fire-X4200-M2-disk-hc-topology.xmlgen
rename from: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4200-M2/Sun-Fire-X4200-M2-hc-topology.xmlgen
         to: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4200-M2/Sun-Fire-X4200-M2-disk-hc-topology.xmlgen
update: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4200-Server/Sun-Fire-X4200-Server-disk-hc-topology.xmlgen
rename from: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4200-Server/Sun-Fire-X4200-Server-hc-topology.xmlgen
         to: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4200-Server/Sun-Fire-X4200-Server-disk-hc-topology.xmlgen
update: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4500/Sun-Fire-X4500-disk-hc-topology.xmlgen
rename from: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4500/Sun-Fire-X4500-hc-topology.xmlgen
         to: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4500/Sun-Fire-X4500-disk-hc-topology.xmlgen
update: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4540/Sun-Fire-X4540-disk-hc-topology.xmlgen
rename from: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4540/Sun-Fire-X4540-hc-topology.xmlgen
         to: usr/src/lib/fm/topo/maps/SUNW,Sun-Fire-X4540/Sun-Fire-X4540-disk-hc-topology.xmlgen
create: usr/src/lib/fm/topo/maps/common/xmlgen-header.xml
create: usr/src/lib/fm/topo/modules/common/ses/ses.h
create: usr/src/lib/fm/topo/modules/common/ses/ses_facility.c
create: usr/src/lib/scsi/plugins/ses/LSILOGIC-SASX28-A.0/Makefile
create: usr/src/lib/scsi/plugins/ses/LSILOGIC-SASX28-A.0/Makefile.com
create: usr/src/lib/scsi/plugins/ses/LSILOGIC-SASX28-A.0/amd64/Makefile
create: usr/src/lib/scsi/plugins/ses/LSILOGIC-SASX28-A.0/common/lsilogic.c
create: usr/src/lib/scsi/plugins/ses/LSILOGIC-SASX28-A.0/i386/Makefile
create: usr/src/lib/scsi/plugins/ses/LSILOGIC-SASX28-A.0/sparc/Makefile
create: usr/src/lib/scsi/plugins/ses/LSILOGIC-SASX28-A.0/sparcv9/Makefile

Examined files: 33

Contents Summary:
      10   create
      23   update

Names Summary:
       4   renamed
      10   update parent's name history
      14   update children's name history

The Sensor Abstraction Layer project page has been updated with links to some new documentation. Below is some more details on three of the key new FMA infrastructure changes: hc FMRI scheme extensions, facility nodes and facility providers.



First some background...

As touched in my previous blog entry, the Solaris Fault Manager maintains a snapshot of the hardware topology in a tree-like structure that includes a node for all hardware resources and FRU's that are managed/monitored by FMA.  The interfaces for generating a topology snapshot, walking the resulting tree and for manipulating the individual nodes in the tree are provided by libtopo and documented in Chapter 9 of the Fault Manager Programmer's Reference Guide.

The Sensor Abstraction Layer for FMA extends libtopo so that sensors and indicators can also be represented in our topology in a fashion that allows for the association of sensors and indicators to hardware resource to be programmatically determined.

Additionally it introduces the concept of a facility provider module which provides an abstraction layer between libtopo and the lower-level interfaces that are used to control a given sensor or indicator.

Together this provides a set of common infrastructure to enable future FMA projects to manipulate sensors and indicators as part of Fault Management activities.


hc FMRI scheme extensions

Th existing hc-scheme allows for a heirarchial representation of hardware resources, according to their physical connection properties. However, this is not a very useful way to represent sensors and indicators in the topology because it does not allow for consumers to programmatically determine the association of sensors/indicators to the hardware resource that they're monitoring.

In Solaris Nevada build 96, we've extended the hc FMRI scheme to allow for this association to be represented using a new type of node in the topology: a facility node.

A facility node is a special leaf node in the hc-scheme topology that represents either a sensor or an indicator. A fault managed resource may have one or more child facility nodes that represent sensors or indicators that are associated with it. The hc-scheme was be extended as shown below to allow for an additional facility node member:

Name Data Type Description
scheme uint32 scheme used for FMRI
version uint32 version of scheme specification
authority nvlist optional authority of FMRI
payload
resource path
facility nvlist facility component of FMRI

The facility nvlist will have two members:

Name Data Type Description
facility-type string type of facility node: "sensor" or "indicator"
facility-name string name of the facility

The string representation of an hc scheme FMRI will also be extended, as shown below:

<scheme>://[authority]/<resource-path>[?<fac-type>=<fac-name>]

where fac-type can be either "sensor" or "indicator" and fac-name is the name of the facility.

for example:
hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E4-B0:server-id=lollipop/chassis=0/fanmodule=0/fan=0?sensor=speed hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E4-B0:server-id=lollipop/chassis=0/bay=47?indicator=ok2rm


Anatomy of a Facility Node

Facility nodes are required to have the following properties specified in a "facility" property group:

 Property Name
Threshold Sensors
Discrete Sensors
Indicators
 type  Yes Yes
Yes
 sensor-class  Yes Yes
No
 reading  Yes No
No
 state Yes
Yes
No
 units Yes
No
No
 mode No
No
Yes

These properties allow for the classification of the facility node to be programmatically determined and are used by the new topo_fmri_facility() interface to check for the existence of sensors or indicators of a given type.

'sensor-class" property

All facility nodes of type "sensor" must specify a "sensor-class" property that is set to one of the following values.

Value Data Type Define
"threshold" string TOPO_SENSOR_CLASS_THRESHOLD
"discrete" string TOPO_SENSOR_CLASS_DISCRETE

'units' property

All 'sensor' facility nodes with a "sensor-class" property value of TOPO_SENSOR_CLASS_THRESHOLD are required to specify a "units" property of type uint32. The value should be set to one of the predefined unit types specified in libtopo.h (see the TOPO_SENSOR_UNIT_* defines)

'type' property

All 'sensor' and 'indicator' facility nodes must provide a "type" property of type uint32. The value should be set to one of the predefined unit types specified in libtopo.h (see the TOPO_SENSOR_TYPE_* and TOPO_LED_TYPE_* defines)


Facility Providers

A facility provider is a logical collection of node and property methods that provide an abstraction layer between libtopo and the underlying lower level interfaces that are used to actually manipulate the sensors and indicators. This allows library consumers (namely fmd) to access sensor readings and manipulate indicators via standard libtopo interfaces (e.g. topo_prop_{get|set}_{type}). Nevada build 96 includes the implementation of facility provider modules for IPMI2 and SES3, which will provide broad coverage across Sun's x64 server platforms.  The diagram below shows how the provider modules fit into the overall software structure:

Facility Provider diagram


Facility providers are implemented as simplified libtopo plugin modules (similar to enumerator modules). However, in the implementation of their tmo_enum entry point, a facility provider will simply register its methods on the node that is passed in.

At a minimum, a facility provider must implement the following property methods:

'reading' property method

Sensor nodes with a "sensor-class" property value of TOPO_SENSOR_CLASS_THRESHOLD must provide a property method for the "reading" property of type TOPO_TYPE_DOUBLE that should return the current analog reading from the sensor.

'state' property method

All sensors nodes (threshold and discrete) should provide a property method for a "state" property of type uint32. The property value should be set to one of the predefined sensor-type specific discrete states defined in libipmi.h (see the TOPO_SENSOR_STATE_* defines)


'mode' property method

For 'indicator' facility nodes, the facility provider must implement a property method to get/set the LED mode.  The mode property can be set to one of the following two values: 0 (OFF) or 1 (ON)

Facility providers can also optionally implement a node method (fac_enum) that can be invoked on a given hardware resource node to automatically discover and enumerate facility nodes that should be bound (associated) with it.

Below are some example excerpts of fmtopo4 output for both a threshold and discrete sensor as well as an indicator node. These examples were taken from a Sun-Fire X4500.

hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E4-B0:server-id=lollipop/motherboard=0/chip=0?sensor=proc.p0.t_core
  group: facility                       version: 1   stability: Private/Private
    entity_ref        string    proc.p0.t_core
    sensor-class      string    threshold
    type              uint32    0x101 (THRESHOLD_STATE)
    state             uint32    0x0 (0x00)
    reading           double    49.000000
    units             uint32    0x1 (DEGREES_C)

hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E4-B0:server-id=lollipop/chassis=0/psu=0?sensor=ps0.prsnt
  group: facility                       version: 1   stability: Private/Private
    entity_ref        string    ps0.prsnt
    sensor-class      string    discrete
    type              uint32    0x108 (GENERIC_PRESENCE)
    state             uint32    0x2 (ASSERTED)

hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E4-B0:server-id=lollipop/chassis=0/bay=40?indicator=present
  group: facility                       version: 1   stability: Private/Private
    entity_ref        string    hdd40.state
    mode              uint32    0x1 (ON)
    type              uint32    0x3 (PRESENT)

Well - that's it for now.  In my next blog entry I'll give some example that demonstrate how easy it is to use the new interfaces in libtopo to get sensor readings or flip LED's on or off.


[1] The hc FMRI scheme extensions and the concept of facility nodes are based on Cynthia McGuire's original design document for the Sensor Abstraction Layer, so readers may find it beneficial to review section 2.3 of that document to gain additional background.

[2] IPMI is an acronym for Intelligent Platform Management Interface which is an Intel specification for doing out-of-band management of computers.  Over the last few years it has become and industry standard and thus most x86 server platforms (including those made by Sun) support IPMI.  Through IPMI we can get access to platforms sensors and indicators.

[3] SES is an acronym for SCSI Enclosure Services which is a protocol for accessing diagnostic services for SCSI storage enclosures including things like temperature and voltage sensors.

[4]  fmtopo is a command-line utility that developers can use to take a snapshot and dump the resulting topology.  It's usage is documented in chapter 12 of the Fault Manager Programmer's Reference Guide.

Thursday Apr 17, 2008


I've pretty much had my head down working on various FMA bug fixes and enhancements for the last few months.  Now that I've finally gotten them putback, I have some time to take a (short) breather and so I thought I'd blog about a few of the things I've been working on here.  Here's the first installment:


The Solaris Fault Manager maintains a snapshot of the hardware topology in a tree-like structure that includes a node for all hardware resources and FRU's that are managed/monitored by FMA.  The interfaces for generating a topology snapshot, walking the resulting tree and for manipulating the individual nodes in the tree are provided by libtopo and documented in Chapter 9 of the Fault Manager Programmer's Reference Guide.  Scott Davenport also has  some nice overview material here.   The nodes in the tree are represented by a unique identifier called an FMRI (fault managed resource identifier).  The format of the FMRI for hardware resources is the following:

hc://[<authority>],[hardware-id]/[<hc-root.>][<hardware-component-path>]

where hardware-id would be:
 [:serial=<serial-number>][:part=<part-number>][:revision=<revision-number>]


Among other things, the optional "hardware-id" fields (in particular the serial) can be used by the fault manager to detect when a FRU has been replaced by service personnel.  In the absence of hardware identity information, administrators must manually inform the fault manager after they've replaced a faulty component via the "repair" subcommand to fmadm(1m).  Otherwise, the fault manager will continue to report the component as faulty and attempt to isolate it.  On our UltraSPARC systems much of this information is provided by the OpenBoot Platform firmware.   On x86 we don't have the benefit of sitting on top a common firmware layer that we control.  As a result, we historically haven't filled in the hardware-id fields because we haven't found a generalized, reliable mechanism for fetching this data.  However, on our newer AMD-based server platforms[1], some FRU information is maintained in non-volatile storage by the service processor and is accessible using a common protocol: IPMI

In Solaris Nevada, build 87 we've added the capability to leverage IPMI to find and attach serial numbers to the dimm nodes in our topology on our AMD-based server platforms and we've extended the fault manager to check for this serial property and use it, if found, to detect when a faulted DIMM has been replaced.  For people who like ugly details :), here's a brief rundown of the code changes:

It all starts with a new topo node property method that is registered to the dimm nodes in our topology on our AMD-based server platforms.  The XML for this looks like the example below and the complete XML changes are in usr/src/lib/fm/topo/maps/i86pc/chip-hc-topology.xml

 <propmethod name='get_dimm_serial' version='0' propname='serial' proptype='string' >
      <argval name='format' type='string' value='p%d.d%d.fru' />
      <argval name='offset' type='uint32' value='0' />
</propmethod>

This property method uses interfaces from libipmi to communicate with the service processor to lookup the FRU locator record for the associated DIMM.  The FRU locator record provides the offset into the FRU inventory on the service processor where we can fetch information such as manufacturer name and the serial number.  Using the manufacturer name and serial number we synthesize a Sun serial ID[2] and attach it as a property to the dimm node.  This all happens in usr/src/lib/fm/topo/modules/i86pc/chip/chip_serial.c

Next we've modified the Fault Manager to look for the existence of the serial property method, and if found, invoke it and attach the serial to the FMRI's that are included in the payload of a fault event.  See fmd_nvl_create_fault() in usr/src/cmd/fm/fmd/common/fmd_api.c

The fault manager maintains a persistent cache of resources that have been the subject of a diagnosis (see Chapter 6 of the Fault Manager Programmer's Reference Guide).  The fault manager uses this to keep track of what's faulty and enables it to re-report and re-isolate a faulted component after a system restart.  However, before doing this, the fault manager first attempts to determine if the faulted component is still present in the system.  (No need to report or isolate something that's been removed).  The  code for determining if a faulted resource is a bit hard to follow and in some case varies based based on the type of component and whether we're on SPARC or x86, but the basic idea is to determine what scheme the FMRI of the faulted resource is in and then call the appropriate is_present method which should return TRUE, if the resource is still present and FALSE, otherwise.  For the DIMM case on our AMD-based platforms, the code flow looks like this:

usr/src/cmd/fm/fmd/common/fmd_asru.c::fmd_asru_hash_recreate()
|
|-> usr/src/cmd/fm/fmd/common/fmd_fmri.c::fmd_fmri_present()
    |
    |-> usr/src/cmd/fm/schemes/mem/mem.c::fmd_fmri_present()
        |
        |-> usr/src/lib/fm/topo/libtopo/common/topo_fmri.c::topo_fmri_present()
            |
            |-> usr/src/lib/fm/topo/libtopo/common/hc.c::hc_is_present()
                |
                |-> usr/src/lib/fm/topo/modules/i86pc/chip/chip_subr.c::rank_is_present()


The rank_is_present method in the chip enumerator module will compare the serial numbers and returns FALSE if the serial number of the faulted resource doesn't match the current serial number in the topology snapshot.  If any errors occur along the path above, thus preventing us from determining if the resource is still present, we err on the side of caution and return TRUE.

Ok - so that's some of the gory code details, but what will it look like from the user's perspective?

If a DIMM is diagnosed as faulty on an X64 system, the user will see something like this on the console (no change here):

SUNW-MSG-ID: AMD-8000-48, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Wed Mar 19 14:04:01 PDT 2008
PLATFORM: Sun Fire X4500, CSN: 00:14:4F:20:E4:B0     , HOSTNAME: lollipop
SOURCE: eft, REV: 1.16
EVENT-ID: 44384620-5c7d-4073-edbc-ff0664004de4
DESC: The number of errors associated with this memory module has exceeded acceptable levels.  Refer to http://sun.com/msg/AMD-8000-48 for more information.
AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported.
IMPACT: Total system memory capacity will be reduced as pages are retired.
REC-ACTION: Schedule a repair procedure to replace the affected memory module.  Use fmdump -v -u <EVENT_ID> to identify the module.


If the user runs "fmadm faulty", they'll see this (note the DIMM serial number is now included in the FRU FMRI)

lollipop# fmadm faulty -a
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Mar 19 14:04:01 44384620-5c7d-4073-edbc-ff0664004de4  AMD-8000-48    Major   

Fault class : fault.memory.dimm_ue
Affects     : mem:///motherboard=0/chip=0/memory-controller=0/dimm=0/rank=0
                  degraded but still in service
FRU         : "CPU 0 DIMM 0" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E4-B0:server-id=lollipop:serial=002C000000DA062AF3/motherboard=0/chip=0/memory-controller=0/dimm=0)
                  faulty

Description : The number of errors associated with this memory module has
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/AMD-8000-48 for more information.

Response    : Pages of memory associated with this memory module are being
              removed from service as errors are reported.

Impact      : Total system memory capacity will be reduced as pages are
              retired.

Action      : Schedule a repair procedure to replace the affected memory
              module.  Use fmdump -v -u <EVENT_ID> to identify the module.

Now if the user/service guy replaces "CPU 0 DIMM 0" and then reruns "fmadm faulty" after bringing the system back up they'll see this:  (note the state of the ASRU and FRU have changed to "faulted and taken out of service" and "not present", respectively)

lollipop# fmadm faulty -a
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Mar 19 14:04:01 44384620-5c7d-4073-edbc-ff0664004de4  AMD-8000-48    Major   

Fault class : fault.memory.dimm_ue
Affects     : mem:///motherboard=0/chip=0/memory-controller=0/dimm=0/rank=0
                  faulted and taken out of service
FRU         : "CPU 0 DIMM 0" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E4-B0:server-id=lollipop:serial=002C000000DA062AF3/motherboard=0/chip=0/memory-controller=0/dimm=0)
                  not present

Description : The number of errors associated with this memory module has
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/AMD-8000-48 for more information.

Response    : Pages of memory associated with this memory module are being
              removed from service as errors are reported.

Impact      : Total system memory capacity will be reduced as pages are
              retired.

Action      : Schedule a repair procedure to replace the affected memory
              module.  Use fmdump -v -u <EVENT_ID> to identify the module.

I also putback a handful of other bug fixes into build 87 - here's the complete putback notification:

Event:            putback-to
Parent workspace: /ws/onnv-gate
(elpaso:/ws/onnv-gate)
Child workspace: /net/hyper/tank/ws/robj/fma-dimm-serial2
(hyper:/tank/ws/robj/fma-dimm-serial2)
User: robj

Comment:
6593380 topology for Sun x64 platforms should include serial numbers for dimms
6671247 missing DIMM FRU labels on 4600/4600M2 platforms with family 15 modules
6672188 chip FRU labels computed incorrectly on 2-socket AF4+ blades
6675806 libipmi: ipmi_fru_read() can leak memory on failure

Files:
update: usr/src/cmd/fm/eversholt/files/i386/i86pc/amd64.esc
update: usr/src/cmd/fm/eversholt/files/i386/i86pc/intel.esc
update: usr/src/cmd/fm/fmd/common/fmd_api.c
update: usr/src/cmd/fm/schemes/mem/mem.c
update: usr/src/lib/fm/topo/libtopo/common/libtopo.h
update: usr/src/lib/fm/topo/libtopo/common/mapfile-vers
update: usr/src/lib/fm/topo/libtopo/common/topo_fmri.c
update: usr/src/lib/fm/topo/maps/i86pc/chip-hc-topology.xml
update: usr/src/lib/fm/topo/modules/i86pc/chip/Makefile
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip.c
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip.h
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip_amd.c
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip_label.c
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip_subr.c
update: usr/src/lib/libipmi/common/ipmi_fru.c
create: usr/src/lib/fm/topo/modules/i86pc/chip/chip_serial.c

Examined files: 16

Contents Summary:
1 create
15 update



[1] I'm qualifying the statement by saying "on our newer AMD-based server platforms" for a few reasons:

  1. Since we're sourcing the serial number from the service processor, we obviously won't be able to support this on our AMD-based desktop platforms which don't have baseboard management controllers.
  2. The third-party service processor firmware on some of older AMD-based server platforms do not export sufficient FRU information to allow us to get the serial numbers.  This mainly affects the lower-end X2100/2200 line.
  3. Our Intel-based platforms use a completely different mechanism to get the DIMM serial numbers.

[2] You might be wondering why we need to synthesize a Sun serial ID as opposed to simply using the manufacturer serial number.  There are a couple problems with using the manufacturer serial number, as is.  First, different DIMM manufacturers could use the same serial number.  Secondly, because the serial space is limited (8 characters) and DIMM manufacturers pump out DIMM's at a staggering rate, the same manufacturer could cycle through and then resuse serial numbers as frequently as every week.  Because FMA needs to use the serial number to determine whether a given DIMM has been replaced, we need know that the serial is as unique as possible.  Newer versions of our service processor firmware (ILOM) will concatenate the following three additional pieces of information to the manufacturer serial to form a globally unique 18 character Sun serial ID:

  1. The JEDEC ID of the manufacturer
  2. The manufacturing location
  3. The manufacturing date

For the cases where we encounter older ILOM software that doesn't synthesize a globally unique Sun serial ID, Solaris will synthesize an 18 character serial ID based on the manufacturer JEDEC ID and the manufacturer serial (filling in zeroes for the location and date).  While this isn't guaranteed to be unique, it is more likely to be unique than just using the manufacturer serial alone.