The Solaris reference implementation of the fault manager recently got a boost in its ability to report faults with the introduction of a two-part SNMP agent. This agent makes it easy to integrate the Solaris fault manager into existing SNMP-based monitoring infrastructure.
Background
The fault manager has always been able to report faults to the system
log and console(s), and to provide a wealth of status information via
fmadm(1M)
and fmdump(1M).
But these reporting mechanisms leave much to be desired; syslog messages
must be parsed, and a busy central log host can easily lose important
messages in the noise. Worse still, a privileged user must log into the
affected system and run administrative commands to get information they
need that isn't contained in the message.
SNMP is a natural choice for extending the reach of the fault manager's voice; it's widely used to facilitate centralised monitoring of events throughout and even across administrative domains. The basic model is simple and extensible; information can be pushed from any device to one or more network management stations (NMSs), or pulled by an administrator or automated utility from a particular device of interest. Managed devices - in this case, a Solaris system - signify events using traps (also called notifications in SNMPv2), which provide a limited amount of information to designated NMSs. They also provide access to a management information base (MIB) on demand. Generally, the MIB provides access to a much greater breadth and depth of information than is transmitted with a trap or notification. An NMS can be configured to retrieve additional data from the MIB upon receipt of a trap if desired.
Availability
The technology described here is available in Solaris Nevada builds 33
and later. OpenSolaris
offers access to the sources. A prerequisite for building or using
these applications is the installation of the SMA packages provided by
the SFW consolidation; BFUing newer ON bits is not sufficient. If you
have SWAN access, you can run
/ws/onnv-gate/public/bin/update_sma to get the necessary
packages; otherwise see the OpenSolaris
download center for the packages.
A Note on NMS Configuration
If you use the Net-SNMP-based NMS software delivered in Solaris, as I do
below, you will want to tell the client utilities to use the fault
management MIB to encode and decode OIDs. The easiest way to do this is
to add MIBS=+ALL to your environment. You can also make
this permanent by creating (or adding to)
/etc/sma/snmp/snmp.conf the line:
mibs +ALL
See snmp.conf(4)
for more information on MIB searching and importing. If you use a
different NMS, consult your vendor's documentation to learn how to
import a new MIB.
snmp-trapgen: an SNMP plugin for fmd(1M)
The trap or notification generator component is snmp-trapgen. This is a
very simple fault manager plugin similar to that which logs fault
information to the system log and console. Instead of writing formatted
text to a log device, however, this plugin generates SNMPv1 traps and/or
SNMPv2 notifications, one for each destination configured in the
systemwide snmpd.conf(4).
No additional configuration is required; if you have already configured
a system to send traps to one or more NMSs, you don't need to do
anything else to be notified upon fault diagnosis. If not, you'll want
to add v1 or v2 trap destinations to
/etc/sma/snmp/snmpd.conf. The hostnames or addresses you
use will need to be configured to receive and act upon SNMP traps or
notifications. If you don't have an NMS on your network, you can use
the snmptrapd(1M)
server included with Solaris.
A fault diagnosis trap (sunFmProblemTrap) includes a limited subset of
the information contained in the syslog message associated with the
fault. Specifically, the diagnosis's UUID, diagnostic code, and
reference URL are included. The object identifiers (OIDs) for these
data are defined by the fault management MIB, SUN-FM-MIB, installed in
/etc/sma/snmp/mibs/. The same information is delivered to
both SNMPv1 and SNMPv2 trap sinks. At present, this is the only trap
defined by the fault management MIB, but others may be generated in the
future. Here's an example of an SNMPv2 notification as decoded by
snmptrapd(1M):
2006-02-07 16:36:34 stomper [192.xx.xx.xx]:
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (2266748911) 262 days, 8:31:29.11
SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
SUN-FM-MIB::sunFmProblemUUID."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: "a58aa105-4fab-6e16-8557-ab7687113de7"
SUN-FM-MIB::sunFmProblemCode."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: SUN4U-8000-KA
SUN-FM-MIB::sunFmProblemURL."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: http://sun.com/msg/SUN4U-8000-KA
The diagnostic code and URL can be used to find knowledge base articles
describing the fault and suggested corrective action. The diagnosis
UUID can be used to get further detail from fmdump(1M),
or from the MIB, as seen in the next section.
libfmd_snmp: a MIB plugin for the System Management Agent (SMA)
Knowing that a fault has been diagnosed is important, but the amount of
information delivered with the trap or notification may not be enough to
provide an administrator with a complete understanding of the problem.
The fault management MIB defines a wealth of detail, and this detail is
made available via SMA by libfmd_snmp. In addition to fault diagnosis
detail, this MIB also offers information about faulty components and the
configuration of the fault manager itself, similar to that offered by
fmadm(1M).
Enabling the plugin requires configuring the master SNMP agent on each server you wish to query. Adding the architecture-dependent line
dlmod sunFM /usr/lib/fm/sparcv9/libfmd_snmp.so.1
to /etc/sma/snmp/snmpd.conf will cause the MIB plugin to be
automatically loaded and initialised the next time the master agent is
started, such as via /etc/init.d/init.sma. In the future, SMA will be
managed via SMF; see 6349499[0].
No further configuration is necessary, although the usual snmpd.conf(4)
directives will allow you to restrict access to the MIB, which may be
important to you since some of the information it provides is ordinarily
restricted to privileged users.
The fault management MIB provides 4 tables and a single scalar, in
addition to the trap/notification described above. sunFmProblemTable
and sunFmFaultEventTable are logically two pieces of the same table;
they are separated only because MIBs do not support nested tables. The
problem table contains the scalar information about each diagnosis,
while the fault event table contains lists of the events associated with
each diagnosis. Both tables are indexed by diagnosis UUID; the fault
event table utilises a second scalar index to distinguish between
multiple events associated with a diagnosis. In response to the trap
above, you might want to know which Automated System Recovery Unit(s)
(ASRU(s)) the fault manager believes may have caused the fault. This is
just a fancy way of saying we want to know what broke to trigger the
diagnosis. Because each ASRU is associated with a fault event, we'll
first need to know how many fault events were associated with this
diagnosis so that we can then look up each one's ASRU in the fault event
table. To do this, we'll use snmpget(1M),
delivered by Solaris in /usr/sfw/bin. Of course, you can
use any NMS software.
nms$ snmpget -c public -v 2c stomper \
sunFmProblemSuspectCount.\"a58aa105-4fab-6e16-8557-ab7687113de7\"
SUN-FM-MIB::sunFmProblemSuspectCount."a58aa105-4fab-6e16-8557-ab7687113de7" = Gauge32: 1
This diagnosis has only one fault event associated with it. To look up
the ASRU, we'll look in the fault event table entry indexed by the UUID
and the fault index. Since fault events are indexed starting from 1,
we'll need to do:
nms$ snmpget -c public -v 2c stomper \
sunFmFaultEventASRU.\"a58aa105-4fab-6e16-8557-ab7687113de7\".1
SUN-FM-MIB::sunFmFaultEventASRU."a58aa105-4fab-6e16-8557-ab7687113de7".1
= STRING: cpu:///cpuid=4/serial=23EBEC1505
Most NMSs offer scripting facilities that allow you to perform actions
similar to these in response to a trap. Alternately, you could poll the
data on a regular basis. Many impementations do both, using polling to
offset the risk of losing traps, which like all SNMP datagrams do not
offer reliable transmission. SNMPv3 informs, also known as acknowledged
notifications, offer only a partial remedy to this problem, and are not
supported by snmp-trapgen at this time.
A polling NMS may wish to poll the systemwide faulty component count,
provided by the MIB as sunFmFaultCount. An increase in this gauge
without a corresponding problem trap is a good indication that the trap
has been lost. More details about devices the fault manager believes to
be in degraded or faulted states is available via the
sunFmResourceTable; walking this table provides a ready - and remote -
answer to the common question "What's broken on that machine?" For
this, we use the snmpwalk(1M)
utility:
nms$ snmpwalk -c public -v 2c stomper sunFmResourceTable
SUN-FM-MIB::sunFmResourceFMRI.1 = STRING: cpu:///cpuid=4/serial=23EBEC1505
SUN-FM-MIB::sunFmResourceStatus.1 = INTEGER: degraded(3)
SUN-FM-MIB::sunFmResourceDiagnosisUUID.1 = STRING:
"a58aa105-4fab-6e16-8557-ab7687113de7"
Finally, the sunFmConfigTable offers remote access to the same
information provided by fmadm(1M)'s
config subcommand; like the other tables, it can be
accessed using snmpget(1M),
snmpwalk(1M),
or any other SNMP-compatible NMS implementation. You can find the
complete fault management MIB at the Fault Management
community site, and in build 33 and later at
/etc/sma/snmp/mibs/SUN-FM-MIB.mib.
[0] The bug should be visible, but it isn't. This is itself a bug, which the SFW team is working to fix.