The wedding:
Dearly beloved, we are gathered here today to unite the community to this code, OpenSolaris, in the bonds of holy hacking which is an honorable estate. Into this, these two now come to be joined. If anyone present can show just and legal cause why they may not be joined, let them speak now or forever hold their peace. :::sorry times up::: Who gives this code to this community? (Sun) Community, will you have this code as your lawful source, to live together in the estate of hackery? Will you love her, debug her, test her, enhance her, and keep her bug free and in health; forsaking all others, be true to her as long as you both shall live? (you will). OpenSolaris, will you have the community as your lawful developer, to live together in the estate of hackery? Will you love them, take putbacks from them, teach them, learn from them, and keep them with some defects (as few as possible) and in health; forsaking all others, be true to them as long as you both shall live? (it will).I now pronounce you code and developers, you may now look at the source.
The honeymoon:
Now that that's taken care of we can finally share our favorite little pieces of code with the community. For this blog I would like to focus on the error handling portions of the I/O subsystem. Discussing fault diagnosis first would be more appropriate but I will go backwards, especially since Andy wrote a great introduction to fault diagnosis in his blog.As part of the Predictive Self-Healing work that went into S10 we "hardened"1 the nexus drivers that connect the system bus to the PCI bus. One such driver is the pcisch (name of driver which can be seen via 'prtconf -D') nexus driver which attaches to the Schizo/Tomatillo/Xmits (internal names which can be seen via 'prtdiag -v') bridge chips.
To "harden" this driver (and actually to "harden" any driver) we first needed to understand the underlying hardware and the various fault conditions that could lead to errors being detected and/or reported by the hostbridge. I will table this discussion for a later blog so that we can focus on error handling.
Once we had a clear indication of what faults exist and what their associated detectors were then the next step was to determine how the errors should be reported and handled in the driver so that we can recover (wherever possible) and persist the error information so that it can be diagnosed (via a diagnosis engine which understands the relationship between faults and errors for this subsystem).
Since this driver controls hardware that bridges two bus standards, as usually is the case, we need to "harden" the devices which exist on the two bus standards as well to be able to take full advantage of the logging/recovery options we may have available. For example, if we encounter a PCI master abort during a PIO2 Read transaction (possible fault(s) may be leaf3 device not responding, requesting driver/user addressing a non-existant device or attempting to access the device after power managing it):
- The PIO read request is sent to the hostbridge by the CPU and is carried out synchronously (traps the executing thread which caused the error versus interrupting the driver after the offending piece of code has moved on).
- The hostbridge sends the address on to the PCI bus and awaits a DEVSEL#4 from the target.
- When the allowed time limit for DEVSEL# is reached the hostbridge records a Master Abort, and sends a failure (Bus Error) on the host bus to the requesting CPU.
- The CPU then records the address which was being accessed, sets the Bus Error bit in the appropriate register and traps.
As you can see from the above, the detector in this instance was
actually the hostbridge but the CPU reported the error
up to the kernel. Both locations have important data, the CPU
recorded the address of the device which either failed to
respond or did not exist due to an incorrect address and the
hostbridge recorded the error condition which caused it
to send the Bus Error to the CPU in the first place. Without
both pieces of information error handling *and* fault diagnosis are
severely limited. Also the device in question may have some status
which could also help solve this case, such as it could currently
be power managed.
So to reap all this information the CPU, nexus, and leaf must be
"hardened".
The "hardened" response to the above example:
- cpu_deferred_error
is called to handle the trap.
- checks and logs the CPU state and checks to see if the current thread is under any type of trap protection (on_trap/t_lofault: topics for future blogs), let's assume we are not protected for this particular PIO read (ddi_get instead of ddi_peek)
- it also checks to see if the current thread was executing in privledged mode or not, let's assume that this is a user thread.
- since the BERR bit is set in response to the detected error it calls into the registered I/O error handlers so that they may log their error state as well as conduct any required error handling steps (cpu_run_bus_error_handlers).
- pci_pbm_err_handler
is eventually called to log any detected errors seen by the pcisch
nexus.
- gathers it's chip register information
- reports it persistently (via ddi_fm_ereport_post)
- looks up the failed address in the access handle cache, fails the handle and calls the registered error handler for the driver (via ndi_fmc_error).
- the leaf error handler might also have more device specific information to log (such as it's current state or device specific error registers).
- returns nonfatal if no fatal errors were discovered.
- cpu_deferred_error will then send a SIGBUS to the offending user process.
If you would like some more details on the inner workings of the PCI nexus error handling code, please read the comment block here
The happy ending:
Previously, the same situation would have caused us to send a
SIGBUS to the offending process and it would have either cored
or printed some cryptic message. We did not reap the
information in the nexus or the leaf and did no diagnosis.
Now we send the SIGBUS to the offending application (and if the
application is not able to recover and is managed by SMF it
will be restarted) and diagnose the detected error
telemetry to the appropriate suspects (the device which failed
to respond, and the driver of the device).
The above is only one simple example, another is Uncorrected ECC Errors we detect while executing a non privledged thread. Previously, a UE taken by a user thread would have caused you downtime (system would panic) with a cryptic panic message. Now the system is able to restart your application (if managed by SMF), diagnose the fault (point you to the failed component, if user action is required), and retires the faulty page.
1:"hardened" for this discussion means handling and reporting errors which are detected by the underlying hardware in a manner which aids fault diagnosis.
2:PIO stands for programmed I/O and is a request, either read or write, to a I/O device.
3:leaf refers to the endpoint device.
4: DEVSEL# is a active low signal on the PCI bus which is enabled when a device accepts an address sent on the bus.
Technorati Tag:
OpenSolaris
Technorati Tag:
Solaris

