Dilpreet Bindra's Weblog

Weblog

All | General | Java | Music | solaris

« Previous month (May 2005) | Main | Next month (Jul 2005) »
20050614 Tuesday June 14, 2005

Match Made in Heaven Match Made in Heaven

The wedding:

Dearly beloved, we are gathered here today to unite the community to this code, OpenSolaris, in the bonds of holy hacking which is an honorable estate. Into this, these two now come to be joined. If anyone present can show just and legal cause why they may not be joined, let them speak now or forever hold their peace. :::sorry times up::: Who gives this code to this community? (Sun) Community, will you have this code as your lawful source, to live together in the estate of hackery? Will you love her, debug her, test her, enhance her, and keep her bug free and in health; forsaking all others, be true to her as long as you both shall live? (you will). OpenSolaris, will you have the community as your lawful developer, to live together in the estate of hackery? Will you love them, take putbacks from them, teach them, learn from them, and keep them with some defects (as few as possible) and in health; forsaking all others, be true to them as long as you both shall live? (it will).

I now pronounce you code and developers, you may now look at the source.

The honeymoon:

Now that that's taken care of we can finally share our favorite little pieces of code with the community. For this blog I would like to focus on the error handling portions of the I/O subsystem. Discussing fault diagnosis first would be more appropriate but I will go backwards, especially since Andy wrote a great introduction to fault diagnosis in his blog.

As part of the Predictive Self-Healing work that went into S10 we "hardened"1 the nexus drivers that connect the system bus to the PCI bus. One such driver is the pcisch (name of driver which can be seen via 'prtconf -D') nexus driver which attaches to the Schizo/Tomatillo/Xmits (internal names which can be seen via 'prtdiag -v') bridge chips.

To "harden" this driver (and actually to "harden" any driver) we first needed to understand the underlying hardware and the various fault conditions that could lead to errors being detected and/or reported by the hostbridge. I will table this discussion for a later blog so that we can focus on error handling.

Once we had a clear indication of what faults exist and what their associated detectors were then the next step was to determine how the errors should be reported and handled in the driver so that we can recover (wherever possible) and persist the error information so that it can be diagnosed (via a diagnosis engine which understands the relationship between faults and errors for this subsystem).

Since this driver controls hardware that bridges two bus standards, as usually is the case, we need to "harden" the devices which exist on the two bus standards as well to be able to take full advantage of the logging/recovery options we may have available. For example, if we encounter a PCI master abort during a PIO2 Read transaction (possible fault(s) may be leaf3 device not responding, requesting driver/user addressing a non-existant device or attempting to access the device after power managing it):

As you can see from the above, the detector in this instance was actually the hostbridge but the CPU reported the error up to the kernel. Both locations have important data, the CPU recorded the address of the device which either failed to respond or did not exist due to an incorrect address and the hostbridge recorded the error condition which caused it to send the Bus Error to the CPU in the first place. Without both pieces of information error handling *and* fault diagnosis are severely limited. Also the device in question may have some status which could also help solve this case, such as it could currently be power managed.

So to reap all this information the CPU, nexus, and leaf must be "hardened".

The "hardened" response to the above example:

If you would like some more details on the inner workings of the PCI nexus error handling code, please read the comment block here

The happy ending:

Previously, the same situation would have caused us to send a SIGBUS to the offending process and it would have either cored or printed some cryptic message. We did not reap the information in the nexus or the leaf and did no diagnosis.

Now we send the SIGBUS to the offending application (and if the application is not able to recover and is managed by SMF it will be restarted) and diagnose the detected error telemetry to the appropriate suspects (the device which failed to respond, and the driver of the device).

The above is only one simple example, another is Uncorrected ECC Errors we detect while executing a non privledged thread. Previously, a UE taken by a user thread would have caused you downtime (system would panic) with a cryptic panic message. Now the system is able to restart your application (if managed by SMF), diagnose the fault (point you to the failed component, if user action is required), and retires the faulty page.


1:"hardened" for this discussion means handling and reporting errors which are detected by the underlying hardware in a manner which aids fault diagnosis.
2:PIO stands for programmed I/O and is a request, either read or write, to a I/O device.
3:leaf refers to the endpoint device.
4: DEVSEL# is a active low signal on the PCI bus which is enabled when a device accepts an address sent on the bus.

Technorati Tag:
Technorati Tag:

(2005-06-14 08:50:16.0) Permalink Comments [1]


Today's Page Hits: 4