#
jdh's blog
theme and variations
|
|
Saturday Nov 11, 2006
Complexity and Completeness: FMA
Complexity brings joy into the life of an engineer: it is so satisfying to find all the nooks and crannies of a problem and come up with a solution that covers them all. No, wait. Complexity is the nemesis of the engineer: it is so hard to be satisfied with an 80% solution. The desire to provide a complete solution is almost unbearable, beyond all reasonable expectations of the company or customer, Must an engineer resort to damned statistics to prove that an inelegant or incomplete solution is sufficient, and a more efficient use of time? Sufficient for whom? For the consumer of course. Being associated with a less than perfect solution repulses me, yet I am the very customer that would never pay what it costs for perfection. One of my favorite examples of the agony and ecstasy of complexity is FMA (fault management architecture - see Mike Shapiro's blog). FMA is hard.1 I've looked at the blog entry by past Sun luminary Andy Rudoff, in which he provides a summary of the concepts of FMA. The concepts are so pure and beautiful. And simple! But it all starts with fault trees. One must explain every fault that could occur, and every symptom (error) it might produce. Then in the middle there are all the timing issues - how long will related errors take to show up? And will they show up, because after all the paths for communication are not perfect? And at the end is the problem of how to isolate the fault. Who are all the constituents who will be affected, and how do we guarantee there is no race condition between multiple actors? Right now we take an all-or-nothing approach to a vertical segment of faults -- all cpu faults, for example. Essentially we diagnose a complete subtree of faults, but ignore faults caused by a component closer to the root of the fault tree (maybe the fault is really a power supply problem that affects all components in the box). This is how we make the problem tractable. But the complete subtree approach gets particularly hard for I/O, where components can have a great deal of interaction, not necessarily all in a nice hierarchy, and errors are reflected in all directions. Cindi McGuire has lead a herculean effort to wrestle down I/O fault management into something containable and expressible, but getting every device to participate in FMA has stumped this effort. The job is not complete. The ability to diagnose with complete accuracy remains an unrealized vision. So I wonder - would it be so bad to just sprinkle some FMA around? For example, if we have evidence that a particular subset of faults is most common or catastrophic, can we can provide just sufficient error reporting and diagnosis to narrow the fault landscape to find the instigators of those most egregious faults? Maybe we allow drivers to be enhanced to some minimal level of error reporting to enable just that type of diagnosis, for example. This might make it easier for driver writers inside and outside of Sun to inch their way along the path of enabling the full FMA vision. 1 let's go shopping Posted at 06:45PM Nov 11, 2006 by Julia Harper in Sun | Comments[0] Comments:
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||