Wednesday June 16, 2004 Computers should diagnose faults rather than simply spewing errors that result from those faults. Okay, I just wanted to get that off my chest and you may think anyone who would open a new blog with that statement is, well, not alot of fun at parties. You may be right, but that's not my fault (it is, in fact, my defect).
Why don't most computer systems print faults instead of errors? Well, it is very hard, for one thing. Most attempts to diagnose faults take only a small portion of the computer, or a single sub-system into account. The reason is that the error telemetry from different system components is handled in completely different ways. Consider all the various free-form messages in all the various logs on a typical UNIX system. In order to use that telemetry for fault diagnosis, one would have to grok all those formats and, even worse, keep up with the changing world of each subsystem.
Anyway, once you've solved the telemetry problem (left as an exercise to the blogger, for now), you then have to decide how to consume the telemetry. Conventional logic leads to some sort of rule-based, heuristic approach at guessing what's wrong based on the error information. Historically this leads to simplistic diagnoses, which may be good enough in certain circumstances. But any complex, real-world system will soon cause the heuristics to grow out of control.
After solving the diagnosis problem, there's a vast system management and administrative rat's nest to tackle. Even if a system knows what is wrong with itself, what does it do with that information? What approaches actually cause the correct service action, minimize system unavailability, and feed information back into the process to improve the failure rate over time?
I have discovered a truly elegant solution to these problems, but this blog entry is too narrow to contain it. Perhaps a future blog entry will spill the beans...