The Observation DeckViews on software from Bryan Cantrill's deck chair
Comments:
I look forward to the blog post, as I didn't quite follow the deck. (I would love an example of a pathological system.) Then again, I couldn't even get professor "Detner's" name right, so maybe there is no hope for me.
BTW, I never much trusted Brown Grads. Bunch of hippie communists.
Posted by Scott Johnston on May 06, 2007 at 08:02 AM PDT #
A pathological system is any that is malfunctioning at some systemic level -- so anything from a cancerous cell to an economic recession to an unhandled software exception represents a pathological system at some level. And as mentioned in the deck, while system pathologies can be fatal, they need not be to still be pathological. Indeed, non-fatal pathologies are often the more difficult to diagnose -- it's easier to give an autopsy than a prognosis...
Posted by Bryan Cantrill on May 06, 2007 at 10:12 AM PDT #
Bryan,
The best author I've come across for exploring exactly these sorts of systems is Petroski. His main thesis being that we learn from failures rather than from successes.
Interestingly we don't seem to focus much on system failures in computer science and computer engineering classes.
One problem with the focus on failures is that we tend to pick case studies that are too new and hence can be analyzed too closely with modern tools and analysis and so we don't learn the design/engineering lessons we simply try to re-solve the problem.
A key to learning from failures then is sometimes to have the distance of history (and I'm paraphrasing Petroski here) so that we don't get bogged down in the particulars of the failure itself, and can focus on the methodology and assumptions and climate that led to it.
A problem with applying this approach in the computer world is that we don't have the distance of history yet for many of the failures.
Andy, Agreed about Petroski, but I might differ on the necessity of distance: in Petroski's domain (civil engineering), failures are exceedingly rare, and often take several years to completely understand. In our domain (software), failure is, to put it euphemistically, much more common, and a single failure can often be completely understood within minutes or hours. Indeed, a single bug that consumes (say) more than a week of a single engineer's time remains quite rare (and nearly always a great story when they are to be had!). I very much agree with you that one wants to abstract away from the details of the solution to the larger issues around design and methodology -- I would just contend that one does not need much distance from a software problem to be able to do just that.
Posted by Bryan Cantrill on May 07, 2007 at 09:28 PM PDT #
Bryan,
One counter to that is that despite lots of engineering effort, good testing processes, etc. we still end up with certain bugs that defy detection in an easy manner. A nice example would be the ANI bug in MS recently, and Michael Howard's excellent writeup.
http://blogs.msdn.com/sdl/archive/2007/04/26/lessons-learned-from-the-animated-cursor-security-bug.aspx
The interesting lesson here surrounds complicated failures that are hard to detect.
I'll concede that this isn't necessarily an example of a pathological system yet at the same time even isolated bugs like that can be exceedingly difficult to detect despite people's excellent efforts. And this isn't even one that relies on weird hardware timings, clock skew, etc.
Bryan. I was researching some DTrace topics and came across this entry in your blog. Not only did it consume 10 minutes of my life (very satisfacoritly, I might add), it also gave me some good ideas about how to teach Dtrace to the uninitiated. I wholeheartedly agree that we can learn more from observing mistakes than we can from observing perfection. As an instructor who has been teaching IT for more than 20 years, I am also heartened by your admiration for professor Doeppner. If he's been inspiring people in his education for 30 years, then perhaps we older IT professionals still have something to offer to the youth of IT?
Posted by Jeff Turner on May 11, 2007 at 01:33 AM PDT #
Andy, you're exactly right: no amount of engineering effort will eliminate all pathology and the remaining failures can be very difficult to understand -- which is why studying it is so critically important. When we honor pathology, we naturally develop systems in which we can better diagnose it. Thanks for the pointer to the ANI bug writeup -- this is exactly what we collectively need much more of.
Posted by Bryan Cantrill on May 11, 2007 at 11:33 AM PDT #
Jeff, glad you enjoyed it! And yes, I have great reverence for the wisdom that only time can give you -- and therefore for the IT professionals that have been around long enough to acquire it. ;) And if you know Tom, one of his greatest strengths is his fascination with history -- a subject that is (frankly) sorely neglected in our domain...
Posted by Bryan Cantrill on May 11, 2007 at 11:37 AM PDT # Post a Comment: Comments are closed for this entry. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||