I am hosting this month's IEEE Computer Society Costal Section meeting tonight where our guest speaker is Cristian Tapus from Caltech speaking on software reliability. Why should we all care about software reliability? Well if you work for a computer company, it is pretty obvious. But as Cristian stated in his talk, more and more of what we do in our everyday lives is impacted by software. Anyone who travels a lot like I do has probably seen one of those airport flight arrival/departure monitors showing a blank blue screen caused by a software bug. I have become accustomed to rebooting my cell phone every few days when it locks up due to software bugs. I can live with software bugs like that. Other software bugs are more troublesome. Cristian shared a detailed explaination of the software bug that caused the Mars Pathfinder to lose data and malfunction in 1997. A simple reboot didn't help in that case. I also don't want the pilot on my next commercial flight to have to reboot the flight control system when the plane is landing.

Cristian has done some pretty interesting work in software reliability. Check out his paper on Distributed Speculative Execution for Reliability and Fault Tolerance. The paper talks about speculative execution in a grid environment, but with CPU vendors planning CPUs this decade that can run not just 1 or 2 threads but 32 or 64 or more in parallel, perhaps we will figure out how to use some of those extra threads to make software more reliable. Sun's upcoming Rock processor, with 16 cores, employs scout threads to increase performance through speculative execution of instructions ahead of the main thread, for instance pre-executing instructions after a branch. Branch prediction and out of order execution is already done today to a limited extent on modern processors, but scout threads, by employing separate hardware treads, promise to do so more efficiently than current processors. Maybe in another few years if we have a 32 core CPU we will use some of the extra threads not to increase performance but to increase reliability, similar to how Cristian describes his research doing this at the cluster level.

This is exactly why, ever since I graduated college, I have been an IEEE member. Be it local chapter meetings, worldwide conferences, or their many publications, my interactions with the IEEE and its members have always sparked by intellectual curiosity. Maybe when my cell phone gets a 32 core processor with scout treads, I can finally stop rebooting it.

Comments:

As long as end users accept and are willing to pay for unreliable software, that is what we will get. So far, it seems, the increases in functionality that are on offer by software companies are enough to offset the lack of reliability of their products. In the life cycle of every software product that is designed for a specific function, there will be a diminishing number of functionality additions that can be made. At that point, if a software producer wants to sell a new version of the product, increasing the reliability of the product will be the only way to induce additional sales.

Posted by Ravenor on February 23, 2007 at 09:53 AM PST #

Post a Comment:
Comments are closed for this entry.

This blog copyright 2009 by marchamilton