Kristien's Weblog
Kristien's Weblog
« Scanner problems and... | Main | Replacing a disk in... »
20060421 vrijdag 21 april 2006
The challenges of doing Root Cause Analysis

Very often we get a request to do a Root Cause Analysis (RCA) of a problem that has already gone away. While in some cases this is possible, in others this is not because the data needed is already gone.  Sometimes we are faced with very insisting customers who keeps pushing for an RCA even if we tell them that we do not have enough data and unfortunately our crystal ball was broken last time we did a general spring cleanup. Needless to say that this results in an unhappy customer and yours truly in despair.

Let us describe an example. This is a fictitious example but similar to situations we end up in from time to time. Customer A experiences probe timeouts on their Oracle Database. Which means, the Sun Cluster Fault probe, which does some tests on the availability of the Oracle database, does not succeed in finishing the tests in time. As a result, the Sun Cluster agent tries to stop and start the Oracle database and since the database cannot be stopped in time, the Oracle resource goes into STOP_FAILED. Now when an application cannot be stopped, there will be no failover by the cluster. First question of the customer: why did it not fail over? Answer: it would be very dangerous to startup another instance of an application on another node when we are not absolutely sure that the application is stopped on one node. Cluster decides here to protect data integrity rather than availability. So far so good.

However, once the customer noticed their cluster in this state, they panicked and decided to reboot the node. After the reboot the STOP_FAILED situation was cleared and Oracle is now running just fine. No more probe timeouts. Customer however is anxious to know what exactly happened gathers explorers and sends these to us, demanding an RCA.

What we are able to do at this point is explain what has happened: Oracle commands that the fault probe executes took too much time, as a result the resource was restarted but the Oracle STOP method took too much time too, and hence the resource went into STOP_FAILED. What is very difficult now is to explain **why** the Oracle commands took so long too complete. With the data we have we can only guess: maybe there were storage failures making disk access slow, maybe the machine was running out of memory, maybe it was only Oracle itself being slow.  In some cases there may be some hints in the messages files or in the Oracle Alert log, but failing that it will be impossible to lay down a root cause. After all the machine was rebooted and the situation which may have led to the phenomena is cleared. It would have been better to gather more data (in the form of crash dump, GUDS output) at the time of the issue, in that way we would have at least have some chance in nailing down the problem. But of course when you run into a situation like that your first concern is probably to get the machine up and running again, and many machines are rebooted without further ado.

You may now object that at least the cluster or Oracle or whatever should gather **more** information at that point. Maybe it should automatically gather performance data. But there are always drawbacks to that as well: gathering more information means more disk space needed, more burden on the machine etc. Since most clusters run fine throughout their lives it seems a logical decision to gather more in-depth information when a problem produces itself.

So dear customers, if you are reading this: if we cannot give you an RCA based on explorers alone and we tell you that we do not have enought data to know what happened, please believe us. We are doing all we can but we cannot do the impossible.


21 apr 2006, 15:53:22 MEST Permalink

Opmerkingen:

Voeg je opmerking toe:

Opmerkingen zijn uitgeschakeld.