Kristien's Weblog
Kristien's Weblog
« Previous day (Jul 24, 2005) | Main | Next day (Jul 25, 2005) »
20050725 maandag 25 juli 2005
So You Got Yourself a Reservation Conflict Panic
For more than 3 years your cluster has been doing fine. Then suddenly you see the following appear on one node's terminal:

[ID 747640 kern.notice] Reservation Conflict

The node has paniced! Fortunately this is a cluster and the other node takes over all services that were running on the panicked node. The panicked node comes up fine and joins the cluster. Still you would like to know WHY this has happened. So you gather explorers of both nodes and the crash dump that was gathered after the panic. You submit them to Sun Service and you wait eagerly for an explanation ....

Unfortunately, it is not as simple as that. As I have discussed in previous blog entries (here, for example, and here), SCSI reservations are used to kick a node out of the cluster, if a split brain occurs, or to prevent a node to join the cluster on its own when it may suffer from amnesia. In both cases, the node that has kicked the other node out thought that this other node was dead because it did not receive any heartbeats anymore. This may be because the other node **is** dead, or, in this case, where it was still alive, because this node **thought** it was dead and worth kicking out. This can happen because there was physically something wrong with the interconnect links (power down on the switches for example), or because either one of the node was unable to send or receive heartbeats, for example because it suffers from SunAlert 57666. And here we see our problem: The cause of the reservation conflict of nodea may very well be an issue on nodeb. Therefore, when a reservation occurs and you want a proper Root Cause Analysis, please also generate a Live Core Dump of the 'good' node ASAP and supply this to us as well.

So here is a To-Do list for Reservation Conflicts:

-Ask yourself if there are any other clusters or stand-alone nodes that can access the disks seen by the affected cluster (this is unsupported and can lead to reservation conflicts).
-Generate a Live Dump on the node that did not experience the reservation conflict. If you are not 100% sure how to do this is, engage a Sun engineer.
-If you cannot generate a live dump, engage a Sun engineer who will provide you with a script to gather some vital information on the 'good' node.
-If you are running Solaris 8, please check if all patches to prevent SunAlert 57666 are installed. I CANNOT STRESS HOW IMPORTANT THIS IS!






25 jul 2005, 15:07:02 MEST Permalink Opmerkingen [2]