For more than 3 years your cluster has been doing fine. Then suddenly you see the following appear on one node's terminal:
[ID 747640 kern.notice]
Reservation Conflict
The node has paniced! Fortunately this is a cluster and the
other node takes over all services that were running on the panicked
node. The panicked node comes up fine and joins the cluster. Still you
would like to know WHY this has happened. So you gather explorers of
both nodes and the crash dump that was gathered after the panic. You
submit them to Sun Service and you wait eagerly for an explanation ....
Unfortunately, it is not as simple as that. As I have discussed in previous blog entries (
here, for example, and
here), SCSI reservations are used to kick a node out of the cluster, if a
split brain occurs, or to prevent a node to join the cluster on its own when it may suffer from
amnesia.
In both cases, the node that has kicked the other node out thought that
this other node was dead because it did not receive any heartbeats
anymore. This may be because the other node **is** dead, or, in this
case, where it was still alive, because this node **thought** it was
dead and worth kicking out. This can happen because there was
physically something wrong with the interconnect links (power down on
the switches for example), or because either one of the node was unable
to send or receive heartbeats, for example because it suffers from
SunAlert 57666.
And here we see our problem: The cause of the reservation conflict of
nodea may very well be an issue on nodeb. Therefore, when a reservation
occurs and you want a proper Root Cause Analysis, please also generate
a Live Core Dump of the 'good' node ASAP and supply this to us as well.
So here is a To-Do list for Reservation Conflicts:
-Ask yourself if there are any other clusters or stand-alone nodes that
can access the disks seen by the affected cluster (this is unsupported
and can lead to reservation conflicts).
-Generate a Live Dump on the node that did not experience the
reservation conflict. If you are not 100% sure how to do this is,
engage a Sun engineer.
-If you cannot generate a live dump, engage a Sun engineer who will
provide you with a script to gather some vital information on the
'good' node.
-If you are running Solaris 8, please check if all patches to prevent
SunAlert 57666 are installed. I CANNOT STRESS HOW IMPORTANT THIS IS!