Kristien's Weblog
Kristien's Weblog
« How to transform a... | Main | About Insanity »
20050725 maandag 25 juli 2005
So You Got Yourself a Reservation Conflict Panic
For more than 3 years your cluster has been doing fine. Then suddenly you see the following appear on one node's terminal:

[ID 747640 kern.notice] Reservation Conflict

The node has paniced! Fortunately this is a cluster and the other node takes over all services that were running on the panicked node. The panicked node comes up fine and joins the cluster. Still you would like to know WHY this has happened. So you gather explorers of both nodes and the crash dump that was gathered after the panic. You submit them to Sun Service and you wait eagerly for an explanation ....

Unfortunately, it is not as simple as that. As I have discussed in previous blog entries (here, for example, and here), SCSI reservations are used to kick a node out of the cluster, if a split brain occurs, or to prevent a node to join the cluster on its own when it may suffer from amnesia. In both cases, the node that has kicked the other node out thought that this other node was dead because it did not receive any heartbeats anymore. This may be because the other node **is** dead, or, in this case, where it was still alive, because this node **thought** it was dead and worth kicking out. This can happen because there was physically something wrong with the interconnect links (power down on the switches for example), or because either one of the node was unable to send or receive heartbeats, for example because it suffers from SunAlert 57666. And here we see our problem: The cause of the reservation conflict of nodea may very well be an issue on nodeb. Therefore, when a reservation occurs and you want a proper Root Cause Analysis, please also generate a Live Core Dump of the 'good' node ASAP and supply this to us as well.

So here is a To-Do list for Reservation Conflicts:

-Ask yourself if there are any other clusters or stand-alone nodes that can access the disks seen by the affected cluster (this is unsupported and can lead to reservation conflicts).
-Generate a Live Dump on the node that did not experience the reservation conflict. If you are not 100% sure how to do this is, engage a Sun engineer.
-If you cannot generate a live dump, engage a Sun engineer who will provide you with a script to gather some vital information on the 'good' node.
-If you are running Solaris 8, please check if all patches to prevent SunAlert 57666 are installed. I CANNOT STRESS HOW IMPORTANT THIS IS!






25 jul 2005, 15:07:02 MEST Permalink Opmerkingen [2]

Terugkoppel URL: http://blogs.sun.com/kristien/entry/so_you_got_yourself_a
Opmerkingen:

So, is there a way to reset the reservations? I'm trying to set up a new cluster and when I rebooted the active node, it wasn't able to deport all the vxvm volumes but put reservations on half of them. Now either node only sees half the disk, and I tried to do a vxdisk destroy on all the disks but it panics ... I think I have a mess now. I just want to reset everything without fscking it :)

Toegevoegd door Nathan om 06 november 2005 om 18:31 MET #

Hi, Are you sure scsi reservations are the problem? To get rid of scsi reservations it depends what kind of reservations you have, for example scsi-2 reserve/release you could get rid of by simply power cycling the storage array. If these were scsi-3 reservations you could use the command /usr/cluster/lib/sc/scsi to scrub them off. But before you do that, maybe best log a call with Sun to doublecheck what the problem is? Kristien

Toegevoegd door kristien om 07 november 2005 om 09:42 MET #

Voeg je opmerking toe:

Naam:
E-Mail:
URL:

Jouw opmerking:

HTML Syntax: Uitgeschakeld