Kristien's Weblog
Kristien's Weblog
« Morning has broken... | Main | On Duty Rantings »
20050519 donderdag 19 mei 2005
Split Brain

For those who want to see a real split brain, I 'd suggest renting this DVD.

For the rest of us, I shall today explain what is a Split Brain in clustering theory and how Sun Cluster protects against that. As discussed some cluster posts ago, Sun Cluster nodes communicate with each other through the private interconnect. This is a redundant network that is exclusively used for intra node communication.  A split brain is a situation where all the links of the private interconnect fail, but the nodes are still running. So each node thinks that the other one(s) is/are dead, and that it should takeover applications and stuff. And here lies the danger: in a split brain nodes would independently startup the applications and access the data, because they do not know the other nodes are doing the same thing. Data corruption is waiting to happen. So to prevent this kind of situation we must do 2 things. Most cluster only do the first, our Sun Cluster does both and that is why it is a brilliant piece of software:

1) Make sure that when a loss of all private interconnects occurs, and subpartitions (of one node or different nodes) can form, only one of the subpartitions or nodes is allowed to continue cluster operation and the rest are kicked out. Sun Cluster decides this based on a majority (quorum) algorithm, which I will discuss next week. This majority algorithm is the same as the one that is used to prevent amnesia.

2) But even more protection is needed. I think this is best explained with a case I had. In this case, only one node experienced problems on all interconnect links: due to issues in the network stack it did not receive heartbeats from the other node anymore. This node went through a reconfiguration and decided to kick the other node out and start up the application etc... However; the other node was still receiving heartbeats from this node and did not reconfigure or check whether it could start up cluster or not. Eventually it would have been kicked out by the winning node anyway, but at this point it was still happily continuing running applications. It is imperative that once a node has decided to continue cluster and take over the application and access the data etc it make sure that the other node is unable to access any of the shared data. So as soon as the 'unaware' node tried to access the data on one of the shared disks, it was kicked out to prevent any data corruption. The other node then remained in the cluster and the applications were available. The mechanism used for this is called Disk Fencing; and it uses SCSI reservations. More on SCSI reservations once I get back from holiday in June...

 

 

 


19 mei 2005, 08:50:49 MEST Permalink Opmerkingen [1]

Terugkoppel URL: http://blogs.sun.com/kristien/entry/split_brain
Opmerkingen:

great stuff

Toegevoegd door 216.113.168.128 om 16 maart 2006 om 00:09 MET #

Voeg je opmerking toe:

Naam:
E-Mail:
URL:

Jouw opmerking:

HTML Syntax: Uitgeschakeld