Kristien's Weblog
Kristien's Weblog
« Jambers on people... | Main | Interesting SVM... »
20050624 vrijdag 24 juni 2005
SCSI reservations in Sun Cluster 3.x
I promised some time ago to write something about the mechanisms that Sun Cluster uses to prevent split brain and amnesia. As said, in a two node cluster, a node can get the vote count from the quorum device by 'reserving' the quorum device or making sure that the other node cannot reserve it. We also discussed that reserving quorum devices is not enough: you should also make sure that all disks are fenced out from a node that has to leave the cluster. This is called disk fencing.  SCSI reservations are used for both the quorum disk and all the other disks.

You have probably heard of SCSI-2 versus SCSI-3. When Sun Cluster 3.x was designed, they reckoned all disks would be ready to understand SCSI-3 by the time Sun Cluster was released, but unfortunately this didn't seem to be true. So they decided to have Sun Cluster use either SCSI-2 or SCSI-3. Big question: when does it use what?  And why not use SCSI-2 all the time? Let's first try to answer the last question: SCSI-2 is an exclusive reservation, which means that only one node can own the disk. Which means that other nodes will not be able to reserve the disk and they will panic. Not so handy when you have a 4 node cluster and you want to kick off only one node. SCSI-3 is a group reservation: every node has a key on a dedicated area on the disk and when a node has to leave, another node will just kick off its key.

The next question, when Sun Cluster uses SCSI-2 or when SCSI-3 is an easy one to answer but there are lots of misunderstandings. Sun Cluster will not 'test' whether the disk understands SCSI-2 or SCSI-3. Reason for that is that we use a specific functionality of SCSI-3 called Persistent (Group) Reservation (PGR) which is optional in the specs. So it is perfectly possible that a disk understands SCSI-3 but does not have PGR functionality enabled. So Sun Cluster decides what mechanism to use based on the number of paths to the disk cluster-wide. You can check this with the output of scdidadm -L.
An example in a 2-node cluster:

14       moon1:/dev/rdsk/c1t2d0         /dev/did/rdsk/d14
14       moon2:/dev/rdsk/c1t2d0         /dev/did/rdsk/d14

-->  Here we see that there is one path from moon1 to /dev/did/rdsk/d14, and one path from moon2 --> hence scsi-2 will be used.

The next thing we will need to do is discuss the difference between scsi reservations used for the Quorum device and the ones used for disk fencing. There is no overlap: Disk fencing code will issue scsi reservations on all shared disks except the Quorum Disk.
Let us first start with the SCSI mechanism used by disk fencing (ie the protection of disk against 'rogue' nodes that have unexpectedly left the cluster). As said, SCSI-2 will be used when it is a 2-node cluster, SCSI-3 when there are more than 2 paths to the disk cluster wide. SCSI-3 is needed in that case because of what we have discussed before: we need more granularity than the all or nothing 'kick everyone out' of SCSI-2. The SCSI-2 reservations used are the typical MHIOCTKOWN and MHIOCRELEASE ioctls.

For the quorum device it is not as straightforward. As said, the quorum rule is used to protect amnesia. This implies that any reservation of the quorum device should be able to persist across reboots of the storage. This is true for SCSI-3 (hence the Persistent in PGR) but not for SCSI-2. Therefore, Sun invented a mechanism it has called SCSI-2 PGRE (Persistent Group Reservation Emulation). This is an emulation using SCSI-2 ioctls of the SCSI-3 mechanism: keys will be put on a designated area on the disk. These keys are able to survive a power cycle of the disk subsystem. One additional remark: since putting your key on a disk or kicking off another ones key off the disk has to be an atomic operation, but the SCSI-2 emulation consists of many commands: therefore a traditional SCSI-2 MHICTKOWN will still be used to ensure atomicity.

Oh: both SCSI-3 and SCSI-2 keys are invisible and are not placed in a specific partition. SCSI-2 keys are in a designated area on the disk or LUN and the location of SCSI-3 keys is implementation-dependant. A quorum disk can still be used to put whatever data you want on. I will show in a next post how you can see these mysterious keys.






24 jun 2005, 09:59:50 MEST Permalink Opmerkingen [4]

Terugkoppel URL: http://blogs.sun.com/kristien/entry/scsi_reservations_in_sun_cluster
Opmerkingen:

Kristien, thank you for the explanation. Is it correct that the SCSI-3 PGR will be used in dual-node cluster when there are 2 paths to the quorum disk from each node and the multipathing is disabled? (I am asking here because the Clusters forum http://forum.sun.com/forum.jspa?forumID=1 is unavailable right now). TIA

Toegevoegd door Leon Koll om 17 december 2005 om 21:58 MET #

kristien, Thanks a lot for SCSI details. I have one question Every node in the cluster registers a key in SCSI -3 PGR disks. Now in case of split brain every sub-cluster will try to eject the keys related to other subcluster. After ejecting the keys, cluster grabbing maximum SCSI -3 disks will be a new cluster and remaining (nodes in other sub-cluster) will panic & reboot. I am not exactly getting who will reboot remaining nodes, whether SCSI -3 framework or fencing algorithm? If SCSI -3 framework is doing then is there any need for fencing algorithm? If yes what exactly it will do? Thanks & regards, Prasad Joshi.

Toegevoegd door Prasad Joshi om 13 maart 2006 om 08:16 MET #

On personal opinion, I find this very helpful. Guys, I have also posted some more relevant info further on this, not sure if you find it useful: http://www.bidmaxhost.com/forum/

Toegevoegd door ocnsss om 30 maart 2007 om 05:35 MEST #

Thanks for this post, Kristien.
I was wondering on which part of the disk, these reservation keys are stored. I could read from multiple documents that the PGRE keys are stored in the private cylinders of the disk. However, couldnt find an answer yet on where exactly are the reservation keys of SCSI-3 as well as SCSI-2 stored (As one is by design persistent, and the other is not. do they have any registers inside the disks / Storage controllers for storing the keys ?)

Toegevoegd door Abhilash V M om 11 januari 2008 om 13:15 MET #

Voeg je opmerking toe:

Naam:
E-Mail:
URL:

Jouw opmerking:

HTML Syntax: Uitgeschakeld