Kristien's Weblog
Kristien's Weblog
Allemaal | Animals | CEC | General | Personal | Philosophy | Sun Cluster
Main | Next page »
20060505 vrijdag 05 mei 2006
Replacing a disk in Sun Cluster
There is plenty of documentation as to how to change a disk in Sun Cluster. Including everything that you need to on the volume manager layer etc. Just check out the cluster collection on http://docs.sun.com

I just want to point out some of the common mistakes.
Especially, what NOT to do when replacing a disk in Sun Cluster.

First of all, if you are replacing a disk in a Hardware RAID box there is **absolutely nothing** you need to do on the cluster or OS layer. Just follow the instructions of the box. No need to do any commands as teh LUNs that the OS (and hence cluster) sees do not change.

Do **not** start to run scdidadm -C, scdidadm -r etc. I repeat: Do NOT start to run scdidadm -C scdidadm -r. It is completely useless.

If you are replacing a physical disk or an entire LUN, and the WWN/DiskID of the disk changes and hence the way the OS sees the disk changes, you have to make Sun Cluster aware of that, so that it can update its DID database.

But again DO NOT run scdidadm -C!
I have seen over the last few months quite a few escalations where someone just happily ran scdidadm -C as part of a disk replacement procedure. And it screwed up the DID database. No problem for me as it is always fun to fix but not fun for the owners of that cluster as in many cases this means downtime.
Now what is the so-feared scdidadm -C for? What it does it will 'clear out' the DID database. Which means, if you permanently (and here I see **permanently**, ie not part of a disk replacement procedure) remove a disk you may run it to free up the DID it was using. But this is a very unlikely situation. Let me sketch one thing that can go wrong when you run it out of the blue because you think that is the right thing to do. Let's say you had a disk represented by the DID number d12. There are some problems on the fabric and for some reason the disk is temporarily unavailable. You see errors and you think 'oh lets run scdidadm -C, that'll fix it'. The command will not find the disk associated with d12 and free up that did number. Next time you reboot or run scdidadm -r or whatever, the disk is back but may be associated with a different DID number. Which of course, can cause many problems.
So: scdidadm -C: Don't do it. Unless you really have to. But not as part of a disk replacement procedure.
The thing you need to do  if you replace a physical disk is write down the DID with which is associated. When you insert a new replacement disk you type
scdidadm -R d# where d# is the did number. This will update the did database with the information (ie diskID) of the new disk and everybody will be happily ever after.







05 mei 2006, 15:59:27 MEST Permalink Opmerkingen [1]

20060421 vrijdag 21 april 2006
The challenges of doing Root Cause Analysis

Very often we get a request to do a Root Cause Analysis (RCA) of a problem that has already gone away. While in some cases this is possible, in others this is not because the data needed is already gone.  Sometimes we are faced with very insisting customers who keeps pushing for an RCA even if we tell them that we do not have enough data and unfortunately our crystal ball was broken last time we did a general spring cleanup. Needless to say that this results in an unhappy customer and yours truly in despair.

Let us describe an example. This is a fictitious example but similar to situations we end up in from time to time. Customer A experiences probe timeouts on their Oracle Database. Which means, the Sun Cluster Fault probe, which does some tests on the availability of the Oracle database, does not succeed in finishing the tests in time. As a result, the Sun Cluster agent tries to stop and start the Oracle database and since the database cannot be stopped in time, the Oracle resource goes into STOP_FAILED. Now when an application cannot be stopped, there will be no failover by the cluster. First question of the customer: why did it not fail over? Answer: it would be very dangerous to startup another instance of an application on another node when we are not absolutely sure that the application is stopped on one node. Cluster decides here to protect data integrity rather than availability. So far so good.

However, once the customer noticed their cluster in this state, they panicked and decided to reboot the node. After the reboot the STOP_FAILED situation was cleared and Oracle is now running just fine. No more probe timeouts. Customer however is anxious to know what exactly happened gathers explorers and sends these to us, demanding an RCA.

What we are able to do at this point is explain what has happened: Oracle commands that the fault probe executes took too much time, as a result the resource was restarted but the Oracle STOP method took too much time too, and hence the resource went into STOP_FAILED. What is very difficult now is to explain **why** the Oracle commands took so long too complete. With the data we have we can only guess: maybe there were storage failures making disk access slow, maybe the machine was running out of memory, maybe it was only Oracle itself being slow.  In some cases there may be some hints in the messages files or in the Oracle Alert log, but failing that it will be impossible to lay down a root cause. After all the machine was rebooted and the situation which may have led to the phenomena is cleared. It would have been better to gather more data (in the form of crash dump, GUDS output) at the time of the issue, in that way we would have at least have some chance in nailing down the problem. But of course when you run into a situation like that your first concern is probably to get the machine up and running again, and many machines are rebooted without further ado.

You may now object that at least the cluster or Oracle or whatever should gather **more** information at that point. Maybe it should automatically gather performance data. But there are always drawbacks to that as well: gathering more information means more disk space needed, more burden on the machine etc. Since most clusters run fine throughout their lives it seems a logical decision to gather more in-depth information when a problem produces itself.

So dear customers, if you are reading this: if we cannot give you an RCA based on explorers alone and we tell you that we do not have enought data to know what happened, please believe us. We are doing all we can but we cannot do the impossible.


21 apr 2006, 15:53:22 MEST Permalink

20051215 donderdag 15 december 2005
Sun Cluster Forum
Just a short post to let you know that there is a Forum on Clustering where you can post all your questions. I am checking this regularly together with some of my colleagues and it is always good to read some posts about the clusters that are out there.
Here is the URL: http://forum.sun.com/forum.jspa?forumID=1

15 dec 2005, 15:28:48 MET Permalink Opmerkingen [2]

20050926 maandag 26 september 2005
The DID database
A long time ago I blogged about the difference between /dev/did and /dev/global device names in Sun Cluster 3.x.
I'd like to discuss some of the files in the Cluster Configuration Repository. The Cluster Configuration Repository (CCR) is the Cluster database containing the information about the current cluster setup. Changes to this setup are saved across reboots in files in the directory /etc/cluster/ccr which is replicated on each node.
One of the things kept in this database is the DID database. Just take a look at the file /etc/cluster/ccr/did_instances, and you will see it looks as follows:

ccr_gennum      3
ccr_checksum    282788695BAD93E939748ECE92B52B4B
19      disk|DEVID_SCSI_SERIAL|SEAGATE ST39102LCSUN9.0GLJW992510000U0010JDH|5345414741544520535433393130324c4353554e392e30474c4a5739393235313030303055303031304a4448|2:/dev/rdsk/c0t1d0
20      disk||||2:/dev/rdsk/c0t6d0
1       disk|DEVID_SCSI_SERIAL|SEAGATE ST39102LCSUN9.0GLJW8793900001001J327|5345414741544520535433393130324c4353554e392e30474c4a57383739333930303030313030314a333237|1:/dev/rdsk/c0t1d0
2       disk||||1:/dev/rdsk/c0t6d0
3       disk|DEVID_SCSI3_WWN| |200000203714ce27|2:/dev/rdsk/c1t21d0|1:/dev/rdsk/c1t21d0
4       disk|DEVID_SCSI3_WWN| |20000020370d3f7d|2:/dev/rdsk/c1t16d0|1:/dev/rdsk/c1t16d0
5       disk|DEVID_SCSI3_WWN| |20000020370d3f5f|2:/dev/rdsk/c1t0d0|1:/dev/rdsk/c1t0d0
6       disk|DEVID_SCSI3_WWN| |20000020370d3f03|2:/dev/rdsk/c1t3d0|1:/dev/rdsk/c1t3d0
7       disk|DEVID_SCSI3_WWN| |20000020370d3590|2:/dev/rdsk/c1t17d0|1:/dev/rdsk/c1t17d0
8       disk|DEVID_SCSI3_WWN| |200000203714ca15|2:/dev/rdsk/c1t22d0|1:/dev/rdsk/c1t22d0
9       disk|DEVID_SCSI3_WWN| |20000020370d3d6d|2:/dev/rdsk/c1t4d0|1:/dev/rdsk/c1t4d0
10      disk|DEVID_SCSI3_WWN| |20000020370a2b24|2:/dev/rdsk/c1t1d0|1:/dev/rdsk/c1t1d0
11      disk|DEVID_SCSI3_WWN| |20000020370dc6ac|2:/dev/rdsk/c1t19d0|1:/dev/rdsk/c1t19d0
12      disk|DEVID_SCSI3_WWN| |200000203714c427|2:/dev/rdsk/c1t20d0|1:/dev/rdsk/c1t20d0
13      disk|DEVID_SCSI3_WWN| |20000020370d10e2|2:/dev/rdsk/c1t5d0|1:/dev/rdsk/c1t5d0
14      disk|DEVID_SCSI3_WWN| |20000020370d4094|2:/dev/rdsk/c1t2d0|1:/dev/rdsk/c1t2d0
15      disk|DEVID_SCSI3_WWN| |20000020370d3ed9|2:/dev/rdsk/c1t6d0|1:/dev/rdsk/c1t6d0
16      disk|DEVID_SCSI3_WWN| |20000020370d4039|2:/dev/rdsk/c1t18d0|1:/dev/rdsk/c1t18d0
17      disk|DEVID_SCSI_SERIAL|IBM     DNES30917SUN9.0G1QK087          |49424d2020202020444e4553333039313753554e392e304731514b30383720202020202020202020|1:/dev/rdsk/c2t10d0
18      disk|DEVID_SCSI_SERIAL|IBM     DNES30917SUN9.0G1QM765          |49424d2020202020444e4553333039313753554e392e304731514d37363520202020202020202020|1:/dev/rdsk/c2t11d0
8191    tape||||1:/dev/rmt/0

All CCR files start with a gennum (generation number) and a checksum (second line). These files are indeed checksum protected and should NOT be edited manually. You CAN edit them in some occasions but if that is required you will need to contact Sun for assistance.
Let us look at one of these lines:

13      disk|DEVID_SCSI3_WWN| |20000020370d10e2|2:/dev/rdsk/c1t5d0|1:/dev/rdsk/c1t5d0

This is the entry for DID 13, which we can see in the output of scdidadm -L as follows:

13       moon1:/dev/rdsk/c1t5d0         /dev/did/rdsk/d13
13       moon2:/dev/rdsk/c1t5d0         /dev/did/rdsk/d13

The second field is 'disk'. This is the type of DID device. These types are defined in the file /etc/cluster/ccr/did_types and right now you have 'disk' and 'tape'.
The third field is 'DEVID_SCSI3_WWN': This defines the type of device ID that this device provides. Each did device is normally identified by a unique ID, such as a serial number, WWN etc.
The actual device ID is in the fourth field, in this case: 20000020370d10e2. This also means that this disk is uniquely identified in the DID database and we cannot just replace it by another disk without telling the cluster about it.
The fifth field is 2:/dev/rdsk/c1t5d0. This means that on the node with nodeid 2, this disk is referred to as 'c1t5d0'
The sixth field is 1:/dev/rdsk/c1t5d0. This means that on the node with nodeid 1, this disk is referred to as 'c1t5d0'. Please be aware that these names may differ on different nodes. The DID layer uses the device ID to make sure we are talking about the same disk, even if it has different Solaris names on the different nodes.

If you are seeing error messages about DID devices, it may be that you have replaced a disk without following the official procedure to replace a disk in the cluster. Let us say you changed the disk on c1t5d0 by another one. You must now tell the cluster that it should update the did database for did number 13 with the new device ID. You can do that as follows:
#scdidadm -R c1t5d0
OR:
#scdidadm -R 13

Sometimes it is possible that the DID configuration is completely messed up, for example because you have been switching cables/controllers without following the correct procedure. To check this, please doublecheck the entries in the did_instances file with, for example, the 'diskinfo' output in the explorer. To fix this contact your Sun Resolution Center.



 



26 sep 2005, 11:56:46 MEST Permalink Opmerkingen [3]

20050725 maandag 25 juli 2005
So You Got Yourself a Reservation Conflict Panic
For more than 3 years your cluster has been doing fine. Then suddenly you see the following appear on one node's terminal:

[ID 747640 kern.notice] Reservation Conflict

The node has paniced! Fortunately this is a cluster and the other node takes over all services that were running on the panicked node. The panicked node comes up fine and joins the cluster. Still you would like to know WHY this has happened. So you gather explorers of both nodes and the crash dump that was gathered after the panic. You submit them to Sun Service and you wait eagerly for an explanation ....

Unfortunately, it is not as simple as that. As I have discussed in previous blog entries (here, for example, and here), SCSI reservations are used to kick a node out of the cluster, if a split brain occurs, or to prevent a node to join the cluster on its own when it may suffer from amnesia. In both cases, the node that has kicked the other node out thought that this other node was dead because it did not receive any heartbeats anymore. This may be because the other node **is** dead, or, in this case, where it was still alive, because this node **thought** it was dead and worth kicking out. This can happen because there was physically something wrong with the interconnect links (power down on the switches for example), or because either one of the node was unable to send or receive heartbeats, for example because it suffers from SunAlert 57666. And here we see our problem: The cause of the reservation conflict of nodea may very well be an issue on nodeb. Therefore, when a reservation occurs and you want a proper Root Cause Analysis, please also generate a Live Core Dump of the 'good' node ASAP and supply this to us as well.

So here is a To-Do list for Reservation Conflicts:

-Ask yourself if there are any other clusters or stand-alone nodes that can access the disks seen by the affected cluster (this is unsupported and can lead to reservation conflicts).
-Generate a Live Dump on the node that did not experience the reservation conflict. If you are not 100% sure how to do this is, engage a Sun engineer.
-If you cannot generate a live dump, engage a Sun engineer who will provide you with a script to gather some vital information on the 'good' node.
-If you are running Solaris 8, please check if all patches to prevent SunAlert 57666 are installed. I CANNOT STRESS HOW IMPORTANT THIS IS!






25 jul 2005, 15:07:02 MEST Permalink Opmerkingen [2]

20050706 woensdag 06 juli 2005
Show me the Key!
As promised in a previous blog entry, I will now show you the commands to look at SCSI-2 PGRE or SCSI-3 PGR keys.
As discussed, such keys are used on the quorum disk because they are persistent and persistency is what you need if you want to avoid amnesia. We also discussed that SCSI-2 PGRE keys are an emulation of SCSI-3 keys. They are invented by Sun Cluster engineering whereas SCSI-3 PGRs are part of the SCSI-3 specification.

The first cluster is a 2 node cluster with clusterwide 2 paths to the quorum device:

# scdidadm -L d4
4        node1:/dev/rdsk/c2t0d0     /dev/did/rdsk/d4    
4        node2:/dev/rdsk/c2t0d0     /dev/did/rdsk/d4 

So the keys used will be PGRE's. The command to use is, guess what, pgre:
# /usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d4s2
key[0]=0x42b2c3e500000001.
key[1]=0x42b2c3e500000002.

The second cluster is a 3 node cluster. It has more than 2 paths to the quorum disk:
# scdidadm -L d5
5        node1:/dev/rdsk/c3t50020F2300002A89d0 /dev/did/rdsk/d5    
5        node2:/dev/rdsk/c3t50020F2300002A89d0 /dev/did/rdsk/d5    
5        node3:/dev/rdsk/c3t50020F2300002A89d0 /dev/did/rdsk/d5 

So the keys used will be scsi-3 PGR's. The command to use is scsi:
# /usr/cluster/lib/sc/scsi -c inkeys -d /dev/did/rdsk/d5s2
Reservation keys(3):
0x4225e25100000003
0x4225e25100000001
0x4225e25100000002

Both the commands pgre and scsi contain options to scrub the keys off the disk. Please don't ever do that if not so instructed by authorised Sun personnel! Sometimes this is mistakenly done to solve amnesia but unfortunately this will only make things worse: the idea with amnesia is to get that nodes key on the disk not to remove all other nodes' keys!
An example of when scrubbing the keys would be useful is for example when you are getting reservation conflict panics because a disk was previously used in another cluster and still has that other cluster's keys. But again, this will have to be diagnosed first before any action is taken.



06 jul 2005, 15:39:23 MEST Permalink Opmerkingen [0]

20050701 vrijdag 01 juli 2005
What to do when your cluster doesn't start up your application
Don't panic! First thing you need to find out is whether the application itself maybe at fault.

Here is a scenario:

Let us say you have a resource group called myapp-rg. Inside myapp-rg you have three resources:
A logicalhostname resource named loghost-rs
A HAStorageplus resource named haplus-rs
A home-brewn application resource named myapp-rs

Someday you stop the resource group and you try to start it up again. However, the application resource does not come online and you see a message like, or something similar:
resource myapp-rs status on node1 change to R_FM_FAULTED
Your first thought may be that there is something wrong with the cluster. My experience is however, that in 99% of the cases it is the application itself that is not able to start up in a timely manner.
Here is how you can check this:

1) Switch offline the resource group:

#scswitch -F -g myapp-rs

2) Disable the application resource:

#scswitch -n -j myapp-rs

3) Start up the resource group. Use scswitch -z:

#scswitch -z -h nodea -g myapp-rg

After this has been done, use the Start script of the myapp-rs (you can check this by grepping for START_COMMAND in the output of scrgadm -pvv) to launch the application manually.
If the application fails to come online, you know that it is the application with is at fault and should be fixed by checking application logs and contacting the appropriate vendor.
If the application comes online, but it takes longer than the START_TIMEOUT value of the application resource (again, find this by grepping for it in the output of scrgadm -pvv), you should increase that value:

#scrgadm -c -j myapp-rs -y START_TIMEOUT=<appropriate value>



01 jul 2005, 14:20:27 MEST Permalink Opmerkingen [0]

20050624 vrijdag 24 juni 2005
Interesting SVM Blog!
Found Sanjay Nadkarni's weblog who discusses some interesting implementation details about Solaris Volume Manager, which as you know is the absolute best Volume Manager to walk the face of this earth...
Keep it coming Sanjay!


24 jun 2005, 17:06:08 MEST Permalink Opmerkingen [0]

SCSI reservations in Sun Cluster 3.x
I promised some time ago to write something about the mechanisms that Sun Cluster uses to prevent split brain and amnesia. As said, in a two node cluster, a node can get the vote count from the quorum device by 'reserving' the quorum device or making sure that the other node cannot reserve it. We also discussed that reserving quorum devices is not enough: you should also make sure that all disks are fenced out from a node that has to leave the cluster. This is called disk fencing.  SCSI reservations are used for both the quorum disk and all the other disks.

You have probably heard of SCSI-2 versus SCSI-3. When Sun Cluster 3.x was designed, they reckoned all disks would be ready to understand SCSI-3 by the time Sun Cluster was released, but unfortunately this didn't seem to be true. So they decided to have Sun Cluster use either SCSI-2 or SCSI-3. Big question: when does it use what?  And why not use SCSI-2 all the time? Let's first try to answer the last question: SCSI-2 is an exclusive reservation, which means that only one node can own the disk. Which means that other nodes will not be able to reserve the disk and they will panic. Not so handy when you have a 4 node cluster and you want to kick off only one node. SCSI-3 is a group reservation: every node has a key on a dedicated area on the disk and when a node has to leave, another node will just kick off its key.

The next question, when Sun Cluster uses SCSI-2 or when SCSI-3 is an easy one to answer but there are lots of misunderstandings. Sun Cluster will not 'test' whether the disk understands SCSI-2 or SCSI-3. Reason for that is that we use a specific functionality of SCSI-3 called Persistent (Group) Reservation (PGR) which is optional in the specs. So it is perfectly possible that a disk understands SCSI-3 but does not have PGR functionality enabled. So Sun Cluster decides what mechanism to use based on the number of paths to the disk cluster-wide. You can check this with the output of scdidadm -L.
An example in a 2-node cluster:

14       moon1:/dev/rdsk/c1t2d0         /dev/did/rdsk/d14
14       moon2:/dev/rdsk/c1t2d0         /dev/did/rdsk/d14

-->  Here we see that there is one path from moon1 to /dev/did/rdsk/d14, and one path from moon2 --> hence scsi-2 will be used.

The next thing we will need to do is discuss the difference between scsi reservations used for the Quorum device and the ones used for disk fencing. There is no overlap: Disk fencing code will issue scsi reservations on all shared disks except the Quorum Disk.
Let us first start with the SCSI mechanism used by disk fencing (ie the protection of disk against 'rogue' nodes that have unexpectedly left the cluster). As said, SCSI-2 will be used when it is a 2-node cluster, SCSI-3 when there are more than 2 paths to the disk cluster wide. SCSI-3 is needed in that case because of what we have discussed before: we need more granularity than the all or nothing 'kick everyone out' of SCSI-2. The SCSI-2 reservations used are the typical MHIOCTKOWN and MHIOCRELEASE ioctls.

For the quorum device it is not as straightforward. As said, the quorum rule is used to protect amnesia. This implies that any reservation of the quorum device should be able to persist across reboots of the storage. This is true for SCSI-3 (hence the Persistent in PGR) but not for SCSI-2. Therefore, Sun invented a mechanism it has called SCSI-2 PGRE (Persistent Group Reservation Emulation). This is an emulation using SCSI-2 ioctls of the SCSI-3 mechanism: keys will be put on a designated area on the disk. These keys are able to survive a power cycle of the disk subsystem. One additional remark: since putting your key on a disk or kicking off another ones key off the disk has to be an atomic operation, but the SCSI-2 emulation consists of many commands: therefore a traditional SCSI-2 MHICTKOWN will still be used to ensure atomicity.

Oh: both SCSI-3 and SCSI-2 keys are invisible and are not placed in a specific partition. SCSI-2 keys are in a designated area on the disk or LUN and the location of SCSI-3 keys is implementation-dependant. A quorum disk can still be used to put whatever data you want on. I will show in a next post how you can see these mysterious keys.






24 jun 2005, 09:59:50 MEST Permalink Opmerkingen [4]

20050525 woensdag 25 mei 2005
Sun Cluster 3.x Quorum algorithm
So let me try to explain the mechanism Sun Cluster uses to prevent both Amnesia and Split Brain. This is a majority algorithm: only a cluster node or a subset of cluster nodes that can have a majority of possible votes can start up (in the case of amnesia) or continue (in the case of split brain) cluster operation.  The other partitions must leave the cluster. So let us first discuss the Split Brain scenario: a node cannot communicate with the other node over the private interconnect, but both nodes are fine. As discussed before we must not allow both nodes to continue cluster operation, so one has to leave. Each node has a vote, but in a 2 node cluster this would mean that in case of a split brain nobody would continue cluster operation. So in a 2 node cluster we would assign a quorum device: a LUN in shared storage that also has a vote. So that there are 3 possible votes in the cluster and a majority of 3 is 2 votes. Once a split brain occurs, both nodes run for the quorum device: the one that is fastest, gets its vote. The other one notices that it is too late and panics with a 'Lost Operational Quorum' message. The mechanisme of reserving Quorum Devices is through scsi reservations, which we will discuss in 2 weeks.
Now how can the quorum mechanism prevent amnesia? To prevent amnesia we must only allow the last node to have left the cluster to startup the cluster. Same story: when a node leaves the cluster, the other node(s) will make sure that it cannot acquire the quorum disk when it starts up. Only the last node in the cluster will be able to do so. So when the first node to have left the cluster tries to start up, it has 1 vote of its own and knows that there are 3 possible votes in the cluster, but it cannot get the  vote of the quorum device: it waits for the other node to first form the cluster  with a message 'waiting for operational quorum'.  The last node that has left the cluster starts up, gets the vote of the quorum disk, starts talking to the waiting node and passes the latest cluster database to that waiting node so that this node is up to date with all information that may have been changed when it was down.

I realise there is a lot more to be said about this, and there are a lot more scenarios when we add more nodes. However it is the end of my day, it is beautiful and warm (27 degrees C) weather and time to make a nice walk with my dog Lukka followed by a nice glass of cool white wine...


25 mei 2005, 18:14:58 MEST Permalink Opmerkingen [2]

20050519 donderdag 19 mei 2005
Split Brain

For those who want to see a real split brain, I 'd suggest renting this DVD.

For the rest of us, I shall today explain what is a Split Brain in clustering theory and how Sun Cluster protects against that. As discussed some cluster posts ago, Sun Cluster nodes communicate with each other through the private interconnect. This is a redundant network that is exclusively used for intra node communication.  A split brain is a situation where all the links of the private interconnect fail, but the nodes are still running. So each node thinks that the other one(s) is/are dead, and that it should takeover applications and stuff. And here lies the danger: in a split brain nodes would independently startup the applications and access the data, because they do not know the other nodes are doing the same thing. Data corruption is waiting to happen. So to prevent this kind of situation we must do 2 things. Most cluster only do the first, our Sun Cluster does both and that is why it is a brilliant piece of software:

1) Make sure that when a loss of all private interconnects occurs, and subpartitions (of one node or different nodes) can form, only one of the subpartitions or nodes is allowed to continue cluster operation and the rest are kicked out. Sun Cluster decides this based on a majority (quorum) algorithm, which I will discuss next week. This majority algorithm is the same as the one that is used to prevent amnesia.

2) But even more protection is needed. I think this is best explained with a case I had. In this case, only one node experienced problems on all interconnect links: due to issues in the network stack it did not receive heartbeats from the other node anymore. This node went through a reconfiguration and decided to kick the other node out and start up the application etc... However; the other node was still receiving heartbeats from this node and did not reconfigure or check whether it could start up cluster or not. Eventually it would have been kicked out by the winning node anyway, but at this point it was still happily continuing running applications. It is imperative that once a node has decided to continue cluster and take over the application and access the data etc it make sure that the other node is unable to access any of the shared data. So as soon as the 'unaware' node tried to access the data on one of the shared disks, it was kicked out to prevent any data corruption. The other node then remained in the cluster and the applications were available. The mechanism used for this is called Disk Fencing; and it uses SCSI reservations. More on SCSI reservations once I get back from holiday in June...

 

 

 


19 mei 2005, 08:50:49 MEST Permalink Opmerkingen [1]

20050509 maandag 09 mei 2005
Amnesia

You forgot that I need ya
You must've caught amnesia
That's why you don't believe

(Black Eyed Peas)

I am not really a Black Eyed Peas fan (not my type of music) but I kind of like this song.

After having discussed the private interconnect some time ago I felt it is necessary to chat a bit about the Cluster Membership Monitor (CMM) but this is impossible without first discussing the typical issues in clustering theory called 'amnesia' and 'split brain'.  Amnesia is for this week, and I hope to get to Split Brain next week, in order to be able to finish up the CMM discussion before I leave on holiday the 28th of May (hurray !!!!!!!).

Let's imagine the following situation: You have a 2 node cluster. At 12pm you shut down a cluster node, nodea, for maintenance. Nodeb is still running. At 4pm you decide to change some settings (such as timeouts). Where will these changes go? The cluster has a central repository, called the Cluster Configuration Repository (CCR), which has a local copy on each node (in /etc/cluster/ccr --> check it out). So, because nodea is down, the update will only make it in nodeb's copy of the database. Nodea is unaware of the change. If it would join the cluster now, it would receive the most recent copy of the CCR from nodeb.

Let us now say that at 6pm, you shut down nodeb as well. Both nodes are down. If we would now boot nodea, it would start up with an old copy of the database. The cluster would have 'forgotten' the changes that happened between 12pm and 6pm, as these changes are only known on nodea, and this one is down. This is called amnesia.

I can already tell you that this is a situation that will never happen in Sun Cluster.

If we take a look at how amnesia prevention was historically done in different clustering solutions, there are several options:

1) Just DO NOT allow changes when members are not part of the cluster
2) Store the cluster database on shared storage, always
3) Store the cluster database on local storage and on shared location and use shared location to override local copies when nodes have been down

Sun Cluster 3.x uses a more elegant approach. In fact, it uses the same mechanism that it uses to prevent Split Brain: A majority algorithm. Before I explain this, however, I will first explain what IS Split Brain. That's for next week.

 


09 mei 2005, 10:41:16 MEST Permalink Opmerkingen [0]

20050428 donderdag 28 april 2005
Why Oracle RAC on Sun Cluster

From time to time I see the following question popping up: Why Oracle RAC on top of Sun Cluster, especially since Oracle 10g RAC ships with its own clusterware. In our book we dedicated a chapter to the benefits of Sun Cluster, and this chapter was also published as a Blueprints article. You find it here:

Understanding the Benefits of Implementing Oracle RAC on Sun Cluster Software  

We did refrain in this article to make explicit comparisons to other clusterware, as we felt this to be somewhat politically incorrect. We only stress what is good about our product (a lot :o) )

Also if you like the article, you can order the book here.


28 apr 2005, 09:28:58 MEST Permalink Opmerkingen [5]

20050426 dinsdag 26 april 2005
Sun Cluster 3.x Private Interconnect: What's on?
(OK I feel guilty for writing so much about personal stuff and animals on this blog. All my colleagues seem to write about **computers** all the time so my spring time resolution is to write at least one blog entry a week about the best piece of software ever to see the face of this earth: Sun Cluster 3.x)

Have you ever wondered what Sun Cluster uses the private interconnect for? Heartbeats yes, but also other stuff? And what's the story about Remote Shared Memory? Let me try to enlighten you a bit.

I tend to split up Private Interconnect Traffic in 3 big chunks, based on functionality and protocols used.

1) First you have the heartbeat messages. These are sent through all available private interconnect links and are basically 'I am alive' messages. Unlike Sun Cluster 2.2 which used ICMP messages that were sent through one private link, Sun Cluster 3.x uses the low-level protocol DLPI to send the heartbeats. If a heartbeat is not received on one path in time, this path is declared dead. If all paths are dead Sun Cluster's Path Manager warns the Cluster Membership Monitor that the other node is presumably dead (more on this maybe on a future blog entry).
So basically the heartbeat messages are used for path and node monitoring

2) Secondly, Sun Cluster sends all kinds of ORB communication through the private interconnect. As you may know, Sun Cluster inter and intra-node communication is implemented in a CORBA-like  way:  method invocations on remote objects. This is done using TCP/IP. Examples of this kind of traffic are:
-Global Filesystem and Global Devices Traffic: Only one node is the primary of the global device. If a device or filesystem is accessed on a non-primary node, this traffic is redirected throught the private interconnect.
-Cluster Configuration Repository  (ie the cluster database) updates
-Messages from the Cluster Membership Monitor about cluster reconfiguration.
These Cluster Framework Messages use all available paths on the private interconnect.

3) Thirdly, applications may want to use the private interconnect. Examples of this kind of applications are NTP and Oracle RAC. Sun Cluster provides IP addresses that can be used to contact other cluster nodes via the private interconnect. These IP addresses are typically of the format clusternode<nodeID>-priv.
Up till Sun Cluster 3.1 GA, this application-level traffic used only one link on the private interconnect. However, this link was HA: if the link failed the IP address was moved to another available private interconnect link. As of Sun Cluster 3.1 10/03, application-level traffic is striped over all available private interconnects, thus allowing a higher troughput.
Application-level traffic can be UDP or TCP. Also, if you are using an SCI (Scalable Coherent Interface)-based interconnect, the application may use the Remote Shared Memory (RSM) functionality of these cards, to reduce latency.
So if you have SCI cards,  1) and 2) (heartbeats & Sun Cluster framework messages) will still use DLPI and TCP/IP over DLPI, but the application (mostly Oracle RAC) can exploit the card's RSM functionalities. These protocols can live next to each other on the same interconnect.









26 apr 2005, 09:22:50 MEST Permalink Opmerkingen [2]

20050414 donderdag 14 april 2005
Sun Cluster 3.x: Difference between /dev/global and /dev/did devices
People sometimes wonder what the difference is between a /dev/global and a /dev/did device in Sun Cluster.
Let us first discuss the /dev/did device. You can see the /dev/did device as just another name for a Solaris /dev/dsk/c#t#d#s# device, with the only difference that this name is guaranteed to be consistent throughout the cluster: The cluster nodes have synchronised their view on the storage and have ended up with the same name for a specific device. It is quite possible that a device is called /dev/dsk/c1t0d0s0 on nodea but /dev/dsk/c2t0d0s0 on nodeb. A common name for these disks comes in handy in the following situations:
-When assigning a Quorum Disk all nodes must know which disk I am talking about. If I would say /dev/dsk/c1t0d0s0, nodeb would be confused
-When adding disks to a diskset
-When creating datafiles on raw LUN partitions in Oracle RAC.
The last example is an interesting one and points us to an important difference between /dev/did and /dev/global devices. In Oracle RAC each instance on each node writes directly to the disks. Synchronisation of the I/O is done by Oracle RAC itself, the OS or cluster layer does not have to worry about that.
If you deploy Sun's HA Oracle agent, and you want to create datafiles on raw LUN partitions, you have to use the /dev/global devices. HA Oracle is a single instance Oracle which is cluster-unaware. This means when a node fails, the HAOracle scripts will take care of the Switchover of the instance to the other node. One of the important issues with such cluster-unaware applications that are made HA through scripts is that we must make sure that I/O to the disk always only happens from one node. During a failover from nodea to nodeb, we must prevent that nodea would still do I/O to the disks. This could happen if if wouldn't be aware of nodeb's takeover (eg in the case of a Split Brain). In Sun Cluster, the Device Configuration System (DCS) takes care of that. The DCS makes sure that any global device has only one primary node: only this primary node can do I/O to the disk. Other nodes, when they try to access the global device, will have their I/O rerouted through the private interconnect. When a failover of the global device occurs, the DCS will use fencing techniques to make sure that the old primary cannot accidentally access the device. /dev/global/dsk/d4s2 is a global device whereas /dev/did/dsk/d4s2 is not.

Let us take a look at how the device files look:

1) For a /dev/did device we see that it is a device file just like any other:
etc-1 # ls -l /dev/did/dsk/d4s2
lrwxrwxrwx   1 root     root          39 Jul 10  2003 /dev/did/dsk/d4s2 -> ../../../devices/pseudo/did@0:4,4s2,blk
etc-1 # ls -l /devices/pseudo/did@0:4,4s2,blk
brw-------   1 root     sys      228,130 Jul 11  2001 /devices/pseudo/did@0:4,4s2,blk


2) For the /dev/global access path of this device d4s2, we see that it is a bit more difficult:
etc-1 # ls -l /dev/global
lrwxrwxrwx   1 root     root          34 Apr 14 13:15 /dev/global -> /global/.devices/node@1/dev/global

So /dev/global/dsk/d4s2 is actually /global/.devices/node@1/dev/global/dsk/d4s2

etc-1 # ls -l /global/.devices/node@1/dev/global/dsk/d4s2
lrwxrwxrwx   1 root     root          39 Jan 23 21:03 /global/.devices/node@1/dev/global/dsk/d4s2 -> ../../../devices/pseudo/did@0:4,4s2,blk

This points to a physical device file /global/.devices/node@1/devices/pseudo/did@0:4,4s2,blk

The thing to notice here is that the file system /global/.devices/node@1 is actually a globally mounted, pxfs filesystem. If you access a device file in such a filesystem, the access is 'cluster aware', which means the actual request will be sent to the node that is the primary. Thus, although /dev/did/dsk/d4s2 and /dev/global/dsk/d4s2 point to the same device, their access path is different: non-cluster aware in the first case, cluster aware in the second case.












14 apr 2005, 15:57:19 MEST Permalink Opmerkingen [1]