| « november 2009 |
| ma | di | wo | do | vr | za | zo |
|---|
| | | | | | | 1 |
2 | 3 | 4 | 5 | 6 | 7 | 8 |
9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 28 | 29 |
30 | | | | | | |
| Vandaag |
Bezoekers van vandaag: 38

vrijdag 05 mei 2006
Replacing a disk in Sun Cluster
There
is plenty of documentation as to how to change a disk in Sun Cluster.
Including everything that you need to on the volume manager layer etc.
Just check out the cluster collection on http://docs.sun.com
I just want to point out some of the common mistakes.
Especially, what NOT to do when replacing a disk in Sun Cluster.
First of all, if you are replacing a disk in a Hardware RAID box there
is **absolutely nothing** you need to do on the cluster or OS layer.
Just follow the instructions of the box. No need to do any commands as
teh LUNs that the OS (and hence cluster) sees do not change.
Do **not** start to run scdidadm -C, scdidadm -r etc. I repeat: Do NOT
start to run scdidadm -C scdidadm -r. It is completely useless.
If you are replacing a physical disk or an entire LUN, and the
WWN/DiskID of the disk changes and hence the way the OS sees the disk
changes, you have to make Sun Cluster aware of that, so that it can
update its DID database.
But again DO NOT run scdidadm -C!
I have seen over the last few months quite a few escalations where
someone just happily ran scdidadm -C as part of a disk replacement
procedure. And it screwed up the DID database. No problem for me as it
is always fun to fix but not fun for the owners of that cluster as in
many cases this means downtime.
Now what is the so-feared scdidadm -C for? What it does it will 'clear
out' the DID database. Which means, if you permanently (and here I see
**permanently**, ie not part of a disk replacement procedure) remove a
disk you may run it to free up the DID it was using. But this is a very
unlikely situation. Let me sketch one thing that can go wrong when you
run it out of the blue because you think that is the right thing to do.
Let's say you had a disk represented by the DID number d12. There are
some problems on the fabric and for some reason the disk is temporarily
unavailable. You see errors and you think 'oh lets run scdidadm -C,
that'll fix it'. The command will not find the disk associated with d12
and free up that did number. Next time you reboot or run scdidadm -r or
whatever, the disk is back but may be associated with a different DID
number. Which of course, can cause many problems.
So: scdidadm -C: Don't do it. Unless you really have to. But not as part of a disk replacement procedure.
The thing you need to do if you replace a physical disk is write
down the DID with which is associated. When you insert a new
replacement disk you type
scdidadm -R d# where d# is the did number. This will update the did
database with the information (ie diskID) of the new disk and everybody
will be happily ever after.

vrijdag 21 april 2006
The challenges of doing Root Cause Analysis
Very often we get a request to do a Root Cause Analysis (RCA) of a problem that has already gone away. While in some cases this is possible, in others this is not because the data needed is already gone. Sometimes we are faced with very insisting customers who keeps pushing for an RCA even if we tell them that we do not have enough data and unfortunately our crystal ball was broken last time we did a general spring cleanup. Needless to say that this results in an unhappy customer and yours truly in despair.
Let us describe an example. This is a fictitious example but similar to situations we end up in from time to time. Customer A experiences probe timeouts on their Oracle Database. Which means, the Sun Cluster Fault probe, which does some tests on the availability of the Oracle database, does not succeed in finishing the tests in time. As a result, the Sun Cluster agent tries to stop and start the Oracle database and since the database cannot be stopped in time, the Oracle resource goes into STOP_FAILED. Now when an application cannot be stopped, there will be no failover by the cluster. First question of the customer: why did it not fail over? Answer: it would be very dangerous to startup another instance of an application on another node when we are not absolutely sure that the application is stopped on one node. Cluster decides here to protect data integrity rather than availability. So far so good.
However, once the customer noticed their cluster in this state, they panicked and decided to reboot the node. After the reboot the STOP_FAILED situation was cleared and Oracle is now running just fine. No more probe timeouts. Customer however is anxious to know what exactly happened gathers explorers and sends these to us, demanding an RCA.
What we are able to do at this point is explain what has happened: Oracle commands that the fault probe executes took too much time, as a result the resource was restarted but the Oracle STOP method took too much time too, and hence the resource went into STOP_FAILED. What is very difficult now is to explain **why** the Oracle commands took so long too complete. With the data we have we can only guess: maybe there were storage failures making disk access slow, maybe the machine was running out of memory, maybe it was only Oracle itself being slow. In some cases there may be some hints in the messages files or in the Oracle Alert log, but failing that it will be impossible to lay down a root cause. After all the machine was rebooted and the situation which may have led to the phenomena is cleared. It would have been better to gather more data (in the form of crash dump, GUDS output) at the time of the issue, in that way we would have at least have some chance in nailing down the problem. But of course when you run into a situation like that your first concern is probably to get the machine up and running again, and many machines are rebooted without further ado.
You may now object that at least the cluster or Oracle or whatever should gather **more** information at that point. Maybe it should automatically gather performance data. But there are always drawbacks to that as well: gathering more information means more disk space needed, more burden on the machine etc. Since most clusters run fine throughout their lives it seems a logical decision to gather more in-depth information when a problem produces itself.
So dear customers, if you are reading this: if we cannot give you an RCA based on explorers alone and we tell you that we do not have enought data to know what happened, please believe us. We are doing all we can but we cannot do the impossible.

donderdag 15 december 2005
Sun Cluster Forum
Just a short post to let you know that there is a Forum on Clustering
where you can post all your questions. I am checking this regularly
together with some of my colleagues and it is always good to read some
posts about the clusters that are out there.
Here is the URL:
http://forum.sun.com/forum.jspa?forumID=1

maandag 26 september 2005
The DID database
A long time ago I blogged about
the difference between /dev/did and /dev/global device names in Sun Cluster 3.x.
I'd like to discuss some of the files in the Cluster Configuration
Repository. The Cluster Configuration Repository (CCR) is the Cluster
database containing the information about the current cluster setup.
Changes to this setup are saved across reboots in files in the
directory /etc/cluster/ccr which is replicated on each node.
One of the things kept in this database is the DID database. Just take
a look at the file /etc/cluster/ccr/did_instances, and you will see it
looks as follows:
ccr_gennum 3
ccr_checksum 282788695BAD93E939748ECE92B52B4B
19 disk|DEVID_SCSI_SERIAL|SEAGATE
ST39102LCSUN9.0GLJW992510000U0010JDH|5345414741544520535433393130324c4353554e392e30474c4a5739393235313030303055303031304a4448|2:/dev/rdsk/c0t1d0
20 disk||||2:/dev/rdsk/c0t6d0
1 disk|DEVID_SCSI_SERIAL|SEAGATE
ST39102LCSUN9.0GLJW8793900001001J327|5345414741544520535433393130324c4353554e392e30474c4a57383739333930303030313030314a333237|1:/dev/rdsk/c0t1d0
2 disk||||1:/dev/rdsk/c0t6d0
3 disk|DEVID_SCSI3_WWN| |200000203714ce27|2:/dev/rdsk/c1t21d0|1:/dev/rdsk/c1t21d0
4 disk|DEVID_SCSI3_WWN| |20000020370d3f7d|2:/dev/rdsk/c1t16d0|1:/dev/rdsk/c1t16d0
5 disk|DEVID_SCSI3_WWN| |20000020370d3f5f|2:/dev/rdsk/c1t0d0|1:/dev/rdsk/c1t0d0
6 disk|DEVID_SCSI3_WWN| |20000020370d3f03|2:/dev/rdsk/c1t3d0|1:/dev/rdsk/c1t3d0
7 disk|DEVID_SCSI3_WWN| |20000020370d3590|2:/dev/rdsk/c1t17d0|1:/dev/rdsk/c1t17d0
8 disk|DEVID_SCSI3_WWN| |200000203714ca15|2:/dev/rdsk/c1t22d0|1:/dev/rdsk/c1t22d0
9 disk|DEVID_SCSI3_WWN| |20000020370d3d6d|2:/dev/rdsk/c1t4d0|1:/dev/rdsk/c1t4d0
10 disk|DEVID_SCSI3_WWN| |20000020370a2b24|2:/dev/rdsk/c1t1d0|1:/dev/rdsk/c1t1d0
11 disk|DEVID_SCSI3_WWN| |20000020370dc6ac|2:/dev/rdsk/c1t19d0|1:/dev/rdsk/c1t19d0
12 disk|DEVID_SCSI3_WWN| |200000203714c427|2:/dev/rdsk/c1t20d0|1:/dev/rdsk/c1t20d0
13 disk|DEVID_SCSI3_WWN| |20000020370d10e2|2:/dev/rdsk/c1t5d0|1:/dev/rdsk/c1t5d0
14 disk|DEVID_SCSI3_WWN| |20000020370d4094|2:/dev/rdsk/c1t2d0|1:/dev/rdsk/c1t2d0
15 disk|DEVID_SCSI3_WWN| |20000020370d3ed9|2:/dev/rdsk/c1t6d0|1:/dev/rdsk/c1t6d0
16 disk|DEVID_SCSI3_WWN| |20000020370d4039|2:/dev/rdsk/c1t18d0|1:/dev/rdsk/c1t18d0
17
disk|DEVID_SCSI_SERIAL|IBM
DNES30917SUN9.0G1QK087
|49424d2020202020444e4553333039313753554e392e304731514b30383720202020202020202020|1:/dev/rdsk/c2t10d0
18
disk|DEVID_SCSI_SERIAL|IBM
DNES30917SUN9.0G1QM765
|49424d2020202020444e4553333039313753554e392e304731514d37363520202020202020202020|1:/dev/rdsk/c2t11d0
8191 tape||||1:/dev/rmt/0
All CCR files start with a gennum (generation number) and a checksum
(second line). These files are indeed checksum protected and should NOT
be edited manually. You CAN edit them in some occasions but if that is
required you will need to contact Sun for assistance.
Let us look at one of these lines:
13 disk|DEVID_SCSI3_WWN| |20000020370d10e2|2:/dev/rdsk/c1t5d0|1:/dev/rdsk/c1t5d0
This is the entry for DID 13, which we can see in the output of scdidadm -L as follows:
13 moon1:/dev/rdsk/c1t5d0 /dev/did/rdsk/d13
13 moon2:/dev/rdsk/c1t5d0 /dev/did/rdsk/d13
The second field is 'disk'. This is the type of DID device. These types
are defined in the file /etc/cluster/ccr/did_types and right now you
have 'disk' and 'tape'.
The third field is 'DEVID_SCSI3_WWN': This defines the type of device
ID that this device provides. Each did device is normally identified by
a unique ID, such as a serial number, WWN etc.
The actual device ID is in the fourth field, in this case:
20000020370d10e2. This also means that this disk is uniquely identified
in the DID database and we cannot just replace it by another disk
without telling the cluster about it.
The fifth field is 2:/dev/rdsk/c1t5d0. This means that on the node with nodeid 2, this disk is referred to as 'c1t5d0'
The sixth field is 1:/dev/rdsk/c1t5d0. This means that on the node with
nodeid 1, this disk is referred to as 'c1t5d0'. Please be aware that
these names may differ on different nodes. The DID layer uses the
device ID to make sure we are talking about the same disk, even if it
has different Solaris names on the different nodes.
If you are seeing error messages about DID devices, it may be that you
have replaced a disk without following the official procedure to
replace a disk in the cluster. Let us say you changed the disk on
c1t5d0 by another one. You must now tell the cluster that it should
update the did database for did number 13 with the new device ID. You
can do that as follows:
#scdidadm -R c1t5d0
OR:
#scdidadm -R 13
Sometimes it is possible that the DID configuration is completely
messed up, for example because you have been switching
cables/controllers without following the correct procedure. To check
this, please doublecheck the entries in the did_instances file with,
for example, the 'diskinfo' output in the explorer. To fix this contact
your Sun Resolution Center.

maandag 25 juli 2005
So You Got Yourself a Reservation Conflict Panic
For more than 3 years your cluster has been doing fine. Then suddenly you see the following appear on one node's terminal:
[ID 747640 kern.notice]
Reservation Conflict
The node has paniced! Fortunately this is a cluster and the
other node takes over all services that were running on the panicked
node. The panicked node comes up fine and joins the cluster. Still you
would like to know WHY this has happened. So you gather explorers of
both nodes and the crash dump that was gathered after the panic. You
submit them to Sun Service and you wait eagerly for an explanation ....
Unfortunately, it is not as simple as that. As I have discussed in previous blog entries (
here, for example, and
here), SCSI reservations are used to kick a node out of the cluster, if a
split brain occurs, or to prevent a node to join the cluster on its own when it may suffer from
amnesia.
In both cases, the node that has kicked the other node out thought that
this other node was dead because it did not receive any heartbeats
anymore. This may be because the other node **is** dead, or, in this
case, where it was still alive, because this node **thought** it was
dead and worth kicking out. This can happen because there was
physically something wrong with the interconnect links (power down on
the switches for example), or because either one of the node was unable
to send or receive heartbeats, for example because it suffers from
SunAlert 57666.
And here we see our problem: The cause of the reservation conflict of
nodea may very well be an issue on nodeb. Therefore, when a reservation
occurs and you want a proper Root Cause Analysis, please also generate
a Live Core Dump of the 'good' node ASAP and supply this to us as well.
So here is a To-Do list for Reservation Conflicts:
-Ask yourself if there are any other clusters or stand-alone nodes that
can access the disks seen by the affected cluster (this is unsupported
and can lead to reservation conflicts).
-Generate a Live Dump on the node that did not experience the
reservation conflict. If you are not 100% sure how to do this is,
engage a Sun engineer.
-If you cannot generate a live dump, engage a Sun engineer who will
provide you with a script to gather some vital information on the
'good' node.
-If you are running Solaris 8, please check if all patches to prevent
SunAlert 57666 are installed. I CANNOT STRESS HOW IMPORTANT THIS IS!

woensdag 06 juli 2005
Show me the Key!
As promised
in a previous blog entry, I will now show you the commands to look at SCSI-2 PGRE or SCSI-3 PGR keys.
As discussed, such keys are used on the quorum disk because they are
persistent and persistency is what you need if you want to avoid
amnesia. We also discussed that SCSI-2 PGRE keys are an emulation of
SCSI-3 keys. They are invented by Sun Cluster engineering whereas
SCSI-3 PGRs are part of the SCSI-3 specification.
The first cluster is a 2 node cluster with clusterwide 2 paths to the quorum device:
# scdidadm -L d4
4
node1:/dev/rdsk/c2t0d0
/dev/did/rdsk/d4
4 node2:/dev/rdsk/c2t0d0 /dev/did/rdsk/d4
So the keys used will be PGRE's. The command to use is, guess what,
pgre:
# /usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d4s2
key[0]=0x42b2c3e500000001.
key[1]=0x42b2c3e500000002.
The second cluster is a 3 node cluster. It has more than 2 paths to the quorum disk:
# scdidadm -L d5
5 node1:/dev/rdsk/c3t50020F2300002A89d0 /dev/did/rdsk/d5
5 node2:/dev/rdsk/c3t50020F2300002A89d0 /dev/did/rdsk/d5
5 node3:/dev/rdsk/c3t50020F2300002A89d0 /dev/did/rdsk/d5
So the keys used will be scsi-3 PGR's. The command to use is
scsi:
# /usr/cluster/lib/sc/scsi -c inkeys -d /dev/did/rdsk/d5s2
Reservation keys(3):
0x4225e25100000003
0x4225e25100000001
0x4225e25100000002
Both the commands pgre and scsi contain options to scrub the keys off
the disk. Please don't ever do that if not so instructed by authorised
Sun personnel! Sometimes this is mistakenly done to solve
amnesia
but unfortunately this will only make things worse: the idea with
amnesia is to get that nodes key on the disk not to remove all other
nodes' keys!
An example of when scrubbing the keys would be useful is for example
when you are getting reservation conflict panics because a disk was
previously used in another cluster and still has that other cluster's
keys. But again, this will have to be diagnosed first before any action
is taken.

vrijdag 01 juli 2005
What to do when your cluster doesn't start up your application
Don't panic! First thing you need to find out is whether the application itself maybe at fault.
Here is a scenario:
Let us say you have a resource group called myapp-rg. Inside myapp-rg you have three resources:
A logicalhostname resource named loghost-rs
A HAStorageplus resource named haplus-rs
A home-brewn application resource named myapp-rs
Someday you stop the resource group and you try to start it up again.
However, the application resource does not come online and you see a
message like, or something similar:
resource myapp-rs status on node1 change to R_FM_FAULTED
Your first thought may be that there is something wrong with the
cluster. My experience is however, that in 99% of the cases it is the
application itself that is not able to start up in a timely manner.
Here is how you can check this:
1) Switch offline the resource group:
#scswitch -F -g myapp-rs
2) Disable the application resource:
#scswitch -n -j myapp-rs
3) Start up the resource group. Use scswitch -z:
#scswitch -z -h nodea -g myapp-rg
After this has been done, use the Start script of the myapp-rs (you can
check this by grepping for START_COMMAND in the output of scrgadm -pvv)
to launch the application manually.
If the application fails to come online, you know that it is the
application with is at fault and should be fixed by checking
application logs and contacting the appropriate vendor.
If the application comes online, but it takes longer than the
START_TIMEOUT value of the application resource (again, find this by
grepping for it in the output of scrgadm -pvv), you should increase
that value:
#scrgadm -c -j myapp-rs -y START_TIMEOUT=<appropriate value>

vrijdag 24 juni 2005
Interesting SVM Blog!
Found
Sanjay Nadkarni's weblog
who discusses some interesting implementation details about Solaris
Volume Manager, which as you know is the absolute best Volume Manager
to walk the face of this earth...
Keep it coming Sanjay!
SCSI reservations in Sun Cluster 3.x
I promised some time ago to write something about the mechanisms that
Sun Cluster uses to prevent split brain and amnesia. As said, in a two
node cluster, a node can get the vote count from the quorum device by
'reserving' the quorum device or making sure that the other node cannot
reserve it. We also discussed that reserving quorum devices is not
enough: you should also make sure that all disks are fenced out from a
node that has to leave the cluster. This is called disk fencing.
SCSI reservations are used for both the quorum disk and all the other
disks.
You have probably heard of SCSI-2 versus SCSI-3. When Sun Cluster 3.x
was designed, they reckoned all disks would be ready to understand
SCSI-3 by the time Sun Cluster was released, but unfortunately this
didn't seem to be true. So they decided to have Sun Cluster use either
SCSI-2 or SCSI-3. Big question: when does it use what? And why
not use SCSI-2 all the time? Let's first try to answer the last
question: SCSI-2 is an exclusive reservation, which means that only one
node can own the disk. Which means that other nodes will not be able to
reserve the disk and they will panic. Not so handy when you have a 4
node cluster and you want to kick off only one node. SCSI-3 is a group
reservation: every node has a key on a dedicated area on the disk and
when a node has to leave, another node will just kick off its key.
The next question, when Sun Cluster uses SCSI-2 or when SCSI-3 is an
easy one to answer but there are lots of misunderstandings. Sun Cluster
will not 'test' whether the disk understands SCSI-2 or SCSI-3. Reason
for that is that we use a specific functionality of SCSI-3 called
Persistent (Group) Reservation (PGR) which is optional in the specs. So
it is perfectly possible that a disk understands SCSI-3 but does not
have PGR functionality enabled. So Sun Cluster decides what mechanism
to use based on the number of paths to the disk cluster-wide. You can
check this with the output of scdidadm -L.
An example in a 2-node cluster:
14 moon1:/dev/rdsk/c1t2d0 /dev/did/rdsk/d14
14 moon2:/dev/rdsk/c1t2d0 /dev/did/rdsk/d14
--> Here we see that there is one path from moon1 to
/dev/did/rdsk/d14, and one path from moon2 --> hence scsi-2 will be
used.
The next thing we will need to do is discuss the difference between
scsi reservations used for the Quorum device and the ones used for disk
fencing. There is no overlap: Disk fencing code will issue scsi
reservations on all shared disks except the Quorum Disk.
Let us first start with the SCSI mechanism used by disk fencing (ie the
protection of disk against 'rogue' nodes that have unexpectedly left
the cluster). As said, SCSI-2 will be used when it is a 2-node cluster,
SCSI-3 when there are more than 2 paths to the disk cluster wide.
SCSI-3 is needed in that case because of what we have discussed before:
we need more granularity than the all or nothing 'kick everyone out' of
SCSI-2. The SCSI-2 reservations used are the typical MHIOCTKOWN and
MHIOCRELEASE ioctls.
For the quorum device it is not as straightforward. As said, the quorum
rule is used to protect amnesia. This implies that any reservation of
the quorum device should be able to persist across reboots of the
storage. This is true for SCSI-3 (hence the Persistent in PGR) but not
for SCSI-2. Therefore, Sun invented a mechanism it has called SCSI-2
PGRE (Persistent Group Reservation Emulation). This is an emulation
using SCSI-2 ioctls of the SCSI-3 mechanism: keys will be put on a
designated area on the disk. These keys are able to survive a power
cycle of the disk subsystem. One additional remark: since putting your
key on a disk or kicking off another ones key off the disk has to be an
atomic operation, but the SCSI-2 emulation consists of many commands:
therefore a traditional SCSI-2 MHICTKOWN will still be used to ensure
atomicity.
Oh: both SCSI-3 and SCSI-2 keys are invisible and are not placed in a
specific partition. SCSI-2 keys are in a designated area on the disk or
LUN and the location of SCSI-3 keys is implementation-dependant. A
quorum disk can still be used to put whatever data you want on. I will
show in a next post how you can see these mysterious keys.

woensdag 25 mei 2005
Sun Cluster 3.x Quorum algorithm
So let me try to explain the mechanism Sun Cluster uses to prevent both
Amnesia and
Split Brain.
This is a majority algorithm: only a cluster node or a subset of
cluster nodes that can have a majority of possible votes can start up
(in the case of amnesia) or continue (in the case of split brain)
cluster operation. The other partitions must leave the cluster.
So let us first discuss the Split Brain scenario: a node cannot
communicate with the other node over the private interconnect, but both
nodes are fine. As discussed
before
we must not allow both nodes to continue cluster operation, so one has
to leave. Each node has a vote, but in a 2 node cluster this would mean
that in case of a split brain nobody would continue cluster operation.
So in a 2 node cluster we would assign a quorum device: a LUN in shared
storage that also has a vote. So that there are 3 possible votes in the
cluster and a majority of 3 is 2 votes. Once a split brain occurs, both
nodes run for the quorum device: the one that is fastest, gets its
vote. The other one notices that it is too late and panics with a 'Lost
Operational Quorum' message. The mechanisme of reserving Quorum Devices
is through scsi reservations, which we will discuss in 2 weeks.
Now how can the quorum mechanism prevent
amnesia?
To prevent amnesia we must only allow the last node to have left the
cluster to startup the cluster. Same story: when a node leaves the
cluster, the other node(s) will make sure that it cannot acquire the
quorum disk when it starts up. Only the last node in the cluster will
be able to do so. So when the first node to have left the cluster tries
to start up, it has 1 vote of its own and knows that there are 3
possible votes in the cluster, but it cannot get the vote of the
quorum device: it waits for the other node to first form the
cluster with a message 'waiting for operational quorum'.
The last node that has left the cluster starts up, gets the vote of the
quorum disk, starts talking to the waiting node and passes the latest
cluster database to that waiting node so that this node is up to date
with all information that may have been changed when it was down.
I realise there is a lot more to be said about this, and there are a
lot more scenarios when we add more nodes. However it is the end of my
day, it is beautiful and warm (27 degrees C) weather and time to make a
nice walk with my dog Lukka followed by a nice glass of cool white
wine...

donderdag 19 mei 2005
Split Brain
For those who want to see a real split brain, I 'd suggest renting this DVD.
For the rest of us, I shall today explain what is a Split Brain in clustering theory and how Sun Cluster protects against that. As discussed some cluster posts ago, Sun Cluster nodes communicate with each other through the private interconnect. This is a redundant network that is exclusively used for intra node communication. A split brain is a situation where all the links of the private interconnect fail, but the nodes are still running. So each node thinks that the other one(s) is/are dead, and that it should takeover applications and stuff. And here lies the danger: in a split brain nodes would independently startup the applications and access the data, because they do not know the other nodes are doing the same thing. Data corruption is waiting to happen. So to prevent this kind of situation we must do 2 things. Most cluster only do the first, our Sun Cluster does both and that is why it is a brilliant piece of software:
1) Make sure that when a loss of all private interconnects occurs, and subpartitions (of one node or different nodes) can form, only one of the subpartitions or nodes is allowed to continue cluster operation and the rest are kicked out. Sun Cluster decides this based on a majority (quorum) algorithm, which I will discuss next week. This majority algorithm is the same as the one that is used to prevent amnesia.
2) But even more protection is needed. I think this is best explained with a case I had. In this case, only one node experienced problems on all interconnect links: due to issues in the network stack it did not receive heartbeats from the other node anymore. This node went through a reconfiguration and decided to kick the other node out and start up the application etc... However; the other node was still receiving heartbeats from this node and did not reconfigure or check whether it could start up cluster or not. Eventually it would have been kicked out by the winning node anyway, but at this point it was still happily continuing running applications. It is imperative that once a node has decided to continue cluster and take over the application and access the data etc it make sure that the other node is unable to access any of the shared data. So as soon as the 'unaware' node tried to access the data on one of the shared disks, it was kicked out to prevent any data corruption. The other node then remained in the cluster and the applications were available. The mechanism used for this is called Disk Fencing; and it uses SCSI reservations. More on SCSI reservations once I get back from holiday in June...

maandag 09 mei 2005
Amnesia
You forgot that I need ya
You must've caught amnesia
That's why you don't believe
(Black Eyed Peas)
I am not really a Black Eyed Peas fan (not my type of music) but I kind of like this song.
After having discussed the private interconnect some time ago I felt it is necessary to chat a bit about the Cluster Membership Monitor (CMM) but this is impossible without first discussing the typical issues in clustering theory called 'amnesia' and 'split brain'. Amnesia is for this week, and I hope to get to Split Brain next week, in order to be able to finish up the CMM discussion before I leave on holiday the 28th of May (hurray !!!!!!!).
Let's imagine the following situation: You have a 2 node cluster. At 12pm you shut down a cluster node, nodea, for maintenance. Nodeb is still running. At 4pm you decide to change some settings (such as timeouts). Where will these changes go? The cluster has a central repository, called the Cluster Configuration Repository (CCR), which has a local copy on each node (in /etc/cluster/ccr --> check it out). So, because nodea is down, the update will only make it in nodeb's copy of the database. Nodea is unaware of the change. If it would join the cluster now, it would receive the most recent copy of the CCR from nodeb.
Let us now say that at 6pm, you shut down nodeb as well. Both nodes are down. If we would now boot nodea, it would start up with an old copy of the database. The cluster would have 'forgotten' the changes that happened between 12pm and 6pm, as these changes are only known on nodea, and this one is down. This is called amnesia.
I can already tell you that this is a situation that will never happen in Sun Cluster.
If we take a look at how amnesia prevention was historically done in different clustering solutions, there are several options:
1) Just DO NOT allow changes when members are not part of the cluster
2) Store the cluster database on shared storage, always
3) Store the cluster database on local storage and on shared location and use shared location to override local copies when nodes have been down
Sun Cluster 3.x uses a more elegant approach. In fact, it uses the same mechanism that it uses to prevent Split Brain: A majority algorithm. Before I explain this, however, I will first explain what IS Split Brain. That's for next week.

donderdag 28 april 2005
Why Oracle RAC on Sun Cluster
From time to time I see the following question popping up: Why Oracle RAC on top of Sun Cluster, especially since Oracle 10g RAC ships with its own clusterware. In our book we dedicated a chapter to the benefits of Sun Cluster, and this chapter was also published as a Blueprints article. You find it here:
Understanding the Benefits of Implementing Oracle RAC on Sun Cluster Software
We did refrain in this article to make explicit comparisons to other clusterware, as we felt this to be somewhat politically incorrect. We only stress what is good about our product (a lot :o) )
Also if you like the article, you can order the book here.

dinsdag 26 april 2005
Sun Cluster 3.x Private Interconnect: What's on?
(OK I feel guilty for writing so much about personal stuff and animals on this blog. All my colleagues seem to write about **computers** all the time so my spring time resolution is to write at least one blog entry a week about the best piece of software ever to see the face of this earth: Sun Cluster 3.x)
Have you ever wondered what Sun Cluster uses the private interconnect for? Heartbeats yes, but also other stuff? And what's the story about Remote Shared Memory? Let me try to enlighten you a bit.
I tend to split up Private Interconnect Traffic in 3 big chunks, based on functionality and protocols used.
1) First you have the heartbeat messages. These are sent through all available private interconnect links and are basically 'I am alive' messages. Unlike Sun Cluster 2.2 which used ICMP messages that were sent through one private link, Sun Cluster 3.x uses the low-level protocol DLPI to send the heartbeats. If a heartbeat is not received on one path in time, this path is declared dead. If all paths are dead Sun Cluster's Path Manager warns the Cluster Membership Monitor that the other node is presumably dead (more on this maybe on a future blog entry).
So basically the heartbeat messages are used for path and node monitoring
2) Secondly, Sun Cluster sends all kinds of ORB communication through the private interconnect. As you may know, Sun Cluster inter and intra-node communication is implemented in a CORBA-like way: method invocations on remote objects. This is done using TCP/IP. Examples of this kind of traffic are:
-Global Filesystem and Global Devices Traffic: Only one node is the primary of the global device. If a device or filesystem is accessed on a non-primary node, this traffic is redirected throught the private interconnect.
-Cluster Configuration Repository (ie the cluster database) updates
-Messages from the Cluster Membership Monitor about cluster reconfiguration.
These Cluster Framework Messages use all available paths on the private interconnect.
3) Thirdly, applications may want to use the private interconnect. Examples of this kind of applications are NTP and Oracle RAC. Sun Cluster provides IP addresses that can be used to contact other cluster nodes via the private interconnect. These IP addresses are typically of the format clusternode<nodeID>-priv.
Up till Sun Cluster 3.1 GA, this application-level traffic used only one link on the private interconnect. However, this link was HA: if the link failed the IP address was moved to another available private interconnect link. As of Sun Cluster 3.1 10/03, application-level traffic is striped over all available private interconnects, thus allowing a higher troughput.
Application-level traffic can be UDP or TCP. Also, if you are using an SCI (Scalable Coherent Interface)-based interconnect, the application may use the Remote Shared Memory (RSM) functionality of these cards, to reduce latency.
So if you have SCI cards, 1) and 2) (heartbeats & Sun Cluster framework messages) will still use DLPI and TCP/IP over DLPI, but the application (mostly Oracle RAC) can exploit the card's RSM functionalities. These protocols can live next to each other on the same interconnect.

donderdag 14 april 2005
Sun Cluster 3.x: Difference between /dev/global and /dev/did devices
People sometimes wonder what the difference is between a /dev/global and a /dev/did device in Sun Cluster.
Let us first discuss the /dev/did device. You can see the /dev/did
device as just another name for a Solaris /dev/dsk/c#t#d#s# device,
with the only difference that this name is guaranteed to be consistent
throughout the cluster: The cluster nodes have synchronised their view
on the storage and have ended up with the same name for a specific
device. It is quite possible that a device is called /dev/dsk/c1t0d0s0
on nodea but /dev/dsk/c2t0d0s0 on nodeb. A common name for these disks
comes in handy in the following situations:
-When assigning a Quorum Disk all nodes must know which disk I am
talking about. If I would say /dev/dsk/c1t0d0s0, nodeb would be confused
-When adding disks to a diskset
-When creating datafiles on raw LUN partitions in Oracle RAC.
The last example is an interesting one and points us to an important
difference between /dev/did and /dev/global devices. In Oracle RAC each
instance on each node writes directly to the disks. Synchronisation of
the I/O is done by Oracle RAC itself, the OS or cluster layer does not
have to worry about that.
If you deploy Sun's HA Oracle agent, and you want to create datafiles
on raw LUN partitions, you have to use the /dev/global devices. HA
Oracle is a single instance Oracle which is cluster-unaware. This means
when a node fails, the HAOracle scripts will take care of the
Switchover of the instance to the other node. One of the important
issues with such cluster-unaware applications that are made HA through
scripts is that we must make sure that I/O to the disk always only
happens from one node. During a failover from nodea to nodeb, we must
prevent that nodea would still do I/O to the disks. This could happen
if if wouldn't be aware of nodeb's takeover (eg in the case of a Split
Brain). In Sun Cluster, the Device Configuration System (DCS) takes
care of that. The DCS makes sure that any global device has only one
primary node: only this primary node can do I/O to the disk. Other
nodes, when they try to access the global device, will have their I/O
rerouted through the private interconnect. When a failover of the
global device occurs, the DCS will use fencing techniques to make sure
that the old primary cannot accidentally access the device.
/dev/global/dsk/d4s2 is a global device whereas /dev/did/dsk/d4s2 is
not.
Let us take a look at how the device files look:
1) For a /dev/did device we see that it is a device file just like any other:
etc-1 # ls -l /dev/did/dsk/d4s2
lrwxrwxrwx 1 root
root 39 Jul
10 2003 /dev/did/dsk/d4s2 ->
../../../devices/pseudo/did@0:4,4s2,blk
etc-1 # ls -l /devices/pseudo/did@0:4,4s2,blk
brw------- 1 root
sys 228,130 Jul 11 2001
/devices/pseudo/did@0:4,4s2,blk
2) For the /dev/global access path of this device d4s2, we see that it is a bit more difficult:
etc-1 # ls -l /dev/global
lrwxrwxrwx 1 root
root 34 Apr 14
13:15 /dev/global -> /global/.devices/node@1/dev/global
So /dev/global/dsk/d4s2 is actually /global/.devices/node@1/dev/global/dsk/d4s2
etc-1 # ls -l /global/.devices/node@1/dev/global/dsk/d4s2
lrwxrwxrwx 1 root
root 39 Jan 23
21:03 /global/.devices/node@1/dev/global/dsk/d4s2 ->
../../../devices/pseudo/did@0:4,4s2,blk
This points to a physical device file /global/.devices/node@1/devices/pseudo/did@0:4,4s2,blk
The thing to notice here is that the file system /global/.devices/node@1 is actually a globally mounted, pxfs filesystem.
If you access a device file in such a filesystem, the access is
'cluster aware', which means the actual request will be sent to the
node that is the primary. Thus, although /dev/did/dsk/d4s2 and
/dev/global/dsk/d4s2 point to the same device, their access path is
different: non-cluster aware in the first case, cluster aware in the
second case.