| « oktober 2009 |
| ma | di | wo | do | vr | za | zo |
|---|
| | | | 1 | 2 | 3 | 4 |
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 | |
| | | | | | | |
| Vandaag |
Bezoekers van vandaag: 38

zondag 20 mei 2007
Goodbye!
I have not written anything for over a year now. There have been some changes in my life, positive ones for sure. But it is with mixed feeling that I close this blog and finsih my life as a sunny. Over the last few years I have studied philosophy and got a Master in Applied Ethics. Now I have the opportunity to do a PHD in exactly that field here at my hometown university. I start 1st of June, my last day at Sun will be the 31st of May. Hence no more Sun Cluster for me. No more computer industry. I have worked here for eight years and it were eight great years. I would never dream of working for another company or working with a different product. But now it is time to change my life completely.
The other change is of course the birth of my son Roman on the first of September 2006. He is doing fine and is now a cute 8-months old baby who is trying to crawl but still with no great success. Being a mother has been great!
So I say goodbye now to you all and close this blog. Take care and remember that Sun Cluster is still the best piece of software around! Cheers...

zondag 21 mei 2006
David's Blog
David, my 'SO' has a blog as of today. So I promised to make some publicity. He loves JAVA so I guess there is a lot of interesting stuff for people visiting this site too.
Here is the link: http://www.jroller.com/page/upperdowner

vrijdag 05 mei 2006
Replacing a disk in Sun Cluster
There
is plenty of documentation as to how to change a disk in Sun Cluster.
Including everything that you need to on the volume manager layer etc.
Just check out the cluster collection on http://docs.sun.com
I just want to point out some of the common mistakes.
Especially, what NOT to do when replacing a disk in Sun Cluster.
First of all, if you are replacing a disk in a Hardware RAID box there
is **absolutely nothing** you need to do on the cluster or OS layer.
Just follow the instructions of the box. No need to do any commands as
teh LUNs that the OS (and hence cluster) sees do not change.
Do **not** start to run scdidadm -C, scdidadm -r etc. I repeat: Do NOT
start to run scdidadm -C scdidadm -r. It is completely useless.
If you are replacing a physical disk or an entire LUN, and the
WWN/DiskID of the disk changes and hence the way the OS sees the disk
changes, you have to make Sun Cluster aware of that, so that it can
update its DID database.
But again DO NOT run scdidadm -C!
I have seen over the last few months quite a few escalations where
someone just happily ran scdidadm -C as part of a disk replacement
procedure. And it screwed up the DID database. No problem for me as it
is always fun to fix but not fun for the owners of that cluster as in
many cases this means downtime.
Now what is the so-feared scdidadm -C for? What it does it will 'clear
out' the DID database. Which means, if you permanently (and here I see
**permanently**, ie not part of a disk replacement procedure) remove a
disk you may run it to free up the DID it was using. But this is a very
unlikely situation. Let me sketch one thing that can go wrong when you
run it out of the blue because you think that is the right thing to do.
Let's say you had a disk represented by the DID number d12. There are
some problems on the fabric and for some reason the disk is temporarily
unavailable. You see errors and you think 'oh lets run scdidadm -C,
that'll fix it'. The command will not find the disk associated with d12
and free up that did number. Next time you reboot or run scdidadm -r or
whatever, the disk is back but may be associated with a different DID
number. Which of course, can cause many problems.
So: scdidadm -C: Don't do it. Unless you really have to. But not as part of a disk replacement procedure.
The thing you need to do if you replace a physical disk is write
down the DID with which is associated. When you insert a new
replacement disk you type
scdidadm -R d# where d# is the did number. This will update the did
database with the information (ie diskID) of the new disk and everybody
will be happily ever after.

vrijdag 21 april 2006
The challenges of doing Root Cause Analysis
Very often we get a request to do a Root Cause Analysis (RCA) of a problem that has already gone away. While in some cases this is possible, in others this is not because the data needed is already gone. Sometimes we are faced with very insisting customers who keeps pushing for an RCA even if we tell them that we do not have enough data and unfortunately our crystal ball was broken last time we did a general spring cleanup. Needless to say that this results in an unhappy customer and yours truly in despair.
Let us describe an example. This is a fictitious example but similar to situations we end up in from time to time. Customer A experiences probe timeouts on their Oracle Database. Which means, the Sun Cluster Fault probe, which does some tests on the availability of the Oracle database, does not succeed in finishing the tests in time. As a result, the Sun Cluster agent tries to stop and start the Oracle database and since the database cannot be stopped in time, the Oracle resource goes into STOP_FAILED. Now when an application cannot be stopped, there will be no failover by the cluster. First question of the customer: why did it not fail over? Answer: it would be very dangerous to startup another instance of an application on another node when we are not absolutely sure that the application is stopped on one node. Cluster decides here to protect data integrity rather than availability. So far so good.
However, once the customer noticed their cluster in this state, they panicked and decided to reboot the node. After the reboot the STOP_FAILED situation was cleared and Oracle is now running just fine. No more probe timeouts. Customer however is anxious to know what exactly happened gathers explorers and sends these to us, demanding an RCA.
What we are able to do at this point is explain what has happened: Oracle commands that the fault probe executes took too much time, as a result the resource was restarted but the Oracle STOP method took too much time too, and hence the resource went into STOP_FAILED. What is very difficult now is to explain **why** the Oracle commands took so long too complete. With the data we have we can only guess: maybe there were storage failures making disk access slow, maybe the machine was running out of memory, maybe it was only Oracle itself being slow. In some cases there may be some hints in the messages files or in the Oracle Alert log, but failing that it will be impossible to lay down a root cause. After all the machine was rebooted and the situation which may have led to the phenomena is cleared. It would have been better to gather more data (in the form of crash dump, GUDS output) at the time of the issue, in that way we would have at least have some chance in nailing down the problem. But of course when you run into a situation like that your first concern is probably to get the machine up and running again, and many machines are rebooted without further ado.
You may now object that at least the cluster or Oracle or whatever should gather **more** information at that point. Maybe it should automatically gather performance data. But there are always drawbacks to that as well: gathering more information means more disk space needed, more burden on the machine etc. Since most clusters run fine throughout their lives it seems a logical decision to gather more in-depth information when a problem produces itself.
So dear customers, if you are reading this: if we cannot give you an RCA based on explorers alone and we tell you that we do not have enought data to know what happened, please believe us. We are doing all we can but we cannot do the impossible.

donderdag 30 maart 2006
Scanner problems and other stuff :O)
I have not been blogging for quite awhile now and have been receiving some complaints about it...
I wanted to post a picture of the reason of the delay here, but I have been unable to configure my printer anex copymachine anex scanner to scan. Printing and copying works but not scanning. In fact this is the worst piece of hardware I have ever laid my hands on, but I'll be fair and not mention any brandnames.
So I shall just write about it. As of today I am 17 weeks pregnant :o) The picture I was planning to post was the ultrasound taken at 12 weeks where you can actually see the baby in its entirety: legs, arms, body, head. Admitted, the head is very big compared to the body so he or she looked a bit like an alien at that stage, but we have been told that that is normal :o)
The reason why I haven't been blogging about it earlier is that I have felt sooooo miserable. Not psychologically, because the baby is of course very welcome, but physically. The first 14 weeks I was tired the whole day long and instead of morning sickness I was suffering from afternoon and evening sickness. To say it in a nice way, whatever I ate for dinner did not stay for a long time in my body and when it came out it was the wrong way :o). Even now I sometimes feel sick but at least I am able to eat a littlebit. Also strange, but less annoying is that I have developed strange tastes: I am eating up to 6 nectarines a day, lots of oranges, but have stopped drinking coffee altogether. I just can't stand the taste of it (whereas before I drank up to 5 cups a day). As an ex smoker I never really cared about other people smoking or sitting in smokey areas but right now I can't even stand the smell of it, it makes me physically ill. I have thus become an unpopular person amongst smokers who have to share my table. Oh well.
So that was it, stay tuned, I shall post some pictures if ever I find out how this scanning thing works. In 3 weeks I have a 3-D ultrasound which normally produces very nice pictures of the baby's face, I hope I can post them then.
PS As for the FAQ: No we do not know whether it is a boy or a girl. We have decided we'd like to be surprised at birth so we are not going to ask next time we have the ultrasound. FAQ2: Due date is 2nd of September.

maandag 26 december 2005
Merry Christmas and Happy New Year!
First of all I want to wish everyone a merry Christmas and a happy New Year. I especially picked out this New Years card for you, but beware it is only for those with a sense of humor and it contains nudity:
http://www.mortierbrigade.com/christmas/I am not taking holidays this year as I have spent last year's holidays and have to save next year's holidays for the Master I am taking (and its associated exams and seminars). But I am a bit jealous though of those who do have holidays! Anyway enjoy them and think sometimes about the poor sods like me who have to work and write papers :o)
Anyway:
LOTS OF KISSES HUGS AND GOOD LUCK IN 2006

donderdag 15 december 2005
Sun Cluster Forum
Just a short post to let you know that there is a Forum on Clustering
where you can post all your questions. I am checking this regularly
together with some of my colleagues and it is always good to read some
posts about the clusters that are out there.
Here is the URL:
http://forum.sun.com/forum.jspa?forumID=1

zaterdag 03 december 2005
Asking you all a favour
Dear everyone,
I know it has been a very long time since I have written something here, and I am really ashamed that my first post in two months is actually about a favour I want from you all.
A small community of dog loving people have started a petition against puppy mills in Belgium. Dogs are often still bred under awful conditions here and presented behind glass walls in pet shops. Needless to say that this is horrible for the bitches that have to produce litter after litter and for the puppies and their new owners who take home an often ill and badly socialized pup.
Please help us and sign the petition at http://www.gopetition.com/online/7635.html
I promise my next post will be about Sun Cluster...

maandag 26 september 2005
The DID database
A long time ago I blogged about
the difference between /dev/did and /dev/global device names in Sun Cluster 3.x.
I'd like to discuss some of the files in the Cluster Configuration
Repository. The Cluster Configuration Repository (CCR) is the Cluster
database containing the information about the current cluster setup.
Changes to this setup are saved across reboots in files in the
directory /etc/cluster/ccr which is replicated on each node.
One of the things kept in this database is the DID database. Just take
a look at the file /etc/cluster/ccr/did_instances, and you will see it
looks as follows:
ccr_gennum 3
ccr_checksum 282788695BAD93E939748ECE92B52B4B
19 disk|DEVID_SCSI_SERIAL|SEAGATE
ST39102LCSUN9.0GLJW992510000U0010JDH|5345414741544520535433393130324c4353554e392e30474c4a5739393235313030303055303031304a4448|2:/dev/rdsk/c0t1d0
20 disk||||2:/dev/rdsk/c0t6d0
1 disk|DEVID_SCSI_SERIAL|SEAGATE
ST39102LCSUN9.0GLJW8793900001001J327|5345414741544520535433393130324c4353554e392e30474c4a57383739333930303030313030314a333237|1:/dev/rdsk/c0t1d0
2 disk||||1:/dev/rdsk/c0t6d0
3 disk|DEVID_SCSI3_WWN| |200000203714ce27|2:/dev/rdsk/c1t21d0|1:/dev/rdsk/c1t21d0
4 disk|DEVID_SCSI3_WWN| |20000020370d3f7d|2:/dev/rdsk/c1t16d0|1:/dev/rdsk/c1t16d0
5 disk|DEVID_SCSI3_WWN| |20000020370d3f5f|2:/dev/rdsk/c1t0d0|1:/dev/rdsk/c1t0d0
6 disk|DEVID_SCSI3_WWN| |20000020370d3f03|2:/dev/rdsk/c1t3d0|1:/dev/rdsk/c1t3d0
7 disk|DEVID_SCSI3_WWN| |20000020370d3590|2:/dev/rdsk/c1t17d0|1:/dev/rdsk/c1t17d0
8 disk|DEVID_SCSI3_WWN| |200000203714ca15|2:/dev/rdsk/c1t22d0|1:/dev/rdsk/c1t22d0
9 disk|DEVID_SCSI3_WWN| |20000020370d3d6d|2:/dev/rdsk/c1t4d0|1:/dev/rdsk/c1t4d0
10 disk|DEVID_SCSI3_WWN| |20000020370a2b24|2:/dev/rdsk/c1t1d0|1:/dev/rdsk/c1t1d0
11 disk|DEVID_SCSI3_WWN| |20000020370dc6ac|2:/dev/rdsk/c1t19d0|1:/dev/rdsk/c1t19d0
12 disk|DEVID_SCSI3_WWN| |200000203714c427|2:/dev/rdsk/c1t20d0|1:/dev/rdsk/c1t20d0
13 disk|DEVID_SCSI3_WWN| |20000020370d10e2|2:/dev/rdsk/c1t5d0|1:/dev/rdsk/c1t5d0
14 disk|DEVID_SCSI3_WWN| |20000020370d4094|2:/dev/rdsk/c1t2d0|1:/dev/rdsk/c1t2d0
15 disk|DEVID_SCSI3_WWN| |20000020370d3ed9|2:/dev/rdsk/c1t6d0|1:/dev/rdsk/c1t6d0
16 disk|DEVID_SCSI3_WWN| |20000020370d4039|2:/dev/rdsk/c1t18d0|1:/dev/rdsk/c1t18d0
17
disk|DEVID_SCSI_SERIAL|IBM
DNES30917SUN9.0G1QK087
|49424d2020202020444e4553333039313753554e392e304731514b30383720202020202020202020|1:/dev/rdsk/c2t10d0
18
disk|DEVID_SCSI_SERIAL|IBM
DNES30917SUN9.0G1QM765
|49424d2020202020444e4553333039313753554e392e304731514d37363520202020202020202020|1:/dev/rdsk/c2t11d0
8191 tape||||1:/dev/rmt/0
All CCR files start with a gennum (generation number) and a checksum
(second line). These files are indeed checksum protected and should NOT
be edited manually. You CAN edit them in some occasions but if that is
required you will need to contact Sun for assistance.
Let us look at one of these lines:
13 disk|DEVID_SCSI3_WWN| |20000020370d10e2|2:/dev/rdsk/c1t5d0|1:/dev/rdsk/c1t5d0
This is the entry for DID 13, which we can see in the output of scdidadm -L as follows:
13 moon1:/dev/rdsk/c1t5d0 /dev/did/rdsk/d13
13 moon2:/dev/rdsk/c1t5d0 /dev/did/rdsk/d13
The second field is 'disk'. This is the type of DID device. These types
are defined in the file /etc/cluster/ccr/did_types and right now you
have 'disk' and 'tape'.
The third field is 'DEVID_SCSI3_WWN': This defines the type of device
ID that this device provides. Each did device is normally identified by
a unique ID, such as a serial number, WWN etc.
The actual device ID is in the fourth field, in this case:
20000020370d10e2. This also means that this disk is uniquely identified
in the DID database and we cannot just replace it by another disk
without telling the cluster about it.
The fifth field is 2:/dev/rdsk/c1t5d0. This means that on the node with nodeid 2, this disk is referred to as 'c1t5d0'
The sixth field is 1:/dev/rdsk/c1t5d0. This means that on the node with
nodeid 1, this disk is referred to as 'c1t5d0'. Please be aware that
these names may differ on different nodes. The DID layer uses the
device ID to make sure we are talking about the same disk, even if it
has different Solaris names on the different nodes.
If you are seeing error messages about DID devices, it may be that you
have replaced a disk without following the official procedure to
replace a disk in the cluster. Let us say you changed the disk on
c1t5d0 by another one. You must now tell the cluster that it should
update the did database for did number 13 with the new device ID. You
can do that as follows:
#scdidadm -R c1t5d0
OR:
#scdidadm -R 13
Sometimes it is possible that the DID configuration is completely
messed up, for example because you have been switching
cables/controllers without following the correct procedure. To check
this, please doublecheck the entries in the did_instances file with,
for example, the 'diskinfo' output in the explorer. To fix this contact
your Sun Resolution Center.

zondag 18 september 2005
Beautiful Belgium?

dinsdag 23 augustus 2005
24 (CONTAINS SPOILERS)
Well cocooning has been the next big thing for quite awhile now and
what is better than a nice glass of wine and a TV series on DVD? It all
started when I bought 24 series 1 two years ago and since then we have
seen: 24 series 2 & 3, Blake's 7 series 1 & 2 & 3, The
Shield series 1 & 2, The Singing Detective, Twin Peaks (a SHAME
that they don't release the episodes after 7 !!), Murder One,
Star Trek (the original one with Captain Kirk and friends), quite
some Doctor Who, Red Dwarf, and I am sure I miss some..
So 2 weeks ago I got 24 series 4. As this was one of our favourite
series it was anxiously awaited and we did have some nice evenings but
overall it was a disappointment...
Here's why (contains spoilers):
-It wasn't Real Time !! Chloe for example in one episode does the
following in less than 10 minutes: change from pyamas to real
clothes, put on some makeup, make it from her appartment to CTU office.
And there are many examples of that. The idea of 24 was exactly that it
was Real Time, and they should stick to it as best as they can.
-There just was too much going on in 24 hours. Focus on one big issue
to be solved in the end and one smaller issue in the beginning like in
previous series.
-Some episodes were just plain boring.
-The Chinese consulate storyline was just totally incredible
-What ever happened to Behrooz? I liked him!!! You cannot have your
audience sympathize with a character for some hours and then just let
him vanish from the story. That is not how television works...
-I thought that some of the content was not politically correct but
that is my personal opinion. There were some meager attemps in the
story to avoid that accusation (the 2 brothers helping Jack during the
firefight in the shop, Behrooz...) but they were a bit stupid.
OK I AM looking forward to 24 series 5 because 24 has always been good fun but this time it was just less fun than usual.

donderdag 28 juli 2005
About Insanity
So I want to give
Coolboy
some tips about his puppy and I write a comment on his blog. When I
check it out later I do not see it. Fortunately the entry is in my
history so I press the back button a couple of times: it seems that to
the 'simple math question'
Which was 7 + 39
I answered
49
Really!!
(Do I have to start worrying now?)

maandag 25 juli 2005
So You Got Yourself a Reservation Conflict Panic
For more than 3 years your cluster has been doing fine. Then suddenly you see the following appear on one node's terminal:
[ID 747640 kern.notice]
Reservation Conflict
The node has paniced! Fortunately this is a cluster and the
other node takes over all services that were running on the panicked
node. The panicked node comes up fine and joins the cluster. Still you
would like to know WHY this has happened. So you gather explorers of
both nodes and the crash dump that was gathered after the panic. You
submit them to Sun Service and you wait eagerly for an explanation ....
Unfortunately, it is not as simple as that. As I have discussed in previous blog entries (
here, for example, and
here), SCSI reservations are used to kick a node out of the cluster, if a
split brain occurs, or to prevent a node to join the cluster on its own when it may suffer from
amnesia.
In both cases, the node that has kicked the other node out thought that
this other node was dead because it did not receive any heartbeats
anymore. This may be because the other node **is** dead, or, in this
case, where it was still alive, because this node **thought** it was
dead and worth kicking out. This can happen because there was
physically something wrong with the interconnect links (power down on
the switches for example), or because either one of the node was unable
to send or receive heartbeats, for example because it suffers from
SunAlert 57666.
And here we see our problem: The cause of the reservation conflict of
nodea may very well be an issue on nodeb. Therefore, when a reservation
occurs and you want a proper Root Cause Analysis, please also generate
a Live Core Dump of the 'good' node ASAP and supply this to us as well.
So here is a To-Do list for Reservation Conflicts:
-Ask yourself if there are any other clusters or stand-alone nodes that
can access the disks seen by the affected cluster (this is unsupported
and can lead to reservation conflicts).
-Generate a Live Dump on the node that did not experience the
reservation conflict. If you are not 100% sure how to do this is,
engage a Sun engineer.
-If you cannot generate a live dump, engage a Sun engineer who will
provide you with a script to gather some vital information on the
'good' node.
-If you are running Solaris 8, please check if all patches to prevent
SunAlert 57666 are installed. I CANNOT STRESS HOW IMPORTANT THIS IS!

maandag 11 juli 2005
How to transform a Dobermann into a Poodle
Well, I must admit I do not have much inspiration today to write on my blog. Moreover, work is piling up.
However, you should check this one out:
http://www.attackchi.org.au/kits.htm
(It is kind of funny although I am not really sure whether the dog agrees).

donderdag 07 juli 2005
Books, Books, Books
During the months of May and June I have strictly allowed myself to
think only about my 2 philosophy exams during my spare time. This means
that I have only read stuff like
Richard Rorty's Contingency, Irony and Solidarity
(was quite good though) and that I had a pile of books staring at me
from the shelves. They were my after-exam treats and now that the exams
are over I am anxious to finish them all before the new academic year
starts in October. Since I have discovered that I spend far too much
money on books anyway, I plan to not allow myself not to buy anything
new before these are finished.
The books are a combination of books about ethics (I plan to tackle a
Master in Applied Ethics course next year in evening classes) and
fiction.
I am currently reading:
*
Animal Liberation, by Peter Singer
*
Solaris, by Stanislaw Lem (NOTHING to do with the Operating System, but a brilliant SF book from 1970)
Here is an overview of what I plan to read:
*
A Companion to Ethics, Edited by Peter Singer
*
Birdsong, Sebastian Faulks
*
Applied Ethics: A Reader, Edited by Jerrold R. Coombs and Earl Winkler
*
Omega Minor, by Paul Verhaeghen (Dutch)
*
Pleidooi voor een moraal der Dubbelzinnigheid, Simone de Beauvoir
*
Animal Ethics Reader, Edited by Susan Armstrong and Richard Botzler
*
Paddy Clarke Ha Ha Ha, Roddy Doyle