Kristien's Weblog
Kristien's Weblog
Main | Next page »
20070520 zondag 20 mei 2007
Goodbye!

I have not written anything for over a year now. There have been some changes in my life, positive ones for sure. But it is with mixed feeling that I close this blog and finsih my life as a sunny. Over the last few years I have studied philosophy and got a Master in Applied Ethics. Now I have the opportunity to do a PHD in exactly that field here at my hometown university. I start 1st of June, my last day at Sun will be the 31st of May. Hence no more Sun Cluster for me. No more computer industry. I have worked here for eight years and it were eight great years. I would never dream of working for another company or working with a different product. But now it is time to change my life completely.

 

The other change is of course the birth of my son Roman on the first of September 2006. He is doing fine and is now a cute 8-months old baby who is trying to crawl but still with no great success. Being a mother has been great!

 

So I say goodbye now to you all and close this blog. Take care and remember that Sun Cluster is still the best piece of software around! Cheers...

 

 


20 mei 2007, 12:02:15 MEST Permalink Opmerkingen [2]

20060521 zondag 21 mei 2006
David's Blog

David, my 'SO' has a blog as of today. So I promised to make some publicity. He loves JAVA so I guess there is a lot of interesting stuff for people visiting this site too.

Here is the link: http://www.jroller.com/page/upperdowner


21 mei 2006, 19:19:51 MEST Permalink

20060505 vrijdag 05 mei 2006
Replacing a disk in Sun Cluster
There is plenty of documentation as to how to change a disk in Sun Cluster. Including everything that you need to on the volume manager layer etc. Just check out the cluster collection on http://docs.sun.com

I just want to point out some of the common mistakes.
Especially, what NOT to do when replacing a disk in Sun Cluster.

First of all, if you are replacing a disk in a Hardware RAID box there is **absolutely nothing** you need to do on the cluster or OS layer. Just follow the instructions of the box. No need to do any commands as teh LUNs that the OS (and hence cluster) sees do not change.

Do **not** start to run scdidadm -C, scdidadm -r etc. I repeat: Do NOT start to run scdidadm -C scdidadm -r. It is completely useless.

If you are replacing a physical disk or an entire LUN, and the WWN/DiskID of the disk changes and hence the way the OS sees the disk changes, you have to make Sun Cluster aware of that, so that it can update its DID database.

But again DO NOT run scdidadm -C!
I have seen over the last few months quite a few escalations where someone just happily ran scdidadm -C as part of a disk replacement procedure. And it screwed up the DID database. No problem for me as it is always fun to fix but not fun for the owners of that cluster as in many cases this means downtime.
Now what is the so-feared scdidadm -C for? What it does it will 'clear out' the DID database. Which means, if you permanently (and here I see **permanently**, ie not part of a disk replacement procedure) remove a disk you may run it to free up the DID it was using. But this is a very unlikely situation. Let me sketch one thing that can go wrong when you run it out of the blue because you think that is the right thing to do. Let's say you had a disk represented by the DID number d12. There are some problems on the fabric and for some reason the disk is temporarily unavailable. You see errors and you think 'oh lets run scdidadm -C, that'll fix it'. The command will not find the disk associated with d12 and free up that did number. Next time you reboot or run scdidadm -r or whatever, the disk is back but may be associated with a different DID number. Which of course, can cause many problems.
So: scdidadm -C: Don't do it. Unless you really have to. But not as part of a disk replacement procedure.
The thing you need to do  if you replace a physical disk is write down the DID with which is associated. When you insert a new replacement disk you type
scdidadm -R d# where d# is the did number. This will update the did database with the information (ie diskID) of the new disk and everybody will be happily ever after.







05 mei 2006, 15:59:27 MEST Permalink Opmerkingen [1]

20060421 vrijdag 21 april 2006
The challenges of doing Root Cause Analysis

Very often we get a request to do a Root Cause Analysis (RCA) of a problem that has already gone away. While in some cases this is possible, in others this is not because the data needed is already gone.  Sometimes we are faced with very insisting customers who keeps pushing for an RCA even if we tell them that we do not have enough data and unfortunately our crystal ball was broken last time we did a general spring cleanup. Needless to say that this results in an unhappy customer and yours truly in despair.

Let us describe an example. This is a fictitious example but similar to situations we end up in from time to time. Customer A experiences probe timeouts on their Oracle Database. Which means, the Sun Cluster Fault probe, which does some tests on the availability of the Oracle database, does not succeed in finishing the tests in time. As a result, the Sun Cluster agent tries to stop and start the Oracle database and since the database cannot be stopped in time, the Oracle resource goes into STOP_FAILED. Now when an application cannot be stopped, there will be no failover by the cluster. First question of the customer: why did it not fail over? Answer: it would be very dangerous to startup another instance of an application on another node when we are not absolutely sure that the application is stopped on one node. Cluster decides here to protect data integrity rather than availability. So far so good.

However, once the customer noticed their cluster in this state, they panicked and decided to reboot the node. After the reboot the STOP_FAILED situation was cleared and Oracle is now running just fine. No more probe timeouts. Customer however is anxious to know what exactly happened gathers explorers and sends these to us, demanding an RCA.

What we are able to do at this point is explain what has happened: Oracle commands that the fault probe executes took too much time, as a result the resource was restarted but the Oracle STOP method took too much time too, and hence the resource went into STOP_FAILED. What is very difficult now is to explain **why** the Oracle commands took so long too complete. With the data we have we can only guess: maybe there were storage failures making disk access slow, maybe the machine was running out of memory, maybe it was only Oracle itself being slow.  In some cases there may be some hints in the messages files or in the Oracle Alert log, but failing that it will be impossible to lay down a root cause. After all the machine was rebooted and the situation which may have led to the phenomena is cleared. It would have been better to gather more data (in the form of crash dump, GUDS output) at the time of the issue, in that way we would have at least have some chance in nailing down the problem. But of course when you run into a situation like that your first concern is probably to get the machine up and running again, and many machines are rebooted without further ado.

You may now object that at least the cluster or Oracle or whatever should gather **more** information at that point. Maybe it should automatically gather performance data. But there are always drawbacks to that as well: gathering more information means more disk space needed, more burden on the machine etc. Since most clusters run fine throughout their lives it seems a logical decision to gather more in-depth information when a problem produces itself.

So dear customers, if you are reading this: if we cannot give you an RCA based on explorers alone and we tell you that we do not have enought data to know what happened, please believe us. We are doing all we can but we cannot do the impossible.


21 apr 2006, 15:53:22 MEST Permalink

20060330 donderdag 30 maart 2006
Scanner problems and other stuff :O)

I have not been blogging for quite awhile now and have been receiving some complaints about it...

I wanted to post a picture of the reason of the delay here, but I have been unable to configure my printer anex copymachine anex scanner to scan. Printing and copying works but not scanning. In fact this is the worst piece of hardware I have ever laid my hands on, but I'll be fair and not mention any brandnames.

So I shall just write about it. As of today I am 17 weeks pregnant :o) The picture I was planning to post was the ultrasound taken at 12 weeks where you can actually see the baby in its entirety: legs, arms, body, head. Admitted, the head is very big compared to the body so he or she looked a bit like an alien at that stage, but we have been told that that is normal :o)

The reason why I haven't been blogging about it earlier is that I have felt sooooo miserable. Not psychologically, because the baby is of course very welcome, but physically. The first 14 weeks I was tired the whole day long and instead of morning sickness I was suffering from afternoon and evening sickness. To say it in a nice way, whatever I ate for dinner did not stay for a long time in my body and when it came out it was the wrong way :o). Even now I sometimes feel sick but at least I am able to eat a littlebit. Also strange, but less annoying is that I have developed strange tastes: I am eating up to 6 nectarines a day, lots of oranges, but have stopped drinking coffee altogether. I just can't stand the taste of it (whereas before I drank up to 5 cups a day). As an ex smoker I never really cared about other people smoking or sitting in smokey areas but right now I can't even stand the smell of it, it makes me physically ill. I have thus become an unpopular person amongst smokers who have to share my table. Oh well.

So that was it, stay tuned, I shall post some pictures if ever I find out how this scanning thing works. In 3 weeks I have a 3-D ultrasound which normally produces very nice pictures of the baby's face, I hope I can post them then.

PS As for the FAQ: No we do not know whether it is a boy or a girl. We have decided we'd like to be surprised at birth so we are not going to ask next time we have the ultrasound. FAQ2: Due date is 2nd of September.


30 mrt 2006, 17:38:38 MEST Permalink Opmerkingen [2]

20051226 maandag 26 december 2005
Merry Christmas and Happy New Year!
First of all I want to wish everyone a merry Christmas and a happy New Year. I especially picked out this New Years card for you, but beware it is only for those with a sense of humor and it contains nudity:
http://www.mortierbrigade.com/christmas/

I am not taking holidays this year as I have spent last year's holidays and have to save next year's holidays for the Master I am taking (and its associated exams and seminars). But I am a bit jealous though of those who do have holidays! Anyway enjoy them and think sometimes about the poor sods like me who have to work and write papers :o)

Anyway:

LOTS OF KISSES HUGS AND GOOD LUCK IN 2006



26 dec 2005, 11:02:42 MET Permalink

20051215 donderdag 15 december 2005
Sun Cluster Forum
Just a short post to let you know that there is a Forum on Clustering where you can post all your questions. I am checking this regularly together with some of my colleagues and it is always good to read some posts about the clusters that are out there.
Here is the URL: http://forum.sun.com/forum.jspa?forumID=1

15 dec 2005, 15:28:48 MET Permalink Opmerkingen [2]

20051203 zaterdag 03 december 2005
Asking you all a favour

Dear everyone,

I know it has been a very long time since I have written something here, and I am really ashamed that my first post in two months is actually about a favour I want from you all.

A small community of dog loving people have started a petition against puppy mills in Belgium. Dogs are often still bred under awful conditions here and presented behind glass walls in pet shops. Needless to say that this is horrible for the bitches that have to produce litter after litter and for the puppies and their new owners who take home an often ill and badly socialized pup.

Please help us and sign the petition at http://www.gopetition.com/online/7635.html

I promise my next post will  be about Sun Cluster...

 

 

 

 


03 dec 2005, 22:03:07 MET Permalink Opmerkingen [1]

20050926 maandag 26 september 2005
The DID database
A long time ago I blogged about the difference between /dev/did and /dev/global device names in Sun Cluster 3.x.
I'd like to discuss some of the files in the Cluster Configuration Repository. The Cluster Configuration Repository (CCR) is the Cluster database containing the information about the current cluster setup. Changes to this setup are saved across reboots in files in the directory /etc/cluster/ccr which is replicated on each node.
One of the things kept in this database is the DID database. Just take a look at the file /etc/cluster/ccr/did_instances, and you will see it looks as follows:

ccr_gennum      3
ccr_checksum    282788695BAD93E939748ECE92B52B4B
19      disk|DEVID_SCSI_SERIAL|SEAGATE ST39102LCSUN9.0GLJW992510000U0010JDH|5345414741544520535433393130324c4353554e392e30474c4a5739393235313030303055303031304a4448|2:/dev/rdsk/c0t1d0
20      disk||||2:/dev/rdsk/c0t6d0
1       disk|DEVID_SCSI_SERIAL|SEAGATE ST39102LCSUN9.0GLJW8793900001001J327|5345414741544520535433393130324c4353554e392e30474c4a57383739333930303030313030314a333237|1:/dev/rdsk/c0t1d0
2       disk||||1:/dev/rdsk/c0t6d0
3       disk|DEVID_SCSI3_WWN| |200000203714ce27|2:/dev/rdsk/c1t21d0|1:/dev/rdsk/c1t21d0
4       disk|DEVID_SCSI3_WWN| |20000020370d3f7d|2:/dev/rdsk/c1t16d0|1:/dev/rdsk/c1t16d0
5       disk|DEVID_SCSI3_WWN| |20000020370d3f5f|2:/dev/rdsk/c1t0d0|1:/dev/rdsk/c1t0d0
6       disk|DEVID_SCSI3_WWN| |20000020370d3f03|2:/dev/rdsk/c1t3d0|1:/dev/rdsk/c1t3d0
7       disk|DEVID_SCSI3_WWN| |20000020370d3590|2:/dev/rdsk/c1t17d0|1:/dev/rdsk/c1t17d0
8       disk|DEVID_SCSI3_WWN| |200000203714ca15|2:/dev/rdsk/c1t22d0|1:/dev/rdsk/c1t22d0
9       disk|DEVID_SCSI3_WWN| |20000020370d3d6d|2:/dev/rdsk/c1t4d0|1:/dev/rdsk/c1t4d0
10      disk|DEVID_SCSI3_WWN| |20000020370a2b24|2:/dev/rdsk/c1t1d0|1:/dev/rdsk/c1t1d0
11      disk|DEVID_SCSI3_WWN| |20000020370dc6ac|2:/dev/rdsk/c1t19d0|1:/dev/rdsk/c1t19d0
12      disk|DEVID_SCSI3_WWN| |200000203714c427|2:/dev/rdsk/c1t20d0|1:/dev/rdsk/c1t20d0
13      disk|DEVID_SCSI3_WWN| |20000020370d10e2|2:/dev/rdsk/c1t5d0|1:/dev/rdsk/c1t5d0
14      disk|DEVID_SCSI3_WWN| |20000020370d4094|2:/dev/rdsk/c1t2d0|1:/dev/rdsk/c1t2d0
15      disk|DEVID_SCSI3_WWN| |20000020370d3ed9|2:/dev/rdsk/c1t6d0|1:/dev/rdsk/c1t6d0
16      disk|DEVID_SCSI3_WWN| |20000020370d4039|2:/dev/rdsk/c1t18d0|1:/dev/rdsk/c1t18d0
17      disk|DEVID_SCSI_SERIAL|IBM     DNES30917SUN9.0G1QK087          |49424d2020202020444e4553333039313753554e392e304731514b30383720202020202020202020|1:/dev/rdsk/c2t10d0
18      disk|DEVID_SCSI_SERIAL|IBM     DNES30917SUN9.0G1QM765          |49424d2020202020444e4553333039313753554e392e304731514d37363520202020202020202020|1:/dev/rdsk/c2t11d0
8191    tape||||1:/dev/rmt/0

All CCR files start with a gennum (generation number) and a checksum (second line). These files are indeed checksum protected and should NOT be edited manually. You CAN edit them in some occasions but if that is required you will need to contact Sun for assistance.
Let us look at one of these lines:

13      disk|DEVID_SCSI3_WWN| |20000020370d10e2|2:/dev/rdsk/c1t5d0|1:/dev/rdsk/c1t5d0

This is the entry for DID 13, which we can see in the output of scdidadm -L as follows:

13       moon1:/dev/rdsk/c1t5d0         /dev/did/rdsk/d13
13       moon2:/dev/rdsk/c1t5d0         /dev/did/rdsk/d13

The second field is 'disk'. This is the type of DID device. These types are defined in the file /etc/cluster/ccr/did_types and right now you have 'disk' and 'tape'.
The third field is 'DEVID_SCSI3_WWN': This defines the type of device ID that this device provides. Each did device is normally identified by a unique ID, such as a serial number, WWN etc.
The actual device ID is in the fourth field, in this case: 20000020370d10e2. This also means that this disk is uniquely identified in the DID database and we cannot just replace it by another disk without telling the cluster about it.
The fifth field is 2:/dev/rdsk/c1t5d0. This means that on the node with nodeid 2, this disk is referred to as 'c1t5d0'
The sixth field is 1:/dev/rdsk/c1t5d0. This means that on the node with nodeid 1, this disk is referred to as 'c1t5d0'. Please be aware that these names may differ on different nodes. The DID layer uses the device ID to make sure we are talking about the same disk, even if it has different Solaris names on the different nodes.

If you are seeing error messages about DID devices, it may be that you have replaced a disk without following the official procedure to replace a disk in the cluster. Let us say you changed the disk on c1t5d0 by another one. You must now tell the cluster that it should update the did database for did number 13 with the new device ID. You can do that as follows:
#scdidadm -R c1t5d0
OR:
#scdidadm -R 13

Sometimes it is possible that the DID configuration is completely messed up, for example because you have been switching cables/controllers without following the correct procedure. To check this, please doublecheck the entries in the did_instances file with, for example, the 'diskinfo' output in the explorer. To fix this contact your Sun Resolution Center.



 



26 sep 2005, 11:56:46 MEST Permalink Opmerkingen [3]

20050918 zondag 18 september 2005
Beautiful Belgium?

I used to think of my own Belgium as probably the ugliest place in the world. Mind you, I wouldn't really want to move somewhere else, but that doesn't mean it isn't ugly: too many buildings, bad architecture (not necessarily due to bad architects: everyone is just allowed to build their own house how they want), no nature, and the only nature you will find is agricultural...

However, since we have the dog we did discover quite some nice walks in the woods and along rivers even not so far from where we live. And just recently, since we have to spend our money on the refurbishment of the kitchen we decided to do a 4-day walking trip in the Ardennes during our vacation and not go abroad like we usually do. The Ardennes is an area in the South of Belgium with lots of woods, you may know it from the 'Battle of the Ardennes'.

Anyway, the hotels we stayed in and the food was crap. However we did do some great walks and see some great nature. One day we even did a 16 kilometer walk without seeing another soul! Here are some pictures to prove that Belgium CAN be beautiful.

Image hosted by Photobucket.com Image hosted by Photobucket.com Image hosted by Photobucket.com Image hosted by Photobucket.com Image hosted by Photobucket.com

18 sep 2005, 16:05:34 MEST Permalink Opmerkingen [2]

20050823 dinsdag 23 augustus 2005
24 (CONTAINS SPOILERS)
Well cocooning has been the next big thing for quite awhile now and what is better than a nice glass of wine and a TV series on DVD? It all started when I bought 24 series 1 two years ago and since then we have seen: 24 series 2 & 3, Blake's 7 series 1 & 2 & 3, The Shield series 1 & 2, The Singing Detective, Twin Peaks (a SHAME that they don't release  the episodes after 7 !!), Murder One, Star Trek (the original one with Captain Kirk and friends),  quite some Doctor Who, Red Dwarf, and I am sure I miss some..

So 2 weeks ago I got 24 series 4. As this was one of our favourite series it was anxiously awaited and we did have some nice evenings but overall it was a disappointment...

Here's why (contains spoilers):

-It wasn't Real Time !! Chloe for example in one episode does the following in less than 10  minutes: change from pyamas to real clothes, put on some makeup, make it from her appartment to CTU office.
And there are many examples of that. The idea of 24 was exactly that it was Real Time, and they should stick to it as best as they can.

-There just was too much going on in 24 hours. Focus on one big issue to be solved in the end and one smaller issue in the beginning like in previous series.

-Some episodes were just plain boring.

-The Chinese consulate storyline was just totally incredible

-What ever happened to Behrooz? I liked him!!! You cannot have your audience sympathize with a character for some hours and then just let him vanish from the story. That is not how television works...

-I thought that some of the content was not politically correct but that is my personal opinion. There were some meager attemps in the story to avoid that accusation (the 2 brothers helping Jack during the firefight in the shop, Behrooz...) but they were a bit stupid.

OK I AM looking forward to 24 series 5 because 24 has always been good fun but this time it was just less fun than usual.

23 aug 2005, 12:39:26 MEST Permalink Opmerkingen [0]

20050728 donderdag 28 juli 2005
About Insanity
So I want to give Coolboy some tips about his puppy and I write a comment on his blog. When I check it out later I do not see it. Fortunately the entry is in my history so I press the back button a couple of times: it seems that to the 'simple math question'

Which was 7 + 39

I answered

49

Really!!

(Do I have to start worrying now?)

28 jul 2005, 15:48:49 MEST Permalink Opmerkingen [1]

20050725 maandag 25 juli 2005
So You Got Yourself a Reservation Conflict Panic
For more than 3 years your cluster has been doing fine. Then suddenly you see the following appear on one node's terminal:

[ID 747640 kern.notice] Reservation Conflict

The node has paniced! Fortunately this is a cluster and the other node takes over all services that were running on the panicked node. The panicked node comes up fine and joins the cluster. Still you would like to know WHY this has happened. So you gather explorers of both nodes and the crash dump that was gathered after the panic. You submit them to Sun Service and you wait eagerly for an explanation ....

Unfortunately, it is not as simple as that. As I have discussed in previous blog entries (here, for example, and here), SCSI reservations are used to kick a node out of the cluster, if a split brain occurs, or to prevent a node to join the cluster on its own when it may suffer from amnesia. In both cases, the node that has kicked the other node out thought that this other node was dead because it did not receive any heartbeats anymore. This may be because the other node **is** dead, or, in this case, where it was still alive, because this node **thought** it was dead and worth kicking out. This can happen because there was physically something wrong with the interconnect links (power down on the switches for example), or because either one of the node was unable to send or receive heartbeats, for example because it suffers from SunAlert 57666. And here we see our problem: The cause of the reservation conflict of nodea may very well be an issue on nodeb. Therefore, when a reservation occurs and you want a proper Root Cause Analysis, please also generate a Live Core Dump of the 'good' node ASAP and supply this to us as well.

So here is a To-Do list for Reservation Conflicts:

-Ask yourself if there are any other clusters or stand-alone nodes that can access the disks seen by the affected cluster (this is unsupported and can lead to reservation conflicts).
-Generate a Live Dump on the node that did not experience the reservation conflict. If you are not 100% sure how to do this is, engage a Sun engineer.
-If you cannot generate a live dump, engage a Sun engineer who will provide you with a script to gather some vital information on the 'good' node.
-If you are running Solaris 8, please check if all patches to prevent SunAlert 57666 are installed. I CANNOT STRESS HOW IMPORTANT THIS IS!






25 jul 2005, 15:07:02 MEST Permalink Opmerkingen [2]

20050711 maandag 11 juli 2005
How to transform a Dobermann into a Poodle
Well, I must admit I do not have much inspiration today to write on my blog. Moreover, work is piling up.
However, you should check this one out: http://www.attackchi.org.au/kits.htm

(It is kind of funny although I am not really sure whether the dog agrees).

11 jul 2005, 12:02:14 MEST Permalink Opmerkingen [0]

20050707 donderdag 07 juli 2005
Books, Books, Books
During the months of May and June I have strictly allowed myself to think only about my 2 philosophy exams during my spare time. This means that I have only read stuff like Richard Rorty's Contingency, Irony and Solidarity (was quite good though) and that I had a pile of books staring at me from the shelves. They were my after-exam treats and now that the exams are over I am anxious to finish them all before the new academic year starts in October. Since I have discovered that I spend far too much money on books anyway, I plan to not allow myself not to buy anything new before these are finished.
The books are a combination of books about ethics (I plan to tackle a Master in Applied Ethics course next year in evening classes) and fiction.

I am currently reading:
*Animal Liberation, by Peter Singer
*Solaris, by Stanislaw Lem (NOTHING to do with the Operating System, but a brilliant SF book from 1970)

Here is an overview of what I plan to read:
*A Companion to Ethics, Edited by Peter Singer
*Birdsong, Sebastian Faulks
*Applied Ethics: A Reader, Edited by Jerrold R. Coombs and Earl Winkler
*Omega Minor, by Paul Verhaeghen (Dutch)
*Pleidooi voor een moraal der Dubbelzinnigheid, Simone de Beauvoir
*Animal Ethics Reader, Edited by Susan Armstrong and Richard Botzler
*Paddy Clarke Ha Ha Ha, Roddy Doyle


07 jul 2005, 12:21:50 MEST Permalink Opmerkingen [0]