Paul Humphreys rambles on....
News and Views

20041017 Sunday October 17, 2004

Outages, notifications, sunrays, clusters, ups and Generator sets ..

Ok here is a question for you. What do all the above all have in common ? Well it is all about providing a service which is reliable and using whatever technology is available to ensure the service stays up as long as YOU want it to.

While working in Paris last week an annoying thing happened which is a common occurrence in many parts of the world. We had a power cut. Of course Sun has being paying its electric bill so it was an unforeseen outage. The power came back on quite quickly but it was a while before I could use my Sunray again (there were some kind of network problems when power was restored) . It is interesting to consider this event in terms of availability and what you expect and need. Sunrays as it turns out improves availability to users if you do sensible things like having the server on UPS. The whole building can tip over when you have a power blip but as long as you keep power to the sunray server using UPS etc as soon as building power is restored, the network switches and sunrays boot up your session will be restored as it was. The sad people who have PC's or desktops will lose what they were doing and worse still maybe have to coax their machines back into life - if they know how to do so. For sunray users if it fails to survive the power blip you just plug another in its place. Simple as plugging in a kettle.

In Sun's campus where we work they have several sunray servers in a failover group . So if one fails (which does not happen very often) you lose your session but you can soon establish another one. The file servers and mail servers are clustered so if one node fails you move onto the other - stateless. UPS and a diesel generator mean the uptime is improved even more. I could add of course that using Ethernet ipmp and dynamic reconfiguration means you can also protect yourself even more.

However in this world of goodness there is a problem. It has been there since I sent a message on a Prime 850 computer from its console to the users telling them it was on fire to persuade them to log off asap. It had the desired effect for those looking at their vdu's at the time.

The problem is this. If an application is failing or there is a non terminal problem with your computer system or network you can let users know. Using email or on sunrays there is a very clever program called utwall. This allows display messages or sounds to be sent to appliances. A toilet flush would be a good one to indicate impending disaster..,

The problem is when things are really down. Users will have a OCF display on their Sunrays which will tell them the server is not working and if they know the OCF numbers/displays they will have a reasonable idea what is going on. But this is my problem. How can you communicate to users and tell them what is really going on and what you are doing about it - before they start coming to see you or calling you up - when you need time to fix the problem you are dealing with. We always in the Prime days thought of a led panel like a set of traffic lights indicating the health of a machine. Netconnect can tell you when a system is down. But how to get that message to users when their computers that rely on the server are down.. Answers please !

( Oct 17 2004, 08:00:00 AM PDT ) Permalink

Comments:

Post a Comment:

Comments are closed for this entry.

Archives
Language
Links
Referrers