Disruption in the Space/Time Continuum
One way to start collecting some basic availability data is to poll SysUpTime from SNMP. Sysuptime shows the number of timeticks since the machine last rebooted and is part of a standard branch of SNMP that’s available on pretty much anything that supports SNMP. This has the benefit over other values in that it’s a counter, so for instance if you poll on a 5 minute interval and the box reboots at 2mins there is a good chance if you use absolute value that you will miss the reboot. With SysUpTime you can detect a reboot by the counter not incrementing between polls. We tend to use SysUpTime along with other metrics to give a proxy for certain types of OS availability.
Why the Star Trek reference? Well while working on a customer issue we seem to be experiencing a temporal anomaly where time can slow down or speed up! The customer is running on a shiny new dell server running windows2k. If you poll every two minutes you would expect a constant number of timeticks to pass (with a small variance to cover overhead of the poll etc). In the box in question we have noted time slow down or speed up.
This begged deeper investigation as I wondered if Dell had come up with a new way to improve the throughput of their hardware by utilizing a worm hole. It turns out that the explanation was a lot simpler and didn’t require a call to Dr Hawking. When the box gets busy its not providing enough cycles to keep the counter used by SNMP updated. Then when it gets a couple of extra cycles it catches back up. We managed to prove this by instrumenting the box with a management agent and dumping the cpu load every few seconds and cross reference.
Makes me start to wonder what other “neat” things this hardware can provide. When you look at the specs of the box, it looks pretty reasonable; it has many of the items you would expect in a server. One gapping hole appears when you dig deeper into the architecture and you note that the box has obviously not been designed to be managed. Living in the world of remote management I look for various elements that allow me to perform my job in harmony with the hardware. Typically the manageability components of a server are referred to as a service processor. This is much more that a KVM switch, it’s a way to poll or be alarmed on fault conditions that can be pushed to an upstream management console for processing. Many high end service processors are really a specialized computer in their own right with a dedicated cpu, memory, network stack and a real-time OS. This allows an independent view of the hardware that’s not influenced by the running OS. Adding a lights out board after market is certainly an option but an ugly hack. The manageability needs to be designed into the hardware to gain the full benefit.
One little know fact about the sunfire v20z (and 40z) servers is they have a strong service processor embedded into the hardware. This is not simply an after thought but designed into the system for manageability. The SP runs on its own CPU (Motorola MPC855T), 64megs of ram, has a built in three port network switch and even an auxiliary power source so you can access the SP functions when the main unit is powered down. Take a look at the whitepaper to see what a well designed SP looks like.
Certainly not all data center hardware is created equally (even the entry-level server market), I would strongly recommend anyone considering a purchase to ask their vendor for more info on manageability of their server before spending a cent.

(they are by no means the end-all-be-all... ESPECIALLY when one is used to the LOM/ALOM/SSP solutions that we have on other SUN SPARC hardware...)
[ Wouldn't you just KILL for an OBP/OK-prompt on these types of systems... ooh a boy can dream... ]
Console connectivity in particular seems really a bit of a hack... Requiring IPMI drivers at the OS level to function.... and different ways of handling the console depending on whether you run SOLARIS or LINUX....
Don't get me wrong, HP's ILO/RILO cards aren't the end-all-be-all either (although for windows-systems they are quite good...) expensive... but quite good. (for CLI they are not that great, but you know what... I can RAS those systems regardless of LINUX, WINDOWS, or SOLARIS.... nice and clean.)
[ RANT-ON ] How come we can just have a stinkin' serial-port (or LOM/ALOM type network deal), and a bootup BIOS that doesn't expect us to hit F12 or some other such (non-ascii/vt100) nonsense with arrow-keys to navigate a (mangled) ANSI-blue-colored RAID-controller screen.... MAN that erks me....
When it comes down to it, these systems are still very much a windows-type machine with their CONNECT-the-Keyboard-and-watch-the-VGA-monitor type of requirements.
I'm hopefull SUN's forthcoming in-house designed x86 machine (due to their designer's background) will have better remote/CLI/CONSOLE management capability.... something built from the ground-up for UNIX console connectivity.....
[ RANT-OFF ]
Keep up the good work guys.
Posted by 192.223.243.5 on March 06, 2005 at 08:33 PM MST #