BM Seer Unofficial thoughts from an anonymous Sun employee

IBM p595 reboot 60 to 90 min

Tuesday Nov 25, 2008

Heard about rebooting power6? Here is some info from:
www.pseriestech.org/forum/pseries-system-p/p595-start-up-2531.html

Q: "hi! has anyone experienced a start-up of the p595 that takes 1 hour?"

A: "Hey, I manage 7 p595 squadrons and they are all different. For the most part, the startup time depends on the amount of I/O you have installed. Two of mine are 3 frame p595's with 64 hba, 64 nic, about 20 internal HDD, dual power systems and I think 10 d20 I/O drawers. These servers take about 90 minutes just to initialise the hardware, not including starting any LPAR's. So tell your friend that this is normal and take a good book and a really large coffee with him each time he needs a restart"

[19] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg
Comments:

Not unlike large Sun frames, a p595 frame rarely has to be restarted. Even individual LPARs can be run a LONG time between reboots given mature operational practices such as strict change control, anticipatory capacity planning, limited privledged access, a limited number of tested "gold OS/App images", scheduled and tested revisions, well planned floor placement, etc. All the common sense stuff. So, yes, bring a good book. But, like the Maytag Repair Man, it is best to enjoy the time between events, rather than the wait time during state changes within an event.

Posted by Dave Brillhart on November 25, 2008 at 07:57 AM PST #

Not to mention concurrent updates, concurrent firmware, capacity upgrade on demand, virtualization mobility etc etc.

Your scraping the FUD barrel now Mr Seer.

Posted by Alex on November 25, 2008 at 09:31 AM PST #

Where is the Fear, Uncertainty and Doubt (FUD). This is simply a pointer to facts that are hard to find.

Posted by BM Seer on November 25, 2008 at 09:40 AM PST #

Why are you posting this comment ? its pure FUD, nothing more, and nothing less.

So a p595 which large numbers of devices installed takes an hour to boot. SO WHAT ??? this isnt a system you reboot very often, if at all.

I would expect nothing less from a similar spec'd system from another vendor.

In summary, another Bum Steer, from BM Seer !

Posted by Alex on November 25, 2008 at 09:50 AM PST #

"The Alex doth protest too much, methinks."

I simply pointed to info from a public website in full context. Wow must you smear me with each of your comments?

Posted by BM Seer on November 25, 2008 at 10:09 AM PST #

How long does it take a single domain starcat to post, huh?

Posted by Andre on November 25, 2008 at 08:50 PM PST #

I remember when, back in 1995 while I was an undergrad, I visited a friend's workplace, a large state bank IT facility. They had a legacy mainframe which boot process took several minutes just to load its *firmware* from a huge floppy disk (5.25inch or bigger, don't remember).

Now on IBM's (or anybody's) modern machines, I'm not an expert in these mammoth machines but I feel something is wrong if a boot takes 1,5h on current tech. If the system has a hundreds I/O devices this shouldn't be any excuse because the OS should be able to initialize most stuff in parallel. Now, if somebody tells me that AIX's or zOS's boot process is completely serialized (single-threaded) I will laugh on his face, we're in 2008 and concurrency is an essential quality of ANY kind of system software.

Posted by Osvaldo Pinali Doederlein on November 26, 2008 at 08:23 AM PST #

Osvaldo, a P595 of the spec mentioned is an enormous beast of a system.

You seriously cannot expect such a system to boot in the time it takes a laptop to boot ?

And as already stated, this isnt an issue because such systems are designed so that booting them is a very rare event.

Posted by Alex on November 26, 2008 at 09:33 AM PST #

Alex: Granted that it's a huge box, but I just need to understand the reasons why is its booting so slow. My Windows Vista laptop takes some 30s to boot, some people complain that such times are slow but I know everything it does - reading over a gigabyte of system files, loading and starting a couple hundred of drivers and services, performing Plug-and-Play discovery, init devices and connections, check that all filesystems are clean, etc. Now, a mainframe is not just a large PC - I don't think the P595 boots in 90 minutes because it's loading 10,000 drivers ;-) And I understand the argument that a P595 box is not restarted often, but this is not excuse for having a boot process stupid enough to start a hundred of disk devices, NICs and other stuff, one at a time (if this is the reason for the zombie-slow boot). Most OSes are forced to start their kernels in serial mode because it's risky and perhaps impossible to parallelize critical tasks like loading kernel-mode drivers or allocating interrupts and other low-level resources, not to mention initializing the physical disk that hosts the OS itself. But on that big bad IBM box, I suppose the core OS relies only on a tiny fraction of the entire hardware pool to load, and the rest 99% can be initialized after the OS is fully loaded up to userland, right?

Posted by Osvaldo Pinali Doederlein on November 26, 2008 at 12:39 PM PST #

It has nothing to do with the OS. The long boot up is essentially due to firmware POST of all of the devices, disks/adapters/i/o drawers etc and so on.

As far as im aware, its the same with all other similar size/spec systems.

If you, or Bum Steer have evidence to the contrary, lets hear it.

Posted by Alex on November 27, 2008 at 02:44 AM PST #

I fully assume my ignorance about big iron. I write many apps that run on a LPAR of some big IBM/AIX machine, but these are servers that not even my client can touch, they're hidden in a datacenter that only IBM support staff can admin (I think that's rule for huge support contracts). So it's the POST that takes so much time, that's fine, but you fail to explain why can't this process execute in parallel over hundreds of devices? Does the OS waits each disk #N to turn on, POST, perhaps fsck, before starting the same procedure in disk #N+1? This is insane. I understand that the system has a hierarchy, so you can only init a hard disk after you init its controller, and the controller after its bus and so on, but this only means that initialization parallelism is limited by the height of the hierarchy. Assuming independent devices on each layer, you first init all layer-1 devices, then all layer-2 devices and so on. Well, perhaps I should just design a better mainframe and put IBM and the other gys out of business. ;-)

Posted by Osvaldo Pinali Doederlein on November 27, 2008 at 08:27 AM PST #

Seriously, I dont see the FUD? FUD is typically a lie, or twisting the truth so it sounds more bad than it is. I dont see that cut & copy from some testimonies in IBM forums are lies? Where is the lying in citing someone else? Ok, if the person never said those things, then it is a lie and FUD. But here we all see the source. It was never a lie. Maybe you think that this is a lie should reconsider what lying means. Typically, when you lie - you NEVER give links to your sources, because you can not. It was a lie, there exist no links or sources. So if you state something with being able to back it up, then it is typically a lie.

For instance, if you state that SUN machines requires reboot all the time without giving sources - then it is you that are lying. And if you have links and reports that support you, then it is an indication that you are telling the truth. So show us the money, where are your links?

Posted by Kebabbert on November 29, 2008 at 07:12 AM PST #

the key line there, in your own words.

"twisting the truth so it sounds more bad than it is."

I would argue that was BM Seers motive for re-posting that link. The fact that it really makes very little difference in the real world, nor is there likely to be much of a difference between a similar size machine from another vendor leads me to suspect that this was just another attempt by BMS to discredit all things IBM.

Posted by Alex on December 01, 2008 at 08:06 AM PST #

Alex, no FUD except you calling FUD to try to discredit me whenever I post facts.

I post benchmark facts and benchmark comparisons. Only when I post things about IBM will you smear me with:

"marketing droid" - I'm an engineer

FUD = I show facts and benchmark comparisons.

If you want comparisons to big Sun servers, maybe another Sun engineer who has measured reboot will post a comparison. I don't have haven't measured that particular fact myself, yet.

Posted by BM Seer on December 01, 2008 at 10:30 AM PST #

in which case, you have no data to show its good, bad or no different.

So why post it ??

Posted by Alex on December 01, 2008 at 10:47 AM PST #

I did have IBM's data.

More importantly why did you want it hidden?

Posted by BM Seer on December 01, 2008 at 11:18 AM PST #

You know what, i really couldnt care less.

Carry on posting your rubbish. I wont be reading it anymore.

Posted by Alex on December 02, 2008 at 03:07 AM PST #

What I'd like to know is why Alex was reading this blog in the first place... It appears that he hasn't enjoyed any of the posts in the past... I imagine that whatever the reason, for that same reason, he won't be able to stay away. :)

Posted by jruss on December 02, 2008 at 06:20 AM PST #

Alex, if you really want to see what FUD looks like, look at this blog that comments on OFFICIAL IBM FUD and messing about with facts.

http://www.c0t0d0s0.org/permalink/IBM-benchmarketing-now-with-100%25-more-nonsense.html

Posted by BM Seer on December 02, 2008 at 11:30 AM PST #

Post a Comment:
Comments are closed for this entry.