|
Thursday December 23, 2004 | Big -vs- Small Servers? | Computers |
Big Iron -vs- Blades. Mainframe -vs- Micro. Hmmm. We're talking Aircraft Carriers -vs- Jet Skis, right?
Sun designs and sell servers that cost from ~$1000 to ~$10 million. Each! We continue to pour billions into R&D and constantly raise the bar on the quality and performance and reliability and feature set that we deliver in our servers. No wonder we lead in too many categories to mention. Okay, I'll mention some :-)

While
the bar keeps rising on our "Enterprise Class", the Commodity/Volume
Class is never too far behind. In fact, I think it may be inappropriate
to continue to refer to our high-end as our Enterprise-class Servers,
because that could imply that our "Volume" Servers are only for
workgroups or non-mission-critical services. That is hardly the case.
Both are important and play a role in even the most critical service
platforms.
Let's look at the next generation Opterons...
which are only months away. And how modern S/W Architectures are
fueling the adoption of these types of servers...
Today's AMD
CPUs, with on-board hypertransport pathways, can handle up to 8 CPUs
per server! And in mid-2005, AMD will ship dual-core Opterons. That
means that it is probable for a server, by mid-2005 or so, to have 16
Opteron cores (8 dual-core sockets) in just a few rack units of space!!
If you compare SPECrate values, such a server would have the raw
compute performance capability of a full-up $850K E6800. Wow!
AMD CPU Roadmap: http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_608,00.html
AMD 8-socket Support: http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~72268,00.html
SPECint:_Rate: http://www.spec.org/cpu2000/results/rint2000.html
E6800 Price: http://tinyurl.com/3xbq2
Clearly,
there are many reasons why our customers are and will continue to buy
our large SMP servers. They offer Mainframe-class on-line maintenance,
redundancy, upgradability. They even exceed the ability of a Mainframe
in terms of raw I/O capability, compute density, on-the-fly expansion,
etc.
But, H/W RAS continue to improve in the Opteron line as
well. One feature I hope to see soon is on-the-fly
PFA-orchestrated CPU off-lining. If this is delivered, it'll be Solaris
x86 rather than Linux. Predictive Fault Analysis detecting if one of
those 16 cores or 32 DIMMs starts to experience soft errors in time to
fence off that component before the server and all the services
crash. The blacklisted component could be serviced at the
next scheduled maintenance event. We can already do that on our
Big Iron. But with that much power, and that many stacked services in a
16-way Opteron box, it would be nice not to take a node panic and
extended node outage.
On the other hand, 80% of the service layers we deploy are already
or are attempting to move to the horizontal model. And modern S/W
architectures are increasingly designed to provide continuity of
service level even in the presence of various fault scenarios. Look at
Oracle RAC, replicated state App Servers with Web-Server plug-ins to
seamlessly transfer user connections, Load Balanced web services, TP
monitors, Object Brokers, Grid Engines and Task Dispatchers, and SOA
designs in which an alternate for a failed dependency is rebound
on-the-fly.
These kinds of things, and many others, are used to
build resilient services that are much more immune to component or node
failures. In that regard, node level RAS is less critical to achieving
a service level objective. Recovery Oriented Computing admits that H/W
fails [http://roc.cs.berkeley.edu/papers/ROC_TR02-1175.pdf].
We do need to reduce the failure rate at the node/component level...
but as Solution Architects, we need to design services such that
node/component failure can occur, if possible, without a service
interruption or degradation of "significance".
In the brave new
world (or, the retro MF mindset) we'll stack services in partitions
across a grid of servers. Solaris 10 gives us breakthrough new
Container technology that will provide this option. Those servers might
be huge million dollar SMP behemoths, or $2K Opteron blades... doesn't
matter from the architectural perspective. We could have dozens of
services running on each server... however, most individual services
will be distributed across partitions (Containers) on multiple servers,
such that a partition panic or node failure has minimal impact. This is
"service consolidation" which includes server consolidation as a side
effect. Not into one massive server, but across a limited set of
networked servers that balance performance, adaptability, service
reliability, etc.

Server RAS matters. Competitive pressure will drive continuous
improvement in quality and feature sets in increasingly powerful and
inexpensive servers. At the same time, new patterns in S/W architecture
will make "grids" of these servers work together to deliver
increasingly reliable services. Interconnect breakthroughs will only
accelerate this trend.
The good news for those of us who love the big iron is that there will always be a need for aircraft carriers even in an age of powerful jet skis.
December 23, 2004 04:57 PM EST Permalink
Today's Page Hits: 37