Monday Oct 13, 2008
Reliability, Availability, and Serviceability (RAS) in the Sun
SPARC Enterprise T5440 builds upon the solid foundations created for
the Sun SPARC
Enterprise T5140, T5240,
and Sun Fire X4600 M2
servers. The large number of CPU cores available in the T5440 needs
large amounts of I/O capability to balance the design. The physical
design of the X4600 M2 servers was a natural candidate for the new
design – modular CPU and memory cards along with plenty of slots
for I/O expansion. We've also seen good field reliability from the
X4600 M2 servers and their components. The T5440 is a excellent
example of how leveraging the best parts of these other designs has
resulted in a very reliable and serviceable system.
The trade-offs required for scaling from a single board design to
a larger, multiple board design always impact reliability of the
server. Additional connectors and other parts also contribute to
increased failure rates, or lower reliability. On the other hand, the
ability to replace a major component without replacing a whole
motherboard increases serviceability – and lowers operating costs.
The additional parts which enable the system to scale also have an
impact on performance, as some of my colleagues have noted. When
comparing systems on a single aspect of the RAS and performance
spectrum, you can miss important design characteristics, or worse,
misunderstand how the trade-offs impact the overall suitability of a
system. To get a better insight on how to apply highly scalable
systems to a complex task prefer to do a performability
analysis.
The T5440 has almost exactly twice the performance capabilities of
the T5220. If you have a workload which previously required four
T5220s with a spare (for availability), then you should be able to
host that workload on only two T5440s, and a spare. Using benchmarks
for sizing is the best way to compare, and we can generally see that
a T5440 is six times more capable than a Sun
Fire V490 server. This will complete a comparable performance
sizing.
On the RAS side, a single T5440 is more reliable than two T5220s,
so there is a reliability gain. But for a performability analysis,
that is contrasted with the fewer numbers of T5440. For example, if
the workload requires 4 servers and we add a spare, then the system
is considered performant when 4 of 5 servers are available. As we
consolidate onto fewer servers, the model changes accordingly: for 2
servers and a spare, the system is performant when 2 of 3 servers are
available. The reliability gain of using fewer servers can be readily
seen in the number of yearly service calls expected. Fewer servers
tends to mean fewer service calls. The math behind this can become
complicated for large clusters and is arguably counter-intuitive at
times. Fortuntately, our RAS modeling tools can handle very
complicated systems relatively easily.
We build availability models for all of our systems and use the
same service parameters to permit easy comparisons. For example, we
would model all systems with 8 hour service response time. The models
are then compared, thusly
|
System
|
Units
|
Performability
|
Yearly Services
|
|
Sun SPARC Enterprise 5440 server
|
2 + 1
|
0.99999903
|
0.585
|
|
Sun SPARC Enterprise 5240 server
|
4 + 1
|
0.99999909
|
0.661
|
|
Sun SPARC Enterprise 5140 server
|
4 + 1
|
0.99999915
|
0.687
|
|
Sun Fire V490 server
|
12 + 1
|
0.99998644
|
1.402
|
In these results, you can see that T5440 clearly wins the number
of units and yearly services. Both of these metrics impact total cost
of ownership (TCO) as the complexity of an environment is generally
attributed to the number of OS instances – fewer servers generally
means fewer OS instances. Fewer service calls means fewer problems
that require physical human interactions.
You can also see that the performability of the T5x40 systems are
very similar. Any of these systems will be much better than a system
of V490 servers.
More information on the RAS features these servers can be found in
the white paper we wrote, Maximizing
IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140,
T5240, and T5440 Servers. Ok, I'll admit that someone else wrote
the title...
Wednesday Apr 09, 2008
Today, Sun introduced two new CMT servers, the Sun
SPARC Enterprise T5140 and T5240
servers.
I'm really excited about this next stage of server development.
Not only have we effectively doubled the performance capacity of the
system, we did so without significantly decreasing the reliability.
When we try to predict reliability of products which are being
designed, we make those predictions based on previous generation
systems. At Sun, we make these predictions at the component level.
Over the years we have collected detailed failure rate data for a
large variety of electronic components as used in the environments
often found at our customer sites. We use these component failure
rates to determine the failure rate of collections of components. For
example, a motherboard may have more than 2,000 components:
capacitors, resistors, integrated circuits, etc. The key to improving
motherboard reliability is, quite simply, to reduce the number of
components. There is some practical limit, though, because we could
remove many of the capacitors, but that would compromise signal
integrity and performance -- not a good trade-off. The big
difference in the open source UltraSPARC
T2 and UltraSPARC
T2plus processors is the high level of integration onto the chip.
They really are systems on a chip, which means that we need very few
additional components to complete a server design. Fewer components
means better reliability, a win-win situation. On average, the T5140
and T5240 only add about 12% more components over the T5120 and T5220
designs. But considering that you get two or four times as many
disks, twice as many DIMM slots, and twice the computing power, this
is a very reasonable trade-off.
Let's take a look at the system block diagram to see where all of
the major components live.

You will notice that the two PCI-e switches are peers and not
cascaded. This allows good flexibility and fault isolation. Compared
to the cascaded switches in the T5120 and T5220 servers, this is a
simpler design. Simple is good for RAS.
You will also notice that we use the same LSI1068E SAS/SATA
controller with onboard RAID. The T5140 is limited to 4 disk bays,
but the T5240 can accommodate 16 disk bays. This gives plenty of disk
targets for implementing a number of different RAID schemes. I
recommend at least some redundancy, dual parity if possible.
Some people have commented that the Neptune Ethernet chip, which
provides dual-10Gb Ethernet or quad-1Gb Ethernet interfaces is a
single point of failure. There is also one quad GbE PHY chip. The
reason the Neptune is there to begin with is because when we
implemented the coherency links in the UltraSPARC T2plus processor we
had to sacrifice the builtin Neptune interface which is available in
the UltraSPARC T2 processor. Moore's Law assures us that this is a
somewhat temporary condition and soon we'll be able to cram even more
transistors onto a chip. This is a case where high integration is
apparent in the packaging. Even though all four GbE ports connect to
a single package, the electronics inside the package are still
isolated. In other words, we don't consider the PHY to be a single
point of failure because the failure modes do not cross the isolation
boundaries. Of course, if your Ethernet gets struck by lightning,
there may be a lot of damage to the server, so there is always the
possibility that a single event will create massive damage. But for
the more common cabling problems, the system offers suitable
isolation. If you are really paranoid about this, then you can
purchase a PCI-e card version of the Neptune and put it in PCI-e slot
1, 2, or 3 to ensure that it uses the other PCI-e switch.
The ILOM service processor is the same as we use in most of our
other small servers and has been a very reliable part of our systems.
It is connected to the rest of the system through a FPGA which
manages all of the service bus connections. This allows the service
processor to be the serviceability interface for the entire server.
The server also uses ECC FB-DIMMs with Extended ECC, which is
another common theme in Sun servers. We have recently been studying
the affects of Solaris Fault Management Architecture and Extended ECC
on systems in the field and I am happy to report that this
combination provides much better system resiliency than possible
through the individual features. In RAS, the whole can be much better
than the sum of the parts.
For more information on the RAS features of the new T5140 and
T5240 servers, see the white paper, Maximizing
IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140
and T5240 Servers. The whitepaper has results of our RAS
benchmarks as well as some performability
calculations.
Tuesday Oct 09, 2007
In complex systems, we must often trade-off performance against
reliability, availability, or serviceability. In many cases, a system
design will include both performance and availability requirements.
We use performability analysis to examine the
performance versus availability trade-off. Performability is simply
the ability to perform. A performability analysis combines
performance characterization for systems under the possible
combinations of degraded states with the probability that the system
will be operating the degraded states.
The simplest performability analysis is often appropriate for
multiple node, shared nothing clusters which scale performance
perfectly. For example, in a simple web server farm, you might have N
servers capable of delivering M pages per server. Disregarding other
bottlenecks in the system such, as the capacity of the internet
connection to the server farm, we can say that N+1 servers will
deliver M*(N+1) performance. Thus we can estimate the aggregate
performance of any number of web servers.
We can also perform an availability analysis on a web server. We
can build Markov models which consider the reliability of the
components in a server and their expected time to repair. The output
of the models will provide the estimated time per year that each web
server may be operational. More specifically, we will know the
staying time per year for each of the model states. For a simple
model, the performance reward for an up state is M and a down
state is 0. A system which provides 99.99% (four-nines) availability
can be expected to be down for approximately 53 minutes per year and
up for the remainder.
For a shared nothing cluster, we can further simplify the analysis
by ignoring common fault effects. In practice, this means that a
failure or repair in one web server does not affect any other web
servers. In many respects, this is the same simplifying assumption we
made with performance, where the performance of a web server is
dependent on any of the other web servers.
The shared nothing cluster availability model will contain the
following system states and the annual staying time in each state:
all up, one down (N-1 up), two down (N-2 up), three down (N-3 up),
and so on. The availability model inputs include the unscheduled mean
time between system interruption (U_MTBSI) and mean time to repair
(MTTR) for the nodes. We often choose a MTTR value by considering
the cost of service response time. For many shared nothing clusters,
a service response time of 48 hours may be reasonable – a value
which may not be reasonable for a database or storage tier. Model
results might look like this:
|
System State
|
Annual Staying Time (minutes)
|
Cumulative Uptime (%)
|
Performance Reward
|
|
All up
|
521,395.20
|
99.2
|
M * N
|
|
1 down
|
4,162.75
|
99.992
|
M * (N - 1)
|
|
2 down
|
39.95
|
99.9996
|
M * (N - 2)
|
|
3 down
|
2.00
|
99.99998
|
M * (N - 3)
|
|
> 3 down
|
0.11
|
100
|
< M * (N - 4)
|
|
Total
|
525,600.00
|
100
|
|
Now we have enough data to evaluate the performability of the
system. For the simple analysis, we accept the cumulative uptime
result for the minimum required performance. We can then compare
various systems considering performability.
We have modeled the new Sun SPARC Enterprise T5120 and Sun SPARC
Enterprise T5220 servers against the venerable Sun Fire V490 servers.
For this analysis we chose a performance benchmark with a metric that
showed we needed 6 T5120 or T5220 servers to match the performance of
9 V490 servers. We will choose to overprovision by one server, which
is often optimum for such architectures. The performability results
are:
|
Servers
|
Units
|
Performability (%)
|
|
Sun SPARC Enterprise T5120
|
6 + 1
|
99.99988
|
|
Sun SPARC Enterprise T5220
|
6 + 1
|
99.99988
|
|
Sun Fire V490
|
9 + 1
|
99.99893
|
You might notice that the T5120 and T5220 have the same
performability results. This is because they share the same
motherboard design, disks, power supplies, etc. It is much more
interesting to compare these to the V490. Even though we use more
V490 systems, the T5120 and T5220 solution provides better
performability. Fewer, faster, more reliable servers should generally
have better performability than more, slower, less reliable servers.