Monday Oct 13, 2008
Reliability, Availability, and Serviceability (RAS) in the Sun
SPARC Enterprise T5440 builds upon the solid foundations created for
the Sun SPARC
Enterprise T5140, T5240,
and Sun Fire X4600 M2
servers. The large number of CPU cores available in the T5440 needs
large amounts of I/O capability to balance the design. The physical
design of the X4600 M2 servers was a natural candidate for the new
design – modular CPU and memory cards along with plenty of slots
for I/O expansion. We've also seen good field reliability from the
X4600 M2 servers and their components. The T5440 is a excellent
example of how leveraging the best parts of these other designs has
resulted in a very reliable and serviceable system.
The trade-offs required for scaling from a single board design to
a larger, multiple board design always impact reliability of the
server. Additional connectors and other parts also contribute to
increased failure rates, or lower reliability. On the other hand, the
ability to replace a major component without replacing a whole
motherboard increases serviceability – and lowers operating costs.
The additional parts which enable the system to scale also have an
impact on performance, as some of my colleagues have noted. When
comparing systems on a single aspect of the RAS and performance
spectrum, you can miss important design characteristics, or worse,
misunderstand how the trade-offs impact the overall suitability of a
system. To get a better insight on how to apply highly scalable
systems to a complex task prefer to do a performability
analysis.
The T5440 has almost exactly twice the performance capabilities of
the T5220. If you have a workload which previously required four
T5220s with a spare (for availability), then you should be able to
host that workload on only two T5440s, and a spare. Using benchmarks
for sizing is the best way to compare, and we can generally see that
a T5440 is six times more capable than a Sun
Fire V490 server. This will complete a comparable performance
sizing.
On the RAS side, a single T5440 is more reliable than two T5220s,
so there is a reliability gain. But for a performability analysis,
that is contrasted with the fewer numbers of T5440. For example, if
the workload requires 4 servers and we add a spare, then the system
is considered performant when 4 of 5 servers are available. As we
consolidate onto fewer servers, the model changes accordingly: for 2
servers and a spare, the system is performant when 2 of 3 servers are
available. The reliability gain of using fewer servers can be readily
seen in the number of yearly service calls expected. Fewer servers
tends to mean fewer service calls. The math behind this can become
complicated for large clusters and is arguably counter-intuitive at
times. Fortuntately, our RAS modeling tools can handle very
complicated systems relatively easily.
We build availability models for all of our systems and use the
same service parameters to permit easy comparisons. For example, we
would model all systems with 8 hour service response time. The models
are then compared, thusly
|
System
|
Units
|
Performability
|
Yearly Services
|
|
Sun SPARC Enterprise 5440 server
|
2 + 1
|
0.99999903
|
0.585
|
|
Sun SPARC Enterprise 5240 server
|
4 + 1
|
0.99999909
|
0.661
|
|
Sun SPARC Enterprise 5140 server
|
4 + 1
|
0.99999915
|
0.687
|
|
Sun Fire V490 server
|
12 + 1
|
0.99998644
|
1.402
|
In these results, you can see that T5440 clearly wins the number
of units and yearly services. Both of these metrics impact total cost
of ownership (TCO) as the complexity of an environment is generally
attributed to the number of OS instances – fewer servers generally
means fewer OS instances. Fewer service calls means fewer problems
that require physical human interactions.
You can also see that the performability of the T5x40 systems are
very similar. Any of these systems will be much better than a system
of V490 servers.
More information on the RAS features these servers can be found in
the white paper we wrote, Maximizing
IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140,
T5240, and T5440 Servers. Ok, I'll admit that someone else wrote
the title...
Wednesday Aug 20, 2008
Over the past few years, a number of people have been working to develop benchmarks for dependability of computer systems. After all, why should the performance guys have all of the fun? We've collected a number of papers on the subject in a new book, Dependability Benchmarking for Computer Systems, available from the IEEE Computer Society Press and Wiley.
The table of contents includes:
- The Autonomic Computing Benchmark
- Analytical Reliability, Availability, and Serviceability Benchmarks
- System Recovery Benchmarks
- Dependability Benchmarking Using Environmental Test Tools
- Dependability Benchmark for OLTP Systems
- Dependability Benchmarking of Web Servers
- Dependability Benchmark of Automotive Engine Control Systems
- Toward Evaluating the Dependability of Anomaly Detectors
- Vajra: Evaluating Byzantine-Fault-Tolerant Distributed Systems
- User-Relevant Software Reliability Benchmarking
- Interface Robustness Testing: Experience and Lessons Learned from the Ballista Project
- Windows and Linux Robustness Benchmarks with Respect to Application Erroneous Behavior
- DeBERT: Dependability Benchmarking of Embedded Real-Time Off-the-Shelf Components for Space Applications
- Benchmarking the Impact of Faulty Drivers: Application to the Linux Kernel
- Benchmarking the Operating System against Faults Impacting Operating System Functions
- Neutron Soft Error Rate Characterization of Microprocessors
Wow, you can see that there has been a lot of work, by a lot of people to measure system dependability and improve system designs.
The work described in Chapter 2, Analytical Reliability, Availability, and Serviceability Benchmarks, can be seen as we are beginning to publish these benchmark results in various product white papers:
Performance benchmarks have proven useful in driving innovation in the computer industry, and I think dependability benchmarks can do likewise. If you feel that these benchmarks are valuable, then please drop me a
note, or better yet, ask your computer vendors for some benchmark
results.
I'd like to thank all of the contributors to the book, the IEEE, and Wiley. Karama Kanoun and Lisa Spainhower worked tirelessly to get all of the works compiled (herding the cats) and interfaced with the publisher, great job! Ira Pramanick, Jim Mauro, William Bryson, and Dong Tang collaborated with me on Chapters 2 & 3, thanks team!
Wednesday Apr 09, 2008
Today, Sun introduced two new CMT servers, the Sun
SPARC Enterprise T5140 and T5240
servers.
I'm really excited about this next stage of server development.
Not only have we effectively doubled the performance capacity of the
system, we did so without significantly decreasing the reliability.
When we try to predict reliability of products which are being
designed, we make those predictions based on previous generation
systems. At Sun, we make these predictions at the component level.
Over the years we have collected detailed failure rate data for a
large variety of electronic components as used in the environments
often found at our customer sites. We use these component failure
rates to determine the failure rate of collections of components. For
example, a motherboard may have more than 2,000 components:
capacitors, resistors, integrated circuits, etc. The key to improving
motherboard reliability is, quite simply, to reduce the number of
components. There is some practical limit, though, because we could
remove many of the capacitors, but that would compromise signal
integrity and performance -- not a good trade-off. The big
difference in the open source UltraSPARC
T2 and UltraSPARC
T2plus processors is the high level of integration onto the chip.
They really are systems on a chip, which means that we need very few
additional components to complete a server design. Fewer components
means better reliability, a win-win situation. On average, the T5140
and T5240 only add about 12% more components over the T5120 and T5220
designs. But considering that you get two or four times as many
disks, twice as many DIMM slots, and twice the computing power, this
is a very reasonable trade-off.
Let's take a look at the system block diagram to see where all of
the major components live.

You will notice that the two PCI-e switches are peers and not
cascaded. This allows good flexibility and fault isolation. Compared
to the cascaded switches in the T5120 and T5220 servers, this is a
simpler design. Simple is good for RAS.
You will also notice that we use the same LSI1068E SAS/SATA
controller with onboard RAID. The T5140 is limited to 4 disk bays,
but the T5240 can accommodate 16 disk bays. This gives plenty of disk
targets for implementing a number of different RAID schemes. I
recommend at least some redundancy, dual parity if possible.
Some people have commented that the Neptune Ethernet chip, which
provides dual-10Gb Ethernet or quad-1Gb Ethernet interfaces is a
single point of failure. There is also one quad GbE PHY chip. The
reason the Neptune is there to begin with is because when we
implemented the coherency links in the UltraSPARC T2plus processor we
had to sacrifice the builtin Neptune interface which is available in
the UltraSPARC T2 processor. Moore's Law assures us that this is a
somewhat temporary condition and soon we'll be able to cram even more
transistors onto a chip. This is a case where high integration is
apparent in the packaging. Even though all four GbE ports connect to
a single package, the electronics inside the package are still
isolated. In other words, we don't consider the PHY to be a single
point of failure because the failure modes do not cross the isolation
boundaries. Of course, if your Ethernet gets struck by lightning,
there may be a lot of damage to the server, so there is always the
possibility that a single event will create massive damage. But for
the more common cabling problems, the system offers suitable
isolation. If you are really paranoid about this, then you can
purchase a PCI-e card version of the Neptune and put it in PCI-e slot
1, 2, or 3 to ensure that it uses the other PCI-e switch.
The ILOM service processor is the same as we use in most of our
other small servers and has been a very reliable part of our systems.
It is connected to the rest of the system through a FPGA which
manages all of the service bus connections. This allows the service
processor to be the serviceability interface for the entire server.
The server also uses ECC FB-DIMMs with Extended ECC, which is
another common theme in Sun servers. We have recently been studying
the affects of Solaris Fault Management Architecture and Extended ECC
on systems in the field and I am happy to report that this
combination provides much better system resiliency than possible
through the individual features. In RAS, the whole can be much better
than the sum of the parts.
For more information on the RAS features of the new T5140 and
T5240 servers, see the white paper, Maximizing
IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140
and T5240 Servers. The whitepaper has results of our RAS
benchmarks as well as some performability
calculations.