Wednesday Apr 09, 2008
Today, Sun introduced two new CMT servers, the Sun
SPARC Enterprise T5140 and T5240
servers.
I'm really excited about this next stage of server development.
Not only have we effectively doubled the performance capacity of the
system, we did so without significantly decreasing the reliability.
When we try to predict reliability of products which are being
designed, we make those predictions based on previous generation
systems. At Sun, we make these predictions at the component level.
Over the years we have collected detailed failure rate data for a
large variety of electronic components as used in the environments
often found at our customer sites. We use these component failure
rates to determine the failure rate of collections of components. For
example, a motherboard may have more than 2,000 components:
capacitors, resistors, integrated circuits, etc. The key to improving
motherboard reliability is, quite simply, to reduce the number of
components. There is some practical limit, though, because we could
remove many of the capacitors, but that would compromise signal
integrity and performance -- not a good trade-off. The big
difference in the open source UltraSPARC
T2 and UltraSPARC
T2plus processors is the high level of integration onto the chip.
They really are systems on a chip, which means that we need very few
additional components to complete a server design. Fewer components
means better reliability, a win-win situation. On average, the T5140
and T5240 only add about 12% more components over the T5120 and T5220
designs. But considering that you get two or four times as many
disks, twice as many DIMM slots, and twice the computing power, this
is a very reasonable trade-off.
Let's take a look at the system block diagram to see where all of
the major components live.

You will notice that the two PCI-e switches are peers and not
cascaded. This allows good flexibility and fault isolation. Compared
to the cascaded switches in the T5120 and T5220 servers, this is a
simpler design. Simple is good for RAS.
You will also notice that we use the same LSI1068E SAS/SATA
controller with onboard RAID. The T5140 is limited to 4 disk bays,
but the T5240 can accommodate 16 disk bays. This gives plenty of disk
targets for implementing a number of different RAID schemes. I
recommend at least some redundancy, dual parity if possible.
Some people have commented that the Neptune Ethernet chip, which
provides dual-10Gb Ethernet or quad-1Gb Ethernet interfaces is a
single point of failure. There is also one quad GbE PHY chip. The
reason the Neptune is there to begin with is because when we
implemented the coherency links in the UltraSPARC T2plus processor we
had to sacrifice the builtin Neptune interface which is available in
the UltraSPARC T2 processor. Moore's Law assures us that this is a
somewhat temporary condition and soon we'll be able to cram even more
transistors onto a chip. This is a case where high integration is
apparent in the packaging. Even though all four GbE ports connect to
a single package, the electronics inside the package are still
isolated. In other words, we don't consider the PHY to be a single
point of failure because the failure modes do not cross the isolation
boundaries. Of course, if your Ethernet gets struck by lightning,
there may be a lot of damage to the server, so there is always the
possibility that a single event will create massive damage. But for
the more common cabling problems, the system offers suitable
isolation. If you are really paranoid about this, then you can
purchase a PCI-e card version of the Neptune and put it in PCI-e slot
1, 2, or 3 to ensure that it uses the other PCI-e switch.
The ILOM service processor is the same as we use in most of our
other small servers and has been a very reliable part of our systems.
It is connected to the rest of the system through a FPGA which
manages all of the service bus connections. This allows the service
processor to be the serviceability interface for the entire server.
The server also uses ECC FB-DIMMs with Extended ECC, which is
another common theme in Sun servers. We have recently been studying
the affects of Solaris Fault Management Architecture and Extended ECC
on systems in the field and I am happy to report that this
combination provides much better system resiliency than possible
through the individual features. In RAS, the whole can be much better
than the sum of the parts.
For more information on the RAS features of the new T5140 and
T5240 servers, see the white paper, Maximizing
IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140
and T5240 Servers. The whitepaper has results of our RAS
benchmarks as well as some performability
calculations.
Wednesday Feb 20, 2008
When we build large clusters, such as high performance clusters or any cluster with a large number of computing nodes, we begin to look in detail at the repair models for the system. You are probably aware of the need to study power usage, air conditioning, weight, system management, networking, and cost for such systems. So you are also aware of how multiplying the environmental needs of one computing node times the number of nodes can become a large number. This can be very intuitive for most folks. But availability isn't quite so intuitive. Deferred repair models can also affect the intuition of the design. So, I thought that a picture would help show how we analyze the RAS characteristics of such systems and why we always look to deferred repair models in their design.
To begin, we have to make some assumptions:
- The availability of the whole is not interesting. The service provided by a big cluster is not dependent on all parts being functional. Rather, we look at it like a swarm of bees. Each bee can be busy, and the whole swarm can contribute towards making honey, but the loss of a few bees (perhaps due to a hungry bee eater) doesn't cause the whole honey producing process to stop. Sure, there may be some components of the system which are more critical than others, like the queen bee, but work can still proceed forward even if some of these systems are temporarily unavailable (the swarm will create new queens, as needed). This is a very different view than looking at the availability of a file service, for example.
- The performability will might be interesting. How many dead bees can we have before the honey production falls below our desired level? But for very, very large clusters, the performability will be generally good, so a traditional performability analysis is also not very interesting. It is more likely that a performability analysis of the critical components, such as networking and storage, will be interesting. But the performability of thousands of compute nodes will be less interesting.
- Common root cause failures are not considered. If a node fails, the root cause of the failure is not common to other nodes. A good example of a common root cause failure is loss of power -- if we lose power to the cluster, all nodes will fail. Another example is software -- a software bug which causes the nodes to crash may be common to all nodes.
- What we will model is a collection of independent nodes, each with their own, independent failure causes. Or just think about bees.
For a large number of compute nodes, even using modern, reliable designs, we know that the probability of all nodes being up at the same time is quite small. This is obvious if we look at the simple availability equation:
Availability = MTBF / (MTBF + MTTR)
where, MTBF (mean time between failure) is MTBF[compute node]/N[nodes]
and, MTTR (mean time to repair) is > 0
The killer here is N. As N becomes large (thousands) and MTTR is dependent on people, then the availability becomes quite small. The time required to repair a machine is included in the MTTR. So as N becomes large, there is more repair work to be done. I don't know about you, but I'd rather not spend my life in constant repair mode, so we need to look at the problem from a different angle.
If we make MTTR large, then the availability will drop to near zero. But if we have some spare compute nodes, then we might be able to maintain a specified service level. Or, some a practical perspective, we could ask the question, "how many spare compute nodes do I need to keep at least M compute nodes operational?" The next, related question is, "how often do we need to schedule service actions?" To solve this problem, we need a model.
Before I dig into the model results, I want to digress for a moment and talk about Mean Time Between Service (MTBS) and Mean Time Between System Interruption (MTBSI). I've blogged in detail about these before, but to put there use in context here, we will actually use MTBSI and not MTBF for the model. Why? Because if a compute node has any sort of redundancy (ECC memory, mirrored disks, etc.) then the node may still work after a component has failed. But we want to model our repair schedule based on how often we need to fix nodes, so we need to look at how often things break for two cases. The models will show us those details, but I won't trouble you with them today.
The figure below shows a proposed 2000+ node HPC cluster with two different deferred repair models. For one solution, we use a one week (168 hour) deferred repair time. For the other solution, we use a two week deferred repair time. I could show more options, but these two will be sufficient to provide the intuition for solving such mathematical problems.
We build a model showing the probability that some number of nodes will be down. The OK state is when all nodes are operational. It is very clear that the longer we wait to repair the nodes, the less probable it is that the cluster will be in the OK state. I would say, that that with a two week deferred maintenance model, there is nearly zero probability that all nodes will be operational. Looking at this another way, if you want all nodes to be available, you need to have a very, very fast repair time (MTTR approaching 0 time). Since fast MTTR is very expensive, accepting a deferred repair and using spares is usually a good cost trade-off.
OK, so we're convinced that a deferred repair model is the way to go, so how many spare compute nodes do we need? A good way to ask that question is, "how may spares do I need to ensure that there is a 95% probability that I will have a minumum of M nodes available?" From the above graph, we would accumulate the probability until we reached the 95% threshold. Thus we see that for the one week deferred repair case, we need at least 8 spares and for the two week deferred repair case we need at least 12 spares. Now this is something we can work with.
The model results will change based on the total number of compute nodes and their MTBSI. If you have more nodes, you'll need more spares. If you have more reliable or redundant nodes, you need fewer spares. If we know the reliability of the nodes and their redundancy characteristics, we have models which can tell you how many spares you need.
This sort of analysis also lets you trade-off the redundancy characteristics of the nodes to see how that affects the system, too. For example, we could look at the affect of zero, one, or two disks (mirrored) per node on the service levels. I personally like the zero disk case, where the nodes boot from the network, and we can model such complex systems quite easily, too. This point should not be underestimated, as you add redundancy to increase the MTBSI, you also increase the MTBS, which impacts your service costs. The engineer's life is a life full of trade-offs.
In conclusion, building clusters with lots of nodes (red shift designs) requires additional analysis beyond what we would normally use for critical systems with few nodes (blue shift designs). We often look at service costs using a deferred service interval and how that affects the overall system service level. We also look at the trade-offs between per-node redundancy and the overall system service level. With proper analysis, we can help determine the best performance and best cost for large, red shift systems.
Tuesday Oct 16, 2007
Modern systems are continuing to evolve and become more tolerant
to failures. For many systems today, a simple performance or
availability analysis does not reveal how well a system will operate
when in a degraded mode. A performability analysis can help answer
these questions for complex systems. In this blog, I'll show one of
the methods we use for performability analysis.
We often begin with a small set of components for test and
analysis. Traditional benchmarking or performance characterization is
a good starting point. For this example, we will analyze a storage
array. We begin with an understanding of the performance
characteristics of our desired workload, which can vary widely for
storage subsystems. In our case, we will create a performance
workload which includes a mix of reads and writes, with a consistent
iop size, and a desired performance metric of iops/second. Storage
arrays tend to have many possible RAID configurations which will have
different performance and data protection trade-offs, so we will pick
a RAID configuration which we think will best suit our requirements.
If it sounds like we're making a lot of choices early, it is because
we are. We know that some choices are clearly bad, some are clearly
good, and there are a whole bunch of choices in between. If we can't
meet our design targets after the performability analysis, then we
might have to go back to the beginning and start again - such is the
life of a systems engineer.
Once we have a reasonable starting point, we will setup a baseline
benchmark to determine the best performance for a fully functional
system. We will then use fault injection to measure the system
performance characteristics under the various failure modes expected
in the system. For most cases, we are concerned with hardware
failures. Often the impact on the performance of a system under
failure conditions is not constant. There may be a fault diagnosis
and isolation phase, a degraded phase, and a repair phase. There may
be several different system performance behaviors during these
phases. The transient diagram below shows the performance
measurements of a RAID array with dual redundant controllers
configured in a fully redundant, active/active operating mode. We
bring the system to a steady state and then inject a fault into one
of the controllers.
This analysis is interesting for several different reasons. We see
that when the fault was injected, there was a short period where the
array serviced no I/O operations. Once the fault was isolated, then a
recovery phase was started during which the array was operating at
approximately half of its peak performance. Once recovery was
completed, the performance returned to normal, even though the system
is in a degraded state. Next we repaired the fault. After the system
reconfigured itself, performance returned to normal for the
non-degraded system. You'll note that during the post-repair
reconfiguration the array stopped servicing I/O operations and this
outage was longer than the outage in the original fault. Sometimes, a
trade-off is made such that the impact of the unscheduled fault is
minimized at the expense of the repair activity. This is usually a
good trade-off because the repair activity is usually a scheduled
event, so we can limit the impact via procedures and planning. If you
have ever waited for an fsck to finish when booting a system, then
you've felt the impact of such decisions and understand why modern
file systems have attempted to minimize the performance costs of
fsck,
or
eliminated the need for fsck altogether. Modeling the system in
this way means that we will consider both the unscheduled faults as
well as the planned repair, though we usually make the simplifying
assumption that there will be one repair action for each unscheduled
fault.
If this sort of characterization sounds tedious, well it is. But
it is the best way for us to measure the performance of a subsystem
under faulted conditions. Trying to measure the performance of a
more complex system with multiple servers, switches, and arrays under
a comprehensive set of fault conditions would be untenable. We do
gain some reduction of the test matrix because we know that some
components have no impact on performance when they fail.
Next we build a RAScad model for the system. I usually use a
heirarchial model built from components which hides much of the
complexity from me, but for this simpler example, the Markov model
looks like this:
Where the states are explained by this table:
|
State
|
Explanation
|
Transition Rate
|
Explanation
|
|
28,0,1
|
No failures
|
m_repair
|
rate (=1/MTTR)
|
|
1 UIC_Dn
|
1 UIC is down
|
l_uic
|
UIC failure rate
|
|
Down
|
System is down
|
l_mp
|
Midplane failure rate
|
|
1 Ctlr_Dn
|
1 Controller is down
|
l_cntl
|
Controller failure rate
|
|
1PCU_Dn
|
1 PCU is down
|
l_pcu
|
PCU failure rate
|
|
27,1,0
|
1 disk is under reconstruction
|
l_recon
|
Disk reconstruction rate
|
|
28,1,1
|
1 disk is under reconstruction, 1 spare disk available
|
l_disk
|
Disk failure rate
|
|
27,0,0
|
No spare disk
|
|
|
|
26,0,0
|
One parity group loses 1 disk, no
spare available, no disk reconstruction
|
|
|
Solving the Markov model will provide us with the average staying
time per year in each of the states. Note that we must make some sort
of assumptions about the service response time. We will usually use 4
hour service response time for enterprise-class operations. Is that
assumption optimal? We don't always know, so that is another feature
of a system I'll explore in a later blog.
So now we have the performance for each state, and the average
staying time per year. These are two variables, so lets graph them
on an X-Y plot. To make it easier to compare different systems, we
sort by the performance (in the Y-axis). We call the resulting graph
a performability graph or P-Graph for short. Here is an
example of a performability graph showing the results for three
different RAID array configurations.
I usually label availability targets across the top as an
alternate X-axis label because many people are more comfortable with
availability targets represented as "nines" than seconds or
minutes. In order to show the typically small staying time, we use a
log scale on the X-axis. The Y-axis shows the performance metric. I
refer to the system's performability curve as a
performability
envelope because it represents the boundaries of performance and
availability, where we can expect the actual use to fall below the
curve for any interval.
Suppose you have a requirement for an array that delivers 1,500
iops with "four-nines" availability. You can see from the
performability graph that Product A and C can deliver 1,500 iops,
Product C can deliver "four-nines" availability, but only
Product A can deliver both 1,500 iops and "four-nines"
availability.
To help you understand the composition of the graph, I colored
some of the states which have longer staying times.
You can see that some of the failure states have little impact on
performance, whereas others will have a significant impact on
performance. For this array, when a power supply/battery unit fails,
the write cache is placed in write through mode, which has a
significant performance impact. Also, when a disk fails and is being
reconstructed, the overall performance is impacted. Now we have a
clearer picture of what performance we can expect from this array per
year.
This composition view is particularly useful for product
engineers, but is less useful to systems engineers. For complex
systems, there are many products, many failure modes, and many more
trade-offs to consider. More on that later...
Wednesday Oct 03, 2007
Adaptec has put together a nice webinar called Nearline Data Drives and Error Handling. If you work with disks or are contemplating building your own home data server, I recommend that you take 22 minutes to review the webinar. As a systems vendor, we are often asked why we made certain design decisions to favor data over costs, and I think this webinar does a good job of showing how some of the complexity of systems design covers a large number of decision points. Here in the RAS Engineering group we tend to gravitate towards the best reliability and availability of systems, which still requires a staggering number of design trade-offs. Rest assured that we do our best to make these decisions with your data in mind.
For the ZFSers in the world, this webinar also provides some insight into how RAID systems like ZFS are designed, and why end-to-end data protection is vitally important.
Enjoy! And if you don't want your Starbuck's gift card, send it to me :-)
Friday May 04, 2007
OpenSolaris build 61 (or later) is now available for download. ZFS
has added a new feature that will improve data protection: redundant
copies for data (aka ditto blocks for data). Previously, ZFS stored redundant copies of metadata.
Now this feature is available for data, too.
This represents a new feature which is unique to ZFS: you can set
the data protection policy on a per-file system basis, beyond that
offered by the underlying device or volume. For single-device
systems, like my laptop with its single disk drive, this is very
powerful. I can have a different data protection policy for the files
that I really care about (my personal files) than the files that I
really don't care about or that can be easily reloaded from the OS
installation DVD. For systems with multiple disks assembled in a RAID
configuration, the data protection is not quite so obvious. Let's
explore this feature, look under the hood, and then analyze some
possible configurations.
Using
Copies
To change the numbers of data copies, set the copies
property. For example, suppose I have a zpool named "zwimming."
The default number of data copies is 1. But you can change that to 2
quite easily.
# zfs set copies=2 zwimming
|
The copies property works for all new writes, so I recommend that
you set that policy when you create the file system or immediately
after you create a zpool.
You can verify the copies setting by looking at the properties.
# zfs get copies zwimming
NAME PROPERTY VALUE SOURCE
zwimming copies 2 local
|
ZFS will account for the space used. For example, suppose I create
three new file systems and copy some data to them. You can then see
that the space used reflects the number of copies. If you use quotas,
then the copies will be charged against the quotas, too.
# zfs create -o copies=1 zwimming/single
# zfs create -o copies=2 zwimming/dual
# zfs create -o copies=3 zwimming/triple
# cp -rp /usr/share/man1 /zwimming/single
# cp -rp /usr/share/man1 /zwimming/dual
# cp -rp /usr/share/man1 /zwimming/triple
# zfs list -r zwimming NAME USED AVAIL REFER MOUNTPOINT zwimming 48.2M 310M 33.5K /zwimming zwimming/dual 16.0M 310M 16.0M /zwimming/dual zwimming/single 8.09M 310M 8.09M /zwimming/single zwimming/triple 23.8M 310M 23.8M /zwimming/triple
|
This makes sense. Each file system has one, two, or three copies
of the data and will use correspondingly one, two, or three times as
much space to store the data.
Under
the Covers
ZFS will spread the ditto blocks across the vdev or vdevs to
provide spatial diversity. Bill
Moore has previously blogged about this, or you can see
it in the code for yourself. From a RAS perspective, this is a
good thing. We want to reduce the possibility that a single failure,
such as a drive head impact with media, could disturb both copies of
our data. If we have multiple disks, ZFS will try to spread the
copies across multiple disks. This is different than mirroring, in
subtle ways. The actual placement is ultimately based upon available
space. Let's look at some simplified examples. First, for the default
file system configuration settings on a single disk.

Note that there are two copies of the metadata, by default. If we
have two or more copies of the data, the number of metadata copies is
three.
Suppose you have a 2-disk stripe. In that case, ZFS will try to
spread the copies across the disks.

Since the copies are created above the zpool, a mirrored zpool
will faithfully mirror the copies.
Since the copies policy is set at the file system level, not the
zpool level, a single zpool may contain multiple file systems, each
with different policies. In other words, you could have data which is
not copied allocated along with data that is copied.
Using different policies for different file systems allows you to have different data protection policies, allows you to improve data protection, and offers many more permutations of configurations for you to weigh in your designs.
RAS
Modeling
It is obvious that increasing the number of data copies will
effectively reduce the amount of available space accordingly. But how
will this affect reliability? To answer that question we use the
MTTDL[2]
model I previously described, with the following changes:
First, we calculate the probability of unsuccessful
reconstruction due to a UER for N disks of a given size (unit
conversion omitted). The number of copies decreases this probability.
This makes sense as we could use another copy of the data for
reconstruction and to completely fail, we'd need to lose all copies:
Precon_fail =
((N-1) * size / UER)copies
For single-disk failure protection:
MTTDL[2] = MTBF / (N *
Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/
(N * (N-1) * MTTR * Precon_fail)
Note that as the number of copies increases, Precon_fail
approaches zero quickly. This will increase the MTTDL. We want higher
MTTDL, so this is a good thing.
OK, now that we can calculate available space and MTTDL, let's
look at some configurations for 46 disks available on a Sun
Fire X4500 (aka Thumper). We'll look at single parity schemes, to
reduce the clutter, but double parity schemes will show the same,
relative improvements.
bigger view
You can see that we are trading off space for MTTDL. You can also
see that for raidz zpools, having more disks in the sets reduces the
MTTDL. It gets more interesting to see that the 2-way mirror with
copies=2 is very similar in space and MTTDL to the 5-disk raidz with
copies=3. Hmm. Also, the 2-way mirror with copies=1 is similar in
MTTDL to the 7-disk raidz with copies=2, though the mirror
configurations allow more space. This information may be useful as
you make trade-offs. Since the copies parameter is set per file
system, you can still set the data protection policy for important
data separately from unimportant data. This might be a good idea for
some situations where you might have permanent originals (eg. CDs,
DVDs) and want to apply a different data protection policy.
In the future, once we have a better feel for the real performance
considerations, we'll be able to add a performance component into the
analysis.
Single
Device Revisited
Now that we see how data protection is improved, let's revisit the
single device case. I use the term device here because there is a
significant change occurring in storage as we replace disk drives
with solid state, non-volatile memory devices (eg. flash disks and
future MRAM or PRAM devices). A large number of enterprise customers
demand dual disk drives for mirroring root file systems in servers.
However, there is also a growing demand for solid state boot devices,
and we
have some Sun servers with this option. Some believe that by
2009, the majority of laptops will also have solid state devices
instead of disk drives. In the interim, there are also hybrid disk
drives.
What affect will these devices have on data retention? We know
that if the entire device completely fails, then the data is most
likely unrecoverable. In real life, these devices can suffer many
failures which result in data loss, but which are not complete device
failures. For disks, we see the most common failure is an
unrecoverable read where data is lost from one or more sector (bar 1
in the graph below). For flash memories, there is an endurance issue
where repeated writes to a cell may reduce the probability of reading
the data correctly. If you only have one copy of the data, then the
data is lost, never to be read correctly again.
We captured disk error codes returned from a number of disk drives
in the field. The Pareto chart below shows the relationship between
the error codes. Bar 1 is the unrecoverable read which accounts for
about 24% of the errors recorded. The violet bars show recoverable
errors which did succeed. Examples of successfully recovered errors
are: write error - recovered with block reallocation, read error -
recovered by ECC using normal retries, etc. The recovered errors do
not (immediately) indicate a data loss event, so they are largely
transparent to applications. We worry more about the unrecoverable
errors.
Approximately 1/3 of the errors were unrecoverable. If such an
error occurs in ZFS metadata, then ZFS will try to read alternate
metadata copy and repair the metadata. If the data has multiple
copies, then it is likely that we will not lose any data. This is a
more detailed view of the storage device because we are not treating
all failures as a full device failure.
Both real and anecdotal evidence suggests that unrecoverable
errors can occur while the device is still largely operational. ZFS
has the ability to survive such errors without data loss. Very cool.
Murphy's Law will ultimately catch up with you, though. In the case
where ZFS cannot recover the data, ZFS will tell you which file is
corrupted. You can then decide whether or not you should recover it
from backups or source media.
Another
Single Device
Now that I've got you to think of the single device as a single
device, I'd like to extend the thought to RAID arrays. There is much
confusion amongst people about whether ZFS should or should not be
used with RAID arrays. If you search, you'll find comments and
recommendations both for and against using hardware RAID for ZFS. The main
argument is centered around the ability of ZFS to correct errors. If
you have a single device backed by a RAID array with some sort of
data protection, then previous versions of ZFS could not recover data
which was lost. Hold it right there, fella! Do I mean that RAID
arrays and the channel from the array to main memory can have errors?
Yes, of course! We have seen cases where errors were introduced
somewhere along the path between disk media to main memory where data
was lost or corrupted. Prior to ZFS, these were silent errors and
blissfully ignored. With ZFS, the checksum now detects these errors
and tries to recover. If you don't believe me, then watch the ZFS
forum on opensolaris.org where we get reports like this about
once a month or so. With ZFS copies, you can now recover from such
errors without changing the RAID array configuration.
If ZFS can correct a data error, it will attempt to do so. You now
have a the option to improve your data protection even when using a
single RAID LUN. And this is the same mechanism we can use for a
single disk or flash drive: data copies. You can implement the copies
on a per-file system basis and thus have different data protection
policies even though the data is physically stored on a RAID LUN in a
hardware RAID array. I really hope we can put to rest the "ZFS
prefers JBOD" argument and just concentrate our efforts on
implementing the best data protection policies for the requirements.
ZFS with data copies is another tool in your toolbelt to improve your
life, and the life of your data.
Monday Apr 23, 2007
My colleague, Gary Combs, put together a podcast describing the new RAS features found in the Sun SPARC Enterprise Servers. The M4000, M5000, M8000, and M9000 servers have very advanced RAS features, which put them head and shoulders above the competition. Here is my list of favorites, in no particular order:
- Memory mirroring. This is like RAID-1 for main memory. As I've said many times, there are 4 types of components which tend to break most often: disks, DIMMs (memory), fans, and power supplies. Memory mirroring brings the fully redundant reliability techniques often used for disks, fans, and power supplies to DIMMs.
- Extended ECC for main memory. Full chip failures on a DIMM can be tolerated.
- Instruction retry. The processor can detect faulty operation and retry instructions. This feature has been available on mainframes, and is now available for the general purpose computing markets.
- Improved data path protection. Many improvements here, along the entire data path. ECC protection is provided for all of the on-processor memory.
- Reduced part count from the older generation Sun Fire E25K. Better integration allows us to do more with fewer parts while simultaneously improving the error detection and correction capabilities of the subsystems.
- Open-source Solaris Fault Management Architecture (FMA) integration. This allows systems administrators to see what faults the system has detected and the system will automatically heal itself.
- Enhanced dynamic reconfiguration. Dynamic reconfiguration can be done at the processor, DIMM (bank), and PCI-E (pairs) level of grainularity.
- Solaris Cluster support. Of course Solaris Cluster is supported including clustering between Solaris containers, dynamic system domains, or chassis.
- Comprehensive service processor. The service processor monitors the health of the system and controls system operation and reconfiguration. This is the most advanced service processor we've developed. Another welcome feature is the ability to delegate responsibilities to different system administrators with restrictions so that they cannot control the entire chassis. This will be greatly appreciated in large organizations where multiple groups need computing resources.
- Dual power grid. You can connect the power supplies to two different power grids. Many people do not have the luxury of access to two different power grids, but those who have been bitten by a grid outage will really appreciate this feature. Think of this as RAID-1 for your power source.
I don't think you'll see anything revolutionary in my favorites list. This is due to the continuous improvements in the RAS technologies. The older Sun Fire servers were already very reliable, and it is hard to create a revolutionary change for mature technologies. We have goals to make every generation better, and we've made many advances with this new generation. If the RAS guys do their job right, you won't notice it - things will just keep working.