Ramblings from Richard's Ranch

RAS in the T5140 and T5240

Wednesday Apr 09, 2008

Today, Sun introduced two new CMT servers, the Sun SPARC Enterprise T5140 and T5240 servers.

I'm really excited about this next stage of server development. Not only have we effectively doubled the performance capacity of the system, we did so without significantly decreasing the reliability. When we try to predict reliability of products which are being designed, we make those predictions based on previous generation systems. At Sun, we make these predictions at the component level. Over the years we have collected detailed failure rate data for a large variety of electronic components as used in the environments often found at our customer sites. We use these component failure rates to determine the failure rate of collections of components. For example, a motherboard may have more than 2,000 components: capacitors, resistors, integrated circuits, etc. The key to improving motherboard reliability is, quite simply, to reduce the number of components. There is some practical limit, though, because we could remove many of the capacitors, but that would compromise signal integrity and performance -- not a good trade-off. The big difference in the open source UltraSPARC T2 and UltraSPARC T2plus processors is the high level of integration onto the chip. They really are systems on a chip, which means that we need very few additional components to complete a server design. Fewer components means better reliability, a win-win situation. On average, the T5140 and T5240 only add about 12% more components over the T5120 and T5220 designs. But considering that you get two or four times as many disks, twice as many DIMM slots, and twice the computing power, this is a very reasonable trade-off.

Let's take a look at the system block diagram to see where all of the major components live.



You will notice that the two PCI-e switches are peers and not cascaded. This allows good flexibility and fault isolation. Compared to the cascaded switches in the T5120 and T5220 servers, this is a simpler design. Simple is good for RAS.

You will also notice that we use the same LSI1068E SAS/SATA controller with onboard RAID. The T5140 is limited to 4 disk bays, but the T5240 can accommodate 16 disk bays. This gives plenty of disk targets for implementing a number of different RAID schemes. I recommend at least some redundancy, dual parity if possible.

Some people have commented that the Neptune Ethernet chip, which provides dual-10Gb Ethernet or quad-1Gb Ethernet interfaces is a single point of failure. There is also one quad GbE PHY chip. The reason the Neptune is there to begin with is because when we implemented the coherency links in the UltraSPARC T2plus processor we had to sacrifice the builtin Neptune interface which is available in the UltraSPARC T2 processor. Moore's Law assures us that this is a somewhat temporary condition and soon we'll be able to cram even more transistors onto a chip. This is a case where high integration is apparent in the packaging. Even though all four GbE ports connect to a single package, the electronics inside the package are still isolated. In other words, we don't consider the PHY to be a single point of failure because the failure modes do not cross the isolation boundaries. Of course, if your Ethernet gets struck by lightning, there may be a lot of damage to the server, so there is always the possibility that a single event will create massive damage. But for the more common cabling problems, the system offers suitable isolation. If you are really paranoid about this, then you can purchase a PCI-e card version of the Neptune and put it in PCI-e slot 1, 2, or 3 to ensure that it uses the other PCI-e switch.

The ILOM service processor is the same as we use in most of our other small servers and has been a very reliable part of our systems. It is connected to the rest of the system through a FPGA which manages all of the service bus connections. This allows the service processor to be the serviceability interface for the entire server.

The server also uses ECC FB-DIMMs with Extended ECC, which is another common theme in Sun servers. We have recently been studying the affects of Solaris Fault Management Architecture and Extended ECC on systems in the field and I am happy to report that this combination provides much better system resiliency than possible through the individual features. In RAS, the whole can be much better than the sum of the parts.

For more information on the RAS features of the new T5140 and T5240 servers, see the white paper, Maximizing IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140 and T5240 Servers. The whitepaper has results of our RAS benchmarks as well as some performability calculations.



[1] Comments

Performability analysis of T5120 and T5220

Tuesday Oct 09, 2007

In complex systems, we must often trade-off performance against reliability, availability, or serviceability. In many cases, a system design will include both performance and availability requirements. We use performability analysis to examine the performance versus availability trade-off. Performability is simply the ability to perform. A performability analysis combines performance characterization for systems under the possible combinations of degraded states with the probability that the system will be operating the degraded states.

The simplest performability analysis is often appropriate for multiple node, shared nothing clusters which scale performance perfectly. For example, in a simple web server farm, you might have N servers capable of delivering M pages per server. Disregarding other bottlenecks in the system such, as the capacity of the internet connection to the server farm, we can say that N+1 servers will deliver M*(N+1) performance. Thus we can estimate the aggregate performance of any number of web servers.

We can also perform an availability analysis on a web server. We can build Markov models which consider the reliability of the components in a server and their expected time to repair. The output of the models will provide the estimated time per year that each web server may be operational. More specifically, we will know the staying time per year for each of the model states. For a simple model, the performance reward for an up state is M and a down state is 0. A system which provides 99.99% (four-nines) availability can be expected to be down for approximately 53 minutes per year and up for the remainder.

For a shared nothing cluster, we can further simplify the analysis by ignoring common fault effects. In practice, this means that a failure or repair in one web server does not affect any other web servers. In many respects, this is the same simplifying assumption we made with performance, where the performance of a web server is dependent on any of the other web servers.

The shared nothing cluster availability model will contain the following system states and the annual staying time in each state: all up, one down (N-1 up), two down (N-2 up), three down (N-3 up), and so on. The availability model inputs include the unscheduled mean time between system interruption (U_MTBSI) and mean time to repair (MTTR) for the nodes. We often choose a MTTR value by considering the cost of service response time. For many shared nothing clusters, a service response time of 48 hours may be reasonable – a value which may not be reasonable for a database or storage tier. Model results might look like this:

System State

Annual Staying Time (minutes)

Cumulative Uptime (%)

Performance Reward

All up

521,395.20

99.2

M * N

1 down

4,162.75

99.992

M * (N - 1)

2 down

39.95

99.9996

M * (N - 2)

3 down

2.00

99.99998

M * (N - 3)

> 3 down

0.11

100

< M * (N - 4)

Total

525,600.00

100


Now we have enough data to evaluate the performability of the system. For the simple analysis, we accept the cumulative uptime result for the minimum required performance. We can then compare various systems considering performability.

We have modeled the new Sun SPARC Enterprise T5120 and Sun SPARC Enterprise T5220 servers against the venerable Sun Fire V490 servers. For this analysis we chose a performance benchmark with a metric that showed we needed 6 T5120 or T5220 servers to match the performance of 9 V490 servers. We will choose to overprovision by one server, which is often optimum for such architectures. The performability results are:

Servers

Units

Performability (%)

Sun SPARC Enterprise T5120

6 + 1

99.99988

Sun SPARC Enterprise T5220

6 + 1

99.99988

Sun Fire V490

9 + 1

99.99893

You might notice that the T5120 and T5220 have the same performability results. This is because they share the same motherboard design, disks, power supplies, etc. It is much more interesting to compare these to the V490. Even though we use more V490 systems, the T5120 and T5220 solution provides better performability. Fewer, faster, more reliable servers should generally have better performability than more, slower, less reliable servers.

 

Like this post? del.icio.us | furl | slashdot | technorati | digg