Ramblings from Richard's Ranch

RAS in the T5140 and T5240

Wednesday Apr 09, 2008

Today, Sun introduced two new CMT servers, the Sun SPARC Enterprise T5140 and T5240 servers.

I'm really excited about this next stage of server development. Not only have we effectively doubled the performance capacity of the system, we did so without significantly decreasing the reliability. When we try to predict reliability of products which are being designed, we make those predictions based on previous generation systems. At Sun, we make these predictions at the component level. Over the years we have collected detailed failure rate data for a large variety of electronic components as used in the environments often found at our customer sites. We use these component failure rates to determine the failure rate of collections of components. For example, a motherboard may have more than 2,000 components: capacitors, resistors, integrated circuits, etc. The key to improving motherboard reliability is, quite simply, to reduce the number of components. There is some practical limit, though, because we could remove many of the capacitors, but that would compromise signal integrity and performance -- not a good trade-off. The big difference in the open source UltraSPARC T2 and UltraSPARC T2plus processors is the high level of integration onto the chip. They really are systems on a chip, which means that we need very few additional components to complete a server design. Fewer components means better reliability, a win-win situation. On average, the T5140 and T5240 only add about 12% more components over the T5120 and T5220 designs. But considering that you get two or four times as many disks, twice as many DIMM slots, and twice the computing power, this is a very reasonable trade-off.

Let's take a look at the system block diagram to see where all of the major components live.



You will notice that the two PCI-e switches are peers and not cascaded. This allows good flexibility and fault isolation. Compared to the cascaded switches in the T5120 and T5220 servers, this is a simpler design. Simple is good for RAS.

You will also notice that we use the same LSI1068E SAS/SATA controller with onboard RAID. The T5140 is limited to 4 disk bays, but the T5240 can accommodate 16 disk bays. This gives plenty of disk targets for implementing a number of different RAID schemes. I recommend at least some redundancy, dual parity if possible.

Some people have commented that the Neptune Ethernet chip, which provides dual-10Gb Ethernet or quad-1Gb Ethernet interfaces is a single point of failure. There is also one quad GbE PHY chip. The reason the Neptune is there to begin with is because when we implemented the coherency links in the UltraSPARC T2plus processor we had to sacrifice the builtin Neptune interface which is available in the UltraSPARC T2 processor. Moore's Law assures us that this is a somewhat temporary condition and soon we'll be able to cram even more transistors onto a chip. This is a case where high integration is apparent in the packaging. Even though all four GbE ports connect to a single package, the electronics inside the package are still isolated. In other words, we don't consider the PHY to be a single point of failure because the failure modes do not cross the isolation boundaries. Of course, if your Ethernet gets struck by lightning, there may be a lot of damage to the server, so there is always the possibility that a single event will create massive damage. But for the more common cabling problems, the system offers suitable isolation. If you are really paranoid about this, then you can purchase a PCI-e card version of the Neptune and put it in PCI-e slot 1, 2, or 3 to ensure that it uses the other PCI-e switch.

The ILOM service processor is the same as we use in most of our other small servers and has been a very reliable part of our systems. It is connected to the rest of the system through a FPGA which manages all of the service bus connections. This allows the service processor to be the serviceability interface for the entire server.

The server also uses ECC FB-DIMMs with Extended ECC, which is another common theme in Sun servers. We have recently been studying the affects of Solaris Fault Management Architecture and Extended ECC on systems in the field and I am happy to report that this combination provides much better system resiliency than possible through the individual features. In RAS, the whole can be much better than the sum of the parts.

For more information on the RAS features of the new T5140 and T5240 servers, see the white paper, Maximizing IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140 and T5240 Servers. The whitepaper has results of our RAS benchmarks as well as some performability calculations.



[1] Comments

Big Clusters and Deferred Repair

Wednesday Feb 20, 2008

When we build large clusters, such as high performance clusters or any cluster with a large number of computing nodes, we begin to look in detail at the repair models for the system. You are probably aware of the need to study power usage, air conditioning, weight, system management, networking, and cost for such systems. So you are also aware of how multiplying the environmental needs of one computing node times the number of nodes can become a large number. This can be very intuitive for most folks. But availability isn't quite so intuitive. Deferred repair models can also affect the intuition of the design. So, I thought that a picture would help show how we analyze the RAS characteristics of such systems and why we always look to deferred repair models in their design.

To begin, we have to make some assumptions:

  • The availability of the whole is not interesting.  The service provided by a big cluster is not dependent on all parts being functional. Rather, we look at it like a swarm of bees. Each bee can be busy, and the whole swarm can contribute towards making honey, but the loss of a few bees (perhaps due to a hungry bee eater) doesn't cause the whole honey producing process to stop. Sure, there may be some components of the system which are more critical than others, like the queen bee, but work can still proceed forward even if some of these systems are temporarily unavailable (the swarm will create new queens, as needed). This is a very different view than looking at the availability of a file service, for example.
  • The performability will might be interesting. How many dead bees can we have before the honey production falls below our desired level? But for very, very large clusters, the performability will be generally good, so a traditional performability analysis is also not very interesting. It is more likely that a performability analysis of the critical components, such as networking and storage, will be interesting. But the performability of thousands of compute nodes will be less interesting.
  • Common root cause failures are not considered. If a node fails, the root cause of the failure is not common to other nodes. A good example of a common root cause failure is loss of power -- if we lose power to the cluster, all nodes will fail. Another example is software -- a software bug which causes the nodes to crash may be common to all nodes.
  • What we will model is a collection of independent nodes, each with their own, independent failure causes.  Or just think about bees.
For a large number of compute nodes, even using modern, reliable designs, we know that the probability of all nodes being up at the same time is quite small. This is obvious if we look at the simple availability equation:
Availability = MTBF / (MTBF + MTTR)

where, MTBF (mean time between failure) is MTBF[compute node]/N[nodes]
and, MTTR (mean time to repair) is > 0

The killer here is N. As N becomes large (thousands) and MTTR is dependent on people, then the availability becomes quite small. The time required to repair a machine is included in the MTTR. So as N becomes large, there is more repair work to be done. I don't know about you, but I'd rather not spend my life in constant repair mode, so we need to look at the problem from a different angle.

If we make MTTR large, then the availability will drop to near zero. But if we have some spare compute nodes, then we might be able to maintain a specified service level. Or, some a practical perspective, we could ask the question, "how many spare compute nodes do I need to keep at least M compute nodes operational?" The next, related question is, "how often do we need to schedule service actions?" To solve this problem, we need a model.

Before I dig into the model results, I want to digress for a moment and talk about Mean Time Between Service (MTBS) and Mean Time Between System Interruption (MTBSI).  I've blogged in detail about these before, but to put there use in context here, we will actually use MTBSI and not MTBF for the model.  Why? Because if a compute node has any sort of redundancy (ECC memory, mirrored disks, etc.) then the node may still work after a component has failed. But we want to model our repair schedule based on how often we need to fix nodes, so we need to look at how often things break for two cases. The models will show us those details, but I won't trouble you with them today.

The figure below shows a proposed 2000+ node HPC cluster with two different deferred repair models. For one solution, we use a one week (168 hour) deferred repair time. For the other solution, we use a two week deferred repair time. I could show more options, but these two will be sufficient to provide the intuition for solving such mathematical problems.

Deferred Repair Model Results 

We build a model showing the probability that some number of nodes will be down. The OK state is when all nodes are operational. It is very clear that the longer we wait to repair the nodes, the less probable it is that the cluster will be in the OK state. I would say, that that with a two week deferred maintenance model, there is nearly zero probability that all nodes will be operational. Looking at this another way, if you want all nodes to be available, you need to have a very, very fast repair time (MTTR approaching 0 time). Since fast MTTR is very expensive, accepting a deferred repair and using spares is usually a good cost trade-off.

OK, so we're convinced that a deferred repair model is the way to go, so how many spare compute nodes do we need? A good way to ask that question is, "how may spares do I need to ensure that there is a 95% probability that I will have a minumum of M nodes available?" From the above graph, we would accumulate the probability until we reached the 95% threshold. Thus we see that for the one week deferred repair case, we need at least 8 spares and for the two week deferred repair case we need at least 12 spares. Now this is something we can work with.

The model results will change based on the total number of compute nodes and their MTBSI. If you have more nodes, you'll need more spares. If you have more reliable or redundant nodes, you need fewer spares. If we know the reliability of the nodes and their redundancy characteristics, we have models which can tell you how many spares you need.

This sort of analysis also lets you trade-off the redundancy characteristics of the nodes to see how that affects the system, too. For example, we could look at the affect of zero, one, or two disks (mirrored) per node on the service levels. I personally like the zero disk case, where the nodes boot from the network, and we can model such complex systems quite easily, too. This point should not be underestimated, as you add redundancy to increase the MTBSI, you also increase the MTBS, which impacts your service costs.  The engineer's life is a life full of trade-offs.

 

In conclusion, building clusters with lots of nodes (red shift designs) requires additional analysis beyond what we would normally use for critical systems with few nodes (blue shift designs). We often look at service costs using a deferred service interval and how that affects the overall system service level. We also look at the trade-offs between per-node redundancy and the overall system service level. With proper analysis, we can help determine the best performance and best cost for large, red shift systems.

 

 

[1] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Introduction to Performability Analysis

Tuesday Oct 16, 2007

Modern systems are continuing to evolve and become more tolerant to failures. For many systems today, a simple performance or availability analysis does not reveal how well a system will operate when in a degraded mode. A performability analysis can help answer these questions for complex systems. In this blog, I'll show one of the methods we use for performability analysis.

We often begin with a small set of components for test and analysis. Traditional benchmarking or performance characterization is a good starting point. For this example, we will analyze a storage array. We begin with an understanding of the performance characteristics of our desired workload, which can vary widely for storage subsystems. In our case, we will create a performance workload which includes a mix of reads and writes, with a consistent iop size, and a desired performance metric of iops/second. Storage arrays tend to have many possible RAID configurations which will have different performance and data protection trade-offs, so we will pick a RAID configuration which we think will best suit our requirements. If it sounds like we're making a lot of choices early, it is because we are. We know that some choices are clearly bad, some are clearly good, and there are a whole bunch of choices in between. If we can't meet our design targets after the performability analysis, then we might have to go back to the beginning and start again - such is the life of a systems engineer.

Once we have a reasonable starting point, we will setup a baseline benchmark to determine the best performance for a fully functional system. We will then use fault injection to measure the system performance characteristics under the various failure modes expected in the system. For most cases, we are concerned with hardware failures. Often the impact on the performance of a system under failure conditions is not constant. There may be a fault diagnosis and isolation phase, a degraded phase, and a repair phase. There may be several different system performance behaviors during these phases. The transient diagram below shows the performance measurements of a RAID array with dual redundant controllers configured in a fully redundant, active/active operating mode. We bring the system to a steady state and then inject a fault into one of the controllers.

array fault transient analysis 

This analysis is interesting for several different reasons. We see that when the fault was injected, there was a short period where the array serviced no I/O operations. Once the fault was isolated, then a recovery phase was started during which the array was operating at approximately half of its peak performance. Once recovery was completed, the performance returned to normal, even though the system is in a degraded state. Next we repaired the fault. After the system reconfigured itself, performance returned to normal for the non-degraded system. You'll note that during the post-repair reconfiguration the array stopped servicing I/O operations and this outage was longer than the outage in the original fault. Sometimes, a trade-off is made such that the impact of the unscheduled fault is minimized at the expense of the repair activity. This is usually a good trade-off because the repair activity is usually a scheduled event, so we can limit the impact via procedures and planning. If you have ever waited for an fsck to finish when booting a system, then you've felt the impact of such decisions and understand why modern file systems have attempted to minimize the performance costs of fsck, or eliminated the need for fsck altogether.

Modeling the system in this way means that we will consider both the unscheduled faults as well as the planned repair, though we usually make the simplifying assumption that there will be one repair action for each unscheduled fault.

If this sort of characterization sounds tedious, well it is. But it is the best way for us to measure the performance of a subsystem under faulted conditions. Trying to measure the performance of a more complex system with multiple servers, switches, and arrays under a comprehensive set of fault conditions would be untenable. We do gain some reduction of the test matrix because we know that some components have no impact on performance when they fail.

Next we build a RAScad model for the system. I usually use a heirarchial model built from components which hides much of the complexity from me, but for this simpler example, the Markov model looks like this:

Markov model 

Where the states are explained by this table:

State

Explanation

Transition Rate

Explanation

28,0,1

No failures

m_repair

rate (=1/MTTR)

1 UIC_Dn

1 UIC is down

l_uic

UIC failure rate

Down

System is down

l_mp

Midplane failure rate

1 Ctlr_Dn

1 Controller is down

l_cntl

Controller failure rate

1PCU_Dn

1 PCU is down

l_pcu

PCU failure rate

27,1,0

1 disk is under reconstruction

l_recon

Disk reconstruction rate

28,1,1

1 disk is under reconstruction, 1 spare disk available

l_disk

Disk failure rate

27,0,0

No spare disk



26,0,0

One parity group loses 1 disk, no

spare available, no disk reconstruction



Solving the Markov model will provide us with the average staying time per year in each of the states. Note that we must make some sort of assumptions about the service response time. We will usually use 4 hour service response time for enterprise-class operations. Is that assumption optimal? We don't always know, so that is another feature of a system I'll explore in a later blog.

So now we have the performance for each state, and the average staying time per year. These are two variables, so lets graph them on an X-Y plot. To make it easier to compare different systems, we sort by the performance (in the Y-axis). We call the resulting graph a performability graph or P-Graph for short. Here is an example of a performability graph showing the results for three different RAID array configurations.

simple performability graph 

I usually label availability targets across the top as an alternate X-axis label because many people are more comfortable with availability targets represented as "nines" than seconds or minutes. In order to show the typically small staying time, we use a log scale on the X-axis. The Y-axis shows the performance metric. I refer to the system's performability curve as a performability envelope

because it represents the boundaries of performance and availability, where we can expect the actual use to fall below the curve for any interval.

Suppose you have a requirement for an array that delivers 1,500 iops with "four-nines" availability. You can see from the performability graph that Product A and C can deliver 1,500 iops, Product C can deliver "four-nines" availability, but only Product A can deliver both 1,500 iops and "four-nines" availability.

To help you understand the composition of the graph, I colored some of the states which have longer staying times.

composite fault performability graph 

You can see that some of the failure states have little impact on performance, whereas others will have a significant impact on performance. For this array, when a power supply/battery unit fails, the write cache is placed in write through mode, which has a significant performance impact. Also, when a disk fails and is being reconstructed, the overall performance is impacted. Now we have a clearer picture of what performance we can expect from this array per year.

This composition view is particularly useful for product engineers, but is less useful to systems engineers. For complex systems, there are many products, many failure modes, and many more trade-offs to consider. More on that later...

Like this post? del.icio.us | furl | slashdot | technorati | digg

Performability analysis of T5120 and T5220

Tuesday Oct 09, 2007

In complex systems, we must often trade-off performance against reliability, availability, or serviceability. In many cases, a system design will include both performance and availability requirements. We use performability analysis to examine the performance versus availability trade-off. Performability is simply the ability to perform. A performability analysis combines performance characterization for systems under the possible combinations of degraded states with the probability that the system will be operating the degraded states.

The simplest performability analysis is often appropriate for multiple node, shared nothing clusters which scale performance perfectly. For example, in a simple web server farm, you might have N servers capable of delivering M pages per server. Disregarding other bottlenecks in the system such, as the capacity of the internet connection to the server farm, we can say that N+1 servers will deliver M*(N+1) performance. Thus we can estimate the aggregate performance of any number of web servers.

We can also perform an availability analysis on a web server. We can build Markov models which consider the reliability of the components in a server and their expected time to repair. The output of the models will provide the estimated time per year that each web server may be operational. More specifically, we will know the staying time per year for each of the model states. For a simple model, the performance reward for an up state is M and a down state is 0. A system which provides 99.99% (four-nines) availability can be expected to be down for approximately 53 minutes per year and up for the remainder.

For a shared nothing cluster, we can further simplify the analysis by ignoring common fault effects. In practice, this means that a failure or repair in one web server does not affect any other web servers. In many respects, this is the same simplifying assumption we made with performance, where the performance of a web server is dependent on any of the other web servers.

The shared nothing cluster availability model will contain the following system states and the annual staying time in each state: all up, one down (N-1 up), two down (N-2 up), three down (N-3 up), and so on. The availability model inputs include the unscheduled mean time between system interruption (U_MTBSI) and mean time to repair (MTTR) for the nodes. We often choose a MTTR value by considering the cost of service response time. For many shared nothing clusters, a service response time of 48 hours may be reasonable – a value which may not be reasonable for a database or storage tier. Model results might look like this:

System State

Annual Staying Time (minutes)

Cumulative Uptime (%)

Performance Reward

All up

521,395.20

99.2

M * N

1 down

4,162.75

99.992

M * (N - 1)

2 down

39.95

99.9996

M * (N - 2)

3 down

2.00

99.99998

M * (N - 3)

> 3 down

0.11

100

< M * (N - 4)

Total

525,600.00

100


Now we have enough data to evaluate the performability of the system. For the simple analysis, we accept the cumulative uptime result for the minimum required performance. We can then compare various systems considering performability.

We have modeled the new Sun SPARC Enterprise T5120 and Sun SPARC Enterprise T5220 servers against the venerable Sun Fire V490 servers. For this analysis we chose a performance benchmark with a metric that showed we needed 6 T5120 or T5220 servers to match the performance of 9 V490 servers. We will choose to overprovision by one server, which is often optimum for such architectures. The performability results are:

Servers

Units

Performability (%)

Sun SPARC Enterprise T5120

6 + 1

99.99988

Sun SPARC Enterprise T5220

6 + 1

99.99988

Sun Fire V490

9 + 1

99.99893

You might notice that the T5120 and T5220 have the same performability results. This is because they share the same motherboard design, disks, power supplies, etc. It is much more interesting to compare these to the V490. Even though we use more V490 systems, the T5120 and T5220 solution provides better performability. Fewer, faster, more reliable servers should generally have better performability than more, slower, less reliable servers.

 

Like this post? del.icio.us | furl | slashdot | technorati | digg

Performability Analysis for Storage

Thursday Oct 04, 2007

I'll be blogging about performability analysis over the next few weeks. Last year Hairong Sun, Tina Tyan, Steven Johnson, Nisha Talagala, Bob Wood, and I published a paper on how we do performability analysis at Sun.  It is titled Performability Analysis of Storage Systems in Practice: Methodology and Tools, and is available online at SpringerLink. Here is the abstract:

This paper presents a methodology and tools used for performability analysis of storage systems in Sun Microsystems. A Markov modeling tool is used to evaluate the probabilities of normal and fault states in the storage system, based on field reliability data collected from customer sites. Fault injection tests are conducted to measure the performance of the storage system in various degraded states with a performance benchmark developed within Sun Microsystems. A graphic metric is introduced for performability assessment and comparison. An example is used throughout the paper to illustrate the methodology and process.

I'm giving a presentation on performability at Sun's Customer Engineering Conference next week, so if you're attending stop by and visit.

[2] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Adaptec webinar on disks and error handling

Wednesday Oct 03, 2007

Adaptec has put together a nice webinar called Nearline Data Drives and Error Handling. If you work with disks or are contemplating building your own home data server, I recommend that you take 22 minutes to review the webinar. As a systems vendor, we are often asked why we made certain design decisions to favor data over costs, and I think this webinar does a good job of showing how some of the complexity of systems design covers a large number of decision points.  Here in the RAS Engineering group we tend to gravitate towards the best reliability and availability of systems, which still requires a staggering number of design trade-offs.  Rest assured that we do our best to make these decisions with your data in mind.

For the ZFSers in the world, this webinar also provides some insight into how RAID systems like ZFS are designed, and why end-to-end data protection is vitally important.

Enjoy!  And if you don't want your Starbuck's gift card, send it to me :-)
 

Like this post? del.icio.us | furl | slashdot | technorati | digg

Mainframe inspired RAS features in new SPARC Enterprise Servers

Monday Apr 23, 2007

My colleague, Gary Combs, put together a podcast describing the new RAS features found in the Sun SPARC Enterprise Servers. The M4000, M5000, M8000, and M9000 servers have very advanced RAS features, which put them head and shoulders above the competition. Here is my list of favorites, in no particular order:

  1. Memory mirroring. This is like RAID-1 for main memory. As I've said many times, there are 4 types of components which tend to break most often: disks, DIMMs (memory), fans, and power supplies. Memory mirroring brings the fully redundant reliability techniques often used for disks, fans, and power supplies to DIMMs.
  2. Extended ECC for main memory.  Full chip failures on a DIMM can be tolerated.
  3. Instruction retry. The processor can detect faulty operation and retry instructions. This feature has been available on mainframes, and is now available for the general purpose computing markets.
  4. Improved data path protection. Many improvements here, along the entire data path.  ECC protection is provided for all of the on-processor memory.
  5. Reduced part count from the older generation Sun Fire E25K.  Better integration allows us to do more with fewer parts while simultaneously improving the error detection and correction capabilities of the subsystems.
  6. Open-source Solaris Fault Management Architecture (FMA) integration. This allows systems administrators to see what faults the system has detected and the system will automatically heal itself.
  7. Enhanced dynamic reconfiguration.  Dynamic reconfiguration can be done at the processor, DIMM (bank), and PCI-E (pairs) level of grainularity.
  8. Solaris Cluster support.  Of course Solaris Cluster is supported including clustering between Solaris containers, dynamic system domains, or chassis.
  9. Comprehensive service processor. The service processor monitors the health of the system and controls system operation and reconfiguration. This is the most advanced service processor we've developed. Another welcome feature is the ability to delegate responsibilities to different system administrators with restrictions so that they cannot control the entire chassis.  This will be greatly appreciated in large organizations where multiple groups need computing resources.
  10. Dual power grid. You can connect the power supplies to two different power grids. Many people do not have the luxury of access to two different power grids, but those who have been bitten by a grid outage will really appreciate this feature.  Think of this as RAID-1 for your power source.

I don't think you'll see anything revolutionary in my favorites list. This is due to the continuous improvements in the RAS technologies.  The older Sun Fire servers were already very reliable, and it is hard to create a revolutionary change for mature technologies.  We have goals to make every generation better, and we've made many advances with this new generation.  If the RAS guys do their job right, you won't notice it - things will just keep working.

[1] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Intro to Mean Time Between Service (MTBS) Analysis

Friday Feb 09, 2007

Mean Time Between Service (MTBS)

Systems which include redundancy can often have very high reliability and availability. When systems get to high levels of redundancy, such as reliability estimated in decades or availability above 99%, then we need to look at other measurements to differentiate design choices. One such measurement is the Mean Time Between Service (MTBS). For example, a 2-way mirror RAID system will offer data redundancy, high Mean Time Between System Interruption (MTBSI), and good availability. Adding a hot spare disk will decrease the Mean Time To Repair (MTTR) and thus improve MTBSI and availability. But what about adding two hot spare disks? The answer is intuitively obvious, more spare disks means that the system should be able to survive even longer in the face of failures. In general, adding more hot spares is better. When comparing two different RAID configurations, it can be helpful to know the MTBS as well as other Reliability, Availability, and Serviceability (RAS) metrics.

We define MTBS as the mean time between failures of a system which would lead to a service call. For a non-redundant system, a service call would occur immediately after a system interruption. For redundant systems, a service call may be scheduled to replace a failed component, even if an interruption did not occur. Intuitively, if we can defer service calls, then the system is easier manage. In the RAID systems, more hot spares means that we can defer scheduled replacement service in the event that one disk fails. The more hot spares we have, the longer we can defer the scheduled replacement service.

Mathematical Details

To calculate MTBS we need to define a fixed time, T, as an interval for the analysis. At Sun, we often use 5 years for disk drives.

T = 5 years = 43,800 hours

Next, we need to calculate the reliability of the device over that period of time. For the general cases, we use an exponential distribution. For a disk, we might see:

MTBF = 800,000 hours

For which we calculate the reliability for time T as:

Rdisk(T) = e-(T/MTBF) = e-(43,800/800,000) = 0.9467

The average failure rate for time T is:

FRdisk(T) = (1 - Rdisk(T))/T = (1 - 0.9467)/43,800 = 0.000001217

For a single disk, which requires a service call when it fails, the MTBS is simply the reciprocal of the failure rate:

MTBS = 1/FRdisk(T) = 1/0.000001217 = 821,764 hours

For this simple case, you can see that MTBS is simply the time-bounded evaluation of the infinite-time MTBF and isn't very interesting. However, when we add more disks, the calculations become more interesting. We can use binomial theorem analysis to determine the probability that multiple disks have failed. Consider the case where you have two identical disks with the same expected MTBF as above. We can see that the possible combinations of disks being functional follow the binomial theorem distribution. First, we calculate the probability of being in a state where K of N (N choose K) drives are failed, which we'll call Pfailed:

Pfailed = (N choose K) * Rdisk(T)N-K * (1 - Rdisk(T))K

This is obviously a good job for a spreadsheet or other automation. The results can be easily seen in the table below, where we have 2 disks, N=2:

Number of Failed Disks (K)

Binomial Distribution
(N choose K)

Pfailed

Cumulative Pfailed

FR(T)

MTBS (hours)

0

1

0.8963

0.8963

2.368e-6

422,300

1

2

0.1009

0.9972

6.481e-8

15,430,323

2

1

0.0028

1.0000

0

If we have a service policy such that we immediately replace a failed drive, then the MTBS is 422,300 hours or approximately MTBF/2, as you would expect.

Suppose we have a mirrored pair and one hot spare (N=3), then the table looks like:

Number of Failed Disks (K)

Binomial Distribution
(N choose K)

Pfailed

Cumulative Pfailed

FR(T)

MTBS (hours)

0

1

0.8485

0.8485

3.458e-6

289,166

1

3

0.1433

0.9918

1.875e-7

5,332,858

2

3

0.0081

0.9998

3.453e-9

289,617,933

3

1

0.0002

1.0000

0

Here we can see that more disks means that the service rate increases and MTBS is lower. However, now we can implement a service policy such that if we only replace a disk when we get down to no redundancy, then the MTBS is larger than for the 2-disk case with the same policy, 289,617,933 vs 15,430,323 hours. A side benefit is that the MTTR of a hot spare does not include a human reaction component, so it is as low as can be reasonably expected. Low MTTR leads to higher availability. Clearly, adding a hot spare is a good thing, from a RAS perspective.

As we add disks, we tend to prefer a service policy such that we don't have to risk getting down to a non-redundant state before we take corrective action. In other words, we may want to look at the MTBS value when the number of failed disks (K) is equal to the number of hot spares. This isn't at all interesting in the 2-disk case, and only mildly interesting in the 3-disk case. Suppose we have 10 disks, then the table looks like:

Number of Failed Disks (K)

Binomial Distribution
(N choose K)

Pfailed

Cumulative Pfailed

FR(T)

MTBS (hours)

0

1

0.5784

0.5784

9.626e-6

103,888

1

10

0.3255

0.9039

2.194e-6

455,747

2

45

0.0824

0.9863

3.122e-7

3,202,918

3

120

0.0124

0.9987

2.978e-8

33,574,752

4

210

0.0012

0.9999

1.969e-9

507,775,096

5

252

8.227e-5

1.0000

9.098e-11

10,990,887,270

6

210

3.3858e-6

1.0000

2.893e-12

345,617,834,855

7

120

1.241e-7

1.0000

6.054e-14

16,519,338,761,278

8

45

2.619e-9

1.0000

7.519e-16

1,330,049,617,377,480

9

10

3.275e-11

1.0000

4.208e-18

237,659,835,757,624,000

10

1

1.843e-13

1.0000

0

Now you can begin to see the trade-off more clearly as you compare the 3-disk scenario with the 10-disk scenario. If we have a service policy which says to replace each disk as they fail, then we will be doing so much more frequently for the 10-disk scenario than for the 3-disk scenario. Again, this should make intuitive sense. But the case can also be made that if we only want to service the system at a given (effective, relative) rate of 3,000,000 hours, then we'd need to have at least two hot spares for the 10-disk scenario versus only one hot spare for the 3-disk scenario. Which is better? Well, it depends on other factors you need to consider when designing such systems. The MTBS analysis provides just one view into the problem.

Caveat Emptor

When we look at failure rates, MTBF, and the MTBS derived from them, people are often shocked by the magnitude of the values. 800,000 hours is 91 years. 91 years ago disk drives did not exist and any current disk drive is unlikely to function 91 years from now. Most disk drives have an expected lifetime of 5 years or so. The affect this has on the math is that the failure rate is not constant and changes over time. As the disk drive gets older, the failure rate increases. From an academic perspective, it would be nice to know the shape of that curve and adjust the algorithms accordingly. From a practical perspective, such data is very rarely publicly available. You can also inoculate your design from this by planning for a technology refresh every 5 years or so. The accuracy of the magnitude does not make a practical impact on system analysis. It is more important to judge the relative merits of competing designs and configurations. For example, should you use 2-way RAID-1 with a spare versus a 3-way RAID-5? The analysis described here can help answer that question by evaluating the impact of the decisions relative to each other. You should interpret the results accordingly.



Like this post? del.icio.us | furl | slashdot | technorati | digg

A story of two MTTDL models

Wednesday Jan 17, 2007

Mean Time to Data Loss (MTTDL) is a metric that we find useful for comparing data storage systems. I think it is particularly useful for determining what sort of data protection you may want to use for a RAID system. For example, suppose you have a Sun Fire X4550 (aka Thumper) server with 48 internal disk drives. What would be the best way to configure the disks for redundancy? Previously, I explored space versus MTTDL and space versus unscheduled Mean Time Between System Interruptions (U_MTBSI) for the X4500 running ZFS. The same analysis works for SVM or LVM, too.

For this blog, I want to explore the calculation of MTTDL for a bunch of disks. It turns out, there are multiple models for calculating MTTDL. The one described previously here is the simplest and only considers the Mean Time Between Failure (MTBF) of a disk and the Mean Time to Repair (MTTR) of the repair and reconstruction process. I'll call that model #1 which solves for MTTDL[1]. To quickly recap:

For non-protected schemes (dynamic striping, RAID-0)
MTTDL[1] = MTBF / N
For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):
MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
For double parity schemes (3-way mirror, raidz2, RAID-6):
MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)
You can often get MTBF data from your drive vendor and you can measure or estimate your MTTR with reasonable accuracy. But MTTDL[1] does not consider the Unrecoverable Error Rate (UER) for read operations on disk drives. It turns out that the UER is often easier to get from the disk drive data sheets, because sometimes the drive vendors don't list MTBF (or Annual Failure Rate, AFR) for all of their drive models. Typically, UER will be 1 per 1014 bits read for consumer class drives and 1 per 1015 for enterprise class drives. This can be alarming, because you could also say that consumer class drives should see 1 UER per 12.5 TBytes of data read. Today, 500-750 GByte drives are readily available and 1 TByte drives are announced. Most people will be unhappy if they get an unrecoverable read error once every dozen or so times they read the whole disk. Worse yet, if we have the data protected with RAID and we have to replace a drive, we really do hope that the data reconstruction completes correctly. To add to our nightmare, the UER does not decrease by adding disks. If we can't rely on the data to be correctly read, we can't be sure that our data reconstruction will succeed, and we'll have data loss. Clearly, we need a model which takes this into account. Let's call that model #2, for MTTDL[2]:
First, we calculate the probability of unsuccessful reconstruction due to a UER for N disks of a given size (unit conversion omitted):
Precon_fail = (N-1) * size / UER
For single-disk failure protection:
MTTDL[2] = MTBF / (N * Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)
Comparing the MTTDL[1] model to the MTTDL[2] model shows some interesting aspects of design. First, there is no MTTDL[2] model for RAID-0 because there is no data reconstruction – any failure and you lose data. Second, the MTTR doesn't enter into the MTTDL[2] model until you get to double-disk failure scenarios. You could nit pick about this, but as you'll soon see, it really doesn't make any difference for our design decision process. Third, you can see that the Precon_fail is a function of the size of the data set. This is because the UER doesn't change as you grow the data set. Or, to look at it from a different direction, if you use consumer class drives with 1 UER for 1014 bits, and you have 12.5 TBytes of data, the probability of an unrecoverable read during the data reconstruction is 1. Ugh. If the Precon_fail is 1, then the MTTDL[2] model looks a lot like the RAID-0 model and friends don't let friends use RAID-0! Maybe you could consider a smaller sized data set to offset this risk. Let's see how that looks in pictures.

 MTTDL models for 2-way mirror

2-way mirroring is an example of a configuration which provides single-disk failure protection. Each data point represents the space available when using 2-way mirrors in a zpool. Since this is for a X4500, we consider 46 total available disks and any disks not used for data are available as spares. In this graph you can clearly see that the MTTDL[1] model encourages the use of hot spares. More importantly, although the results of the calculations of the two models are around 5 orders of magnitude different, the overall shape of the curve remains the same. Keep in mind that we are talking years here, perhaps 10 million years, which is well beyond the 5-year expected life span of a disk. This is the nature of the beast when using a constant MTBF. For models which consider the change in MTBF as the device ages, you should never see such large numbers. But the wish for more accurate models does not change the relative merits of the design decision, which is what we really care about – the best RAID configuration given a bunch of disks. Should I use single disk failure protection or double disk failure protection? To answer that, lets look at the model for raidz2.

MTTDL models for raidz2 

From this graph you can see that double disk protection is clearly better than single disk protection above, regardless of which model we choose. Good, this makes sense. You can also see that with raidz2 we have a larger number of disk configuration options. A 3-disk raidz2 set is somewhat similar to a 3-way mirror with the best MTTDL, but doesn't offer much available space. A 4-disk set will offer better space, but not quite as good MTTDL. This pattern continues through 8 disks/set. Judging from the graphs, you should see that a 3-disk set will offer approximately an order of magnitude better MTTDL than an 8-disk, for either MTTDL model. This is because the UER remains constant while the data to be reconstructed increases.
I hope that these models give you an insight into how you can model systems for RAS. In my experience, most people get all jazzed up with the space and forget that they are often making a space vs. RAS trade-off. You can use these models to help you make good design decisions when configuring RAID systems. Since the graphs use Space on the X-axis, it is easy to look at the design trade-offs for a given amount of available space.

Just one more teaser... there are other MTTDL models, but it is unclear if they would help make better decisions, and I'll explore those in another blog.

[5] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Diversity revisited

Tuesday Jan 16, 2007

The recent disruption of the internet connection between Taiwan and the mainland is an example of a system where there is not much diversity. This is a rather large scale example, perhaps one of the largest such examples on earth. The lesson is that the diversity in your design needs to be able to cover the expected fault (pun intended) zone. In this case, what was lost was capacity, the international connection capacity was reduced to approximately 50%.

Another, more local example of diversity in communications recently occurred to my colleague, Mike Anuta, in  Wisconsin. A construction accident severed the underground fiber link. This resulted in disconnecting the local area from the rest of the world. Mike says the accident also clobbered a bunch of the cellular service. Being a clever engineer, and having a working internet link, Mike was able to access the most important application at Sun, e-mail. The big question is whether it is also worth the cost to add a VOIP account, if the pricing was reasonable. So, this is another example of how you can work around systems faults using diversity. Remember, you can run IP over almost anything from cups-n-strings to radio links to fiber -- all using a protocol (IP) designed to overcome link faults.


Like this post? del.icio.us | furl | slashdot | technorati | digg

ZFS RAID recommendations: space vs MTTDL

Thursday Jan 11, 2007

It is not always obvious what the best RAID set configuration should be for a given set of disks. This is even more difficult to see as the number of disks grows large, like on a Sun Fire X4500 (aka Thumper) server. By default, the X4500 ships with 46 disks available for data. This leads to hundreds of possible permutations of RAID sets.  Which would be best? One analysis is the trade-off space and Mean Time To Data Loss (MTTDL). For this blog, I will try to stick with ZFS terminology in the text, but the principles apply to other RAID systems, too.

The space calculation is straightforward.  Configure the RAID sets and sum the total space available.

The MTTDL calculation is one attribute of Reliability, Availability, and Serviceability (RAS) which we can also calculate relatively easily. For large numbers of disks, MTTDL is particularly useful because we only need to consider the reliability of the disks, and not the other parts of the system (fodder for a later blog :-). While this doesn't tell the whole RAS story, it is a very good method for evaluating a big bunch of disks. The equations are fairly straightforward:

For non-protected schemes (dynamic striping, RAID-0)

MTTDL = MTBF / N

For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):

MTTDL = MTBF2 / (N * (N-1) * MTTR)

For double parity schemes (3-way mirror, raidz2, RAID-6):

MTTDL = MTBF3 / (N * (N-1) * (N-2) * MTTR2)

Where MTBF is the Mean Time Between Failure and MTTR is the Mean Time To Recover. You can get MTBF values from disk data sheets which are usually readily available. You could also adjust them for your situation or based upon your actual experience. At Sun, we have many years of field failure data for disks and use design values which are consistent with our experiences. YMMV, of course. For MTTR you need to consider the logistical repair time, which is usually the time required to identify the failed disk and physically replace it.  You also need to consider the data reconstruction time, which may be a long time for large disks, depending on how rapidly ZFS or your logical volume manager (LVM) will reconstruct the data. Obviously, a spreadsheet or tool helps ease the computational burden.

Note: the reconstruction time for ZFS is a function of the amount of data, not the size of the disk. Traditional LVMs or hardware RAID arrays have no context of the data and therefore have to reconstruct the entire disk rather than just reconstruct the data. In the best case (0% used), ZFS will reconstruct the data almost instantaneously.  In the worst case (100% used) ZFS will have to reconstruct the entire disk, just like a traditional LVM.  This is one of the advantages of ZFS over traditional LVMs: faster reconstruction time, lower MTTR, better MTTDL.

Note: if you have a multi-level RAID set, such as RAID-1+0, then you need to use both the single parity and no protection MTTDL calculations to get the MTTDL of the top-level volume. 

So, I took a large number of possible ZFS configurations for a X4500 and calculated the space and MTTDL for the zpool. The interesting thing is that the various RAID protection schemes fall out in clumps. For example, you would expect that a 3-way mirror has better MTTDL and less available space than a 2-way mirror. As you vary the configurations, you can see the changes in space and MTTDL, but you would never expect a 2-way mirror to have better MTTDL than a 3-way mirror. The result is that if you plot the available space against the MTTDL, then the various RAID configurations will tend to clump together.

X4500 MTTDL vs Space

The most obvious conclusion from the above data is that you shouldn't use simply dynamic striping or RAID-0. Friends don't let friends use RAID-0!

You will also notice that I've omitted the values on the MTTDL axis. You've noticed that the MTTDL axis uses a log scale, so that should give you a clue as to the magnitude of the differences. The real reason I've omitted the values is because they are a rat hole opportunity with a high entrance probability. It really doesn't matter if the MTTDL calculation shows that you should see a trillion years of MTTDL because the expected lifetime of a disk is on the order of 5 years.  I don't really expect any disk to last more than a decade or two. What you should take away from this is that bigger MTTDL is better, and you get a much bigger MTTDL as you increase the number of redundant copies of the data. It is better to stay out of the MTTDL value rat hole. 

The other obvious conclusion is that you should use hot spares. The reason for this is that when a hot spare is used, the MTTR is decreased because we don't have to wait for the physical disk to be replaced before we start data reconstruction on a spare disk. The time you must wait for the data to be reconstructed and available is time where you are exposed to another failure which may cause data loss. In general, you always want to increase MTBF (the numerator) and decrease MTTR (the denominator) to get high RAS.

The most interesting result of this analysis is that the RAID configurations will tend to clump together. For example, there isn't much difference between the MTTDL of a 5-disk zpool versus a 6-disk raidz zpool.

But if you look at this data another way, there is a huge difference in the RAS. For example, suppose you want 15,000 GBytes of space in your X4500.  You could use either raidz or raidz2 with or without spares. Clearly, you would have better RAS if you choose raidz2 with spares than any of the other options for the space requirement. Whether you use 6, 7, 8, or 9 disks in your raidz2 set makes less difference in MTTDL.

 

There are other considerations when choosing the ZFS or RAID configurations which I plan to address in later blogs. For now, I hope that this will encourage you to think about how you might approach the space and RAS trade-offs for your storage configurations.

 

[10] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg