Wednesday Feb 20, 2008
When we build large clusters, such as high performance clusters or any cluster with a large number of computing nodes, we begin to look in detail at the repair models for the system. You are probably aware of the need to study power usage, air conditioning, weight, system management, networking, and cost for such systems. So you are also aware of how multiplying the environmental needs of one computing node times the number of nodes can become a large number. This can be very intuitive for most folks. But availability isn't quite so intuitive. Deferred repair models can also affect the intuition of the design. So, I thought that a picture would help show how we analyze the RAS characteristics of such systems and why we always look to deferred repair models in their design.
To begin, we have to make some assumptions:
- The availability of the whole is not interesting. The service provided by a big cluster is not dependent on all parts being functional. Rather, we look at it like a swarm of bees. Each bee can be busy, and the whole swarm can contribute towards making honey, but the loss of a few bees (perhaps due to a hungry bee eater) doesn't cause the whole honey producing process to stop. Sure, there may be some components of the system which are more critical than others, like the queen bee, but work can still proceed forward even if some of these systems are temporarily unavailable (the swarm will create new queens, as needed). This is a very different view than looking at the availability of a file service, for example.
- The performability will might be interesting. How many dead bees can we have before the honey production falls below our desired level? But for very, very large clusters, the performability will be generally good, so a traditional performability analysis is also not very interesting. It is more likely that a performability analysis of the critical components, such as networking and storage, will be interesting. But the performability of thousands of compute nodes will be less interesting.
- Common root cause failures are not considered. If a node fails, the root cause of the failure is not common to other nodes. A good example of a common root cause failure is loss of power -- if we lose power to the cluster, all nodes will fail. Another example is software -- a software bug which causes the nodes to crash may be common to all nodes.
- What we will model is a collection of independent nodes, each with their own, independent failure causes. Or just think about bees.
For a large number of compute nodes, even using modern, reliable designs, we know that the probability of all nodes being up at the same time is quite small. This is obvious if we look at the simple availability equation:
Availability = MTBF / (MTBF + MTTR)
where, MTBF (mean time between failure) is MTBF[compute node]/N[nodes]
and, MTTR (mean time to repair) is > 0
The killer here is N. As N becomes large (thousands) and MTTR is dependent on people, then the availability becomes quite small. The time required to repair a machine is included in the MTTR. So as N becomes large, there is more repair work to be done. I don't know about you, but I'd rather not spend my life in constant repair mode, so we need to look at the problem from a different angle.
If we make MTTR large, then the availability will drop to near zero. But if we have some spare compute nodes, then we might be able to maintain a specified service level. Or, some a practical perspective, we could ask the question, "how many spare compute nodes do I need to keep at least M compute nodes operational?" The next, related question is, "how often do we need to schedule service actions?" To solve this problem, we need a model.
Before I dig into the model results, I want to digress for a moment and talk about Mean Time Between Service (MTBS) and Mean Time Between System Interruption (MTBSI). I've blogged in detail about these before, but to put there use in context here, we will actually use MTBSI and not MTBF for the model. Why? Because if a compute node has any sort of redundancy (ECC memory, mirrored disks, etc.) then the node may still work after a component has failed. But we want to model our repair schedule based on how often we need to fix nodes, so we need to look at how often things break for two cases. The models will show us those details, but I won't trouble you with them today.
The figure below shows a proposed 2000+ node HPC cluster with two different deferred repair models. For one solution, we use a one week (168 hour) deferred repair time. For the other solution, we use a two week deferred repair time. I could show more options, but these two will be sufficient to provide the intuition for solving such mathematical problems.
We build a model showing the probability that some number of nodes will be down. The OK state is when all nodes are operational. It is very clear that the longer we wait to repair the nodes, the less probable it is that the cluster will be in the OK state. I would say, that that with a two week deferred maintenance model, there is nearly zero probability that all nodes will be operational. Looking at this another way, if you want all nodes to be available, you need to have a very, very fast repair time (MTTR approaching 0 time). Since fast MTTR is very expensive, accepting a deferred repair and using spares is usually a good cost trade-off.
OK, so we're convinced that a deferred repair model is the way to go, so how many spare compute nodes do we need? A good way to ask that question is, "how may spares do I need to ensure that there is a 95% probability that I will have a minumum of M nodes available?" From the above graph, we would accumulate the probability until we reached the 95% threshold. Thus we see that for the one week deferred repair case, we need at least 8 spares and for the two week deferred repair case we need at least 12 spares. Now this is something we can work with.
The model results will change based on the total number of compute nodes and their MTBSI. If you have more nodes, you'll need more spares. If you have more reliable or redundant nodes, you need fewer spares. If we know the reliability of the nodes and their redundancy characteristics, we have models which can tell you how many spares you need.
This sort of analysis also lets you trade-off the redundancy characteristics of the nodes to see how that affects the system, too. For example, we could look at the affect of zero, one, or two disks (mirrored) per node on the service levels. I personally like the zero disk case, where the nodes boot from the network, and we can model such complex systems quite easily, too. This point should not be underestimated, as you add redundancy to increase the MTBSI, you also increase the MTBS, which impacts your service costs. The engineer's life is a life full of trade-offs.
In conclusion, building clusters with lots of nodes (red shift designs) requires additional analysis beyond what we would normally use for critical systems with few nodes (blue shift designs). We often look at service costs using a deferred service interval and how that affects the overall system service level. We also look at the trade-offs between per-node redundancy and the overall system service level. With proper analysis, we can help determine the best performance and best cost for large, red shift systems.
Tuesday Oct 16, 2007
Modern systems are continuing to evolve and become more tolerant
to failures. For many systems today, a simple performance or
availability analysis does not reveal how well a system will operate
when in a degraded mode. A performability analysis can help answer
these questions for complex systems. In this blog, I'll show one of
the methods we use for performability analysis.
We often begin with a small set of components for test and
analysis. Traditional benchmarking or performance characterization is
a good starting point. For this example, we will analyze a storage
array. We begin with an understanding of the performance
characteristics of our desired workload, which can vary widely for
storage subsystems. In our case, we will create a performance
workload which includes a mix of reads and writes, with a consistent
iop size, and a desired performance metric of iops/second. Storage
arrays tend to have many possible RAID configurations which will have
different performance and data protection trade-offs, so we will pick
a RAID configuration which we think will best suit our requirements.
If it sounds like we're making a lot of choices early, it is because
we are. We know that some choices are clearly bad, some are clearly
good, and there are a whole bunch of choices in between. If we can't
meet our design targets after the performability analysis, then we
might have to go back to the beginning and start again - such is the
life of a systems engineer.
Once we have a reasonable starting point, we will setup a baseline
benchmark to determine the best performance for a fully functional
system. We will then use fault injection to measure the system
performance characteristics under the various failure modes expected
in the system. For most cases, we are concerned with hardware
failures. Often the impact on the performance of a system under
failure conditions is not constant. There may be a fault diagnosis
and isolation phase, a degraded phase, and a repair phase. There may
be several different system performance behaviors during these
phases. The transient diagram below shows the performance
measurements of a RAID array with dual redundant controllers
configured in a fully redundant, active/active operating mode. We
bring the system to a steady state and then inject a fault into one
of the controllers.
This analysis is interesting for several different reasons. We see
that when the fault was injected, there was a short period where the
array serviced no I/O operations. Once the fault was isolated, then a
recovery phase was started during which the array was operating at
approximately half of its peak performance. Once recovery was
completed, the performance returned to normal, even though the system
is in a degraded state. Next we repaired the fault. After the system
reconfigured itself, performance returned to normal for the
non-degraded system. You'll note that during the post-repair
reconfiguration the array stopped servicing I/O operations and this
outage was longer than the outage in the original fault. Sometimes, a
trade-off is made such that the impact of the unscheduled fault is
minimized at the expense of the repair activity. This is usually a
good trade-off because the repair activity is usually a scheduled
event, so we can limit the impact via procedures and planning. If you
have ever waited for an fsck to finish when booting a system, then
you've felt the impact of such decisions and understand why modern
file systems have attempted to minimize the performance costs of
fsck,
or
eliminated the need for fsck altogether. Modeling the system in
this way means that we will consider both the unscheduled faults as
well as the planned repair, though we usually make the simplifying
assumption that there will be one repair action for each unscheduled
fault.
If this sort of characterization sounds tedious, well it is. But
it is the best way for us to measure the performance of a subsystem
under faulted conditions. Trying to measure the performance of a
more complex system with multiple servers, switches, and arrays under
a comprehensive set of fault conditions would be untenable. We do
gain some reduction of the test matrix because we know that some
components have no impact on performance when they fail.
Next we build a RAScad model for the system. I usually use a
heirarchial model built from components which hides much of the
complexity from me, but for this simpler example, the Markov model
looks like this:
Where the states are explained by this table:
|
State
|
Explanation
|
Transition Rate
|
Explanation
|
|
28,0,1
|
No failures
|
m_repair
|
rate (=1/MTTR)
|
|
1 UIC_Dn
|
1 UIC is down
|
l_uic
|
UIC failure rate
|
|
Down
|
System is down
|
l_mp
|
Midplane failure rate
|
|
1 Ctlr_Dn
|
1 Controller is down
|
l_cntl
|
Controller failure rate
|
|
1PCU_Dn
|
1 PCU is down
|
l_pcu
|
PCU failure rate
|
|
27,1,0
|
1 disk is under reconstruction
|
l_recon
|
Disk reconstruction rate
|
|
28,1,1
|
1 disk is under reconstruction, 1 spare disk available
|
l_disk
|
Disk failure rate
|
|
27,0,0
|
No spare disk
|
|
|
|
26,0,0
|
One parity group loses 1 disk, no
spare available, no disk reconstruction
|
|
|
Solving the Markov model will provide us with the average staying
time per year in each of the states. Note that we must make some sort
of assumptions about the service response time. We will usually use 4
hour service response time for enterprise-class operations. Is that
assumption optimal? We don't always know, so that is another feature
of a system I'll explore in a later blog.
So now we have the performance for each state, and the average
staying time per year. These are two variables, so lets graph them
on an X-Y plot. To make it easier to compare different systems, we
sort by the performance (in the Y-axis). We call the resulting graph
a performability graph or P-Graph for short. Here is an
example of a performability graph showing the results for three
different RAID array configurations.
I usually label availability targets across the top as an
alternate X-axis label because many people are more comfortable with
availability targets represented as "nines" than seconds or
minutes. In order to show the typically small staying time, we use a
log scale on the X-axis. The Y-axis shows the performance metric. I
refer to the system's performability curve as a
performability
envelope because it represents the boundaries of performance and
availability, where we can expect the actual use to fall below the
curve for any interval.
Suppose you have a requirement for an array that delivers 1,500
iops with "four-nines" availability. You can see from the
performability graph that Product A and C can deliver 1,500 iops,
Product C can deliver "four-nines" availability, but only
Product A can deliver both 1,500 iops and "four-nines"
availability.
To help you understand the composition of the graph, I colored
some of the states which have longer staying times.
You can see that some of the failure states have little impact on
performance, whereas others will have a significant impact on
performance. For this array, when a power supply/battery unit fails,
the write cache is placed in write through mode, which has a
significant performance impact. Also, when a disk fails and is being
reconstructed, the overall performance is impacted. Now we have a
clearer picture of what performance we can expect from this array per
year.
This composition view is particularly useful for product
engineers, but is less useful to systems engineers. For complex
systems, there are many products, many failure modes, and many more
trade-offs to consider. More on that later...
Tuesday Oct 09, 2007
In complex systems, we must often trade-off performance against
reliability, availability, or serviceability. In many cases, a system
design will include both performance and availability requirements.
We use performability analysis to examine the
performance versus availability trade-off. Performability is simply
the ability to perform. A performability analysis combines
performance characterization for systems under the possible
combinations of degraded states with the probability that the system
will be operating the degraded states.
The simplest performability analysis is often appropriate for
multiple node, shared nothing clusters which scale performance
perfectly. For example, in a simple web server farm, you might have N
servers capable of delivering M pages per server. Disregarding other
bottlenecks in the system such, as the capacity of the internet
connection to the server farm, we can say that N+1 servers will
deliver M*(N+1) performance. Thus we can estimate the aggregate
performance of any number of web servers.
We can also perform an availability analysis on a web server. We
can build Markov models which consider the reliability of the
components in a server and their expected time to repair. The output
of the models will provide the estimated time per year that each web
server may be operational. More specifically, we will know the
staying time per year for each of the model states. For a simple
model, the performance reward for an up state is M and a down
state is 0. A system which provides 99.99% (four-nines) availability
can be expected to be down for approximately 53 minutes per year and
up for the remainder.
For a shared nothing cluster, we can further simplify the analysis
by ignoring common fault effects. In practice, this means that a
failure or repair in one web server does not affect any other web
servers. In many respects, this is the same simplifying assumption we
made with performance, where the performance of a web server is
dependent on any of the other web servers.
The shared nothing cluster availability model will contain the
following system states and the annual staying time in each state:
all up, one down (N-1 up), two down (N-2 up), three down (N-3 up),
and so on. The availability model inputs include the unscheduled mean
time between system interruption (U_MTBSI) and mean time to repair
(MTTR) for the nodes. We often choose a MTTR value by considering
the cost of service response time. For many shared nothing clusters,
a service response time of 48 hours may be reasonable – a value
which may not be reasonable for a database or storage tier. Model
results might look like this:
|
System State
|
Annual Staying Time (minutes)
|
Cumulative Uptime (%)
|
Performance Reward
|
|
All up
|
521,395.20
|
99.2
|
M * N
|
|
1 down
|
4,162.75
|
99.992
|
M * (N - 1)
|
|
2 down
|
39.95
|
99.9996
|
M * (N - 2)
|
|
3 down
|
2.00
|
99.99998
|
M * (N - 3)
|
|
> 3 down
|
0.11
|
100
|
< M * (N - 4)
|
|
Total
|
525,600.00
|
100
|
|
Now we have enough data to evaluate the performability of the
system. For the simple analysis, we accept the cumulative uptime
result for the minimum required performance. We can then compare
various systems considering performability.
We have modeled the new Sun SPARC Enterprise T5120 and Sun SPARC
Enterprise T5220 servers against the venerable Sun Fire V490 servers.
For this analysis we chose a performance benchmark with a metric that
showed we needed 6 T5120 or T5220 servers to match the performance of
9 V490 servers. We will choose to overprovision by one server, which
is often optimum for such architectures. The performability results
are:
|
Servers
|
Units
|
Performability (%)
|
|
Sun SPARC Enterprise T5120
|
6 + 1
|
99.99988
|
|
Sun SPARC Enterprise T5220
|
6 + 1
|
99.99988
|
|
Sun Fire V490
|
9 + 1
|
99.99893
|
You might notice that the T5120 and T5220 have the same
performability results. This is because they share the same
motherboard design, disks, power supplies, etc. It is much more
interesting to compare these to the V490. Even though we use more
V490 systems, the T5120 and T5220 solution provides better
performability. Fewer, faster, more reliable servers should generally
have better performability than more, slower, less reliable servers.
Thursday Oct 04, 2007
I'll be blogging about performability analysis over the next few weeks. Last year Hairong Sun, Tina Tyan, Steven Johnson, Nisha Talagala, Bob Wood, and I published a paper on how we do performability analysis at Sun. It is titled Performability Analysis of Storage Systems in Practice: Methodology and Tools, and is available online at SpringerLink. Here is the abstract:
This paper presents a methodology and tools used for performability
analysis of storage systems in Sun Microsystems. A Markov modeling tool
is used to evaluate the probabilities of normal and fault states in the
storage system, based on field reliability data collected from customer
sites. Fault injection tests are conducted to measure the performance
of the storage system in various degraded states with a performance
benchmark developed within Sun Microsystems. A graphic metric is
introduced for performability assessment and comparison. An example is
used throughout the paper to illustrate the methodology and process.
I'm giving a presentation on performability at Sun's Customer Engineering Conference next week, so if you're attending stop by and visit.