Introduction to Performability Analysis
Tuesday Oct 16, 2007
Modern systems are continuing to evolve and become more tolerant to failures. For many systems today, a simple performance or availability analysis does not reveal how well a system will operate when in a degraded mode. A performability analysis can help answer these questions for complex systems. In this blog, I'll show one of the methods we use for performability analysis.
We often begin with a small set of components for test and analysis. Traditional benchmarking or performance characterization is a good starting point. For this example, we will analyze a storage array. We begin with an understanding of the performance characteristics of our desired workload, which can vary widely for storage subsystems. In our case, we will create a performance workload which includes a mix of reads and writes, with a consistent iop size, and a desired performance metric of iops/second. Storage arrays tend to have many possible RAID configurations which will have different performance and data protection trade-offs, so we will pick a RAID configuration which we think will best suit our requirements. If it sounds like we're making a lot of choices early, it is because we are. We know that some choices are clearly bad, some are clearly good, and there are a whole bunch of choices in between. If we can't meet our design targets after the performability analysis, then we might have to go back to the beginning and start again - such is the life of a systems engineer.
Once we have a reasonable starting point, we will setup a baseline benchmark to determine the best performance for a fully functional system. We will then use fault injection to measure the system performance characteristics under the various failure modes expected in the system. For most cases, we are concerned with hardware failures. Often the impact on the performance of a system under failure conditions is not constant. There may be a fault diagnosis and isolation phase, a degraded phase, and a repair phase. There may be several different system performance behaviors during these phases. The transient diagram below shows the performance measurements of a RAID array with dual redundant controllers configured in a fully redundant, active/active operating mode. We bring the system to a steady state and then inject a fault into one of the controllers.
Modeling the system in this way means that we will consider both the unscheduled faults as well as the planned repair, though we usually make the simplifying assumption that there will be one repair action for each unscheduled fault.
If this sort of characterization sounds tedious, well it is. But it is the best way for us to measure the performance of a subsystem under faulted conditions. Trying to measure the performance of a more complex system with multiple servers, switches, and arrays under a comprehensive set of fault conditions would be untenable. We do gain some reduction of the test matrix because we know that some components have no impact on performance when they fail.
Next we build a RAScad model for the system. I usually use a heirarchial model built from components which hides much of the complexity from me, but for this simpler example, the Markov model looks like this:
|
State |
Explanation |
Transition Rate |
Explanation |
|---|---|---|---|
|
28,0,1 |
No failures |
m_repair |
rate (=1/MTTR) |
|
1 UIC_Dn |
1 UIC is down |
l_uic |
UIC failure rate |
|
Down |
System is down |
l_mp |
Midplane failure rate |
|
1 Ctlr_Dn |
1 Controller is down |
l_cntl |
Controller failure rate |
|
1PCU_Dn |
1 PCU is down |
l_pcu |
PCU failure rate |
|
27,1,0 |
1 disk is under reconstruction |
l_recon |
Disk reconstruction rate |
|
28,1,1 |
1 disk is under reconstruction, 1 spare disk available |
l_disk |
Disk failure rate |
|
27,0,0 |
No spare disk |
|
|
|
26,0,0 |
One parity group loses 1 disk, no spare available, no disk reconstruction |
|
|
Solving the Markov model will provide us with the average staying time per year in each of the states. Note that we must make some sort of assumptions about the service response time. We will usually use 4 hour service response time for enterprise-class operations. Is that assumption optimal? We don't always know, so that is another feature of a system I'll explore in a later blog.
So now we have the performance for each state, and the average staying time per year. These are two variables, so lets graph them on an X-Y plot. To make it easier to compare different systems, we sort by the performance (in the Y-axis). We call the resulting graph a performability graph or P-Graph for short. Here is an example of a performability graph showing the results for three different RAID array configurations.
because it represents the boundaries of performance and availability, where we can expect the actual use to fall below the curve for any interval.
Suppose you have a requirement for an array that delivers 1,500 iops with "four-nines" availability. You can see from the performability graph that Product A and C can deliver 1,500 iops, Product C can deliver "four-nines" availability, but only Product A can deliver both 1,500 iops and "four-nines" availability.
To help you understand the composition of the graph, I colored some of the states which have longer staying times.
This composition view is particularly useful for product engineers, but is less useful to systems engineers. For complex systems, there are many products, many failure modes, and many more trade-offs to consider. More on that later...
Tags: availability benchmark performability ras rascad










