Wednesday Aug 20, 2008
Over the past few years, a number of people have been working to develop benchmarks for dependability of computer systems. After all, why should the performance guys have all of the fun? We've collected a number of papers on the subject in a new book, Dependability Benchmarking for Computer Systems, available from the IEEE Computer Society Press and Wiley.
The table of contents includes:
- The Autonomic Computing Benchmark
- Analytical Reliability, Availability, and Serviceability Benchmarks
- System Recovery Benchmarks
- Dependability Benchmarking Using Environmental Test Tools
- Dependability Benchmark for OLTP Systems
- Dependability Benchmarking of Web Servers
- Dependability Benchmark of Automotive Engine Control Systems
- Toward Evaluating the Dependability of Anomaly Detectors
- Vajra: Evaluating Byzantine-Fault-Tolerant Distributed Systems
- User-Relevant Software Reliability Benchmarking
- Interface Robustness Testing: Experience and Lessons Learned from the Ballista Project
- Windows and Linux Robustness Benchmarks with Respect to Application Erroneous Behavior
- DeBERT: Dependability Benchmarking of Embedded Real-Time Off-the-Shelf Components for Space Applications
- Benchmarking the Impact of Faulty Drivers: Application to the Linux Kernel
- Benchmarking the Operating System against Faults Impacting Operating System Functions
- Neutron Soft Error Rate Characterization of Microprocessors
Wow, you can see that there has been a lot of work, by a lot of people to measure system dependability and improve system designs.
The work described in Chapter 2, Analytical Reliability, Availability, and Serviceability Benchmarks, can be seen as we are beginning to publish these benchmark results in various product white papers:
Performance benchmarks have proven useful in driving innovation in the computer industry, and I think dependability benchmarks can do likewise. If you feel that these benchmarks are valuable, then please drop me a
note, or better yet, ask your computer vendors for some benchmark
results.
I'd like to thank all of the contributors to the book, the IEEE, and Wiley. Karama Kanoun and Lisa Spainhower worked tirelessly to get all of the works compiled (herding the cats) and interfaced with the publisher, great job! Ira Pramanick, Jim Mauro, William Bryson, and Dong Tang collaborated with me on Chapters 2 & 3, thanks team!
Tuesday Oct 16, 2007
Modern systems are continuing to evolve and become more tolerant
to failures. For many systems today, a simple performance or
availability analysis does not reveal how well a system will operate
when in a degraded mode. A performability analysis can help answer
these questions for complex systems. In this blog, I'll show one of
the methods we use for performability analysis.
We often begin with a small set of components for test and
analysis. Traditional benchmarking or performance characterization is
a good starting point. For this example, we will analyze a storage
array. We begin with an understanding of the performance
characteristics of our desired workload, which can vary widely for
storage subsystems. In our case, we will create a performance
workload which includes a mix of reads and writes, with a consistent
iop size, and a desired performance metric of iops/second. Storage
arrays tend to have many possible RAID configurations which will have
different performance and data protection trade-offs, so we will pick
a RAID configuration which we think will best suit our requirements.
If it sounds like we're making a lot of choices early, it is because
we are. We know that some choices are clearly bad, some are clearly
good, and there are a whole bunch of choices in between. If we can't
meet our design targets after the performability analysis, then we
might have to go back to the beginning and start again - such is the
life of a systems engineer.
Once we have a reasonable starting point, we will setup a baseline
benchmark to determine the best performance for a fully functional
system. We will then use fault injection to measure the system
performance characteristics under the various failure modes expected
in the system. For most cases, we are concerned with hardware
failures. Often the impact on the performance of a system under
failure conditions is not constant. There may be a fault diagnosis
and isolation phase, a degraded phase, and a repair phase. There may
be several different system performance behaviors during these
phases. The transient diagram below shows the performance
measurements of a RAID array with dual redundant controllers
configured in a fully redundant, active/active operating mode. We
bring the system to a steady state and then inject a fault into one
of the controllers.
This analysis is interesting for several different reasons. We see
that when the fault was injected, there was a short period where the
array serviced no I/O operations. Once the fault was isolated, then a
recovery phase was started during which the array was operating at
approximately half of its peak performance. Once recovery was
completed, the performance returned to normal, even though the system
is in a degraded state. Next we repaired the fault. After the system
reconfigured itself, performance returned to normal for the
non-degraded system. You'll note that during the post-repair
reconfiguration the array stopped servicing I/O operations and this
outage was longer than the outage in the original fault. Sometimes, a
trade-off is made such that the impact of the unscheduled fault is
minimized at the expense of the repair activity. This is usually a
good trade-off because the repair activity is usually a scheduled
event, so we can limit the impact via procedures and planning. If you
have ever waited for an fsck to finish when booting a system, then
you've felt the impact of such decisions and understand why modern
file systems have attempted to minimize the performance costs of
fsck,
or
eliminated the need for fsck altogether. Modeling the system in
this way means that we will consider both the unscheduled faults as
well as the planned repair, though we usually make the simplifying
assumption that there will be one repair action for each unscheduled
fault.
If this sort of characterization sounds tedious, well it is. But
it is the best way for us to measure the performance of a subsystem
under faulted conditions. Trying to measure the performance of a
more complex system with multiple servers, switches, and arrays under
a comprehensive set of fault conditions would be untenable. We do
gain some reduction of the test matrix because we know that some
components have no impact on performance when they fail.
Next we build a RAScad model for the system. I usually use a
heirarchial model built from components which hides much of the
complexity from me, but for this simpler example, the Markov model
looks like this:
Where the states are explained by this table:
|
State
|
Explanation
|
Transition Rate
|
Explanation
|
|
28,0,1
|
No failures
|
m_repair
|
rate (=1/MTTR)
|
|
1 UIC_Dn
|
1 UIC is down
|
l_uic
|
UIC failure rate
|
|
Down
|
System is down
|
l_mp
|
Midplane failure rate
|
|
1 Ctlr_Dn
|
1 Controller is down
|
l_cntl
|
Controller failure rate
|
|
1PCU_Dn
|
1 PCU is down
|
l_pcu
|
PCU failure rate
|
|
27,1,0
|
1 disk is under reconstruction
|
l_recon
|
Disk reconstruction rate
|
|
28,1,1
|
1 disk is under reconstruction, 1 spare disk available
|
l_disk
|
Disk failure rate
|
|
27,0,0
|
No spare disk
|
|
|
|
26,0,0
|
One parity group loses 1 disk, no
spare available, no disk reconstruction
|
|
|
Solving the Markov model will provide us with the average staying
time per year in each of the states. Note that we must make some sort
of assumptions about the service response time. We will usually use 4
hour service response time for enterprise-class operations. Is that
assumption optimal? We don't always know, so that is another feature
of a system I'll explore in a later blog.
So now we have the performance for each state, and the average
staying time per year. These are two variables, so lets graph them
on an X-Y plot. To make it easier to compare different systems, we
sort by the performance (in the Y-axis). We call the resulting graph
a performability graph or P-Graph for short. Here is an
example of a performability graph showing the results for three
different RAID array configurations.
I usually label availability targets across the top as an
alternate X-axis label because many people are more comfortable with
availability targets represented as "nines" than seconds or
minutes. In order to show the typically small staying time, we use a
log scale on the X-axis. The Y-axis shows the performance metric. I
refer to the system's performability curve as a
performability
envelope because it represents the boundaries of performance and
availability, where we can expect the actual use to fall below the
curve for any interval.
Suppose you have a requirement for an array that delivers 1,500
iops with "four-nines" availability. You can see from the
performability graph that Product A and C can deliver 1,500 iops,
Product C can deliver "four-nines" availability, but only
Product A can deliver both 1,500 iops and "four-nines"
availability.
To help you understand the composition of the graph, I colored
some of the states which have longer staying times.
You can see that some of the failure states have little impact on
performance, whereas others will have a significant impact on
performance. For this array, when a power supply/battery unit fails,
the write cache is placed in write through mode, which has a
significant performance impact. Also, when a disk fails and is being
reconstructed, the overall performance is impacted. Now we have a
clearer picture of what performance we can expect from this array per
year.
This composition view is particularly useful for product
engineers, but is less useful to systems engineers. For complex
systems, there are many products, many failure modes, and many more
trade-offs to consider. More on that later...