Tuesday Oct 09, 2007
In complex systems, we must often trade-off performance against
reliability, availability, or serviceability. In many cases, a system
design will include both performance and availability requirements.
We use performability analysis to examine the
performance versus availability trade-off. Performability is simply
the ability to perform. A performability analysis combines
performance characterization for systems under the possible
combinations of degraded states with the probability that the system
will be operating the degraded states.
The simplest performability analysis is often appropriate for
multiple node, shared nothing clusters which scale performance
perfectly. For example, in a simple web server farm, you might have N
servers capable of delivering M pages per server. Disregarding other
bottlenecks in the system such, as the capacity of the internet
connection to the server farm, we can say that N+1 servers will
deliver M*(N+1) performance. Thus we can estimate the aggregate
performance of any number of web servers.
We can also perform an availability analysis on a web server. We
can build Markov models which consider the reliability of the
components in a server and their expected time to repair. The output
of the models will provide the estimated time per year that each web
server may be operational. More specifically, we will know the
staying time per year for each of the model states. For a simple
model, the performance reward for an up state is M and a down
state is 0. A system which provides 99.99% (four-nines) availability
can be expected to be down for approximately 53 minutes per year and
up for the remainder.
For a shared nothing cluster, we can further simplify the analysis
by ignoring common fault effects. In practice, this means that a
failure or repair in one web server does not affect any other web
servers. In many respects, this is the same simplifying assumption we
made with performance, where the performance of a web server is
dependent on any of the other web servers.
The shared nothing cluster availability model will contain the
following system states and the annual staying time in each state:
all up, one down (N-1 up), two down (N-2 up), three down (N-3 up),
and so on. The availability model inputs include the unscheduled mean
time between system interruption (U_MTBSI) and mean time to repair
(MTTR) for the nodes. We often choose a MTTR value by considering
the cost of service response time. For many shared nothing clusters,
a service response time of 48 hours may be reasonable – a value
which may not be reasonable for a database or storage tier. Model
results might look like this:
|
System State
|
Annual Staying Time (minutes)
|
Cumulative Uptime (%)
|
Performance Reward
|
|
All up
|
521,395.20
|
99.2
|
M * N
|
|
1 down
|
4,162.75
|
99.992
|
M * (N - 1)
|
|
2 down
|
39.95
|
99.9996
|
M * (N - 2)
|
|
3 down
|
2.00
|
99.99998
|
M * (N - 3)
|
|
> 3 down
|
0.11
|
100
|
< M * (N - 4)
|
|
Total
|
525,600.00
|
100
|
|
Now we have enough data to evaluate the performability of the
system. For the simple analysis, we accept the cumulative uptime
result for the minimum required performance. We can then compare
various systems considering performability.
We have modeled the new Sun SPARC Enterprise T5120 and Sun SPARC
Enterprise T5220 servers against the venerable Sun Fire V490 servers.
For this analysis we chose a performance benchmark with a metric that
showed we needed 6 T5120 or T5220 servers to match the performance of
9 V490 servers. We will choose to overprovision by one server, which
is often optimum for such architectures. The performability results
are:
|
Servers
|
Units
|
Performability (%)
|
|
Sun SPARC Enterprise T5120
|
6 + 1
|
99.99988
|
|
Sun SPARC Enterprise T5220
|
6 + 1
|
99.99988
|
|
Sun Fire V490
|
9 + 1
|
99.99893
|
You might notice that the T5120 and T5220 have the same
performability results. This is because they share the same
motherboard design, disks, power supplies, etc. It is much more
interesting to compare these to the V490. Even though we use more
V490 systems, the T5120 and T5220 solution provides better
performability. Fewer, faster, more reliable servers should generally
have better performability than more, slower, less reliable servers.
Thursday Oct 04, 2007
I'll be blogging about performability analysis over the next few weeks. Last year Hairong Sun, Tina Tyan, Steven Johnson, Nisha Talagala, Bob Wood, and I published a paper on how we do performability analysis at Sun. It is titled Performability Analysis of Storage Systems in Practice: Methodology and Tools, and is available online at SpringerLink. Here is the abstract:
This paper presents a methodology and tools used for performability
analysis of storage systems in Sun Microsystems. A Markov modeling tool
is used to evaluate the probabilities of normal and fault states in the
storage system, based on field reliability data collected from customer
sites. Fault injection tests are conducted to measure the performance
of the storage system in various degraded states with a performance
benchmark developed within Sun Microsystems. A graphic metric is
introduced for performability assessment and comparison. An example is
used throughout the paper to illustrate the methodology and process.
I'm giving a presentation on performability at Sun's Customer Engineering Conference next week, so if you're attending stop by and visit.
Tuesday Jan 30, 2007
In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems.
The best thing about a model is that it is a simplification of
real life.
The worst thing about a model is that it is a
simplification of real life.
Small, Random Read Performance Model
For this analysis, we will use a small, random read performance
model. The calculations for the model can be made with data which is
readily available from disk data sheets. We calculate the expected
I/O operations per second (iops) based on the average read seek and
rotational speed of the disk. We don't consider the command overhead,
as it is generally small for modern drives and is not always
specified in disk data sheets.
maximum rotational latency = 60,000
(ms/min) / rotational speed (rpm)
iops = 1000 (ms/s) / (average read
seek time (ms) + (maximum rotational latency (ms) / 2))
Since most disks use consistent rotational speeds, this small
table may help you to see what the rotational speed contribution will
be.
-
|
Rotational Speed (rpm)
|
Maximum Rotational Latency (ms)
|
|
4,200
|
14.3
|
|
5,400
|
11.1
|
|
7,200
|
8.3
|
|
10,000
|
6.0
|
|
15,000
|
4.0
|
For example, if we have a 73
GByte, 2.5" Seagate Saviio SAS drive which has a 4.1 ms
average read seek and rotational speed of 10,000 rpm:
iops = 1000 / (4.1 + (6.0 / 2)) =
140.8
By comparison, a 750
GByte, 3.5" Seagate Barracuda SATA drive which has a 8.5 ms
average read seek and rotational speed of 7,200 rpm:
iops = 1000 / (8.5 + (8.3 / 2)) = 79.0
I purposely used those two examples because people are always
wondering why we tend to prefer smaller, faster, and (unfortunately)
more expensive drives over larger, slower, less expensive drives - a
78% performance improvement is rather significant. The 3.5"
drives also use about 25-75% more power than their smaller cousins,
largely due to the rotating mass. Small is beautiful in a SWaP
sense.
Next we use the RAID set configuration information to calculate
the total small, random read iops for the zpool or volume. Here we
need to talk about sets of disks which may make up a multi-level
zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of
mirrored sets (RAID-1). RAID-0 is a stripe of disks.
For dynamic striping (RAID-0), add the iops for each set or
disk. On average the iops are spread randomly across all sets or
disks, gaining concurrency.
For mirroring (RAID-1), add the iops for each set or disk.
For reads, any set or disk can satisfy a read, so we also get
concurrency.
For single parity raidz (RAID-5), the set operates at the
performance of one disk. See below.
For double parity raidz2 (RAID-6), the set operates at the
performance of one disk. See below.
For example, if you have 6 disks, then there are many different
ways you can configure them, with varying performance calculations
|
RAID Configuration (6 disks)
|
Small, Random Read Performance Relative to a Single Disk
|
|
6-disk dynamic stripe (RAID-0)
|
6
|
|
3-set dynamic stripe, 2-way mirror (RAID-1+0)
|
6
|
|
2-set dynamic stripe, 3-way mirror (RAID-1+0)
|
6
|
|
6-disk raidz (RAID-5)
|
1
|
|
2-set dynamic stripe, 3-disk raidz (RAID-5+0)
|
2
|
|
2-way mirror, 3-disk raidz (RAID-5+1)
|
2
|
|
6-disk raidz2 (RAID-6)
|
1
|
Clearly, using mirrors improves both performance and data
reliability. Using stripes increases performance, at the cost of data
reliability. raidz and raidz2 offer data reliability, at the cost of
performance. This leads us down a rathole...
The Parity Performance Rathole
Many people expect that data protection schemes based on parity,
such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance
of striped volumes, except for the parity disk. In other words, they
expect that a 6-disk raidz zpool would have the same small. random
read performance as a 5-disk dynamic stripe. Similarly, they expect
that a 6-disk raidz2 zpool would have the same performance as a
4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a
checksum to validate the contents of a block of data written. The
block is spread across the disks (vdevs) in the set. In order to
validate the checksum, ZFS must read the blocks from more than one
disk, thus not taking advantage of spreading unrelated, random reads
concurrently across the disks. In other words, the small, random read
performance of a raidz or raidz2 set is, essentially, the same as the
single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.
Many people also think that this is a design deficiency. As a RAS
guy, I value the data validation offered by the checksum over the
performance supposedly gained by RAID-5. Reasonable people can
disagree, but perhaps some day a clever person will solve this for
ZFS.
So, what do other logical volume managers or RAID arrays do? The
results seem mixed. I have seen some RAID array performance
characterization data which is very similar to the ZFS performance
for parity sets. I have heard anecdotes that other implementations
will read the blocks and only reconstruct a failed block as
needed. The problem is, how do such systems know that a block has
failed? Anecdotally,
it seems that some of them trust what is read from the disk. To
implement a per-disk block checksum verification, you'd still have to
perform at least two reads from different disks, so it seems to me
that you are trading off data integrity for performance. In ZFS, data
integrity is paramount. Perhaps there is more room for research here,
or perhaps it is just one of those engineering trade-offs that we
must live with.
Other Performance Models
I'm also looking for other performance models which can be applied
to generic disks with data that is readily available to the public.
The reason that the small, random read iops model works is that it
doesn't need to consider caching or channel resource utilization.
Adding these variables would require some knowledge of the
configuration topology and the cache policies (which may also change
with firmware updates.) I've kicked around the idea of a total disk
bandwidth model which will describe a range of possible bandwidths
based upon the media speed of the drives, but it is not clear to me
that it will offer any satisfaction. Drop me a line if you have a
good model or further thoughts on this topic.
You should be cautious about extrapolating the performance results
described here to other workloads. You could consider this to be a
worst-case model because it assumes 0% disk cache hits. I would hope
that most workloads exhibit better performance, but rather than
guessing (hoping) the best way to find out is to run the workload and
measure the performance. If you characterize a number of different
configurations, then you might build your own performance graphs
which fit your workload.
Putting It All Together
Now we have a method to compare a variety of different ZFS or RAID
disk configurations by evaluating space, performance, and MTTDL.
First, let's look at single parity schemes such as 2-way mirrors and
raidz on the Sun
Fire X4500 (aka Thumper) server.
Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better
performance and MTTDL than raidz for any specific space requirement
except for the case where we run out of hot spares for the 2-way
mirror (using all 46 disks for data). By contrast, all of the raidz
configurations here have hot spares. You can use this to help make
design trade-offs by prioritizing space, performance, and MTTDL.
You'll also note that I did not label the left-side Y axis (MTTDL)
again, but I did label the right-side Y axis (small, random read
iops). I did this with mixed emotion. I didn't label the MTTDL axis
values as I explained previously. But I did label the performance
axis so that you can do a rough comparison to the double parity graph
below. Note that in the double parity graph, the MTTDL axis is in
units of Millions of years, instead of years above.

Here you can see the same sort of comparison between 3-way mirrors
and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.
Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place. If you want to be happier, you should use mirroring with at least one hot spare.
Conclusion
We can make design trade-offs between space, performance, and
MTTDL for disk storage systems. As with most engineering decisions,
there often is not a clear best solution given all of the possible
solutions. By using some simple models, we can see the trade-offs
more clearly.
And you REALLY think many are going to pay $25 (re...
I,
Many companies and libraries have Spr...