I see many reports about running campains of test measuring
performance over a test matrix. One problem with this approach is of
course the
Matrix.
That matrix never big enough for the consumer of the information ("can
you run this instead ?").
A more useful approach is to think in terms of performance
invariants. We all know that 7.2K RPM disk drive can do 150-200 IOPS
as an invariant and disks will have throughput limit such as
80MB/sec. Thinking in terms of those invariant helps in extrapolating
performance data (with caution) and observing breakdowns in invariant
is often a sign that something else needs to be root caused.
So using 11 metrics and our Performance engineering effort what can be
our guiding invariants ? Bearing in mind that it is expected that
those are rough estimate. For real measured numbers check out Amitabha
Banerjee's excellent post on
Analyzing the Sun Storage 7000.
Streaming : 1 GB/s on server and 110 MB/sec on client
For read Streaming wise, we're observing that 1GB/s is somewhat our
guiding number for read streaming . This can be acheived with fairly
small number of client and threads but will be easier to reach if the
data is prestaged in server caches. A client normally running 1Gbe
network cards is able to extract 110 MB/sec rather easily. Read
streaming will be easier to acheived with the larger 128K records
probably due to the lesser CPU demand. While our results are with
regular 1500 Bytes ethernet frames, using jumbo frame will also make
this limit easier to reach or even break. For a mirrored pool, data
needs to be sent twice to the storage and we see a reduction of about
50% for write streaming workloads.
Random Read I/Os per second : 150 random read IOPS per mirrored disks
This is probably a good guiding light also. When going to disks that
will be a reasonable expectation. But here caching can radically
change this. Since we can configure up to 128GB of host ram and 4
times that much of secondary caches, there are opportunity to break
this barrier. But when going to spindles that needs to be kept under
consideration. We also know that Raid-z spreads records to all
disks. So the 150 IOPS limit basically applies to
raid-z groups. Do
plan to have many groups to service random reads.
Random Read I/Os per second using SSDs : 3100 Read IOPS per Read Optimized SSD
In some instances, data after eviction from main memory will be kept
in secondary caches. Small files and tuned recordsize filesystem are
good target workload for this. Those read-optimized SSD can restitute this
data at a rate of 3100 IOPS
L2 ARC). More
importantly so it can do so at much reduced latency meaning that
lightly threaded workloads will be able to acheive high throughput.
Synchronous writes per second : 5000-9000 Synchronous write per Write Optimized SSD
Synchronous writes can be generated by a O_DSYNC write (database) or
just as part of the NFS protocol (such as the tar extract :
open,write,close workloads). Those will reach the NAS server and be
coalesced in a single transaction with the separate intent log. Those
SSD devices are great latency accelerators but are still devices with
a max throughput of around 110 MB/sec. However our code actually
detects when the SSD devices become the bottleneck and will
divert some of the I/O request to use the main storage pool. The net
of all this is a complex equation but we've observed easily 5000-8000
synchronous writes per SSD up to 3 devices (or 6 in mirrored pairs).
Using smaller working set which creates less competition for CPU
resources we've even observed 48K synchronous writes per second.
Cycles per Bytes : 30-40 cycles per byte for NFS and CIFS
Once we include the full NFS or CIFS protocol, the efficiency was
observed to be in the 30-40 cycles per byte (8 to 10 of those coming
from the pure network component at regulat 1500 bytes MTU). More
studies are required to figure out the extent to which this is valid
but it's an interesting way to look at the problem. Having to run
disk I/O vs being serviced directly from cached data is expected to
exert an additional 10-20 cycles per byte. Obviously for metadata test
in which small amount of byte is transfered per operation, we probably
need to come up with a cycles/MetaOps invariant but that is still TBD.
Single Client NFS throughput : 1 TCP Window per round
trip latency.
This is one fundamental rule of network throughput but it's a
good occasion to refresh this in everyones mind. Clients, at least
solaris clients, will establish a single TCP connection to a server.
On that connection there can be a large number of unreleated requests
as NFS is a very scalable protocol. However, a single connection will
transport data at a maximum speed of a "socket buffer" divided by the
round trip latency. Since today's network speed, particularly in wide
area networks have grown somewhat faster than default socket buffers
we can see such things becoming performance bottleneck. Now given that
I work in Europe but my tests systems are often located in california,
I might be a little more sensitive than most to this fact. So one
important change we did early on, in this project was to simply bump
up the default socket buffers in the 7000 line to 1MB. However for
read throughput under similar conditions, we can only advise you to do
the same to your client infrastructure.