| « novembre 2008 » |
| lun. | mar. | mer. | jeu. | ven. | sam. | dim. |
|---|
| | | | | | 1 | 2 |
3 | | 5 | 6 | 7 | 8 | 9 |
| 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
| | | | | | | |
| Today |

lundi novembre 10, 2008
Blogfest : Performance and the Hybrid Storage Pool
Today Sun is announcing a new line of
Unified
Storage designed by a core of the most brilliant engineers . For
starters Mike Shapiro provides a great introduction into this product,
the new economics behind it and the killer App in
Sun
Storage 7000.
The killer App is of course Bryan Cantrill's brainchild, the already
famous
Analytics.
As a performance engineer, it's been a great thrill to have given this
tool an early test drive. Working a full 1 ocean's (the atlantic) + 1
continent (the USA) away from my system running Analytics I was
skeptical at first that I would be visualizing in real time all that
information : the NFS/CIFS ops, the disk ops, the CPU load and network
throughput, per client, per disk, per file ARE YOU CRAZY ! All that
information available IN REAL TIME; I just have to say a big thank you
to the team that made it possible. I can't wait to see our customer
put this to productive use.
Also check out Adam Levanthal's great description of HSP the
Hybrid Storage Pool and read my own perspective on this topic
ZFS as a
Network Attach Storage Controller.
Lest we forget the immense contribution of the boundless Energy bubble
that is Brendan Gregg; the man that braught DTracetoolkit to the
semi-geek; he must be jumping with excitement as we now see the power
of DTrace delivered to each and every system administrator.
He talks here about the
Status
Dashboard. And Brendan's contribution does not stop here, he is
also the parent of this wonderful component of the
HSP known
as the L2ARC which is how the readzillas become activated. See his own
previous work on the
L2ARC along with
Jing Zhang more recent
studies. Quality assurance people don't often get into the spotlight but check out
Tim Foster 's post on how he tortured the zpool code
adding and removing l2 arc devices from pools :
For myself, it's been very exciting to be able to see performance
improvement ideas get turned into product improvements from weeks to
weeks. Those interested should read how our group influenced the product that
is shipping today, see
Alan Chiu
and my own
Delivering Performance Improvements.
Such a product has a strong Price/Performance appeal and given that we
fundamentally did not think that there where public benchmarks that
captured our value proposition, we had to come up with a
third millenium
participative ways to talk about performance. Check out how we
designed our
Metrics
or maybe go straight to our numbers obtained by
Amitabha
Banerjee a concise entry backed up by immense, intense and
carefull data gathering effort in the last few weeks. bmseer is putting his own light
on the
low level data (data to be updated with numbers from a grander config).
I've also posted here a few performance guiding lights to be used
thinking about this product; I call them
Performance
Invariants. So further numbers can be found here about
raid rebuild times.
On the application side, we have the great work of Sean (Hsianglung
Wu) and Arini Balakrishnan showing how a 7210 can deliver
> 5000 concurrent video streams at an aggregate of,
you're kidding, :
WOW ZA 750MB/sec.
More Details on how this was acheived in
cdnperf.
Jignesh Shaw shows step by step instructions setting up
PostgreSQL over iSCSI.
See our Vice President, Solaris Data, Availability, Scalability &
HPC Bob Porras trying to tame this beast into a
nutshell
and pointing out code bits reminding everyone of the value of the
OpenStorage proposition.
See also what bmseer has to say on
Web
2.0 Consolidation and get from Marcus Heckel a walkthrough of
setting up
Olio
Web 2.0 kit with nice Analytics performance screenshots. Also get
the ISV reaction (a bit later) from
Georg Edelmann. Ryan Pratt reports on
Windows Server 2003 WHQL certification of the Sun Storage 7000 line.
And this just in : Data about what to expect from a
Database perspective.
We can talk all we want about performance but as Josh Simons points out,
these babies are available to you for your own
try and buy.
Or check out how you could be running the appliance within the next hour really :
Sun Storage 7000 in VMware.
It seems I am in competition with another less verbose
aggregator
Finally capture the whole stream of related posting to
Sun Storage 7000
Delivering Performance Improvements to Sun Storage 7000
I describe here the effort I spearheaded studying the performance
characteristics of the OpenStorage platform and the ways in which our
team of engineers delivered real out of the box improvements to the
product that is shipping today.
One of the Joy of working on the OpenStorage NAS appliance was
that solutions we found to performance issues could be immediately
transposed into changes to the appliance without further process.
The first big wins
We initially stumble on 2 major issues, one for NFS synchronous writes
and one for the CIFS protocol in general. The NFS problem was a subtle
one involving the distinction of O_SYNC vs O_DSYNC writes in the ZFS
intent log and was impacting our threaded synchronous writes test by
up to a 20X factor. Fortunately I had an history of studying that part
of the code and could quickly identify the problem and suggest a
fix. This was tracked as
6683293: concurrent O_DSYNC writes to a
fileset can be much improved over NFS.
The following week, turning to CIFS studies, we were seeing great
scalability limitation in the code. Here again I was fortunate to be
the first one to hit this. The problem was that to manage CIFS request
the kernel code was using simple kernel allocations that could
accommodate the largest possible request. Such large allocations and
deallocations causes what is known as a storm of TLB shootdown
cross-calls limiting scalability.
Incredibly though after implementing the trivial fix, I found that the
rest of the CIFS server was beautifully scalable code with no other
barriers. So in one quick and simple fix (using kmem caches) I could
demonstrate a great scalability improvements to CIFS. This was
tracked as
6686647
: smbsrv scalability impacted by memory
Since those 2 protocol problems were identified early on, I must say
that no serious protocol performance problems have come up. While we
can always find incremental improvements to any given test, our
current implementation has held up to our testing so far.
In the next phase of the project, we did a lot of work on improving
network efficiency at high data rate. In order to deliver the
throughput that the server is capable of, we must use 10Gbps network
interface and the one available on the NAS platforms are based on the
Neptune networking interface running the nxge driver.
Network Setup
I collaborated on this with
Alan Chiu that already new a
lot about this network card and driver tunables and so we quickly
could hash out the issues. We had to decide for a proper out of the
box setup involving
- how many MSI-X interrupts to use
- whether to use networking soft rings or not
- what bcopy threshold to use in the driver as opposed to
binding dma.
- Whether to use or not the new Large Segment Offload (LSO)
technique for transmits.
We new basically where we wanted to go here. We wanted many interrupts
on receive side so as to not overload any CPU and avoid the use of
layered softrings which reduces efficiency. A low bcopy threshold so
that dma binding be used more frequently as the default value was too
high for this x64 based platform.
And LSO was providing a nice boost to efficiency. That got us to some
proper efficiency level.
However we noticed that under stress and high number of connections
our efficiency would drop by 2 or 3 X. After much head scratching we
rooted this to the use of too many TX dma channels. It turns out that
with this driver and architecture using a few channels leads to more
stickyness in the scheduling and much much greater efficiency. We
settled on 2 tx rings as a good compromise. That got us to a level
of 8-10 cpu cycles per byte transfered in network code (more on
Performance
Invariants).
Interrupt Blanking
Studying a Opensource alternative controller, we also found that on 1
of 14 metrics we where slower. That was rooted in the interrupt
blanking parameter that NIC use to gain efficiency. What we found here
was that by reducing our blanking to a small value we could leapfrog
the competition (from 2X worse to 2X better) on this test while
preserving our general network efficiency. We were then on par or
better for every one of the 14 tests.
Media Streaming
When we ran thousand or 1 Mb/s media streams from our systems we
quickly found that the file level software prefetching was hurting us.
So we initially disabled the code in our lab to run our media studies
but at the end of the project we had to find an out of the box setup
that could preserve our Media result without impairing maximum read
streaming. At some point we realized that what we were hitting
6469558:
ZFS prefetch needs to be more aware of memory pressure. It turns out
that the internals of zfetch code is setup to manage 8 concurrent
streams per file and can readahead up to 256 blocks or records : in
this case 128K. So when we realized that with 1000s of streams we
could readahead ourself out of memory, we knew what we needed to
do. We decided on setting up 2 streams per file reading ahead up to 16
blocks and that seems quite sufficient to retain our media serving
throughput while keeping so prefetching capabilities. I note here also
is that NFS client code will themselve recognize streaming and issue
their own readahead. The backend code is then reading ahead of client
readahead requests. So we kind of where getting
ahead of
ourselves here. Read more about it @
cndperf
To slog or not to slog
One of the innovative aspect of this Openstorage server is the use of
read and write optimized solid state devices; see for instance
The Value of Solid State Devices.
Those SSD are beautiful devices designed to help
latency but not throughput. A massive commit is actually better handled by
regular storage not ssd. It turns out that it was actually dead easy
to instruct the ZIL to recognize massive commits and divert it's block
allocation strategy away from the SSD toward the common pool of
disks. We see two benefits here, the massive commits will sped up
(preventing the SSD from becoming the bottleneck) but more importantly
the SSD will now be available as low latency devices to handle
workloads that rely on low latency synchronous operations. One should
note here that the ZIL is a "per filesystem" construct and so while a
filesystem might be working on a large commit another filesystem from
the same pool might still be running a series of small transaction and
benefit from the write optimized SSD.
In a similar way, when we first tested the read-optimized ssds , we quickly
saw that streamed data would install in this caching layer and that it
could slow down the processing later. Again the beauty of working on
an appliance and closely with developers meant that the following
build, those problems had been solved.
Transaction Group Time
ZFS operates by issuing regular transaction groups in which
modifications since last transaction group are recorded on disk and
the ueberblock is updated. This used to be done at a 5 second interval
but with the recent improvement to the
write
throttling code this became a 30 second interval (on light
workloads) which aims to not generate more than 5 seconds of I/O per
transaction groups. Using 5 seconds of I/O per txg was used to
maximize the ratio of data to metadata in each txg, delivering more
application throughput. Now these Storage 7000 servers will typically
have lots of I/O capability on the storage side and the data/metadata
is not as much a concern as for a small JBOD storage. What we found
was that we could reduce the the target of 5 second of I/O down to 1
while still preserving good throughput. Having this smaller value
smoothed out operation.
IT JUST WORKS
Well that is certainly the goal. In my group, we spent the
last year performance testing these OpenStorage systems finding and
fixing bugs, suggesting code improvements, and looking for better
compromise for common tunables. At this point, we're happy with the
state of the systems particularly for mirrored configuration with
write optimized SSD accelerators. Our code is based on a recent
OpenSolaris (from august) that already has a lot of improvements over
Solaris 10 particularly for ZFS, to which we've added specific
improvements relevant to NAS storage. We think these systems will at
times deliver great performance (see Amithaba's
results
) but almost always shine in the price performance categories.
Sun Storage 7000 Performance invariants
I see many reports about running campains of test measuring
performance over a test matrix. One problem with this approach is of
course the
Matrix.
That matrix never big enough for the consumer of the information ("can
you run this instead ?").
A more useful approach is to think in terms of performance
invariants. We all know that 7.2K RPM disk drive can do 150-200 IOPS
as an invariant and disks will have throughput limit such as
80MB/sec. Thinking in terms of those invariant helps in extrapolating
performance data (with caution) and observing breakdowns in invariant
is often a sign that something else needs to be root caused.
So using 11 metrics and our Performance engineering effort what can be
our guiding invariants ? Bearing in mind that it is expected that
those are rough estimate. For real measured numbers check out Amitabha
Banerjee's excellent post on
Analyzing the Sun Storage 7000.
Streaming : 1 GB/s on server and 110 MB/sec on client
For read Streaming wise, we're observing that 1GB/s is somewhat our
guiding number for read streaming . This can be acheived with fairly
small number of client and threads but will be easier to reach if the
data is prestaged in server caches. A client normally running 1Gbe
network cards is able to extract 110 MB/sec rather easily. Read
streaming will be easier to acheived with the larger 128K records
probably due to the lesser CPU demand. While our results are with
regular 1500 Bytes ethernet frames, using jumbo frame will also make
this limit easier to reach or even break. For a mirrored pool, data
needs to be sent twice to the storage and we see a reduction of about
50% for write streaming workloads.
Random Read I/Os per second : 150 random read IOPS per mirrored disks
This is probably a good guiding light also. When going to disks that
will be a reasonable expectation. But here caching can radically
change this. Since we can configure up to 128GB of host ram and 4
times that much of secondary caches, there are opportunity to break
this barrier. But when going to spindles that needs to be kept under
consideration. We also know that Raid-z spreads records to all
disks. So the 150 IOPS limit basically applies to
raid-z groups. Do
plan to have many groups to service random reads.
Random Read I/Os per second using SSDs : 3100 Read IOPS per Read Optimized SSD
In some instances, data after eviction from main memory will be kept
in secondary caches. Small files and tuned recordsize filesystem are
good target workload for this. Those read-optimized SSD can restitute this
data at a rate of 3100 IOPS
L2 ARC). More
importantly so it can do so at much reduced latency meaning that
lightly threaded workloads will be able to acheive high throughput.
Synchronous writes per second : 5000-9000 Synchronous write per Write Optimized SSD
Synchronous writes can be generated by a O_DSYNC write (database) or
just as part of the NFS protocol (such as the tar extract :
open,write,close workloads). Those will reach the NAS server and be
coalesced in a single transaction with the separate intent log. Those
SSD devices are great latency accelerators but are still devices with
a max throughput of around 110 MB/sec. However our code actually
detects when the SSD devices become the bottleneck and will
divert some of the I/O request to use the main storage pool. The net
of all this is a complex equation but we've observed easily 5000-8000
synchronous writes per SSD up to 3 devices (or 6 in mirrored pairs).
Using smaller working set which creates less competition for CPU
resources we've even observed 48K synchronous writes per second.
Cycles per Bytes : 30-40 cycles per byte for NFS and CIFS
Once we include the full NFS or CIFS protocol, the efficiency was
observed to be in the 30-40 cycles per byte (8 to 10 of those coming
from the pure network component at regulat 1500 bytes MTU). More
studies are required to figure out the extent to which this is valid
but it's an interesting way to look at the problem. Having to run
disk I/O vs being serviced directly from cached data is expected to
exert an additional 10-20 cycles per byte. Obviously for metadata test
in which small amount of byte is transfered per operation, we probably
need to come up with a cycles/MetaOps invariant but that is still TBD.
Single Client NFS throughput : 1 TCP Window per round
trip latency.
This is one fundamental rule of network throughput but it's a
good occasion to refresh this in everyones mind. Clients, at least
solaris clients, will establish a single TCP connection to a server.
On that connection there can be a large number of unreleated requests
as NFS is a very scalable protocol. However, a single connection will
transport data at a maximum speed of a "socket buffer" divided by the
round trip latency. Since today's network speed, particularly in wide
area networks have grown somewhat faster than default socket buffers
we can see such things becoming performance bottleneck. Now given that
I work in Europe but my tests systems are often located in california,
I might be a little more sensitive than most to this fact. So one
important change we did early on, in this project was to simply bump
up the default socket buffers in the 7000 line to 1MB. However for
read throughput under similar conditions, we can only advise you to do
the same to your client infrastructure.
Using ZFS as a Network Attach Controller and the Value of Solid State Devices
So Sun is coming out today with a line of Sun Storage 7000
systems that have ZFS as the integrated volume and filesystem manager
using both read and write optimized SSD. What is this
Hybrid Storage Pool
and why is this a
good performance architecture for storage ?
A write optimized SSD is a custom designed device for
the purpose of
accelerating operations of the ZFS intent log (ZIL). The ZIL is the
part of ZFS that manages the important synchronous operation
guaranteeing that such writes are acknowledged quickly to applications
while guaranteeing persistence in case of outage. Data stored in the
ZIL is also kept in memory until ZFS issue the next Transaction Groups
(every few seconds).
The ZIL is what stores data urgently (when application is waiting) but
the TXG is what stores data permanently. The ZIL on-disk blocks are
only ever re-read after a failure such as power outage. So the SSDs
that are used to accelerate the ZIL are write-optimized : they need to handle data at
low latency on writes; reads are unimportant.
The TXG is an operation that is asynchronous to applications : apps
are generally not waiting for transactions groups to commit. The
exception here is when data is generated at a rate that exceeds the
TXG rate for a sustained period of time. In this case, we become
throttled by the pool throughput. In a NAS storage this will rarely
happen since network connectivity even at GB/s is still much less that
what storage is capable of and so we do not generate the imbalance.
The important thing now is that in a NAS server, the controller is also
running a file level protocol (NFS or CIFS) and so is knowledgeable
about the nature (synchronous or not) of the requested writes. As such
it can use the accelerated path (the SSD) only for the necessary
component of the workloads. Less competition for these devices means
we can deliver both high throughput and low latency together in the
same consolidated server.
But here is where is gets nifty. At times, a NAS server might
receive a huge synchronous request. We've observed this for instance
due to fsflush running on clients which will turn non-synchronous
writes into a massive synchronous one. I note here that a way to
reduce this effect, is to tune up fsflush (to say 600). This is
commonly done to reduce the cpu usage of fsflush but will be welcome
in the case of client interacting with NAS storage. We can also
disable page flushing entirely by setting dopageflush to 0. But that
is a client issue. From the perspective of the server, we still need
as a NAS to manage large commit request.
When subject to such a workload, say 1GB commit, ZFS being all aware
of the situation, can now decide to bypass the SDD device and issue
request straight to disk based pool blocks. It would do so for 2
reasons. One is that the pool of disks in it's entirety has more
throughput capabilities than the few write optimized SSD and so we will
service this request faster. But more importantly, the value of the
SSD is in it's latency reduction aspect. Leaving the SSDs available to
service many low latency synchronous writes is considered valuable
here. Another way to say this is that large writes are generally well
served by regular disk operations (they are throughput bound) whereas
small synchronous writes (latency bound) can and will get help from
the SSDs.
Caches at work
On the read path we also have custom designed read optimized
SSDs to fit in these OpenStorage platforms. At Sun, we just believe
that many workloads will naturally lend to caching technologies. In a
consolidated storage solution, we can offer up to 128GB of primary
memory based caching and approximately 500GB of SSD based caching.
We also recognized that the latency delta between memory cached
response and disk response was just too steep. By inserting a layer
of SSD between memory and disk, we have this intermediate step
providing lower latency access than disk to a working set which is
now many times greater than memory.
It's important here to understand how and when these read
optimized SSD will work. The first thing to recognized is that the SSD
will have to be primed with data. They feed off data being evicted
from the primary caches. So their effect will not immediately seen at
the start of a benchmarks. Second, one of the value of read optimized
SSD is truly in low latency responses to small requests. Small request
here means things of the order of 8K in size. Such request will occur
either when dealing with small files (~8K) or if dealing with larger
size but with fix record based application, typically a database. For
those application it is customary to set the recordsize and this will
allow those new SSDs to become more effective.
Our read optimized SSD can service up to 3000 read IOPS (see Brendan's work
on the
L2 ARC)
and this is close or better to what a 24 x 7.2 RPM disks JBOD can do. But the
key point is that the low latency response means it can do so using
much fewer threads that would be necessary to reach the same level on
a JBOD. Brendan demonstrated here that the response time of these
devices can be 20 times faster than disks and 8 to 10 times faster
from the client's perspective. So once data is installed in the
SSD, users will see their requests serviced much faster
which means we are less likely to be subject to queuing delays.
The use of read optimized SSD is configurable in the Appliance. Users
should learn to identify the part of their datasets that end up gated
by lightly threaded read response time. For those workloads enabling
the secondary cache is one way to deliver the value of the read optimized
SSD. For those filesystems, if the workload contains small files (such
as 8K) there is no need to tune anything, however for large files
access in small chunks setting the filesystem recordsize to 8K is
likely to produce the best response time.
Another benefit to these SSDs will be in the $/IOPS case. Some
workloads are just IOPS hungry while not necessarely huge block
consumers. The SSD technology offers great advantages in this space
where a single SDD can deliver the IOPS of a full JBOD at a fraction
of the cost. So with workloads that are more modestly sized but IOPS
hungry a test drive of the SSD will be very interesting.
It's also important to recognized that these systems are used in
consolidation scenarios. It can be that some part of the applications
will be sped up by read or write optimized SSD, or by the large memory
based caches while other consolidated workloads can exercise other
components.
There is another interesting implication to using SSD in the
storage in regards to clustering. The read optimized ssd acting as
caching layers actually never contain critical data. This means those
SSD can go into disk slots of head nodes since there is no data to be
failed over. On the other hand, write optimized SSD will store data
associated with the critical synchronous writes. But since those are located in
dual-ported backend enclosures, not the head nodes, it implies that,
during clustered operations, storage head nodes do not have to
exchange
any user level data.
So by using ZFS and read and write optimized SSDs, we can deliver low
latency writes for application that rely on them, and good throughput
for synchronous and non synchronous case using cost effective SATA
drives. Similarly on the read size, the high amount of primary and
secondary caches enables delivering high IOPS at low latency (even if
the workload is not highly threaded) and it can do so using the more
cost and energy efficient SATA drive.
Our architecture allows us to take advantage of the latency
accelerators while never being gated by them.
Designing Performance Metrics for Sun Storage 7000
One of the necessary checkpoint before launching a product is to be
able to assess it's performance. With Sun Storage 7xxx we had a
challenge in that the only NFS benchmark of notoriety was SPEC
SFS. Now this benchmark will have it's supporters and some customers
might be attached to it but it's important to understand what a
benchmarks actually says.
These SFS benchmark is a lot about "cache busting" the server : this is
interesting but at Sun we think that Caches are actually helpful in
real scenarios. Data goes in cycles in which it becomes hot at times.
Retaining that data in cache layers allow much lower latency access,
and much better human interaction with storage engines. Being a cache
busting benchmark, SFS numbers end up as a measure of the number of
disk rotation attached to the NAS server. So good SFS result requires
100 or 1000 of expensive, energy hungry 15K RPM spindles. To get good
IOPS, layers of caching are more important to the end user experience
and cost efficiency of the solution.
So we needed another way to talk about performance. Benchmarks tend to
test the system in peculiar ways that not necessarely reflect the
workloads each customer is actually facing. There are very many
workload generators for I/O but one interesting one that is OpenSource
and extensible is
Filebench
available in
Source.
So we used filebench to gather basic performance information about our
system with the hope that customers will then use filebench to
generate profiles that map to their own workloads. That way, different
storage option can be tested on hopefully more meaningful tests than
benchmarks.
Another challenge is that a NAS server interacts with client system
that themselve keep a cache of the data. Given that we wanted to
understand the back-end storage, we had to setup the tests to avoid
client side caching as much as possible. So for instance between the
phase of file creation and the phase of actually running the tests we
needed to clear the client caches and at times the server caches as
well. These possibilities are not readily accessible with the
simplest load generators and we had to do this in rather ad-hoc
fashion. One validation of our runs was to insure that the amount of
data transfered over the wire, observed with
Analytics was compatible
with the aggregate throughput measured at the client.
Still another challenge was that we needed to test a storage system
designed to interact with large number of clients. Again load
generators are not readily setup to coordinate multiple client and
gather global metrics. During the course of the effort filebench did
come up with a clustered mode of operation but we actually where too
far engaged in our path to take advantage of it.
This coordination of client is important because, the performance
information we want to report is actually the one that is delivered to
the client. Now each client will report it's own value for a given
test and our tool will sum up the numbers; but such a Sum is only
valid inasmuch as the tests ran on the clients in the same timeframe.
The possibility of skew between tests is something that needs to be
monitored by the person running the investigation.
One way that we increased this coordination was that we
divided our tests in 2 categories; those that required precreated
files, and those that created files during the timed portion of the
runs. If not handled properly, file creation would actually cause
important result skew. The option we pursued here was to have a
pre-creation phase of files that was done once. From that point, our
full set of metrics could then be run and repeated many times with
much less human monitoring leading to better reproducibility of
results.
Another goal of this effort was that we wanted to be able to run our
standard set of metrics in a relatively short time. Say less than 1
hours. In the end we got that to about 30 minutes per run to gather 10
metrics. Having a short amount of time here is important because there
are lots of possible ways that such test can be misrun. Having someone
watch over the runs is critical to the value of the output and to it's
reproducibility. So after having run the pre-creation of file
offline, one could run many repeated instance of the tests validating
the runs with
Analytics and through general observation of the system
gaining some insight into the meaning of the output.
At this point we were ready to define our metrics.
Obviously we needed streaming reads and writes. We needed ramdom reads.
We needed small synchronous writes important to Database workloads and
to the NFS protocol. Finally small filecreation and stat operation
completed the mix. For random reading we also needed to distinguish
between operating from disks and from storage side caches, an
important aspect of our architecture.
Now another thing that was on my mind was that, this is not a
benchmark. That means we would not be trying to finetune the metrics
in order to find out just exactly what is the optimal number of
threads and request size that leads to best possible performance from
the server. This is not the way your workload is setup. Your number of
client threads running is not elastic at will. Your workload is what
it is (threading included); the question is how fast is it being
serviced.
So we defined precise
per client workloads with preset number
of thread running the operations. We came up with this set just as an
illustration of what could be representative loads :
1- 1 thread streaming reads from 20G uncached set, 30 sec.
2- 1 thread streaming reads from same set, 30 sec.
3- 20 threads streaming reads from 20G uncached set, 30 sec.
4- 10 threads streaming reads from same set, 30 sec.
5- 20 threads 8K random read from 20G uncached set, 30 sec.
6- 128 threads 8K random read from same set, 30 sec.
7- 1 thread streaming write, 120 sec
8- 20 threads streaming write, 120 sec
9- 128 threads 8K synchronous writes to 20G set, 120 sec
10- 20 threads metadata (fstat) IOPS from pool of 400k files, 120 sec
11- 8 threads 8K file create IOPS, 120 sec.
For each of the 11 metrics, we could propose mapping these to relevant industries :
1- Backups, Database restoration (source), DataMining , HPC
2- Financial/Risk Analysis, Video editing, HPC
3- Media Streaming, HPC
4- Video Editing
5- DB consolidation, Mailserver, generic fileserving, Software development.
6- DB consolidation, Mailserver, generic fileserving, Software development.
7- User data Restore (destination)
8- Financial/Risk Analysis, backup server
9- Database/OLTP
10- Wed 2.0, Mailserver/Mailstore, Software Development
11- Web 2.0, Mailserver/Mailstore, Software Development
We managed to get all these tests running except the fstat (test 10)
due to a technicality in filebench. Filebench insisted on creating
the files up front and this test required thousands of them; moreover
filebench used a method that ended up single threaded to do so and in
the end, the stat information was mostly cached on the client. While
we could have plowed through some of the issues the conjunctions of
all these made us put the fstat test on the side for now.
Concerning thread counts, we figured that single stream read test was
at times critical (for administrative purposes) and an interesting
measure of the latency. Test 1 and 2 were defined this way with test
1 starting with cold client and server caches and test 2 continuing
the runs after having cleared the client cache (but not the server)
thus showing the boost from server side caching. Test 3 and 4 are
similarly defined with more threads involved for instance to mimic a
media server. Test 5 and 6 did random read tests, again with test 5
starting with a cold server cache and test 6 continuing with some of
the data precached from test 5. Here, we did have to deal with client
caches trying to insure that we don't hit in the client cache too much
as the run progressed. Test 7 and 8 showcased streaming writes for
single and 20 streams (per client). Reproducibility of test 7 and 8
is more difficult we believe because of client side
fsflush issue. We
found that we could get more stable results tuning fsflush on the
clients. Test 9 is the all important synchronous write case (for
instance a database). This test truly showcases the benefit of our
write side SSD and also shows why tuning the recordsize to match ZFS
records with DB accesses is important. Test 10 was inoperant as
mentioned above and test 11 filecreate, completes the set.
Given that those we predefined test definition, we're very happy to
see that our numbers actually came out really well with these tests
particularly for the Mirrored configs with write optimized SSDs.
See for instance results obtained by
Amitabha Banerjee .
I should add that these can now be used to give ballpark estimate of the
capability of the servers. They were not designed to deliver the
topmost numbers from any one config. The variability of the runs are
at times more important that we'd wish and so your mileage will
vary. Using
Analytics to observe the running system can be quite
informative and a nice way to actually demo that capability. So use
the output with caution and use your own judgment when it comes to
performance issues.