| « juillet 2009 |
| lun. | mar. | mer. | jeu. | ven. | sam. | dim. |
|---|
| | | 1 | 2 | 3 | 4 | 5 |
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 | 31 | | |
| | | | | | | |
| Today |

jeudi juin 11, 2009
Compared Performance of Sun 7000 Unified Storage Array Line
The Sun Storage 7410 Unified Storage Array provides high-performance
for NAS environments. Sun's product can be used on a wide variety of
applications. The Sun Storage 7410 Unified Storage Array with a _single_
10 GbE connection delivers linespeed of the 10 GbE.
- The Sun Storage 7410 Unified Storage Array delivers 1 GB/sec throughput performance.
- The Sun Storage 7310 Unified Storage Array delivers over 500
MB/sec on streaming writes for backups and imaging applications.
- The Sun Storage 7410 Unified Storage Array delivers over 22000
of 8K synchronous writes per second combining great DB
performance and ease of deployment of Network Attached Storage
while delivering the economics benefits of inexpensice SATA disks.
- The Sun Storage 7410 Unified Storage Array delivers over 36000
of random 8K reads per second from a 400GB working set for great Mail application
responsiveness. This corresponds to an entreprise of 100000 people
with every employee accessing new data every 3.6 second consolidated
on a single server.
All those numbers characterise a single head of a 7410 clusterable
technology. The 7000 clustering technology stores all data in dual
attached disk trays and no state is shared between cluster heads
(see
Sun 7000 Storage clusters). This
means that an active-active cluster of 2 healthy 7410 will deliver 2X
the performance posted here.
Also note that the performance posted here represent what is acheived
under a very tightly defined constrained
workload (see
Designing 11 Storage metric) and those do not represent the performance limits of the systems. This is testing 1 x 10 GbE port only; each product can have 2 or 4 10 GbE ports, and by running load across multiple ports the server can deliver even higher performance. Achieving maximum performance is a separate exercise done extremely well
by my friend Brendan :
Measurement Method
To measure our performance we used the open source Filebench tool
accessible from SourceForge (
Filebench
on solarisinternals.com). Measuring performance of a NAS storage
is not an easy task. One has to deal with the client side cache which
needs to be bypassed, the synchronisation of multiple clients, the
presence of client side page flushing deamons which can turn asynchronous
workloads into synchronous ones. Because our Storage 7000 line
can have such large caches (up to 128GB of ram and more than 500GB of secondary caches) and we wanted to test disk responses, we
needed to find a backdoor ways to flush those caches on the servers. Read
Amithaba
Filebench
Kit entry on the topic in which he posts a link to the toolkit
used to produce the numbers.
We recently released our first major software update
2000.Q2 and along with that a new lower cost clusterable 96 TB Storage, the
7310.
We report here the compared numbers of a 7310 with the latest software release to those
previously obtained for the 7410, 7210 and 7110
systems each attached to an 18 to 20 client pool over a single 10Gbe interface
with the regular frame ethernet (1500 Bytes). By the way, looking at
brendan's results above, I encourage you to upgrade to use Jumbo Frames
ethernet for even more performance and note that our servers can drive
two 10Gbe at line speed.
Tested Systems and Metrics
The tested setup are :
Sun Storage 7410, 4 x quad core: 16 cores @ 2.3 Ghz AMD.
128GB of host memory.
1 dual port 10Gbe Network Atlas Card. NXGE driver. 1500 MTU
Streaming Tests:
2 x J4400 JBOD, 44 x 500GB SATA drives 7.2K RPM, Mirrored pool,
3 Write optimized 18GB SSD, 2 Read Optimized 100GB SSD.
IOPS tests:
12 x J4400 JBOD, 280 x 500GB SATA drives 7.2K RPM, Mirrored pool,
272 Data drives + 8 spares.
8-Mirrored Write Optimised 18GB SSD, 6 Read Optimized 100GB SSD.
FW OS : ak/generic@2008.11.20,1-0
Sun Storage 7310,2 x quad core: 8 cores @ 2.3 Ghz AMD.
32GB of host memory.
1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
4 x J4400 JBOD for a total 92 SATA drives 7.2K RPM
43 mirrored pairs
4 Write Optimised 18GB SSD, 2 Read Optimized 100GB SSD.
FW OS : Q2 2009.04.10.2.0,1-1.15
Sun Storage 7210, 2 x quad core: 8 cores @ 2.3 Ghz AMD
32 GB of host memory.
1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
44 x 500 GB SATA drives 7.2K RPM, Mirrored pool,
2 Write Optimised 18 GB SSD.
FW OS : ak/generic@2008.11.20,1-0
Sun Storage 7110, 2 x quad core opteron: 8 cores @ 2.3 Ghz AMD
8 GB of host memory.
1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
12 x 146 GB SAS drives, 10K RPM, in 3+1 Raid-Z pool.
FW OS : ak/generic@2008.11.20,1-0
The newly released 7310 was tested with the most recent software revision and that certainly is giving the 7310 an edge over it's peers.
The 7410 on the other hand was measured here managing a much large contingent of storage, including mirrored Logzillas and 3 times as many JBODs and that is
expected to account for some of the performance delta being observed.
| Metrics |
Short Name |
| 1 thread per client streaming cached reads | Stream Read light |
| 1 thread per client streaming cold cache reads | Cold Stream Read light |
| 10 threads per client streaming cached reads | Stream Read |
| 20 threads per client streaming cold cached reads | Cold Stream Read |
| 1 thread per client streaming write | Stream Write light |
| 20 threads per client streaming write | Stream Write |
| 128 threads per client 8k synchronous writes | Sync write |
| 128 threads per client 8k random read | Random Read |
| 20 threads per client 8k random read on cold caches | Cold Random Read |
| 8 threads per client 8k small file create IOPS | Filecreate |
There are 6 read tests, 2 writes test and 1 synchronous write test
which overwrites it's data files as a database would. A final
filecreate test complete the metrics. Test executes against 20GB
working set _per client_ times 18 to 20 clients. There are 4 sets used
in total running over independent shares for a total of 80GB per
client. So before actual runs at taken, we create all working sets
or 1.6 TB of precreated data. Then before each run, we clear all
caches on the clients and server.
In each of the 3 groups of 2 read tests, the first one benefits from
no caching at all and the throughput delivered to the client over the
network is observed to come from disk. The test runs for N seconds
priming data in the Storage caches. A second run (
non-cold) is
then started after clearing the client side caches. Those test will
see the 100% of the data delivered over the network link but not all
of it is coming off the disks. Streaming test will race through the
cached data and then finish off reading from disks. The random read
test can also benefit from increasing cached responses as the test
progresses. The exact caching characteristic of a 7000 lines will
depend on a large number of parameters including your application
access pattern. Numbers here reflect the performance of fully
randomized test over 20GB per client x 20 clients or a 400GB working
set. Upcoming studies will include more data (showing even higher
performance) for workloads with higher cache hit ratio than those used
here.
In a Storage 7000 server, disks are grouped together in one pool and
then individual Shares are creates. Each share has access to all disk
resource subject to quota (a minimum) and reservation (a maximum) that
might be set. One important setup parameter associated with each share
is the DB record size. It is generally better for IOPS test to use 8K
records and for streaming test to use 128K records. The recordsize can
be dynamically set based on expected usage.
The tests shown here were obtained with NFSv4 the default for Solaris clients (NFSv3 is expected to
come out slightly better). The
clients were running Solaris 10, with tuned tcp_recv_hiwat of 400K and
dopageflush=0 to prevent buffered writes from being converted into
synchronous writes.
Compared Results of the 7000 Storage Line
| NFSv4 Test |
7410 Head Mirrored Pool |
7310 Head Mirrored Pool |
7210 Head Mirrored Pool |
7110 Head 3+1 Raid-Z |
| Throughput | | | | |
| Cold Stream Read light | 915 MB/sec | 685 MB/sec | 719 MB/sec | 378 MB/sec |
| Stream Read light | 1074 MB/sec | 751 MB/sec | 894 MB/sec | 416 MB/sec |
| Cold Stream Read | 959 MB/sec | 598 MB/sec | 752 MB/sec | 329 MB/sec |
| Stream Read | 1030 MB/sec | 620 MB/sec | 792 MB/sec | 386 MB/sec |
| Stream Write light | 480 MB/sec | 507 MB/sec | 490 MB/sec | 226 MB/sec |
| Stream Write | 447 MB/sec | 526 MB/sec | 481 MB/sec | 224 MB/sec |
| IOPS | | | | |
| Sync write | 22383 IOPS | 8527 IOPS | 10184 IOPS | 1179 IOPS |
| Filecreate | 5162 IOPS | 4909 IOPS | 4613 IOPS | 162 IOPS |
| Cold Random Read | 28559 IOPS | 5686 IOPS | 4006 IOPS | 1043 IOPS |
| Random Read | 36478 IOPS | 7107 IOPS | 4584 IOPS | 1486 IOPS |
| Per Spindle IOPS | 272 Spindles | 86 Spindles | 44 Spindles | 12 Spindles |
|---|
| Cold Random Read | 104 IOPS | 76 IOPS | 91 IOPS | 86 IOPS |
| Random Read | 134 IOPS | 94 IOPS | 104 IOPS | 123 IOPS |
Analysis
The data shows that the entire Sun Storage 7000 line are throughput
workhorse delivering 10 Gbps level NAS services per cluster head
nodes, using a single Network Interface and single IP address for easy
integration into your existing network.
As with other storage technology write streaming performance require
more involvement from the storage controller and this leads to about
50% less write throughput compared to read throughput.
The use of write optimized SSD in the 7410, 7310 and 7220 also give
this storage very high synchronous write capabilities. This is one of
the most interesting result as it maps to database performance. The ability to
sustain 24000 O_DSYNC writes at 192MB/sec of synchronized user data
using only 48 inexpensive sata disks and 3 write optimized SSD is one
of the many great performance characteristics of this novel storage
system.
Random Read test generally map directly to individual disk
capabilities and is a measure of total disk rotations. The cold runs
shows that all our platforms are delivering data at the expected 100
IOPS per spindle for those SATA disks. Recall that our offering is
based on the economical energy efficient 7.2 RPM disk technology. For
cold random reads, a mirrored pair of 2 x 7.2K RPM offers the same
total disk rotation (and IOPS) as expensive and power hungry 15 K
RPM disks but in a much more economical package.
Moreover the difference between the warm and cold random read runs is
showing that the Hybrid Storage Pool (HSP) is providing a 30% boost
even on this workload that addresses randomly 400GB working set on
128GB of controller cache. The effective boost from the HSP can be
much greater depending on the cacheability of workloads.
If we consider an organisation in which the avg mail message is 8K
in size, our results show that we could consolidate 100000 employees on
a single 7410 storage where each employee is accessing new data every
3.6 seconds with 70ms response time.
Messaging system are also big consumer of file creations, I've shown
in the past how efficient ZFS can be at creating small files
(
Need Inodes ?). For the NFS protocol,
file creation is a straining workload but the 7000 storage line comes
out not too bad with more than 5000 filecreates per second per storage
controller.
Conclusion
Performance Can never be summerised with a few numbers and we have
just begun to scratch the surface here. The numbers presented here
along with the disruptive pricing of the Hybrid Storage Pool will, I
hope, go a long way to show the incredible power of the Open
Storage architecture being proposed. And keep in mind that this
performance is achievable using less expensive, less power hungry SATA
drives and that every data services : NFS, CIFS, iSCSI, ftp, HTTP etc.
offered by our Sun Storage 7000 servers are available at 0 additional
software cost to you.
Disclosure Statement:
Sun Microsystem generated results using filebench. Results reported 11/10/08 and
26/05/2009 Analysis done on June 6 2009.

mercredi mai 27, 2009
Free Beer and Free Deep Dive
For those lucky enough to be in the Bay area next week, I just heard there will be free unlimited beer and free access to the technical deep dives at the Community One event at the Moscone Center and Intercontinental Hotel nearby (and I'm only lying about one of the free thingy).
The program looks great with lots of star speakers on both days, so while free is cool, don't overlook the June 1st program as well.
Lucky you.

vendredi février 13, 2009
Need Inodes ?
It seems that some old school filesystem still need to statically allocate inodes to hold pointers to individual files. Normally this should not cause too much problems as default settings account for an average filesize of
32K. Or will it ?
If the avg filesize you need to store on the filesystem is much smaller than this, they you are likely to eventually run out of inodes even if the space consumed on the storage is far from exhausted.
In ZFS inodes are allocated on demand and so the question came up, how many files can I store onto a piece of storage. I managed to scrape up an old disk of 33GB, created a pool and wanted to see how many 1K files I could store on that storage.
ZFS stores files with the smallest number of sectors possible and so 2 sectors was enough to store the data. Then of course one needs to also store some amount of metadata, indirect pointer, directory entries etc to complete the story. There I didn't know what to expect. My program would create 1000 files per directory. Max depth level is 2, nothing sophisticated attempted here.
So I let my program run for a while and eventually interrupted it at 86% of disk capacity :
Filesystem size used avail capacity Mounted on
space 33G 27G 6.5G 81% /space
Then I counted the files.
#ptime find /space/create | wc
real 51:26.697
user 1:16.596
sys 25:27.416
23823805 23823805 1405247330
So 23.8M files consuming 27GB of data. Basically less than 1.2K of used disk space per KB of files. A legacy type filesystem that would allocate one inode per 32K would have run out of space after a meager 1M files but ZFS managed to store 23X more on the disk without any tuning.
The find command here is mostly gated on fstat performance and we see here that we did the 23.8M fstat in 3060 seconds or 7777 fstat per second.
But here is the best part : And how long did it take to create all those files ?
real 1:09:20.558
user 9:20.236
sys 2:52:53.624
This is hard to believe but it took about 1 hour for 23.8 million files.This is on a single direct attach drive
3. c1t3d0 <FUJITSU-MAP3367N SUN36G-0401-33.92GB>
ZFS created on average 5721 files per second. Now obviously such a drive cannot do 5721 IOPS but with ZFS it
didn't need to. File create is actually more of a cpu benchmark because the application is interacting with host cache. It's the task of the filesystem to then create the files on disk in the background. With ZFS, the combination of the Allocated on Write policy and the sophisticated I/O aggregation in the I/O scheduler (
dynamics) means that the I/O for multiple independant file create can be coalesced. Using dtrace I counted the number of IO required and filecreates per minutes, typical samples show more than 200K files created per minutes using about 3000 IO per minutes or 3300 files per second using a mere 50 IOPS !!!
Per Minute
Sample Create IOs
#1 214643 2856
#2 215409 3342
#3 212797 2917
#4 211545 2999
Finally with all these files, is scrubbing a problem ? It took 1h34m to actually scrub that many files at a pace of 4200 scrubbed files per second. No sweat.
pool: space
state: ONLINE
scrub: scrub completed after 1h34m with 0 errors on Wed Feb 11 12:17:20 2009
If you need to create, store and otherwise manipulate lots of small files efficiently, ZFS has got to be you
filesystem of choice for you.

lundi décembre 15, 2008
Decoding Bonnie++
I've been studying the popular Bonnie++ load generator to see if it
was a suitable benchmark to use with Network attached storage such as
Sun Storage 7000 line.
At this stage I've looked at the single client runs, and it doesn't appear
that Bonnie++ is an appropriate tool in this environment because
as we'll see here, for many of the tests, it either stresses the networking environment
or the strength of client side cpu.
The first interesting thing to note is that Bonnie will work
on a data set that is double the client's memory. This does address
some of the client side caching concern one could otherwise have. In a
NAS environment the amount of memory present on the server is not
considered by a default bonnie++ run. My client had 4GB leading to a
working set was then 8GB while the server had 128GB of memory.
The Bonnie++'s output looks like :
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03d ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
v2c01 8G 81160 92 109588 38 89987 67 69763 88 113613 36 2636 67
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 687 10 +++++ +++ 1517 9 647 10 +++++ +++ 1569 8
v2c01,8G,81160,92,109588,38,89987,67,69763,88,113613,36,2635.7,67,16,687,10,+++++,+++,1517,9,647,10,+++++,+++,1569,8
Method
I have used a combination of Solaris truss(1), reading
Bonnie++ code, looking at AmberRoad's
Analytics
data , as well as a custom Bonnie
d-script in
order to understand how each test triggered system calls on the client
and how those translated into a NAS server load. In the d-script, I
characterise the system calls by the average elapse time as well as by
the time spent waiting for a response from the NAS server. The time
spent waiting is the operational latency that one should be interested
in when characterising a NAS, while the additional time relates to the
client CPU strength along with the client NFS implementation. Here is
what I found trying to explain how performant each test was.
Writing with putc()
So easy enough, that test creates a file using single character putc stdio library
call.
This test is clearly a client CPU test with most of the time spent
in user space running putc(). Every 8192 putc, stdio library will
issue a
write(2) system call. That syscall is still a client CPU test
since the data is absorbed on the client cache. What we test here is
the client single CPU performance and the client NFS implementation.
On a 2 CPU/ 4GB V20z running Solaris, we observed on the server using
analytics a network transfer rate of 87 MB/sec.
Results : 87 MB/sec of writes. Limited by single CPU speed.
Writing intelligently...done
Here it's more clever since it writes a file using sequential 8K write
system calls.
In this test the CPU is much relieved. So here the application is
running 8K write system call to client NFS. This is absorbed by memory
on the client. With an Opensolaris client, no over the wire request is
sent for such an 8K write. However after 4 such 8K writes we reach the
natural 32K chunk advertised by the server and that will cause the
client to asynchronously issue a write request to the server. The
asynchronous nature means that this will not cause the application to
wait for the response and the test will keep going on CPU. The process
will now race ahead generating more 8K writes and 32K asynchronous NFS
requests. If we manage to generate such request at a greater rate than
responses, we will consume all allocated aysnchronous threads. On
Solaris this maps to nfs4_max_threads (8) threads. When all 8
asynchronous threads are waiting for a response, then the application
will finally block waiting for a previously issued request to get a
response and free an async thread.
Since generating 8K write systems to fill the client cache is faster
than the network connection between the client and the server we will
eventually reach this point. The steady state of this test is that
Bonnie++ is waiting for data to transfer to the server. This happens
at the speed of a single NFS connection which for us saturated the
1Gbps link we had. We observed 113MB/sec which is network line rate
considering protocol overheads.
To get more through on this test, one could use Jumbo Frame ethernet
instead of the 1500 Byte default frame size used as this would reduce the protocol overhead
slightly. One could also configure the server and client to use
10Gbps ethernet links.
One could also use LACP link aggregation of 1Gbps network ports to
increase the throughput. LACP increases throughput of multiple network
connections but not single socket protocol. By default a Solaris
client will establish a single connection (clnt_max_conns = 1) to a
server (1 connections per target IP). So using multiple aggregated
links _and_ tuning clnt_max_conns could yield extra throughput here.
Using single connection one could use a faster network between client
and server links to reach additional throughput.
More commonly, we expect to saturate the client 1Gbps connectivity
here, not much of a stress for a Sun Storage 7000 server.
Results : 113 MB/sec of writes. Network limited.
Rewriting...done
This gets a little interesting. It actually reads 8K, lseek back to
the start of the block, overwrites the 8K with new data and loops.
So here we read, lseek back, overwrite . For the NFS protocol lseek is
a noop since every over the wire write is tagged with the target
offset. In this test we are effectively stream reading the file from
the server and stream writing the file back to the server. The stream
write behavior will be much like the previous test. We never need to
block the process unless we consume all 8 asynchronous threads.
Similarly 8K sequential reads will be recognised by our client NFS as
streaming access which will deploy asynchronous readahead requests. We
will use 4 (nfs4_nra) request for 32K blocks ahead of the point being
currently read. What we observed here was that of 88 second of elapse
time, 15 was spent in write and 20 in reads. However a small portion
of that was spent waiting for response. It was mostly all spent on CPU
time to interact with the client NFS. This implies that readhead and
asynchronous writeback was behaving without becoming bottlenecks. The
Bonnie++ process took 50 sec of the 88 sec and a big chunk of this, 27
sec, was spent waiting off cpu. I struggle somewhat in this
interpretation but I do know from the Analytics data on the server
that the network is seeing 100 MB/sec of data flowing in each
direction. This must also be close to network saturation. The wait
time attributed to Bonnie++ in this test seems be related to kernel
preemption. As Bonnie++ is coming out of its system calls we see such
events in dtrace.
unix`swtch+0x17f
unix`preempt+0xda
genunix`post_syscall+0x59e
genunix`syscall_exit+0x59
unix`0xfffffffffb800f06
17570
This must be to service the kernel threads of higher priority, likely
the asynchronous threads being spawned by the reads and writes.
This test is then a stress test of bidirectional flow of 32K data
transfers. Just like the previous test, to improve the numbers one
would need to improve the network connection throughput between the
client and server. It also potentially could then benefit from faster
and more client CPUs.
Results : 100MB/sec in each direction, network limited.
Reading with getc()...done
Reads the file one character at a time.
Back to a test of the client CPU much like the first one.
We see that the readahead are working great since little time is spent
waiting (0.4 of 114 seconds). Given that this test does 1 million
reads in 114 seconds, the average latency could be evaluated to be 114
usec.
Results : 73MB/sec, single CPU limited on the client.
Reading intelligently...done
start 'em...done...done...done...
Reads with 8k system calls, sequential.
This test seems to be using 3 spawned bonnie process to read files.
The reads are of size 8K and we needed 1M of them to read our 8GB
working set. We observed with analytics no I/O done on the server
since it had 128GB of cache available to it. The network on the other
hand is saturated at 118 MB/sec.
The dtrace script shows that the 1M read calls collectively spend 64
seconds waiting (most of that NFS response). So that implies a 64 usec
read response time for this sequential workload.
Results : 118MB/sec, limited by Network environment.
start 'em...done...done...done...
Here is seems that Bonnie starts 3 helper processes
used to read the files in the "Reading Intelligently" test.
Create files in sequential order...done.
Here we see 16K files being created (with
creat(2)) then closed.
This test will create and close 16K files and took 22 seconds in our
environment. 19 seconds were used for the creates, 17.5 waiting for
responses. That means a 1ms response time for file creates. The test
seems single threaded. Using analytics we observe 13500 NFS ops per
second to handle those file create. We do see some activity on the
Write bias SSD although very modest at 2.64MB /sec. Given that the
test is single threaded we can't estimate if this metric is
representative of the NAS server capability. More likely this is
representative the single thread capability of the whole environment
made of : client CPU, client NFS implementation, client network driver
and configuration, network envinronment including switches, and the
NAS server.
Results : 744 filecreate per second per thread. Limited by operational latency.
Here is the analytics view captured for the this tests and the
following 5 tests.
Stat files in sequential order...done.
Test was too elusive possibly working against cached stat information.
Delete files in sequential order...done.
Here we
unlink(2) the 16K files.
Here we call the unlink system call for the 16K files. The run takes
10.294 seconds showing a 1591 unlink per second. Each call goes
off cpu, waiting for a server response for 600 usec.
Much like the create file test above, while we get information about
the single threaded unlink time present in the environment it's
obviously not representative of the server's capabilities.'
Results : 1591 unlink per second per thread, Limited by operational latency.
Create files in random order...done.
We recreate 16K files, closing each one but also running a stat()
system call on each.
Stat files in random order...done.
Elusive as above.
Delete files in random order...done.
We remove the 16K files.
I could not discern in the "random order" test any meaninful
differences to the sequential order ones.
Analytics screenshot of Bonnie++ run
Here is the full screen shot from analytics including Disk and CPU
data
The takeway here is that single instance bonnie++ does not generally
stress one Sun Storage 7000 NAS server but will stress the client CPU
and 1Gbps network connectivity. There is no multi-client support in
Bonnie++ (that I could find).
One can certainly start multiple clients simultaneously, but since the
different tests would not be synchronized the output of bonnie++ would
be very questionable. Bonnie++ does have a multi-instance
synchronisation mode that is based on semaphore which will only work if
all instances are running within the same OS environment.
So in a multi client test, Only the total elapsed time would be of
interest here and that would be dominated by the streaming performance
as each client would read and write its working set 3 times over the
wire. Filecreate and unlink times would also contribute to the total
elapsed time of such a test.
For a single node multi-instance bonnie++ run, one would need to have
a large client, with at least 16 x 2Ghz CPUS, and about 10Gbps worth
of network capabilities in order to properly test one Sun Storage 7410
server. Otherwise, Bonnie++ is more likely to show client and network
limits, not server ones. As for unlink capabilities, the topic is a
pretty complex and important one that certainly cannot be captured
with simple commands. The interaction with snapshots and the I/O load
generated on the server during large unlink storms needs to be studied
carefully in order to understand the competitive merits of different
solutions.
In Summary, here is what governs the performance of the
individual Bonnie++ tests :
| Writing with putc()... | 87 MB/sec | Limited by client's single CPU speed |
| Writing intelligently... | 113 MB/sec | Limited
by Network conditions |
| Rewriting... | 100MB/sec | Limited by Network conditions |
| Reading with getc()... | 73MB/sec | Limited by
client's single CPU speed |
| Reading intelligently... | 118MB/sec | Limited
by Network conditions |
| start 'em...done...done...done... |
| Create files in sequential order... | 744 create/s | Limited by operational latency |
| Stat files in sequential order... | not observable |
| Delete files in sequential order... | 1591 unlink/s | Limited by operational latency |
| Create files in random order... | same as sequential |
| Stat files in random order... | same as sequential |
| Delete files in random order... | same as sequential |
So Bonnie++ won't tell you much about our server's
capabilities. Unfortunately, the clustered mode of Bonnie++ won't coordinate
multiple clients systems and so cannot be used to stress a server.
Bonnie++ could be used to stress a NAS server using a single large
multi-core client with very strong networking capabilities.
In the end though I don't expect to learn much about our servers over
and above what is already known. For that please check out our links here :
Low
Level Performance of Sun Storage
Analyzing
the Sun Storage 7000
Designing
Performance Metrics...
Sun
Storage 7xxx Performance Invariants
Here is the
bonnie.d
d-script used
and the output generated
bonnie.out.

lundi novembre 10, 2008
Blogfest : Performance and the Hybrid Storage Pool
Today Sun is announcing a new line of
Unified
Storage designed by a core of the most brilliant engineers . For
starters Mike Shapiro provides a great introduction into this product,
the new economics behind it and the killer App in
Sun
Storage 7000.
The killer App is of course Bryan Cantrill's brainchild, the already
famous
Analytics.
As a performance engineer, it's been a great thrill to have given this
tool an early test drive. Working a full 1 ocean's (the atlantic) + 1
continent (the USA) away from my system running Analytics I was
skeptical at first that I would be visualizing in real time all that
information : the NFS/CIFS ops, the disk ops, the CPU load and network
throughput, per client, per disk, per file ARE YOU CRAZY ! All that
information available IN REAL TIME; I just have to say a big thank you
to the team that made it possible. I can't wait to see our customer
put this to productive use.
Also check out Adam Levanthal's great description of HSP the
Hybrid Storage Pool and read my own perspective on this topic
ZFS as a
Network Attach Storage Controller.
Lest we forget the immense contribution of the boundless Energy bubble
that is Brendan Gregg; the man that braught DTracetoolkit to the
semi-geek; he must be jumping with excitement as we now see the power
of DTrace delivered to each and every system administrator.
He talks here about the
Status
Dashboard. And Brendan's contribution does not stop here, he is
also the parent of this wonderful component of the
HSP known
as the L2ARC which is how the readzillas become activated. See his own
previous work on the
L2ARC along with
Jing Zhang more recent
studies. Quality assurance people don't often get into the spotlight but check out
Tim Foster 's post on how he tortured the zpool code
adding and removing l2 arc devices from pools :
For myself, it's been very exciting to be able to see performance
improvement ideas get turned into product improvements from weeks to
weeks. Those interested should read how our group influenced the product that
is shipping today, see
Alan Chiu
and my own
Delivering Performance Improvements.
Such a product has a strong Price/Performance appeal and given that we
fundamentally did not think that there where public benchmarks that
captured our value proposition, we had to come up with a
third millenium
participative ways to talk about performance. Check out how we
designed our
Metrics
or maybe go straight to our numbers obtained by
Amitabha
Banerjee a concise entry backed up by immense, intense and
carefull data gathering effort in the last few weeks. bmseer is putting his own light
on the
low level data (data to be updated with numbers from a grander config).
I've also posted here a few performance guiding lights to be used
thinking about this product; I call them
Performance
Invariants. So further numbers can be found here about
raid rebuild times.
On the application side, we have the great work of Sean (Hsianglung
Wu) and Arini Balakrishnan showing how a 7210 can deliver
> 5000 concurrent video streams at an aggregate of,
you're kidding, :
WOW ZA 750MB/sec.
More Details on how this was acheived in
cdnperf.
Jignesh Shaw shows step by step instructions setting up
PostgreSQL over iSCSI.
See our Vice President, Solaris Data, Availability, Scalability &
HPC Bob Porras trying to tame this beast into a
nutshell
and pointing out code bits reminding everyone of the value of the
OpenStorage proposition.
See also what bmseer has to say on
Web
2.0 Consolidation and get from Marcus Heckel a walkthrough of
setting up
Olio
Web 2.0 kit with nice Analytics performance screenshots. Also get
the ISV reaction (a bit later) from
Georg Edelmann. Ryan Pratt reports on
Windows Server 2003 WHQL certification of the Sun Storage 7000 line.
And this just in : Data about what to expect from a
Database perspective.
We can talk all we want about performance but as Josh Simons points out,
these babies are available to you for your own
try and buy.
Or check out how you could be running the appliance within the next hour really :
Sun Storage 7000 in VMware.
It seems I am in competition with another less verbose
aggregator
Finally capture the whole stream of related posting to
Sun Storage 7000
Designing Performance Metrics for Sun Storage 7000
One of the necessary checkpoint before launching a product is to be
able to assess it's performance. With Sun Storage 7xxx we had a
challenge in that the only NFS benchmark of notoriety was SPEC
SFS. Now this benchmark will have it's supporters and some customers
might be attached to it but it's important to understand what a
benchmarks actually says.
These SFS benchmark is a lot about "cache busting" the server : this is
interesting but at Sun we think that Caches are actually helpful in
real scenarios. Data goes in cycles in which it becomes hot at times.
Retaining that data in cache layers allow much lower latency access,
and much better human interaction with storage engines. Being a cache
busting benchmark, SFS numbers end up as a measure of the number of
disk rotation attached to the NAS server. So good SFS result requires
100 or 1000 of expensive, energy hungry 15K RPM spindles. To get good
IOPS, layers of caching are more important to the end user experience
and cost efficiency of the solution.
So we needed another way to talk about performance. Benchmarks tend to
test the system in peculiar ways that not necessarely reflect the
workloads each customer is actually facing. There are very many
workload generators for I/O but one interesting one that is OpenSource
and extensible is
Filebench
available in
Source.
So we used filebench to gather basic performance information about our
system with the hope that customers will then use filebench to
generate profiles that map to their own workloads. That way, different
storage option can be tested on hopefully more meaningful tests than
benchmarks.
Another challenge is that a NAS server interacts with client system
that themselve keep a cache of the data. Given that we wanted to
understand the back-end storage, we had to setup the tests to avoid
client side caching as much as possible. So for instance between the
phase of file creation and the phase of actually running the tests we
needed to clear the client caches and at times the server caches as
well. These possibilities are not readily accessible with the
simplest load generators and we had to do this in rather ad-hoc
fashion. One validation of our runs was to insure that the amount of
data transfered over the wire, observed with
Analytics was compatible
with the aggregate throughput measured at the client.
Still another challenge was that we needed to test a storage system
designed to interact with large number of clients. Again load
generators are not readily setup to coordinate multiple client and
gather global metrics. During the course of the effort filebench did
come up with a clustered mode of operation but we actually where too
far engaged in our path to take advantage of it.
This coordination of client is important because, the performance
information we want to report is actually the one that is delivered to
the client. Now each client will report it's own value for a given
test and our tool will sum up the numbers; but such a Sum is only
valid inasmuch as the tests ran on the clients in the same timeframe.
The possibility of skew between tests is something that needs to be
monitored by the person running the investigation.
One way that we increased this coordination was that we
divided our tests in 2 categories; those that required precreated
files, and those that created files during the timed portion of the
runs. If not handled properly, file creation would actually cause
important result skew. The option we pursued here was to have a
pre-creation phase of files that was done once. From that point, our
full set of metrics could then be run and repeated many times with
much less human monitoring leading to better reproducibility of
results.
Another goal of this effort was that we wanted to be able to run our
standard set of metrics in a relatively short time. Say less than 1
hours. In the end we got that to about 30 minutes per run to gather 10
metrics. Having a short amount of time here is important because there
are lots of possible ways that such test can be misrun. Having someone
watch over the runs is critical to the value of the output and to it's
reproducibility. So after having run the pre-creation of file
offline, one could run many repeated instance of the tests validating
the runs with
Analytics and through general observation of the system
gaining some insight into the meaning of the output.
At this point we were ready to define our metrics.
Obviously we needed streaming reads and writes. We needed ramdom reads.
We needed small synchronous writes important to Database workloads and
to the NFS protocol. Finally small filecreation and stat operation
completed the mix. For random reading we also needed to distinguish
between operating from disks and from storage side caches, an
important aspect of our architecture.
Now another thing that was on my mind was that, this is not a
benchmark. That means we would not be trying to finetune the metrics
in order to find out just exactly what is the optimal number of
threads and request size that leads to best possible performance from
the server. This is not the way your workload is setup. Your number of
client threads running is not elastic at will. Your workload is what
it is (threading included); the question is how fast is it being
serviced.
So we defined precise
per client workloads with preset number
of thread running the operations. We came up with this set just as an
illustration of what could be representative loads :
1- 1 thread streaming reads from 20G uncached set, 30 sec.
2- 1 thread streaming reads from same set, 30 sec.
3- 20 threads streaming reads from 20G uncached set, 30 sec.
4- 10 threads streaming reads from same set, 30 sec.
5- 20 threads 8K random read from 20G uncached set, 30 sec.
6- 128 threads 8K random read from same set, 30 sec.
7- 1 thread streaming write, 120 sec
8- 20 threads streaming write, 120 sec
9- 128 threads 8K synchronous writes to 20G set, 120 sec
10- 20 threads metadata (fstat) IOPS from pool of 400k files, 120 sec
11- 8 threads 8K file create IOPS, 120 sec.
For each of the 11 metrics, we could propose mapping these to relevant industries :
1- Backups, Database restoration (source), DataMining , HPC
2- Financial/Risk Analysis, Video editing, HPC
3- Media Streaming, HPC
4- Video Editing
5- DB consolidation, Mailserver, generic fileserving, Software development.
6- DB consolidation, Mailserver, generic fileserving, Software development.
7- User data Restore (destination)
8- Financial/Risk Analysis, backup server
9- Database/OLTP
10- Wed 2.0, Mailserver/Mailstore, Software Development
11- Web 2.0, Mailserver/Mailstore, Software Development
We managed to get all these tests running except the fstat (test 10)
due to a technicality in filebench. Filebench insisted on creating
the files up front and this test required thousands of them; moreover
filebench used a method that ended up single threaded to do so and in
the end, the stat information was mostly cached on the client. While
we could have plowed through some of the issues the conjunctions of
all these made us put the fstat test on the side for now.
Concerning thread counts, we figured that single stream read test was
at times critical (for administrative purposes) and an interesting
measure of the latency. Test 1 and 2 were defined this way with test
1 starting with cold client and server caches and test 2 continuing
the runs after having cleared the client cache (but not the server)
thus showing the boost from server side caching. Test 3 and 4 are
similarly defined with more threads involved for instance to mimic a
media server. Test 5 and 6 did random read tests, again with test 5
starting with a cold server cache and test 6 continuing with some of
the data precached from test 5. Here, we did have to deal with client
caches trying to insure that we don't hit in the client cache too much
as the run progressed. Test 7 and 8 showcased streaming writes for
single and 20 streams (per client). Reproducibility of test 7 and 8
is more difficult we believe because of client side
fsflush issue. We
found that we could get more stable results tuning fsflush on the
clients. Test 9 is the all important synchronous write case (for
instance a database). This test truly showcases the benefit of our
write side SSD and also shows why tuning the recordsize to match ZFS
records with DB accesses is important. Test 10 was inoperant as
mentioned above and test 11 filecreate, completes the set.
Given that those we predefined test definition, we're very happy to
see that our numbers actually came out really well with these tests
particularly for the Mirrored configs with write optimized SSDs.
See for instance results obtained by
Amitabha Banerjee .
I should add that these can now be used to give ballpark estimate of the
capability of the servers. They were not designed to deliver the
topmost numbers from any one config. The variability of the runs are
at times more important that we'd wish and so your mileage will
vary. Using
Analytics to observe the running system can be quite
informative and a nice way to actually demo that capability. So use
the output with caution and use your own judgment when it comes to
performance issues.
Using ZFS as a Network Attach Controller and the Value of Solid State Devices
So Sun is coming out today with a line of Sun Storage 7000
systems that have ZFS as the integrated volume and filesystem manager
using both read and write optimized SSD. What is this
Hybrid Storage Pool
and why is this a
good performance architecture for storage ?
A write optimized SSD is a custom designed device for
the purpose of
accelerating operations of the ZFS intent log (ZIL). The ZIL is the
part of ZFS that manages the important synchronous operation
guaranteeing that such writes are acknowledged quickly to applications
while guaranteeing persistence in case of outage. Data stored in the
ZIL is also kept in memory until ZFS issue the next Transaction Groups
(every few seconds).
The ZIL is what stores data urgently (when application is waiting) but
the TXG is what stores data permanently. The ZIL on-disk blocks are
only ever re-read after a failure such as power outage. So the SSDs
that are used to accelerate the ZIL are write-optimized : they need to handle data at
low latency on writes; reads are unimportant.
The TXG is an operation that is asynchronous to applications : apps
are generally not waiting for transactions groups to commit. The
exception here is when data is generated at a rate that exceeds the
TXG rate for a sustained period of time. In this case, we become
throttled by the pool throughput. In a NAS storage this will rarely
happen since network connectivity even at GB/s is still much less that
what storage is capable of and so we do not generate the imbalance.
The important thing now is that in a NAS server, the controller is also
running a file level protocol (NFS or CIFS) and so is knowledgeable
about the nature (synchronous or not) of the requested writes. As such
it can use the accelerated path (the SSD) only for the necessary
component of the workloads. Less competition for these devices means
we can deliver both high throughput and low latency together in the
same consolidated server.
But here is where is gets nifty. At times, a NAS server might
receive a huge synchronous request. We've observed this for instance
due to fsflush running on clients which will turn non-synchronous
writes into a massive synchronous one. I note here that a way to
reduce this effect, is to tune up fsflush (to say 600). This is
commonly done to reduce the cpu usage of fsflush but will be welcome
in the case of client interacting with NAS storage. We can also
disable page flushing entirely by setting dopageflush to 0. But that
is a client issue. From the perspective of the server, we still need
as a NAS to manage large commit request.
When subject to such a workload, say 1GB commit, ZFS being all aware
of the situation, can now decide to bypass the SDD device and issue
request straight to disk based pool blocks. It would do so for 2
reasons. One is that the pool of disks in it's entirety has more
throughput capabilities than the few write optimized SSD and so we will
service this request faster. But more importantly, the value of the
SSD is in it's latency reduction aspect. Leaving the SSDs available to
service many low latency synchronous writes is considered valuable
here. Another way to say this is that large writes are generally well
served by regular disk operations (they are throughput bound) whereas
small synchronous writes (latency bound) can and will get help from
the SSDs.
Caches at work
On the read path we also have custom designed read optimized
SSDs to fit in these OpenStorage platforms. At Sun, we just believe
that many workloads will naturally lend to caching technologies. In a
consolidated storage solution, we can offer up to 128GB of primary
memory based caching and approximately 500GB of SSD based caching.
We also recognized that the latency delta between memory cached
response and disk response was just too steep. By inserting a layer
of SSD between memory and disk, we have this intermediate step
providing lower latency access than disk to a working set which is
now many times greater than memory.
It's important here to understand how and when these read
optimized SSD will work. The first thing to recognized is that the SSD
will have to be primed with data. They feed off data being evicted
from the primary caches. So their effect will not immediately seen at
the start of a benchmarks. Second, one of the value of read optimized
SSD is truly in low latency responses to small requests. Small request
here means things of the order of 8K in size. Such request will occur
either when dealing with small files (~8K) or if dealing with larger
size but with fix record based application, typically a database. For
those application it is customary to set the recordsize and this will
allow those new SSDs to become more effective.
Our read optimized SSD can service up to 3000 read IOPS (see Brendan's work
on the
L2 ARC)
and this is close or better to what a 24 x 7.2 RPM disks JBOD can do. But the
key point is that the low latency response means it can do so using
much fewer threads that would be necessary to reach the same level on
a JBOD. Brendan demonstrated here that the response time of these
devices can be 20 times faster than disks and 8 to 10 times faster
from the client's perspective. So once data is installed in the
SSD, users will see their requests serviced much faster
which means we are less likely to be subject to queuing delays.
The use of read optimized SSD is configurable in the Appliance. Users
should learn to identify the part of their datasets that end up gated
by lightly threaded read response time. For those workloads enabling
the secondary cache is one way to deliver the value of the read optimized
SSD. For those filesystems, if the workload contains small files (such
as 8K) there is no need to tune anything, however for large files
access in small chunks setting the filesystem recordsize to 8K is
likely to produce the best response time.
Another benefit to these SSDs will be in the $/IOPS case. Some
workloads are just IOPS hungry while not necessarely huge block
consumers. The SSD technology offers great advantages in this space
where a single SDD can deliver the IOPS of a full JBOD at a fraction
of the cost. So with workloads that are more modestly sized but IOPS
hungry a test drive of the SSD will be very interesting.
It's also important to recognized that these systems are used in
consolidation scenarios. It can be that some part of the applications
will be sped up by read or write optimized SSD, or by the large memory
based caches while other consolidated workloads can exercise other
components.
There is another interesting implication to using SSD in the
storage in regards to clustering. The read optimized ssd acting as
caching layers actually never contain critical data. This means those
SSD can go into disk slots of head nodes since there is no data to be
failed over. On the other hand, write optimized SSD will store data
associated with the critical synchronous writes. But since those are located in
dual-ported backend enclosures, not the head nodes, it implies that,
during clustered operations, storage head nodes do not have to
exchange
any user level data.
So by using ZFS and read and write optimized SSDs, we can deliver low
latency writes for application that rely on them, and good throughput
for synchronous and non synchronous case using cost effective SATA
drives. Similarly on the read size, the high amount of primary and
secondary caches enables delivering high IOPS at low latency (even if
the workload is not highly threaded) and it can do so using the more
cost and energy efficient SATA drive.
Our architecture allows us to take advantage of the latency
accelerators while never being gated by them.
Delivering Performance Improvements to Sun Storage 7000
I describe here the effort I spearheaded studying the performance
characteristics of the OpenStorage platform and the ways in which our
team of engineers delivered real out of the box improvements to the
product that is shipping today.
One of the Joy of working on the OpenStorage NAS appliance was
that solutions we found to performance issues could be immediately
transposed into changes to the appliance without further process.
The first big wins
We initially stumble on 2 major issues, one for NFS synchronous writes
and one for the CIFS protocol in general. The NFS problem was a subtle
one involving the distinction of O_SYNC vs O_DSYNC writes in the ZFS
intent log and was impacting our threaded synchronous writes test by
up to a 20X factor. Fortunately I had an history of studying that part
of the code and could quickly identify the problem and suggest a
fix. This was tracked as
6683293: concurrent O_DSYNC writes to a
fileset can be much improved over NFS.
The following week, turning to CIFS studies, we were seeing great
scalability limitation in the code. Here again I was fortunate to be
the first one to hit this. The problem was that to manage CIFS request
the kernel code was using simple kernel allocations that could
accommodate the largest possible request. Such large allocations and
deallocations causes what is known as a storm of TLB shootdown
cross-calls limiting scalability.
Incredibly though after implementing the trivial fix, I found that the
rest of the CIFS server was beautifully scalable code with no other
barriers. So in one quick and simple fix (using kmem caches) I could
demonstrate a great scalability improvements to CIFS. This was
tracked as
6686647
: smbsrv scalability impacted by memory
Since those 2 protocol problems were identified early on, I must say
that no serious protocol performance problems have come up. While we
can always find incremental improvements to any given test, our
current implementation has held up to our testing so far.
In the next phase of the project, we did a lot of work on improving
network efficiency at high data rate. In order to deliver the
throughput that the server is capable of, we must use 10Gbps network
interface and the one available on the NAS platforms are based on the
Neptune networking interface running the nxge driver.
Network Setup
I collaborated on this with
Alan Chiu that already new a
lot about this network card and driver tunables and so we quickly
could hash out the issues. We had to decide for a proper out of the
box setup involving
- how many MSI-X interrupts to use
- whether to use networking soft rings or not
- what bcopy threshold to use in the driver as opposed to
binding dma.
- Whether to use or not the new Large Segment Offload (LSO)
technique for transmits.
We new basically where we wanted to go here. We wanted many interrupts
on receive side so as to not overload any CPU and avoid the use of
layered softrings which reduces efficiency. A low bcopy threshold so
that dma binding be used more frequently as the default value was too
high for this x64 based platform.
And LSO was providing a nice boost to efficiency. That got us to some
proper efficiency level.
However we noticed that under stress and high number of connections
our efficiency would drop by 2 or 3 X. After much head scratching we
rooted this to the use of too many TX dma channels. It turns out that
with this driver and architecture using a few channels leads to more
stickyness in the scheduling and much much greater efficiency. We
settled on 2 tx rings as a good compromise. That got us to a level
of 8-10 cpu cycles per byte transfered in network code (more on
Performance
Invariants).
Interrupt Blanking
Studying a Opensource alternative controller, we also found that on 1
of 14 metrics we where slower. That was rooted in the interrupt
blanking parameter that NIC use to gain efficiency. What we found here
was that by reducing our blanking to a small value we could leapfrog
the competition (from 2X worse to 2X better) on this test while
preserving our general network efficiency. We were then on par or
better for every one of the 14 tests.
Media Streaming
When we ran thousand or 1 Mb/s media streams from our systems we
quickly found that the file level software prefetching was hurting us.
So we initially disabled the code in our lab to run our media studies
but at the end of the project we had to find an out of the box setup
that could preserve our Media result without impairing maximum read
streaming. At some point we realized that what we were hitting
6469558:
ZFS prefetch needs to be more aware of memory pressure. It turns out
that the internals of zfetch code is setup to manage 8 concurrent
streams per file and can readahead up to 256 blocks or records : in
this case 128K. So when we realized that with 1000s of streams we
could readahead ourself out of memory, we knew what we needed to
do. We decided on setting up 2 streams per file reading ahead up to 16
blocks and that seems quite sufficient to retain our media serving
throughput while keeping so prefetching capabilities. I note here also
is that NFS client code will themselve recognize streaming and issue
their own readahead. The backend code is then reading ahead of client
readahead requests. So we kind of where getting
ahead of
ourselves here. Read more about it @
cndperf
To slog or not to slog
One of the innovative aspect of this Openstorage server is the use of
read and write optimized solid state devices; see for instance
The Value of Solid State Devices.
Those SSD are beautiful devices designed to help
latency but not throughput. A massive commit is actually better handled by
regular storage not ssd. It turns out that it was actually dead easy
to instruct the ZIL to recognize massive commits and divert it's block
allocation strategy away from the SSD toward the common pool of
disks. We see two benefits here, the massive commits will sped up
(preventing the SSD from becoming the bottleneck) but more importantly
the SSD will now be available as low latency devices to handle
workloads that rely on low latency synchronous operations. One should
note here that the ZIL is a "per filesystem" construct and so while a
filesystem might be working on a large commit another filesystem from
the same pool might still be running a series of small transaction and
benefit from the write optimized SSD.
In a similar way, when we first tested the read-optimized ssds , we quickly
saw that streamed data would install in this caching layer and that it
could slow down the processing later. Again the beauty of working on
an appliance and closely with developers meant that the following
build, those problems had been solved.
Transaction Group Time
ZFS operates by issuing regular transaction groups in which
modifications since last transaction group are recorded on disk and
the ueberblock is updated. This used to be done at a 5 second interval
but with the recent improvement to the
write
throttling code this became a 30 second interval (on light
workloads) which aims to not generate more than 5 seconds of I/O per
transaction groups. Using 5 seconds of I/O per txg was used to
maximize the ratio of data to metadata in each txg, delivering more
application throughput. Now these Storage 7000 servers will typically
have lots of I/O capability on the storage side and the data/metadata
is not as much a concern as for a small JBOD storage. What we found
was that we could reduce the the target of 5 second of I/O down to 1
while still preserving good throughput. Having this smaller value
smoothed out operation.
IT JUST WORKS
Well that is certainly the goal. In my group, we spent the
last year performance testing these OpenStorage systems finding and
fixing bugs, suggesting code improvements, and looking for better
compromise for common tunables. At this point, we're happy with the
state of the systems particularly for mirrored configuration with
write optimized SSD accelerators. Our code is based on a recent
OpenSolaris (from august) that already has a lot of improvements over
Solaris 10 particularly for ZFS, to which we've added specific
improvements relevant to NAS storage. We think these systems will at
times deliver great performance (see Amithaba's
results
) but almost always shine in the price performance categories.

mardi novembre 04, 2008
People ask: where are we with ZFS performance ?
The standard answer to any computer performance question is
almost always : "it depends" which is semantically
equivalent to "I don't know". The better answer is to state
the dependencies.
I would certainly like to see every performance issue studied with a
scientific approach. OpenSolaris and Dtrace are just incredible
enablers when trying to reach root cause and finding those causes is
really the best way to work toward delivering improved performance.
More generally tough, people use common wisdom or possible faulty
assumption to match their symptoms with that of other similar reported
problems. And, as human nature has it, we'll easily blame the
component we're least familiar with for problems. So we often end up
with a lot of report of ZFS performance that once, drilled down,
become either totally unrelated to ZFS (say HW problems) , or
misconfiguration, departure from Best Practices or, at times,
unrealistic expectations.
That does not mean, there are no issues. But it's important
that users can more easily identify known issues, schedule
for fixes, workarounds etc. So anyone deploying ZFS should
really be familiar with those 2 sites :
ZFS Best Practices and
Evil Tuning Guide
That said, what are real commonly encountered performance problems
I've seen and where do we stand ?
Writes overunning memory
That is a real problem that was fixed last March and is integrated in
the Solaris U6 release. Running out of memory causes many different
types of complaints and erratic system behavior. This can happen
anytime a lot of data is created and streamed at rate greater than
that which can be set into the pool. Solaris U6 will be an important
shift for customers running into this issue. ZFS will still try to
use memory to cache your data (a good thing) but the competition this
creates for memory resources will be much reduced. The way ZFS is
designed to deal with this contention (ARC shrinking) will need a new
evaluation from the community. The lack of throttling was a great
impairement to the ability of the ARC to give back memory under
pressure. In the mean time lots of people are capping their arc size
with success as per the Evil Tuning guide.
For more on this topic check out :
The new ZFS write throttle
Cache flushes on SAN storage
This is a common issue we hit in the entreprise. Although it will
cause ZFS to be totally underwhelming in terms of performance, it's
interestingly not a sign of any defect in ZFS. Sadly this touches
customers that are the most performance minded. The issue is somewhat
related to ZFS and somewhat to the Storage. As is well documented
elsewhere, ZFS will, at critical times, issue "cache flush" request to
the storage elements on which is it layered. This is to take into
account the fact that storage can be layered on top of _volatile_
caches that do need to be set on stable storage for ZFS to reach it's
consistency points. Entreprise Storage Arrays do not use _volatile_
caches to store data and so should ignore the request from ZFS to
"flush caches". The problem is that some arrays don't. This
misunderstanding between ZFS and Storage Arrays leads to underwhelming
performance. Fortunately we have an easy workaround that can be used
to quickly identify if this is indeed the problem : setting
zfs_nocacheflush (see evil tuning guide). The best workaround here is
to configure the storage with the setting to indeed ignore "cache
flush". And we also have the option of tuning sd.conf on a per array
basis. Refer again to the evil tuning guide for more detailed
information.
NFS slow over ZFS (Not True)
This is just not generally true and often a side effect of the
previous Cache flush problem. People have used storage arrays to
accelerate NFS for long time but failed to see the expected gains with
ZFS. Many sighting of NFS problems are traced to this.
Other sightings involve common disks with volatile
caches. Here the performance delta observed are rooted in
the stronger semantics that ZFS offer to this operational
model. See
NFS and ZFS for a more detailed description of the
issue.
While I don't consider ZFS as generally slow serving NFS, we did
identify in recent months a condition that effects high thread count
of synchronous writes (such as a DB). This issue is fixed in the
Solaris 10 Update 6 (
CR 6683293).
I would encourage you to be familiar to where we stand regarding ZFS
and NFS because, I know of no big gapping ZFS over NFS problems (if
there were one, I think I would know). People just need to be aware
that NFS is a protocol need some type of accelaration (such as NVRAM)
in order to deliver a user experience close to what a direct attach
filesystem provides.
ZIL is a problem (Not True)
There is a wide perception that the ZIL is the source of performance
problems. This is just a naive interpretation of the facts. The ZIL
serves a very fundamental component of the filesystem and does that
admirably well. Disabling the synchronous semantics of a filesystem
will necessarely lead to higher performance in a way that is totally
misleading to the outside observer. So while we are looking at further
zil improvements for large scale problems, the ZIL is just not today
the source of common problems. So please don't disable this unless you
know what you're getting into.
Random read from Raid-Z
Raid-Z is a great technology that allows to store blocks on top of
common JBOD storage without being subject to raid-5 write hole
corruption (see : http://blogs.sun.com/bonwick/entry/raid_z). However
the performance characteristics of raid-z departs significantly from
raid-5 as to surprise first time users. Raid-Z as currently
implemented spreads blocks to the full width of the raid group and
creates extra IOPS during random reading. At lower loads, the latency
of operations is not impacted but sustained random read loads can
suffer. However, workloads that end up with frequent cache hits will
not be subject to the same penalty as workloads that access vast
amount of data more uniformly. This is where one truly needs to say,
"it depends".
Interestingly, the same problem does not affect Raid-Z streaming
performance and won't affect workloads that commonly benefit from
caching. That said both random and streaming performance are
perfectible and we are looking at a number different ways to improve
on this situation. To better understand Raid-Z, see one of my very
first ZFS entry on this topic :
Raid-Z
CPU consumption, scalability and benchmarking
This is an area we will need to make more studies. With todays very
capable multicore systems, there are many workloads that won't suffer
from the CPU consumptions of ZFS. Most systems do not run at 100% cpu
bound (being more generally constrained by disk, networks or
application scalability) and the user visible latency of operations
are not strongly impacted by extra cycles spent in say the ZFS
checksumming.
However, this view breaks down when it comes to system benchmarking.
Many benchmarks I encounter (the most crafted ones to boot) end up as
host CPU efficiency benchmarks : How many Operations can I do on this
system given large amount of disk and network resources while
preserving some level X of response time. The answer to this question
is purely the reverse of the cycles spent per operation.
This concern is more relevant when the CPU cycles spent in managing
direct attach storage and filesystem is in direct competition with
cycles spent in the application. This is also why database
benchmarking is often associated with using raw device, a fact must
less encountered in common deployment.
Root causing scalability limits and efficiency problems is just part of the never ending performance optimisation
of filesystems.
Direct I/O
Directio has been a great enabler of database performance in other
filesystems. The problem for me is that Direct I/O is a group of
improvements each with their own contribution to the end result. Some
want the concurrent writes, some wants to avoid a copy, some wants to
avoid double caching, some don't know but see performance gains when
turned on (some also see a degradation). I note that concurrent writes
has never been a problem in ZFS and that the extra copy used when
managing a cache is generally cheap considering common DB rates of
access. Acheiving greater CPU efficiency is certainly a valid goal
and we need to look into what is impacting this in common DB
workloads. In the mean time, ZFS in OpenSolaris got a new feature to
manage the cachebility of Data in the ZFS ARC. The per filesystem
"primarycache" property will allow users to decide if blocks should
actually linger in the ARC cache or just be transient. This will
allow DB deployed on ZFS to avoid any form of double caching that
might have occured in the past.
ZFS Performance is and will be a moving target for some time in the
future. Solaris 10 Update 6 with a new write throttle, will be a
significant change and then Opensolaris offers additional
advantages. But generally just be skeptical of any performance issue that is
not root caused: the problem might not be where you expect it

mercredi mai 14, 2008
The new ZFS write throttle
A very significant improvement is coming soon to ZFS. A change that
will increase the general quality of service delivered by ZFS.
Interestingly it's a change that might also slow down your
microbenchmark but nevertheless it's a change you should be eager for.
Write throttling
For a filesystem, write throttling designates the act of blocking
application for some amount of time, as short as possible, waiting for
the proper conditions to allow the write system calls to succeed.
Write throttling is normally required because applications can write
to memory (dirty memory pages) at a rate significantly faster than the
kernel can flush the data to disk. Many workloads dirty memory pages
by writing to the filesystem page cache at near memory copy speed,
possibly using multiple threads issuing high-rates of filesystem
writes. Concurrently, the filesystem is doing it's best to drain all
that data to the disk subsystem.
Given the constraints, the time to empty the filesystem cache to disk
can be longer than the time required for applications to dirty the
cache. Even if one considers storage with fast NVRAM, under sustained
load, that NVRAM will fill up to a point where it needs to wait for a
slow disk I/O to make room for more data to get in.
When committing data to a filesystem in bursts, it can be quite
desirable to push the data at memory speed and then drain the cache to
disk during the lapses between bursts. But when data is generated at
a sustained high rate, lack of throttling leads to total memory
depletion. We thus need at some point to try and match the application
data rate with that of the I/O subsystem. This is the primary goal of
write throttling.
A secondary goal of write throttling is to prevent massive data loss.
When applications do not manage I/O synchronization (i.e don't use
O_DSYNC and fsync), data ends up cached in the filesystem and the
contract is that there is no guarantee that the data will still be
there if a system crash were to occur. So even if the filesystem
cannot be blamed for such data loss, it is still a nice feature to
help prevent such massive losses.
Case in point : UFS Write throttling
For instance UFS would use the fsflush daemon to try to keep data
exposed for no more than 30 seconds (default value of autoup). Also,
UFS would keep track of the amount of I/O outstanding for each
file. Once too much I/O was pending, UFS would throttle writers for
that file. This was controlled through ufs_HW, ufs_LW and their
values were commonly tuned (a bad sign). Eventually old defaults
values were updated and seem to work nicely today. UFS write
throttling thus operates on a per file basis. While there are some
merits to this approach, it can be defeated as it does not manage the
imbalance between memory and disks at a system level.
ZFS Previous write throttling
ZFS is designed around the concept of transaction groups (txg).
Normally, every 5 seconds an _open_ txg goes to the quiesced
state. From that state the quiesced txg will go to the syncing state
which sends dirty data to the I/O subsystem. For each pool, there are
at most 1 txg in each of the 3 states, open, quiescing, syncing. Write
throttling used to occur when the 5 second txg clock would fire while
the syncing txg had not yet completed. The open group would wait on
the quiesced one which waits on the syncing one. Application writers
(write system call) would block, possibly a few seconds, waiting for a
txg to open. In other words, if a txg took more than 5 seconds to
sync to disk, we would globally block writers thus matching their
speed with that of the I/O. But if a workload had a bursty write
behavior that could be synced during the allotted 5 seconds,
application would never be throttled.
The Issue
But ZFS did not sufficiently controled the amount of data that could
get in an open txg. As long as the ARC cache was no more than half
dirty, ZFS would accept data. For a large memory machine or one with
weak storage, this was likely to cause long txg sync times. The
downsides were many :
- if we did ended up throttled, long sync times meant the system
behavior would be sluggish for seconds at a time.
- long txg sync times also meant that our granularity at which
we could generate snapshots would be impacted.
- we ended up with lots of pending data in the cache all of
which could be lost in the event of a crash.
- the ZFS I/O scheduler which prioritizes operations was also
negatively impacted.
- By not throttling we had the possibility that
sequential writes on large files could displace from the ARC
a very large number of smaller objects. Refilling
that data meant very large number of disk I/Os.
Not throttling can paradoxically end up as very
costly for performance.
- the previous code also could at times, not be issuing I/Os
to disk for seconds even though the workload was
critically dependant of storage speed.
- And foremost, lack of throttling depleted memory and prevented
ZFS from reacting to memory pressure.
That ZFS is considered a memory hog is most likely the results of the
the previous throttling code. Once a proper solution is in place, it will
be interesting to see if we behave better on that front.
The Solutions
The new code keeps track of the amount of data accepted in a TXG and
the time it takes to sync. It dynamically adjusts that amount so that
each TXG sync takes about 5 seconds (txg_time variable). It also
clamps the limit to no more than 1/8th of physical memory.
And to avoid the system wide and seconds long throttle effect, the new
code will detect when we are dangerously close to that situation
(7/8th of the limit) and will insert 1 tick delays for applications
issuing writes. This prevents a write intensive thread from hogging
the available space starving out other threads. This delay should
also generally prevent the system wide throttle.
So the new steady state behavior of write intensive workloads is that,
starting with an empty TXG, all threads will be allowed to dirty
memory at full speed until a first threshold of bytes in the TXG is
reached. At that time, every write system call will be delayed by 1
tick thus significantly slowing down the pace of writes. If the
previous TXG completes it's I/Os, then the current TXG will then be
allowed to resume at full speed. But in the unlikely event that a
workload, despite the per write 1-tick delay, manages to fill up the
TXG up to the full threshold we will be forced to throttle all writes
in order to allow the storage to catch up.
It should make the system much better behaved and generally more
performant under sustained write stress.
If you are owner of an unlucky workload that ends up as slowed by more
throttling, do consider the other benefits that you get from the new
code. If that does not compensate for the loss, get in touch and tell
us what your needs are on that front.

lundi janvier 08, 2007
NFS and ZFS, a fine combination
No doubt there is still a lot to learn about ZFS as an NFS server and
this will not delve deeply into that topic. What I'd like to dispel
here is the notion that ZFS can cause some NFS workloads to exhibit
pathological performance characteristics.
The Sightings
Since there have been a few perceived 'sightings' of such slowdowns, a
little clarification is in order. Large reported slowdowns would
typically be reported when looking at a single threaded load, probably
doing small file creation such as 'tar xf many_small_files.tar'.
For instance, I've run a small such test over a 72G SAS drive.
tmpfs : 0.077 sec
ufs : 0.25 sec
zfs : 0.12 sec
nfs/zfs : 12 sec
There are a few things to observe here. Local filesystem services
have a huge advantage for this type of load: in the absence of
specific request by the application (e.g. tar), local filesystems can
lose your data and noone will complain. This is data loss, not data
corruption, and this generally accepted data loss will occur in the
event of a system crash. The argument being that if you need a higher
level of integrity, you need to program it in applications either
using O_DSYNC, fsync etc. Many applications are not that critical and
avoid such burden.
NFS and COMMIT
On the other hand, the nature of the NFS protocol is such that the
client _must_ at some specific point request to the server to place
previously sent data onto stable storage. This is done through an
NFSv3 or NFSv4 COMMIT operation. The COMMIT operation is a contract
between clients and servers that allows the client to forget about
its previous historical interaction with the file. In the event of a
server crash/reboot, the client is guaranteed that previously commited
data will be returned by the server. Operations since the last COMMIT
can be replayed after a server crash in a way that insures a coherent
view between everybody involved.
But this all topples over if the COMMIT contract is not honored. If a
local filesystem does not properly commit data when requested to do
so, there is no more guarantee that the client's view of files will be
what it would otherwise normally expect. Despite the fact that the
client has completed the 'tar x' with
no errors, it can happen
that some of the files are missing in full or in parts.
With local filesystems, a system crash is plainly obvious to users and
requires applications to be restarted. With NFS, a server crash in
not obvious to users of the service (the only sign being a lengthy
pause), and applications are not notified. The fact that files or
parts of files may go missing
in the absence of errors can be
considered as plain
corruption of the client's side view.
When the underlying filesystem serving NFS ignores COMMIT request, or
when the storage subsystem acknowledge I/O before they reach stable
storage, what is potential data loss on the server, becomes
corruption of the client's point of view.
It turns out that in NFSv3/NFSv4 the client will request a COMMIT on
close; Moreover, the NFS server itself is required to commit on
meta-data operations; for NFSv3 that is on :
SETATTR, CREATE, MKDIR, SYMLINK, MKNOD,
REMOVE, RMDIR, RENAME, LINK
and a COMMIT maybe required on the containing directory.
Expected Performance
Let's imagine we find a way to run our load at 1 COMMIT (on close) per
extracted files. The COMMIT means the client must wait for at least a
full I/O latency and since 'tar x' processes the tar file from a
single thread, that implies that we can run our workload at the
maximum rate (assuming infinitely fast networking) of one extracted file
per I/O latency or about 200 extracted files per second (on modern
disks). If the files to be extracted are 1K in average size, the tar
x will proceed at a pace of 200K/sec. If we are required to issue 2
COMMIT operations per extracted files (for instance due to a
server-side COMMIT on file create), that would further halves that
throughput number.
However, If we had lots of threads extracting individual files
concurrently the performance would scale up nicely with the number of
threads.
But tar is single threaded, so what is actually going on here ? The
need to COMMIT frequently means that our thread must frequently pause
for a full server side I/O latency. Because our single threaded tar is
blocked, nothing is able to process the rest of our workload. If we
allow the server to ignore COMMIT operations, then NFS responses will
we sent earlier allowing the single thread to proceed down the tar
file at greater speed. One must realise that the extra performance is
obtained at the risk of causing corruption from the client's point of
view in the event of a crash.
Whether or not the client or the server needs to COMMIT as often as it
does is a separate issue. The existence of other clients that would be
accessing the files needs to be considered in that discussion. The
point being made here is that this issue is not particular to ZFS, nor
does ZFS necessarily exacerbate the problem. The performance of single
threaded writes to NFS will be throttled as a result of the
NFS-imposed COMMIT semantics.
ZFS Relevant Controls
ZFS has two controls that come into this picture. The disk write
caches and the zil_disable tunable. ZFS is designed to work correctly
whether or not the disk write caches are enabled. This is acheived
through explicit cache flush requests, which are generated (for
example) in response to an NFS COMMIT. Enabling the write caches is
then a performance consideration, and can offer performance gains for
some workloads. This is not the same with UFS which is not aware of
the existence of a disk write cache and is not designed to operate
with such cache enabled. Running UFS on a disk with write cache
enabled can lead to corruption of the client's view in the event of a
system crash.
ZFS also has the zil_disable control. ZFS is not designed to operate
with zil_disable set to 1. Setting this variable (before mounting a
ZFS filesystem) means that O_DSYNC writes, fsync as well as NFS COMMIT
operations are all ignored! We note that, even without a ZIL, ZFS will
always maintain a coherent local view of the on-disk state. But by
ignoring NFS COMMIT operations, it will cause the client's view to
become
corrupted (as defined above).
Comparison with UFS
In the original complaint, there was no comparison between a
semantically correct NFS service delivered by ZFS to another
similar NFS service delivered by another filesystem. Let's gather
some more data:
Local and memory based filesystems :
tmpfs : 0.077 sec
ufs : 0.25 sec
zfs : 0.12 sec
NFS service with risk of corruption of client's side view :
nfs/ufs : 7 sec (write cache enable)
nfs/zfs : 4.2 sec (write cache enable,zil_disable=1)
nfs/zfs : 4.7 sec (write cache disable,zil_disable=1)
Semantically correct NFS service :
nfs/ufs : 17 sec (write cache disable)
nfs/zfs : 12 sec (write cache disable,zil_disable=0)
nfs/zfs : 7 sec (write cache enable,zil_disable=0)
We note that with most filesystems we can easily produce an
improper NFS service by enabling the disk write caches. In this
case, a server-side filesystem may think it has commited data to
stable storage but the presence of an enabled disk write cache causes this
assumption to be false. With ZFS, enabling the write caches is not
sufficient to produce an
improper service.
Disabling the ZIL (setting zil_disable to 1 using mdb and then
mounting the filesystem) is one way to generate an improper NFS
service. With the ZIL disabled, commit request are ignored with
potential client's view corruption.
Intelligent Storage
An different topic is about running ZFS on intelligent storage arrays.
One known pathology is that some arrays will
_honor_ the ZFS
request to flush the write caches despite the fact that their caches
are qualified as stable storage. In this case, NFS performance will
be much much worst than otherwise expected. On this topic and ways to
workaround this specific issue, see Jason's .Plan:
Shenanigans with
ZFS.
Conclusion
In many common circumstances, ZFS offers a fine NFS service that
complies with
all NFS semantics even with write caches enabled.
If another filesystem appears much faster, I suggest first making sure
that this other filesystem complies in the same way.
This is not to say that ZFS performance cannot be perfected as clearly
it can. The performance of ZFS is still evolving quite rapidly. In
many situations, ZFS provides the highest throughput of any
filesystem. In others, ZFS performance is highly competitive with
other filesystems. In some cases, ZFS can be slower than other
filesystems -- while in all cases providing end-to-end data integrity,
ease of use and integrated services such as compression, snapshots
etc.
See Also Eric's fine entry on
zil_disable

vendredi septembre 22, 2006
ZFS and OLTP
ZFS and Databases
Given that we started to have enough understanding on the internal
dynamics of ZFS,
I figured it was time to tackle the next hurdle : running a database
management system (DBMS). Now I know very little myself about DBMS,
so I teamed up with people that have tons of experience with it, my
Colleagues from Performance Engineering (PAE), Neelakanth (Neel) Nagdir and
Sriram Gummuluru getting occasional words of wisdom from Jim Mauro as
well.
Note that UFS (with DIO) has been heavily tuned over the years to
provide very good support for DBMS. We are just beginning to explore
the tweaks and tunings necessary to achieve comparable performance
from ZFS in this specialized domain.
We knew that running a DBMS would be a challenge since, a database
tickles filesystems in ways that are quite different from other types
of loads. We had 2 goals. Primarily, we needed to understand how
ZFS performs in a DB environment and in what specific area it needs to
improve. Secondly, we figured that whatever would come out of the
work, could be used as blog-material, as well as best practice
recommendations. You're reading the blog material now; also watch this
space for Best Practise updates.
Note that it was not a goal of this exercise to generate data for a world record press-release. (There is always a metric where this can be achieved.)
Workload
The workload we use in PAE to characterize DBMSes is called OLTP/Net.
This benchmark was developed inside Sun for the purpose of
engineering performance into DBMS. Modeled on common transaction
processing benchmarks, it is OLTP-like but with a higher network-to-disk ratio. This makes it more representative of real world application. Quoting from Neel's prose:
"OLTP/Net, the New-Order transaction involves multi-hops as it
performs Item validation, and inserts a single item per hop as
opposed to block updates "
I hope that means something to you; Neel will be blogging on his own,
if you need more info.
Reference Point
The reference performance point for this work would be UFS (with VxFS
being also an interesting data point, but I'm not tasked with
improving that metric). For DB loads we know that UFS directio (DIO)
provides a significant performance boost and that would be our target
as well.
Platform & Configuration
Our platform was a Niagara T2000 (8 cores @ 1.2Ghz, 4 HW threads or
strands per core) with 130 @ 36GB disks attached in JBOD
fashion. Each disk was partitioned in 2 equal slices, with half of
the surface given to a Solaris Volume Manager (SVM) onto which UFS
would be built and the other half was given to ZFS pool.
The benchmark was designed to not fully saturate either the CPU or the disks. While we know that performance varies between inner & outer disk surface, we don't expect the effect to be large enough to require attention here.
Write Cache Enabled (WCE)
ZFS is designed to work safely, whether or not a disk write-cache is enabled (WCE). This stays true if ZFS is operating on a disk slice. However,
when given a full disk, ZFS will turn _ON_ the write cache as part of
the import sequence. That is, it won't enable write cache when given only a
slice. So, to be fair to ZFS capabilities we manually turned ON WCE when
running our test over ZFS.
UFS is not designed to work with WCE and will put data at risk if WCE
is set, so we needed to turn it off for the UFS runs. We needed to do
this, to get around the fact that we did not have enough disk to
provide each filesystem. Therefore the performance we measured is what
would be expected when giving full disk to either filesystem. We note
that, for the FC devices we used, WCE does not provide ZFS a
significant performance boost on this setup.
No Redundancy
For this initial effort we also did not configure any form of
redundancy for either filesystem. ZFS RAID-Z does not really have
equivalent feature in UFS and so we settled on simple stripe. We could
eventually configure software mirroring on both filesystems, but we
don't expect that will change our conclusions. But still this will be
interesting in follow-up work.
DBMS logging
Another thing we know already is that a DBMS's log writer latency is
critical to OLTP performance. So in order to improve on that metric,
it's good practice to set aside a number of disks for the DBMS'
logs. So with this in hand, we manage to run our benchmark and get our
target performance number (in relative terms, higher the better):
UFS/DIO/SVM : 42.5
Separate Data/log volumes
Recordsize
OK, so now we're ready. We load up Solaris 10 Update 2 (S10U2/ZFS),
build a log pool and a
data pool and get going. Note that log writers actually generate a pattern
of sequential I/O of varying sizes. That should map quite well with
ZFS out of the box. But for the DBMS' data pool, we expect a very
random pattern of read and writes to DB records. A commonly known zfs
best practice when servicing fixed record access is to match the ZFS'
recordsize property to that of the application. We note that UFS, by
chance or by design, also works (at least on sparc) using 8K records.
2nd run ZFS/S10U2
So for a fair comparison, we set the recordsize to 8K for the data
pool and run our OLTP/Net and....gasp!:
ZFS/S10U2 : 11.0
Data pool (8K record on FS)
Log pool (no tuning)
So that's no good and we have our work cut out for us.
The role of Prefetch in this result
To some extent we already knew of a subsystem that commonly misbehaves (which is being fix as we speak), the vdev level prefetch code (that I also
refer to as the software track buffer). In this code, whenever ZFS
issues a small read I/O to a device, it will, by default, go and fetch
quite a sizable chunk of data (64K) located at the physical location
being read. In itself, this should not increase the I/O latency which
is dominated by the head-seek and since the data is stored in a small
fixed sized buffer we don't expect this is eating up too much memory
either. However in a heavy-duty environment like we have here, every
extra byte that moves up or down the data channel occupies valuable
space. Moreover, for a large DB, we really don't expect the
speculatively read data to be used very much. So for our next attempt
we'll tune down the prefetch buffer to 8K.
And the role of the vq_max_pending parameter
But we don't expect this to be quite sufficient here. My DBMS savvy
friends would tell me that the I/O latency of reads was quite large in
our runs. Now ZFS prioritizes reads over writes and so we thought we
should be ok. However during a pool transaction group sync, ZFS will
issue quite a number of concurrent writes to each device. This is the
vq_max_pending parameter which default to 35. Clearly during this
phase the read latency even if prioritized will take a somewhat longer
time to complete.
3rd run, ZFS/S10U2 - tuned
So I wrote up a
script
to tune those 2 ZFS knobs. We could then run
with a vdev preftech buffer of 8K and a vq_max_pending of 10. This
boosted our performance almost 2X:
ZFS/S10U2 : 22.0
Data pool (8K record on FS)
Log pool (no tuning)
vq_max_pending : 10
vdev prefetch : 8K
But not quite satisfying yet.
ZFS/S10U2 known bug
We know of something else about ZFS. In the last few builds before
S10U2, a little bug made it's way into the code base. The effect of
this bug was that for full record rewrite, ZFS would actually input
the old block even though the data is actually not needed at all.
Shouldn't be too bad, perfectly aligned block rewrites of uncached
data is not that common....except for database, bummer.
So S10U2 is plagued with this issue affecting DB performance with no
workaround. So our next step was to move on to ZFS latest bits.
4th run ZFS/Build 44
Build 44 of our next Solaris version has long had this particular
issue fixed. There we topped our past performance with:
ZFS/B44 : 33.0
Data pool (8K record on FS)
Log pool (no tuning)
vq_max_pending : 10
vdev prefetch : 8K
As we compare to umpty-years of super tuned UFS:
UFS/DIO/SVM : 42.5
Separate Data/log volumes
Summary
I think at this stage of ZFS, the results are neither great nor
bad. We have achieved:
UFS/DIO : 100 %
UFS : xx no directio (to be updated)
ZFS Best : 75% best tuned config with latest bits.
ZFS S10U2 : 50% best tuned config.
ZFS S10U2 : 25% simple tuning.
To achieve acceptable performance levels:
The latest ZFS code base. ZFS improves fast these days. We
will need to keep tracking releases for a little while. The
current OpenSolaris release as well as the upcoming Solaris 10
Update 3 (this fall), should perform for these tests, as well
as the Build 44 results shown here.
1 data pool and 1 log pool: common practice to partition HW
resource when we want proper isolation. Going forward I think
that, we will eventually get to the point where this will not be
necessary but it seems an acceptable constraint for now.
Tuned vdev prefetch: the code is being worked on. I expect
that in a near future this will not be necessary.
Tuned vq_max_pending: that may take a little longer. In a DB
workload, latency is key and throughput secondary. There are
a number of ideas that needs to be tested which will help ZFS
improve on both average latency as well as latency
fluctuations. This will help both the Intent log (O_DSYNC
write) latency as well as reads.
Parting Words
As those improvement come out, they may well allow ZFS to catch or
surpass our best UFS numbers. When you match that kind of performance
with all the usability and data integrity features of ZFS, that's a
proposition that becomes hard to pass up.

mercredi mai 31, 2006
WHEN TO (AND NOT TO) USE RAID-Z
WHEN TO (AND NOT TO) USE RAID-Z
RAID-Z is the technology used by ZFS to implement a data-protection scheme
which is less costly than mirroring in terms of block
overhead.
Here, I'd like to go over, from a theoretical standpoint, the
performance implication of using RAID-Z. The goal of this technology
is to allow a storage subsystem to be able to deliver the stored data
in the face of one or more disk failures. This is accomplished by
joining multiple disks into a N-way RAID-Z group. Multiple RAID-Z
groups can be dynamically striped to form a larger storage pool.
To store file data onto a RAID-Z group, ZFS will spread a filesystem
(FS) block onto the N devices that make up the group. So for each FS
block, (N - 1) devices will hold file data and 1 device will hold
parity information. This information would eventually be used to
reconstruct (or resilver) data in the face of any device failure. We
thus have 1 / N of the available disk blocks that are used to store
the parity information. A 10-disk RAID-Z group has 9/10th of the
blocks effectively available to applications.
A common alternative for data protection, is the use of mirroring. In
this technology, a filesystem block is stored onto 2 (or more) mirror
copies. Here again, the system will survive single disk failure (or
more with N-way mirroring). So 2-way mirror actually delivers similar
data-protection at the expense of providing applications access to
only one half of the disk blocks.
Now let's look at this from the performance angle in particular that
of delivered filesystem blocks per second (FSBPS). A N-way RAID-Z
group achieves it's protection by spreading a ZFS block onto the N
underlying devices. That means that a single ZFS block I/O must be
converted to N device I/Os. To be more precise, in order to acces an
ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for
input as the parity data need not generally be read-in.
Now after a request for a ZFS block has been spread this way, the IO
scheduling code will take control of all the device IOs that needs to
be issued. At this stage, the ZFS code is capable of aggregating
adjacent physical I/Os into fewer ones. Because of the ZFS
Copy-On-Write (COW) design, we actually do expect this reduction in
number of device level I/Os to work extremely well for just about any
write intensive workloads. We also expect it to help streaming input
loads significantly. The situation of random inputs is one that needs
special attention when considering RAID-Z.
Effectively, as a first approximation, an N-disk RAID-Z group will
behave as a single device in terms of delivered random input
IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group. This is the price to
pay to achieve proper data protection without the 2X block overhead
associated with mirroring.
With 2-way mirroring, each FS block output must be sent to 2 devices.
Half of the available IOPS are thus lost to mirroring. However, for
Inputs each side of a mirror can service read calls independently from
one another since each side holds the full information. Given a
proper software implementation that balances the inputs between sides
of a mirror, the FS blocks delivered by a mirrored group is actually
no less than what a simple non-protected RAID-0 stripe would give.
So looking at random access input load, the number of FS blocks per
second (FSBPS), Given N devices to be grouped either in RAID-Z, 2-way
mirrored or simply striped (a.k.a RAID-0, no data protection !), the
equation would be (where dev represents the capacity in terms of
blocks of IOPS of a single device):
Random
Blocks Available FS Blocks / sec
---------------- --------------
RAID-Z (N - 1) * dev 1 * dev
Mirror (N / 2) * dev N * dev
Stripe N * dev N * dev
Now lets take 100 disks of 100 GB, each each capable of 200 IOPS and
look at different possible configurations; In the table below the
configuration labeled:
"Z 5 x (19+1)"
refers to a dynamic striping of 5 RAID-Z groups, each group made of 20
disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a
simple dynamic stripe.
Random
Config Blocks Available FS Blocks /sec
------------ ---------------- ---------
Z 1 x (99+1) 9900 GB 200
Z 2 x (49+1) 9800 GB 400
Z 5 x (19+1) 9500 GB 1000
Z 10 x (9+1) 9000 GB 2000
Z 20 x (4+1) 8000 GB 4000
Z 33 x (2+1) 6600 GB 6600
M 2 x (50) 5000 GB 20000
S 1 x (100) 10000 GB 20000
So RAID-Z gives you at most 2X the number of blocks that mirroring
provides but hits you with much fewer delivered IOPS. That means
that, as the number of devices in a group N increases, the expected
gain over mirroring (disk blocks) is bounded (to at most 2X) but the
expected cost in IOPS is not bounded (cost in the range of [N/2, N]
fewer IOPS).
Note that for wide RAID-Z configurations, ZFS takes into account the
sector size of devices (typically 512 Bytes) and dynamically adjust
the effective number of columns in a stripe. So even if you request a
99+1 configuration, the actual data will probably be stored on much
fewer data columns than that. Hopefully this article will contribute
to steering deployments away from those types of configuration.
In conclusion, when preserving IOPS capacity is important, the size of
RAID-Z groups should be restrained to smaller sizes and one must
accept some level of disk block overhead.
When performance matters most, mirroring should be highly favored. If
mirroring is considered too costly but performance is nevertheless
required, one could proceed like this:
Given N devices each capable of X IOPS.
Given a target of delivered Y FS blocks per second
for the storage pool.
Build your storage using dynamically striped RAID-Z groups of
(Y / X) devices.
For instance:
Given 50 devices each capable of 200 IOPS.
Given a target of delivered 1000 FS blocks per second
for the storage pool.
Build your storage using dynamically striped RAID-Z groups of
(1000 / 200) = 5 devices.
In that system we then would have 20% block overhead lost to maintain
RAID-Z level parity.
RAID-Z is a great technology not only when disk blocks are your most
precious resources but also when your available IOPS far exceed your
expected needs. But beware that if you get your hands on fewer very
large disks, the IOPS capacity can easily become your most precious
resource. Under those conditions, mirroring should be strongly favored
or alternatively a dynamic stripe of RAID-Z groups each made up of a
small number of devices.

mardi mai 16, 2006
128K Suffice
I argue for the fact that 128K I/O sizes is sufficient to extract the most out of
a disk given enough concurrent I/Os
[
Read More]

mercredi novembre 16, 2005
ZFS to UFS Performance Comparison on Day 1
With special thanks to Chaoyue Xiong for her help in this work.
In this paper I'd like to review the performance data we have gathered
comparing this initial release of ZFS (Nov 16 2005) with the Solaris
legacy, optimized beyond reason, UFS filesystem. The data we will be
reviewing is based on 14 Unit tests that were designed to stress some
specific usage pattern of filesystem operations. Working with these
well contained usage scenarios, greatly facilitate subsequent
performance engineering analysis.
Our focus was to issue a fair head to head comparison between UFS and
ZFS but not try to produce the biggest, meanest marketing numbers.
Since ZFS is also a Volume Manager, we actually compared ZFS to a
UFS/SVM combination. In cases where ZFS underperforms UFS, we wanted
to figure out why and how to improve ZFS.
We currently also are focusing on data intensive operations. Metadata
intensive tests are being develop and we will report on those in a
later study.
Looking ahead to our results we find that of our 12 Filesystem Unit
test that were successfully run:
- ZFS outpaces UFS in 6 tests by a mean factor of 3.4
- UFS outpaces ZFS in 4 tests by a mean factor of 3.0
- ZFS equals UFS in 2 tests.
In this paper, we will be taking a closer look at the tests where UFS
is ahead and try to make proposition toward improving those numbers.
THE SYSTEM UNDER TEST
Our testbed is a hefty V890 with 8 x 1200 Mhz US-IV CPUs (16 cores). At
this point we are not yet monitoring the CPU utilization of the
different tests although we plan to do so in the future. The storage
is an insanely large 300 disk array; The disks were rather old
technology, small & slow 9 GB disks. None of the test currently
stresses the array very much and the idea was mostly trying to take
the storage configuration out of the equation. Working with old
technology disks, the absolute throughput numbers are not necessarily
of interest; they are presented in an appendix.
Every disk in our configuration is partitioned into 2 slices and a
simple zvm or zpool stripped volume is made across all spindles. We
then build a filesystem on top of the volume. All commands are run
with default parameters. Both filesystems are mounted and we can run
our test suite on either one.
Every test is rerun multiple times in succession; The tests are
defined and developed to avoid variability between instances. Some of
the current test definition require that file data not be present in
the filesystem cache. Since we currently do not have a convenient way
to control this for ZFS, the result for those tests are omitted from
this report.
THE FILESYSTEM UNIT TESTS
Here is the definition of the 14 data intensive tests we have
currently identified. Note that we are very open to new test
definition; if you know of an data intensive application, that uses a
Filesystem in a very different pattern, and there must be tons of
them, we would dearly like to hear from you.
Test 1
This is the simplest way to create a file; we open/creat a file then
issue 1MB writes until the filesize reaches 128 MB; we then close the file.
Test 2
In this test, we also create a new file, although here we work with a
file opened with the O_DSYNC flag. We work with 128K writes system
calls. This maps to some database file creation scheme.
Test 3
This test is also relative to file creation but with writes that are
much smaller and of varying sizes. In this test, we create a 50MB file
using writes of size picked randomly between [1K,8K]. The file is open
with default flags (no O_*SYNC) but every 10 MB of written data we
issue an fsync() call for the whole file. This form of access can be
used for log files that have data integrity requirements.
Test 4
Moving now to a read test; we read a 1 GB file (assumed in cache) with
32K read system call. This is a rather simple test to keep everybody
honest.
Test 5
This is same test as Test 4 but when the file is assumed not present
in the filesystem cache. We currently have no control on ZFS for this
and so we will not be reporting performance numbers for this test.
This is a basic streaming read sequence that should test the readahead
capacity of a filesystem.
Test 6
Our previous write test, were allocating writes. In this test we will
verify the ability of a filesystem to rewrite over an existing file.
We will look at 32K writes, to a file open with O_DSYNC.
Test 7
Here we also test the ability to rewrite existing files. The size are
randomly picked in the [1K,8K] range. Not special control over data
integrity (no O_*SYNC, no fsync()).
Test 8
In this test we create a very large file (10 GB) with 1MB writes
followed by 2 full-pass sequential read. This test is still evolving
but we want verify the ability of the filesystem to work with files
that are of size close or larger that available free memory.
Test 9
In this test, we issue 8K writes at random 8K aligned offsets in a 1 GB
file. When 128 MB of data is written we issue an fsync().
Test 10
Here, we issue 2K writes at random (unaligned) offsets to a file
opened O_DSYNC.
Test 11
Same test as 10 but using 4 cooperating threads all working on a
single file.
Test 12
Here we attempt to simulate a mixed read/write pattern. Working with
an existing file, we loop through a pattern of 3 reads at 3 randomly
selected 8K aligned offsets followed by an 8K write to the last read
block.
Test 13
In this test we issue 2K pread() calls (to an random unaligned
offset). File is asserted to not be in the cache. Since we currently
have no such control, no won't report data for this test.
Test 14
We have 4 cooperating threads (working on a single file) issuing 2K
pread() calls to random unaligned offset. The file is present in the
cache.
THE RESULTS
We have a common testing framework to generate the performance data.
Each test is written using as a simple C program and the framework is
responsible for creating threads, files, timing the runs and
reporting. We currently are in discussing merging this test framework
with the Filebench suite. We regret that we cannot easily share the
test code, however the above descriptions should be sufficiently
precise to allow someone to reproduce our data. In my mind a simple
10 to 20 disk array and any small server should be enough to generate
similar numbers. If anyone find very different results, I would be
very interested in knowing about it.
Our framework reports all timing results as a throughput
measure. Absolute values of throughput is highly test case dependent.
A 2K O_DSYNC write will not have the same throughput as a 1MB cached
read. Some test would be better described in terms of operations per
second. However since our focus is a relative ZFS to UFS/SVM
comparison, we will focus here on the delta in throughput between the
2 filesystems (for the curious the full throughput data is posted in
the appendix).
Drumroll....
Task ID Description Winning FS / Performance Delta
1 open() and allocation of a ZFS / 3.4X 128.00 MB file with
write(1024K) then close().
2 open(O_DSYNC) and ZFS / 5.3X
allocation of a
5.00 MB file with
write(128K) then close().
3 open() and allocation of a UFS / 1.8X
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync()
every 10.00 MB
4 Sequential read(32K) of a ZFS / 1.1X 1024.00 MB file, cached.
5 Sequential read(32K) of a no data
1024 MB MB file, uncached.
6 Sequential rewrite(32K) of a ZFS / 2.6X
10.00 MB file, O_DSYNC,
uncached
7 Sequential rewrite() of a 1000.00 UFS / 1.3X
MB cached file, size picked
uniformly in the [1K,8K]
range, then close().
8 create a file of size 1/2 of ZFS / 2.3X
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No
special cache manipulation.
9 128.00 MB worth of random 8 UFS / 2.3X
K-aligned write to a
1024.00 MB file; followed
by fsync(); cached.
10 1.00 MB worth of 2K write to draw (UFS == ZFS)
100.00 MB file, O_DSYNC,
random offset, cached.
11 1.00 MB worth of 2K write to ZFS / 5.8X
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each
writing 1 MB
12 128.00 MB worth of 8K aligned draw (UFS == ZFS)
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read
page, random offset,
cached.
13 5.00 MB worth of pread(2K) per no data
thread within a shared
1024.00 MB file, random
offset, uncached
14 5.00 MB worth of pread(2K) per UFS / 6.9X
thread within a shared
1024.00 MB file, random
offset, cached 4 threads.
As stated in the abstract
- ZFS outpaces UFS in 6 tests by a mean factor of 3.4
- UFS outpaces ZFS in 4 tests by a mean factor of 3.0
- ZFS equals UFS in 2 tests.
The performance differences can be sizable; lets have a closer look
at some of them.
PERFORMANCE DEBRIEF
Lets look at each test to try and understand what is the cause of the
performance differences.
Test 1 (ZFS 3.4X)
open() and allocation of a
128.00 MB file with
write(1024K) then close().
This test is not fully analyzed. We note that in this situation UFS
will regularly kick off some I/O from the context of the write system
call. This would occur whenever a cluster of writes (typically of
size 128K or 1MB) has completed. The initiation of I/O by UFS slows
down the process. On the other hand ZFS can zoom through the test at
a rate much closer to a memcopy. The ZFS I/Os to disks are actually
generated internally by the ZFS transaction group mechanism: every few
seconds a transaction group will come and flush the dirty data to disk
and this occurs without throttling the test.
Test 2 (ZFS 5.3X)
open(O_DSYNC) and
allocation of a
5.00 MB file with
write(128K) then close().
Here ZFS shows an even bigger advantage. Because of it's design and
complexity, UFS is actually somewhat limited in it capacity to write
allocate files in O_DSYNC mode. Every new UFS write requires some
disk block allocation, which must occur one block at a time when
O_DSYNC is set. ZFS can easily outperform UFS for this test.
Test 3 (UFS 1.8X)
open() and allocation of a
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync()
every 10.00 MB
Here ZFS pays the advantage it had in test 1. In this test, we issue
very many writes to a file. Those are cached as the process is racing
along. When the fsync() hits (every 10 MB of outstanding data per the
test definition) the FS must now guarantee that all the data is set to
stable storage. Since UFS kicks off I/O more regularly, when the
fsync() hits UFS has a smaller amount of data left to sync up. What
save the day for ZFS is that, for that leftover data UFS slows down to
a crawl. On the other hand ZFS has accumulated a large amount of data
in the cache and when the fsync() hits. Fortunately ZFS is able to
issue much larger I/Os to disk and catches some of it's lag that has
built up. But the final results shows that UFS wins the horse race
(at least in this specific test); Details of the test will influence
final result here.
However the ZFS team is working on ways to make the fsync() much
better. We actually have 2 possible avenues of improvements. We can
borrow from the UFS behavior and kick off some I/Os when too much
outstanding data is cached. UFS does this at a very regular interval
which does not look right either. But clearly if a file has many MB
of outstanding dirty data sending them off to disk might be
beneficial. On the other hand, keeping the data in cache in
interesting when the pattern of writing is such that the same file
offsets are written and re-written over and over again. Sending the
data to disk is wasteful if data is subsequently rewritten shortly
after. Basically the FS must place a bet on whether a future fsync()
will occur before an new write to the block. We cannot win this bet
on all tests all the time.
Given that fsync() performance is important, I would like to see us
asynchronously kick off I/O when some we reach many MB of outstanding
data to a file. This is nevertheless debatable.
Even if we don't do this, we have another area of improvement that the
ZFS team is looking into. When the Fsync finally hits the fan, even
with a lot of outstanding data; the current implementation does not
issue disk I/Os very efficiently. The proper way to do this is to
kick-off all required I/Os and then wait for them to all complete.
Currently in the intricacies of the code, some I/Os are issued and
waited upon one after the other. This is not yet optimal but we
certainly should see improvements coming in the future and I truly
expect ZFS fsync() performance to be ahead all the time.
Test 4 (ZFS 1.1X)
Sequential read(32K) of a 1024.00
MB file, cached.
Rather simple test, mostly close to memcopy speed between the
Filesystem cache and the user buffer. Contest is almost a wash with
ZFS slightly on top. Not yet analyzed.
Test 5 (N/A)
Sequential read(32K) of a 1024.00
MB file, uncached.
No results dues to lack of control on the ZFS file level caching.
Test 6 (ZFS 2.6X)
Sequential rewrite(32K) of a
10.00 MB file, O_DSYNC,
uncached
Due to the WAFL (Write Anywhere File Layout) ZFS, a rewrite is not
very different to an initial write and it seems to perform very well
on this test. Presumably UFS performance is hindered by the need to
synchronize the cached data. Result not yet analyzed.
Test 7 (UFS 1.3X)
Sequential rewrite() of a 1000.00
MB cached file, size picked
uniformly in the [1K,8K]
range, then close().
In this test we are not timing any of the disk I/O. This is merely a
test about unrolling the filesystem code for 1K to 8K cached writes.
The UFS codepath wins in simplicity and years of performance tuning.
The ZFS codepath here somewhat suffers from it's youth. Understandably
the ZFS current implementation is very well layered and we easily
imagine that the locking strategies of the different layers are
independent of one another. We have found (thanks dtrace) that a small
ZFS cached write would use about 3 times as many lock acquisition that
an equivalent UFS call. Mutex rationalization within or between
layers certainly seems to be an area of potential improvement for ZFS
that would help this particular test. We also realised that the very
clean and layered code implementation is causing the callstack to
follow very many elevator ride up and down between layers. On a Sparc
CPU going up and down 6 or 7 layers deep in the callstack causes a
spill/fill trap and one additional trap for every additional floor
travelled. Fortunately there are very many areas where ZFS will be
able to merge different functions into single one or possibly exploit
the technique of tail calls to regain some of the lost performance.
All in all, we find that the performance difference is small enough to
not be worrysome at this point specially in view of the possible
improvements we already have identified.
Test 8 (ZFS 2.3X)
create a file of size 1/2 of
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No
special cache manipulation.
This test needs to be analyzed further. We note that UFS will
proactively freebehind read blocks. While this is a very responsible
use of memory (give it back after use) it potentially impact the
re-read UFS performance. While we're happy to see ZFS performance on
top, some investigation is warranted to make sure that ZFS does not
overconsume memory in some situations.
Test 9 (UFS 2.3X)
128.00 MB worth of random 8
K-aligned write to a
1024.00 MB file; followed
by fsync(); cached.
In this test we expect a rational similar to the one of Test 3 to take
effect. The same cure should also apply.
Test 10 (draw)
1.00 MB worth of 2K write to
100.00 MB file, O_DSYNC,
random offset, cached.
Both FS must issue and wait for a 2K I/O on each write. They both do
this as efficiently as possible.
Test 11 (ZFS 5.8X)
1.00 MB worth of 2K write to
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each
writing 1 MB
This test is similar to the previous one except for the 4 cooperating
threads. ZFS being on top highlights a key feature of ZFS, the lack of
single writer lock. UFS can only allow a single write thread working
per file. The only exception is when directio is enabled and then
only with rather restrictive conditions. UFS with directio would allow
concurrent writers with the implied restriction that it did not honor
full POSIX semantics regarding write atomicity. ZFS, out of the box,
is able to allow concurrent writers without requiring any special
setup nor giving up full POSIX semantics. All great news for
simplicity of deployment and great Data-Base performance .
Test 12 (draw)
128.00 MB worth of 8K aligned
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read
page, random offset,
cached.
Both filesystem perform appropriately. Test still require analysis.
Test 13 (N/A)
5.00 MB worth of pread(2K) per
thread within a shared
1024.00 MB file, random
offset, uncached
No results dues to lack of control on the ZFS file level caching.
Test 14 (UFS 6.9X)
5.00 MB worth of pread(2K) per
thread within a shared
1024.00 MB file, random
offset, cached 4 threads.
This test unexplicably shows UFS on top. The UFS code can perform
rather well given that the FS cache is stored in the page cache.
Servicing writes from cache can be made very scalable. We are just
starting our analysis of the performance characteristic of ZFS for
this test We have identified some serialization construct in the
buffer management code where we find that reclaiming the buffers into
which to put the cached data is acting as a serial throttle. This is
truly the only test where the ZFS performance disappoint although
there is no doubt that we will be finding a cure to this
implementation issue.
THE TAKEAWAY
ZFS is on top on very many of our test often by a significant
factor. Where UFS is ahead we have a clear view on how to improve the
ZFS implementation. The case of shared readers to a single file will
be the test that requires special attention.
Given the youth of the ZFS implementation, the performance outline
presented in this paper shows that the ZFS design decision are totally
validated from a performance perspective.
FUTURE DIRECTIONS
Clearly, we should now expands the unit test coverage. We would like
to study more metadata intensive workloads. We also would like to see
how ZFS features such as compression and RaidZ perform. Other
interesting studies could focus on CPU consumption and memory
efficiency. We also need to find a solution to running the existing
unit test that requires the files to not be cached in the filesystem.
APPENDIX/ THROUGHPUT MEASURE
Here are the raw throughput measures for each of the 14 Unit test.
Task ID Description ZFS latest+nv25(MB/s) UFS+nv25 (MB/s)
1 open() and allocation of a 486.01572 145.94098
128.00 MB file with
write(1024K) then close(). ZFS 3.4X
2 open(O_DSYNC) and 4.5637 0.86565
allocation of a
5.00 MB file with
write(128K) then close(). ZFS 5.3X
3 open() and allocation of a 27.3327 50.09027
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync() 1.8X UFS
every 10.00 MB
4 Sequential read(32K) of a 1024.00 674.77396 612.92737
MB file, cached.
ZFS 1.1X
5 Sequential read(32K) of a 1024.00 1756.57637 17.53705
MB file, uncached.
XXXXXXXXX
6 Sequential rewrite(32K) of a 2.20641 0.85497
10.00 MB file, O_DSYNC,
uncached ZFS 2.6X
7 Sequential rewrite() of a 1000.00 204.31557 257.22829
MB cached file, size picked
uniformly in the [1K,8K] 1.3X UFS
range, then close().
8 create a file of size 1/2 of 698.18182 298.25243
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No ZFS 2.3X
special cache manipulation.
9 128.00 MB worth of random 8 42.75208 100.35258
K-aligned write to a
1024.00 MB file; followed 2.3X UFS
by fsync(); cached.
10 1.00 MB worth of 2K write to 0.117925 0.116375
100.00 MB file, O_DSYNC,
random offset, cached. ====
11 1.00 MB worth of 2K write to 0.42673 0.07391
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each ZFS 5.8X
writing 1 MB
12 128.00 MB worth of 8K aligned 264.84151 266.78044
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read =====
page, random offset,
cached.
13 5.00 MB worth of pread(2K) per 75.98432 0.11684
thread within a shared
1024.00 MB file, random XXXXXXXX
offset, uncached
14 5.00 MB worth of pread(2K) per 56.38486 386.70305
thread within a shared
1024.00 MB file, random 6.9X UFS
offset, cached 4 threads.
OpenSolaris,
ZFS