With special thanks to Chaoyue Xiong for her help in this work.
In this paper I'd like to review the performance data we have gathered comparing this initial release of ZFS (Nov 16 2005) with the Solaris legacy, optimized beyond reason, UFS filesystem. The data we will be reviewing is based on 14 Unit tests that were designed to stress some specific usage pattern of filesystem operations. Working with these well contained usage scenarios, greatly facilitate subsequent performance engineering analysis.
Our focus was to issue a fair head to head comparison between UFS and ZFS but not try to produce the biggest, meanest marketing numbers. Since ZFS is also a Volume Manager, we actually compared ZFS to a UFS/SVM combination. In cases where ZFS underperforms UFS, we wanted to figure out why and how to improve ZFS.
We currently also are focusing on data intensive operations. Metadata intensive tests are being develop and we will report on those in a later study.
Looking ahead to our results we find that of our 12 Filesystem Unit test that were successfully run:
ZFS outpaces UFS in 6 tests by a mean factor of 3.4
UFS outpaces ZFS in 4 tests by a mean factor of 3.0
ZFS equals UFS in 2 tests.
In this paper, we will be taking a closer look at the tests where UFS is ahead and try to make proposition toward improving those numbers.
THE SYSTEM UNDER TEST
Our testbed is a hefty V890 with 8 x 1200 Mhz US-IV CPUs (16 cores). At this point we are not yet monitoring the CPU utilization of the different tests although we plan to do so in the future. The storage is an insanely large 300 disk array; The disks were rather old technology, small & slow 9 GB disks. None of the test currently stresses the array very much and the idea was mostly trying to take the storage configuration out of the equation. Working with old technology disks, the absolute throughput numbers are not necessarily of interest; they are presented in an appendix.
Every disk in our configuration is partitioned into 2 slices and a simple zvm or zpool stripped volume is made across all spindles. We then build a filesystem on top of the volume. All commands are run with default parameters. Both filesystems are mounted and we can run our test suite on either one.
Every test is rerun multiple times in succession; The tests are defined and developed to avoid variability between instances. Some of the current test definition require that file data not be present in the filesystem cache. Since we currently do not have a convenient way to control this for ZFS, the result for those tests are omitted from this report.
THE FILESYSTEM UNIT TESTS
Here is the definition of the 14 data intensive tests we have currently identified. Note that we are very open to new test definition; if you know of an data intensive application, that uses a Filesystem in a very different pattern, and there must be tons of them, we would dearly like to hear from you.
Test 1
This is the simplest way to create a file; we open/creat a file then issue 1MB writes until the filesize reaches 128 MB; we then close the file.
Test 2
In this test, we also create a new file, although here we work with a file opened with the O_DSYNC flag. We work with 128K writes system calls. This maps to some database file creation scheme.
Test 3
This test is also relative to file creation but with writes that are much smaller and of varying sizes. In this test, we create a 50MB file using writes of size picked randomly between [1K,8K]. The file is open with default flags (no O_*SYNC) but every 10 MB of written data we issue an fsync() call for the whole file. This form of access can be used for log files that have data integrity requirements.
Test 4
Moving now to a read test; we read a 1 GB file (assumed in cache) with 32K read system call. This is a rather simple test to keep everybody honest.
Test 5
This is same test as Test 4 but when the file is assumed not present in the filesystem cache. We currently have no control on ZFS for this and so we will not be reporting performance numbers for this test. This is a basic streaming read sequence that should test the readahead capacity of a filesystem.
Test 6
Our previous write test, were allocating writes. In this test we will verify the ability of a filesystem to rewrite over an existing file. We will look at 32K writes, to a file open with O_DSYNC.
Test 7
Here we also test the ability to rewrite existing files. The size are randomly picked in the [1K,8K] range. Not special control over data integrity (no O_*SYNC, no fsync()).
Test 8
In this test we create a very large file (10 GB) with 1MB writes followed by 2 full-pass sequential read. This test is still evolving but we want verify the ability of the filesystem to work with files that are of size close or larger that available free memory.
Test 9
In this test, we issue 8K writes at random 8K aligned offsets in a 1 GB file. When 128 MB of data is written we issue an fsync().
Test 10
Here, we issue 2K writes at random (unaligned) offsets to a file opened O_DSYNC.
Test 11
Same test as 10 but using 4 cooperating threads all working on a single file.
Test 12
Here we attempt to simulate a mixed read/write pattern. Working with an existing file, we loop through a pattern of 3 reads at 3 randomly selected 8K aligned offsets followed by an 8K write to the last read block.
Test 13
In this test we issue 2K pread() calls (to an random unaligned offset). File is asserted to not be in the cache. Since we currently have no such control, no won't report data for this test.
Test 14
We have 4 cooperating threads (working on a single file) issuing 2K pread() calls to random unaligned offset. The file is present in the cache.
THE RESULTS
We have a common testing framework to generate the performance data. Each test is written using as a simple C program and the framework is responsible for creating threads, files, timing the runs and reporting. We currently are in discussing merging this test framework with the Filebench suite. We regret that we cannot easily share the test code, however the above descriptions should be sufficiently precise to allow someone to reproduce our data. In my mind a simple 10 to 20 disk array and any small server should be enough to generate similar numbers. If anyone find very different results, I would be very interested in knowing about it.
Our framework reports all timing results as a throughput measure. Absolute values of throughput is highly test case dependent. A 2K O_DSYNC write will not have the same throughput as a 1MB cached read. Some test would be better described in terms of operations per second. However since our focus is a relative ZFS to UFS/SVM comparison, we will focus here on the delta in throughput between the 2 filesystems (for the curious the full throughput data is posted in the appendix).
Drumroll....
Task ID Description Winning FS / Performance Delta
1 open() and allocation of a ZFS / 3.4X 128.00 MB file with write(1024K) then close().
2 open(O_DSYNC) and ZFS / 5.3X allocation of a 5.00 MB file with write(128K) then close().
3 open() and allocation of a UFS / 1.8X 50.00 MB file with write() of size picked uniformly in [1K,8K] issuing fsync() every 10.00 MB
4 Sequential read(32K) of a ZFS / 1.1X 1024.00 MB file, cached.
5 Sequential read(32K) of a no data 1024 MB MB file, uncached.
6 Sequential rewrite(32K) of a ZFS / 2.6X 10.00 MB file, O_DSYNC, uncached
7 Sequential rewrite() of a 1000.00 UFS / 1.3X MB cached file, size picked uniformly in the [1K,8K] range, then close().
8 create a file of size 1/2 of ZFS / 2.3X freemem using write(1MB) followed by 2 full-pass sequential read(1MB). No special cache manipulation.
9 128.00 MB worth of random 8 UFS / 2.3X K-aligned write to a 1024.00 MB file; followed by fsync(); cached.
10 1.00 MB worth of 2K write to draw (UFS == ZFS) 100.00 MB file, O_DSYNC, random offset, cached.
11 1.00 MB worth of 2K write to ZFS / 5.8X 100.00 MB file, O_DSYNC, random offset, uncached. 4 cooperating threads each writing 1 MB
12 128.00 MB worth of 8K aligned draw (UFS == ZFS) read&write to 1024.00 MB file, pattern of 3 X read, then write to last read page, random offset, cached.
13 5.00 MB worth of pread(2K) per no data thread within a shared 1024.00 MB file, random offset, uncached
14 5.00 MB worth of pread(2K) per UFS / 6.9X thread within a shared 1024.00 MB file, random offset, cached 4 threads.
As stated in the abstract
ZFS outpaces UFS in 6 tests by a mean factor of 3.4
UFS outpaces ZFS in 4 tests by a mean factor of 3.0
ZFS equals UFS in 2 tests.
The performance differences can be sizable; lets have a closer look at some of them.
PERFORMANCE DEBRIEF
Lets look at each test to try and understand what is the cause of the performance differences.
Test 1 (ZFS 3.4X)
open() and allocation of a 128.00 MB file with write(1024K) then close().
This test is not fully analyzed. We note that in this situation UFS will regularly kick off some I/O from the context of the write system call. This would occur whenever a cluster of writes (typically of size 128K or 1MB) has completed. The initiation of I/O by UFS slows down the process. On the other hand ZFS can zoom through the test at a rate much closer to a memcopy. The ZFS I/Os to disks are actually generated internally by the ZFS transaction group mechanism: every few seconds a transaction group will come and flush the dirty data to disk and this occurs without throttling the test.
Test 2 (ZFS 5.3X)
open(O_DSYNC) and allocation of a 5.00 MB file with write(128K) then close().
Here ZFS shows an even bigger advantage. Because of it's design and complexity, UFS is actually somewhat limited in it capacity to write allocate files in O_DSYNC mode. Every new UFS write requires some disk block allocation, which must occur one block at a time when O_DSYNC is set. ZFS can easily outperform UFS for this test.
Test 3 (UFS 1.8X)
open() and allocation of a 50.00 MB file with write() of size picked uniformly in [1K,8K] issuing fsync() every 10.00 MB
Here ZFS pays the advantage it had in test 1. In this test, we issue very many writes to a file. Those are cached as the process is racing along. When the fsync() hits (every 10 MB of outstanding data per the test definition) the FS must now guarantee that all the data is set to stable storage. Since UFS kicks off I/O more regularly, when the fsync() hits UFS has a smaller amount of data left to sync up. What save the day for ZFS is that, for that leftover data UFS slows down to a crawl. On the other hand ZFS has accumulated a large amount of data in the cache and when the fsync() hits. Fortunately ZFS is able to issue much larger I/Os to disk and catches some of it's lag that has built up. But the final results shows that UFS wins the horse race (at least in this specific test); Details of the test will influence final result here.
However the ZFS team is working on ways to make the fsync() much better. We actually have 2 possible avenues of improvements. We can borrow from the UFS behavior and kick off some I/Os when too much outstanding data is cached. UFS does this at a very regular interval which does not look right either. But clearly if a file has many MB of outstanding dirty data sending them off to disk might be beneficial. On the other hand, keeping the data in cache in interesting when the pattern of writing is such that the same file offsets are written and re-written over and over again. Sending the data to disk is wasteful if data is subsequently rewritten shortly after. Basically the FS must place a bet on whether a future fsync() will occur before an new write to the block. We cannot win this bet on all tests all the time.
Given that fsync() performance is important, I would like to see us asynchronously kick off I/O when some we reach many MB of outstanding data to a file. This is nevertheless debatable.
Even if we don't do this, we have another area of improvement that the ZFS team is looking into. When the Fsync finally hits the fan, even with a lot of outstanding data; the current implementation does not issue disk I/Os very efficiently. The proper way to do this is to kick-off all required I/Os and then wait for them to all complete. Currently in the intricacies of the code, some I/Os are issued and waited upon one after the other. This is not yet optimal but we certainly should see improvements coming in the future and I truly expect ZFS fsync() performance to be ahead all the time.
Test 4 (ZFS 1.1X)
Sequential read(32K) of a 1024.00 MB file, cached.
Rather simple test, mostly close to memcopy speed between the Filesystem cache and the user buffer. Contest is almost a wash with ZFS slightly on top. Not yet analyzed.
Test 5 (N/A)
Sequential read(32K) of a 1024.00 MB file, uncached.
No results dues to lack of control on the ZFS file level caching.
Test 6 (ZFS 2.6X)
Sequential rewrite(32K) of a 10.00 MB file, O_DSYNC, uncached
Due to the WAFL (Write Anywhere File Layout) ZFS, a rewrite is not very different to an initial write and it seems to perform very well on this test. Presumably UFS performance is hindered by the need to synchronize the cached data. Result not yet analyzed.
Test 7 (UFS 1.3X)
Sequential rewrite() of a 1000.00 MB cached file, size picked uniformly in the [1K,8K] range, then close().
In this test we are not timing any of the disk I/O. This is merely a test about unrolling the filesystem code for 1K to 8K cached writes. The UFS codepath wins in simplicity and years of performance tuning. The ZFS codepath here somewhat suffers from it's youth. Understandably the ZFS current implementation is very well layered and we easily imagine that the locking strategies of the different layers are independent of one another. We have found (thanks dtrace) that a small ZFS cached write would use about 3 times as many lock acquisition that an equivalent UFS call. Mutex rationalization within or between layers certainly seems to be an area of potential improvement for ZFS that would help this particular test. We also realised that the very clean and layered code implementation is causing the callstack to follow very many elevator ride up and down between layers. On a Sparc CPU going up and down 6 or 7 layers deep in the callstack causes a spill/fill trap and one additional trap for every additional floor travelled. Fortunately there are very many areas where ZFS will be able to merge different functions into single one or possibly exploit the technique of tail calls to regain some of the lost performance. All in all, we find that the performance difference is small enough to not be worrysome at this point specially in view of the possible improvements we already have identified.
Test 8 (ZFS 2.3X)
create a file of size 1/2 of freemem using write(1MB) followed by 2 full-pass sequential read(1MB). No special cache manipulation.
This test needs to be analyzed further. We note that UFS will proactively freebehind read blocks. While this is a very responsible use of memory (give it back after use) it potentially impact the re-read UFS performance. While we're happy to see ZFS performance on top, some investigation is warranted to make sure that ZFS does not overconsume memory in some situations.
Test 9 (UFS 2.3X)
128.00 MB worth of random 8 K-aligned write to a 1024.00 MB file; followed by fsync(); cached.
In this test we expect a rational similar to the one of Test 3 to take effect. The same cure should also apply.
Test 10 (draw)
1.00 MB worth of 2K write to 100.00 MB file, O_DSYNC, random offset, cached.
Both FS must issue and wait for a 2K I/O on each write. They both do this as efficiently as possible.
Test 11 (ZFS 5.8X)
1.00 MB worth of 2K write to 100.00 MB file, O_DSYNC, random offset, uncached. 4 cooperating threads each writing 1 MB
This test is similar to the previous one except for the 4 cooperating threads. ZFS being on top highlights a key feature of ZFS, the lack of single writer lock. UFS can only allow a single write thread working per file. The only exception is when directio is enabled and then only with rather restrictive conditions. UFS with directio would allow concurrent writers with the implied restriction that it did not honor full POSIX semantics regarding write atomicity. ZFS, out of the box, is able to allow concurrent writers without requiring any special setup nor giving up full POSIX semantics. All great news for simplicity of deployment and great Data-Base performance .
Test 12 (draw)
128.00 MB worth of 8K aligned read&write to 1024.00 MB file, pattern of 3 X read, then write to last read page, random offset, cached.
Both filesystem perform appropriately. Test still require analysis.
Test 13 (N/A)
5.00 MB worth of pread(2K) per thread within a shared 1024.00 MB file, random offset, uncached
No results dues to lack of control on the ZFS file level caching.
Test 14 (UFS 6.9X)
5.00 MB worth of pread(2K) per thread within a shared 1024.00 MB file, random offset, cached 4 threads.
This test unexplicably shows UFS on top. The UFS code can perform rather well given that the FS cache is stored in the page cache. Servicing writes from cache can be made very scalable. We are just starting our analysis of the performance characteristic of ZFS for this test We have identified some serialization construct in the buffer management code where we find that reclaiming the buffers into which to put the cached data is acting as a serial throttle. This is truly the only test where the ZFS performance disappoint although there is no doubt that we will be finding a cure to this implementation issue.
THE TAKEAWAY
ZFS is on top on very many of our test often by a significant factor. Where UFS is ahead we have a clear view on how to improve the ZFS implementation. The case of shared readers to a single file will be the test that requires special attention.
Given the youth of the ZFS implementation, the performance outline presented in this paper shows that the ZFS design decision are totally validated from a performance perspective.
FUTURE DIRECTIONS
Clearly, we should now expands the unit test coverage. We would like to study more metadata intensive workloads. We also would like to see how ZFS features such as compression and RaidZ perform. Other interesting studies could focus on CPU consumption and memory efficiency. We also need to find a solution to running the existing unit test that requires the files to not be cached in the filesystem.
APPENDIX/ THROUGHPUT MEASURE
Here are the raw throughput measures for each of the 14 Unit test.
Task ID Description ZFS latest+nv25(MB/s) UFS+nv25 (MB/s)
1 open() and allocation of a 486.01572 145.94098 128.00 MB file with write(1024K) then close(). ZFS 3.4X
2 open(O_DSYNC) and 4.5637 0.86565 allocation of a 5.00 MB file with write(128K) then close(). ZFS 5.3X
3 open() and allocation of a 27.3327 50.09027 50.00 MB file with write() of size picked uniformly in [1K,8K] issuing fsync() 1.8X UFS every 10.00 MB
4 Sequential read(32K) of a 1024.00 674.77396 612.92737 MB file, cached. ZFS 1.1X
5 Sequential read(32K) of a 1024.00 1756.57637 17.53705 MB file, uncached. XXXXXXXXX
6 Sequential rewrite(32K) of a 2.20641 0.85497 10.00 MB file, O_DSYNC, uncached ZFS 2.6X
7 Sequential rewrite() of a 1000.00 204.31557 257.22829 MB cached file, size picked uniformly in the [1K,8K] 1.3X UFS range, then close().
8 create a file of size 1/2 of 698.18182 298.25243 freemem using write(1MB) followed by 2 full-pass sequential read(1MB). No ZFS 2.3X special cache manipulation.
9 128.00 MB worth of random 8 42.75208 100.35258 K-aligned write to a 1024.00 MB file; followed 2.3X UFS by fsync(); cached.
10 1.00 MB worth of 2K write to 0.117925 0.116375 100.00 MB file, O_DSYNC, random offset, cached. ====
11 1.00 MB worth of 2K write to 0.42673 0.07391 100.00 MB file, O_DSYNC, random offset, uncached. 4 cooperating threads each ZFS 5.8X writing 1 MB
12 128.00 MB worth of 8K aligned 264.84151 266.78044 read&write to 1024.00 MB file, pattern of 3 X read, then write to last read ===== page, random offset, cached.
13 5.00 MB worth of pread(2K) per 75.98432 0.11684 thread within a shared 1024.00 MB file, random XXXXXXXX offset, uncached
14 5.00 MB worth of pread(2K) per 56.38486 386.70305 thread within a shared 1024.00 MB file, random 6.9X UFS offset, cached 4 threads.
Nice amount of detail there - and it is much apperciated to see such a comparison.
Given the frequent comments about fsync, I thought you might find the following interesting:
Recently I was benchmarking Qmail on a FreeBSD server. Turns out that it uses fsync for several stages of the email queue, and that turning it off (by commenting them out from the source code and re-compiling) could give a 10-20x performance improvement in those parts. Qmail creates a lot of small files as part of the implementation and it might make an interesting file-system benchmark when injecting emails for local delivery.
Posted by
Chris Rijk
on novembre 17, 2005 at 12:20 AM MET
#
Fantastic post! It was very interesting to see that level of a performance deep dive, and I am glad to hear that the ZFS team is continuing the drive towards even better results. Like Chris Rijk, I am also interested in seeing fast fsync performance, having just ran into a fsync issue with vxfs 4.0 (which most visibly manifested itself with large files on machines with large memory configurations). While I understand ZFS won't necessarily win every benchmark, the performance seems very good and the manageability is unbeatable.
Posted by
William Hathaway
on novembre 17, 2005 at 01:18 AM MET
#
I'd also like to see some tests of situations commonly seen on mail servers, that is, milions of small files (usualy <4k) and lots of random fsync() writes on them. Currently the only reasonable choice for such usage patterns is reiserfs (3.6, not v4) and their <a href="http://www.namesys.com/benchmarks/mongo_readme.html">mongo benchmark (download here) recreates them reasonably well. I'm sure I'm not the only one eagerly awaiting mongo results :)
Posted by
Jure Pecar
on novembre 17, 2005 at 05:28 PM MET
#
Good work, Roch! Thanks for the detailed writeup. However, as a RAS guy, I wonder about the utility of benchmarks run when there is no data protection. Is there plans for re-running these using data protection?
Posted by
Richard Elling
on novembre 17, 2005 at 05:59 PM MET
#
Nice analysis, as everyone has already said!
Just one point: don't get blindsighted about the fact that ZFS (might) hold too much stuff in memory. Memory footprint of the FS needs to be somewhat tunable - I run Linux on this laptop, for example! Trading off I/O cycles for memory consumption might be just what you need (sometimes).
You have benchmarked everything using a cost-metric of time - I'd love to see a *power* metric used too. Hmm - could be very interesting indeed - with a laptop set for very aggressive power saving (instant disk spin-down etc).
Posted by
Kevin Maciunas
on novembre 18, 2005 at 07:38 AM MET
#
Nice writeup. No doubt these tests are helpful in tuning zfs. For us who are thinking of using it for NFS servers, will you guys set up a system with decent amount of memory and disks for a spec sfs run?
Posted by
Ying Xie
on novembre 20, 2005 at 07:30 AM MET
#
I have a strong suggestion for another test setup. We have millions of image files written individually, whose sizes form a skewed gaussian, with a minimum of 10KB, peak at 20KB, and max at 150KB. Set up a write test for millions of such files, and then a read test for full-file reads of random files within the set. Use a deep hierarchy of directories to store the files so they're not all clustered in the same directory, with about 100 or so subdirectories at each level and 100 or so leaf files in each bottom-level directory. Writes as well as reads should be scattered, not sequential, across the hierarchy. You might consider this a heavy-metadata test, but ordinary users tend to ignore that aspect and are mostly concerned with the overall data throughput. A useful metric is not only the relative ZFS/UFS performance, but whether the throughput degrades over time as the disk fills up. Then purge some random 20% or so of the files and write some more, taking more measurements as the holes fill up.
Posted by
Glenn
on novembre 21, 2005 at 09:59 AM MET
#
Glenn,
Your test scenario mentioned above is very similar to the Postmark benchmark (NetApp). It measures throughput/transaction rates of write/re-write read/re-read tests of lots and lots of files. Designed for the workload that an email/file server satisfies.
Posted by
Nathanael Burton
on décembre 22, 2005 at 06:43 AM MET
#
Given the frequent comments about fsync, I thought you might find the following interesting:
Recently I was benchmarking Qmail on a FreeBSD server. Turns out that it uses fsync for several stages of the email queue, and that turning it off (by commenting them out from the source code and re-compiling) could give a 10-20x performance improvement in those parts. Qmail creates a lot of small files as part of the implementation and it might make an interesting file-system benchmark when injecting emails for local delivery.
Qmail home page
Posted by Chris Rijk on novembre 17, 2005 at 12:20 AM MET #
Posted by William Hathaway on novembre 17, 2005 at 01:18 AM MET #
Posted by Jure Pecar on novembre 17, 2005 at 05:28 PM MET #
Posted by Richard Elling on novembre 17, 2005 at 05:59 PM MET #
Posted by Kevin Maciunas on novembre 18, 2005 at 07:38 AM MET #
Posted by Ying Xie on novembre 20, 2005 at 07:30 AM MET #
Posted by Glenn on novembre 21, 2005 at 09:59 AM MET #
Posted by Nathanael Burton on décembre 22, 2005 at 06:43 AM MET #