Wednesday November 16, 2005 It's been said that developing a new filesystem is an over-constrained problem. The interfaces are fixed (POSIX system calls at the top, block devices at the bottom), correctness in the face of bad hardware is not optional, and it must be fast, or nobody cares. Well, gather 'round, settle back, and listen to (or read) the tale of ZFS vs. "The Benchmark".
Once, not long ago, we had a customer write a benchmark for Solaris that was meant to stress the VM system. The idea was fairly straightforward: mmap(2) a large file, then randomly dirty it, forcing the system to do some paging in the process. Pretty simple, right?
Our beloved customer then goes to run this on an appropriately large file, stored on a UFS filesystem. The benchmark then proceeds to grind the system to a near standstill. Cursory inspection shows that the single disk containing the UFS filesystem has 24,000 outstanding I/Os to it. This is the same disk that contains the root filesystem. Not a good sign.
As you might well imagine, the system was pretty unresponsive. Not hung, but so slow it was beyond tolerable for even the most patient of us. The benchmark had caused enough memory pressure that most non-dirty pages (like our shell, ls, vmstat, etc.) were evicted from memory. Whenever we wanted to type a command, these pages had to be fetched off of disk. That's right, the same disk with 24,000 outstanding I/Os. The disk was a real champ, cranking out about 400 IOPS, which meant that any read request (for, say, the ls command) took about one minute to work its way to the front of the I/O queue. Like I said, a trying situation for someone attempting to figure out what the system is doing.
The score so far? The Benchmark: 1 UFS: 0
At this point in time, ZFS was still in its early stages, but I had just finished writing some fancy I/O scheduling code based on some ideas that had been bouncing around in my head for a couple of years. Having explained what I had just done to Bryan Cantrill, he had the bright idea of running this same benchmark using ZFS. While I was still busy waving my hands and trying to back down on my bold claims, he had already started up the benchmark using ZFS.
The difference in system response was breathtaking. While running the same workload, the system was perfectly responsive and behaved like a machine without a care in the world. Wow. I then promptly went back to making bold claims.
One of the features of our advanced I/O scheduler in ZFS is that each I/O has both a priority and a deadline associated with it. Higher priority I/Os get deadlines that are sooner than lower priority ones. Our I/O issue policy is then to issue the I/O within the oldest deadline group that has the lowest LBA (logical block address). This means that for I/Os that share the same deadline, we do a linear pass across the disk, kind of like half of an elevator algorithm. We don't do a full elevator algorithm since it tends to lead to over-servicing the middle of the disk and starving the outer edges.
How does this fancy I/O scheduling help, you ask? In general, a write(2) system call is asynchronous. The kernel buffers the data for the write in memory, returns from the system call, and later sends the data to disk. On the other hand, a read(2) is inherently synchronous. An application is blocked from making forward progress until the read(2) system call returns, which it can't do until the I/O has come back from disk. As a result, we give writes a lower priority than reads. So if we have, say, several thousand writes queued up to a disk, and then a read comes in, the read will effectively cut to the front of the line and get issued right away. This leads to a much more responsive system, even when it's under heavy load. Since we have deadlines associated with each I/O, a constant stream of reads won't permanently starve the outstanding writes.
At this point, the clouds cleared, the sun came out, birds started singing, and all was Goodness and Light. The Benchmark had gone from being a "thrash the disk" test to being an actual VM test, like it was originally intended.
Final score? The Benchmark: 0 ZFS: 1
Posted by Diego on November 16, 2005 at 10:53 AM PST #
So Sun employees can brag about ZFS, instead of getting depressed about their crappy drivers?
Posted by Alex on November 17, 2005 at 08:34 AM PST #
What if the <tt>read(2)</tt> reads a portion of the file that a pending <tt>write(2)</tt> would have affected?
Isn't that a Bad Thing?
Posted by Ron on November 17, 2005 at 03:33 PM PST #
I'll ignore the inflammatory tone of your comment and try to answer the question briefly. More will follow in a blog entry.
The whole purpose of an I/O scheduler (especially one with advanced features like deadlines and inheritable priority) is to re-order I/O requests. Without knowledge of the interdependancies, it is impossible to do this without sacrificing correctenss. In the read case, you can generally do this, but you still have to be careful so that you're consistent with interleaved writes to the same block.
Posted by Bill Moore on November 17, 2005 at 09:18 PM PST #
We handle that case at a higher layer in ZFS. If a read is issued for a write that is on its way to disk, the read is satisfied by the in-memory copy of that data. It would never make it down to the actual I/O layer, so it isn't a problem.
Posted by Bill Moore on November 17, 2005 at 09:20 PM PST #
Posted by Vladislav Mikhailikov on November 18, 2005 at 03:17 AM PST #
I don't think your statement is correct. This is a perfectly fair comparison. It just may not be the comparison that you're most interested in.
That said, let me assure you that we're working on benchmarking against Reiser4, and so far, we're doing quite well. I'm working to make sure the comparison is as fair as possible, since benchmarking different filesystems on different OSes is a tricky business. It's very easy to conflate the OS differences with the filesystem ones.
Posted by Bill Moore on November 18, 2005 at 09:15 PM PST #
Posted by James on April 12, 2006 at 09:16 AM PDT #
Hopefully the system will behave intelligently and actually write the file to disk instead of flushing the file to the page cache. Now I could see things getting tricky here if you have some big slow data storage and then a fast small disk for your pagecache/root disk. Now copying the file out to the pagecache will be much faster than just writing it.
I'm actually very curious how ZFS handles this. Or more broadly ZFS+Solaris since it is a problem about the interaction of the filesystem and the VM system.
BTW Mr. Moore, are you still planning on updating this blog at somepoint. I can't complain about getting busy and not posting for long periods of time since I do the same on my blog but I want to know if I should subscribe to your feed.
Posted by logicnazi on April 30, 2006 at 09:43 AM PDT #
Posted by Ben Bucksch on May 01, 2006 at 09:39 AM PDT #
Ok, so you've posted this "benchmark" showing how ZFS is better than UFS at this task. Highly likely that ZFS is better than UFS at most things. I may even be persuaded to say ZFS is better than most filesystems at most things... but there I see three big problems.
The fact that Sun thinks something like ZFS is the answer to storage problems tells me just how little Sun understands about storage... very sad indeed...
Posted by fdr on November 21, 2006 at 08:35 AM PST #
Posted by Paul Connolly on January 17, 2007 at 04:11 AM PST #
I'm following along here in hopes of finding the bottleneck in our ZFS implementations.
Bulk write performance is just terrible (as example a 400GB database backup). Writes to tape are 400% faster than to our ZFS filesystems (tested, consistent). and this to a 6130/40 SAN disk array. (with disk sets and virtual disks not shared and allocated specifically to the the backup op).
400% faster to tape than zfs disks on a fast array ?
Makes me wonder.
Posted by Db2dude on October 29, 2007 at 04:20 PM PDT #