
mercredi juillet 12, 2006
ZFS AND DIRECTIO
In view of the great performance gains that UFS gets out of the
'Directio' (DIO) feature, it is interesting to ask ourselves, where exactly
do those gains come from and if ZFS can be tweaked to benefit from
them in the same way.
UFS Directio
UFS Directio is actually a set of things bundled together that
improves performance of very specific workloads most notably that of
Database. Directio is actually a performance hint to the filesystem
and apart from relaxing posix requirements does not carry any change
in filesystem semantics. The users of directio actually assert the
condition on the full Filesystem or individual file level and the
filesystem code if given extra freedom to run or not the tuned DIO
codepath.
What does that tuned code path gets us ? A few things:
- output goes directly from application buffer to disk
bypassing the filesystem core memory cache.
- the FS is not constrained anymore to strictly obey the POSIX
write ordering. The FS is thus able to allow multiple thread
concurrently issuing some I/Os to a single file.
- On input UFS DIO refrains from doing any form of readahead.
In a sense, by taking out the middleman (the filesystem cache),
UFS/DIO causes files to behave a lot like a raw device. Application
reads and writes map one to one onto individual I/Os.
People often consider that the great gains that DIO provides comes
from avoiding the CPU cost of the copy into system caches and from the
avoiding the double buffering, once in the DB, once in the FS, that one
gets in the non-directio case.
I would argue that while the CPU cost associated with a copy certainly
does exists, the copy will run very very quickly compared to the time
the ensuing I/O takes. So the impact of the copy would only appear on
systems that have their CPU quite saturated, notably for industry
standard benchmarks. However real systems, which are more likely to
be I/O constrained than CPU constrained should not pay a huge toll to
this effect.
As for double buffering, I note that Databases (or applications in
general), are normally setup to consume a given amount of memory and
the FS operates using the remaining portion. Filesystems caches data
in memory for lack of better use of that memory. And FS give up their
hold whenever necessary. So the data is not double buffered but
rather 'free' memory keeps a hold on recently issued I/O. Buffering
data in 2 locations does not look like a performance issue to me.
Anything for ZFS ?
So what does that leaves us with ? Why is DIO so good ?
This tells me that we gain a lot from those 2 mantras
don't do any more I/O that requested
allow multiple concurrent I/O to a file.
I note that UFS readahead is particularly bad for certain usage; when
UFS sees access to 2 consecutive pages, it will read a full cluster
and those are typically 1MB in sizes today. So avoiding UFS readahead
has probably contributed greatly to the success of DIO. As for ZFS
there are 2 levels of readahead (a.k.a prefetching). One that is
filebased and one device based. Both are being reworked at this stage.
I note that filebased readahead code has not and will not behave like
UFS. On the other hand device level prefetching probably is being
over agressive for DB type loads and it should be avoided. While I
have not given hope of that this can be managed automatically, watch
this space for tuning scripts to control the device prefetching
behavior.
DIO for input does not otherwise appear an interesting proposition
since if the data is cached, I don't really see the gains in bypassing
it (apart from slowing down the reads).
As for writes, ZFS, out of the box, does not suffer from the single
writer lock that UFS needs to implement the posix ordering rules. The
transaction groups (TXG) are sufficient for that purpose (see
The
Dynamics of ZFS).
This leaves us to the amount of I/O needed by the 2 filesystems when
running many concurrent O_DSYNC writers running small writes to random
file offsets.
UFS actually handles this load by overwriting the data in it's
preallocated disk locations. Every 8K pages is associated with set
place on the storage and a write to that location means a disk head
movement and an 8K output I/O. This loads should scale well with
number of disks in the storage and the 'random' IOPS capability of
each drives. If a drives handle 150 random IOPS, then we can handle
about 1MB/s/drive of output.
Now ZFS will behave quite differently. ZFS does not have preallocation
of file blocks and will not, ever, overwrite live data. The handling
of the O_DSYNC writes in ZFS will occur in 2 stages.
The 2 stages of ZFS
First at the ZFS Intent Log (ZIL) level where we need to I/O the data
in order to release the application blocked in a write call. Here the
ZIL has the ability of aggregating data from multiple writes and issue
fewer/larger I/Os than UFS would. Given the ZFS strategy of block
allocation we also expect those I/O to be able to stream to the disk
at high speed. We don't expect to be restrained by the random IOPS
capabilities of disk but more by their streaming performance.
Next at the TXG level, we clean up the state of the filesystem and
here again the block allocation should allow high rate of data
transfer. At this stage there are 2 things we have to care about.
With current state of things, we probably will see the data
sent to disk twice, once to the ZIL once to the pool. While this
appears suboptimal at first, the aggregation and streaming
characteristics of ZFS makes the current situation already probably
better than what UFS can achieve. We're also looking to see if we can
make this even better by avoiding the 2 copies while preserving the
full streaming performance characteristics.
For pool level I/O we must take care to not inflate the amount
of data sent to disk which could eventually cause early storage
saturation. ZFS works out of the box with 128K records for large
files. However for DB workloads, we expect this will be tuned such
that the ZFS recordsize matches the DB block size. We also expect the
DB blocksize to be at least 8K in sizes. Matching the ZFS recordize to
the DB block size is a recommendation that is inline with what UFS DIO
has taught us: don't do any more I/O than necessary.
Note also that with ZFS, because we don't overwrite live data, every
block output needs to bubble up into metadata block updates etc... So
there are some extra I/O that ZFS has to do. So depending on the exact
test conditions the gains of ZFS can be offset by the extra metadata
I/Os.
ZFS Performance and DB
Despite all the advantage of ZFS, the reason that performance data has
been hard to come by is that we have to clear up the road and bypass
the few side issues that currently affects performance on large DB
loads. At this stage, we do have to spend some time and apply magic
recipes to get ZFS performance on Database to behave the way it's
intended to.
But when the dust settles, we should be right up there in terms of
performance compared to UFS/DIO, and improvements ideas are still
plenty, if you have some more I'm interested....
Slide 31 at the above link seems to suggest that one day there will be a userland interface into the DMU. Am I misreading the slide? It would nice to see what kind of performance is possible by having the database written directly to the DMU.
Posted by Myron Scott on juillet 12, 2006 at 10:01 PM MEST #
Bypassing the cache for a read makes sense if your locality of reference is so low that you never get cache hits on that data, but do have some other concurrent workload which can benefit from cached filesystem data. Not just for databases: high speed linear reads through huge files will completely trash the buffer cache quickly enough that useful data can be thrown out.
Example: Linux 2.4's tendency to drop executable pages when heavy file activity was going on. You'd start some heavy file munging and go to get coffee, come back to your machine after five minutes and it would take thirty seconds after the screen unlocked to get all your windows redrawn because the executables were no longer resident in memory.
Posted by Jason Ozolins on juillet 13, 2006 at 03:02 AM MEST #
Posted by PJ on juillet 13, 2006 at 04:12 AM MEST #
Posted by James Mansion on juillet 13, 2006 at 12:02 PM MEST #