No doubt there is still a lot to learn about ZFS as an NFS server and
this will not delve deeply into that topic. What I'd like to dispel
here is the notion that ZFS can cause some NFS workloads to exhibit
pathological performance characteristics.
The Sightings
Since there have been a few perceived 'sightings' of such slowdowns, a
little clarification is in order. Large reported slowdowns would
typically be reported when looking at a single threaded load, probably
doing small file creation such as 'tar xf many_small_files.tar'.
For instance, I've run a small such test over a 72G SAS drive.
tmpfs : 0.077 sec
ufs : 0.25 sec
zfs : 0.12 sec
nfs/zfs : 12 sec
There are a few things to observe here. Local filesystem services
have a huge advantage for this type of load: in the absence of
specific request by the application (e.g. tar), local filesystems can
lose your data and noone will complain. This is data loss, not data
corruption, and this generally accepted data loss will occur in the
event of a system crash. The argument being that if you need a higher
level of integrity, you need to program it in applications either
using O_DSYNC, fsync etc. Many applications are not that critical and
avoid such burden.
NFS and COMMIT
On the other hand, the nature of the NFS protocol is such that the
client _must_ at some specific point request to the server to place
previously sent data onto stable storage. This is done through an
NFSv3 or NFSv4 COMMIT operation. The COMMIT operation is a contract
between clients and servers that allows the client to forget about
its previous historical interaction with the file. In the event of a
server crash/reboot, the client is guaranteed that previously commited
data will be returned by the server. Operations since the last COMMIT
can be replayed after a server crash in a way that insures a coherent
view between everybody involved.
But this all topples over if the COMMIT contract is not honored. If a
local filesystem does not properly commit data when requested to do
so, there is no more guarantee that the client's view of files will be
what it would otherwise normally expect. Despite the fact that the
client has completed the 'tar x' with
no errors, it can happen
that some of the files are missing in full or in parts.
With local filesystems, a system crash is plainly obvious to users and
requires applications to be restarted. With NFS, a server crash in
not obvious to users of the service (the only sign being a lengthy
pause), and applications are not notified. The fact that files or
parts of files may go missing
in the absence of errors can be
considered as plain
corruption of the client's side view.
When the underlying filesystem serving NFS ignores COMMIT request, or
when the storage subsystem acknowledge I/O before they reach stable
storage, what is potential data loss on the server, becomes
corruption of the client's point of view.
It turns out that in NFSv3/NFSv4 the client will request a COMMIT on
close; Moreover, the NFS server itself is required to commit on
meta-data operations; for NFSv3 that is on :
SETATTR, CREATE, MKDIR, SYMLINK, MKNOD,
REMOVE, RMDIR, RENAME, LINK
and a COMMIT maybe required on the containing directory.
Expected Performance
Let's imagine we find a way to run our load at 1 COMMIT (on close) per
extracted files. The COMMIT means the client must wait for at least a
full I/O latency and since 'tar x' processes the tar file from a
single thread, that implies that we can run our workload at the
maximum rate (assuming infinitely fast networking) of one extracted file
per I/O latency or about 200 extracted files per second (on modern
disks). If the files to be extracted are 1K in average size, the tar
x will proceed at a pace of 200K/sec. If we are required to issue 2
COMMIT operations per extracted files (for instance due to a
server-side COMMIT on file create), that would further halves that
throughput number.
However, If we had lots of threads extracting individual files
concurrently the performance would scale up nicely with the number of
threads.
But tar is single threaded, so what is actually going on here ? The
need to COMMIT frequently means that our thread must frequently pause
for a full server side I/O latency. Because our single threaded tar is
blocked, nothing is able to process the rest of our workload. If we
allow the server to ignore COMMIT operations, then NFS responses will
we sent earlier allowing the single thread to proceed down the tar
file at greater speed. One must realise that the extra performance is
obtained at the risk of causing corruption from the client's point of
view in the event of a crash.
Whether or not the client or the server needs to COMMIT as often as it
does is a separate issue. The existence of other clients that would be
accessing the files needs to be considered in that discussion. The
point being made here is that this issue is not particular to ZFS, nor
does ZFS necessarily exacerbate the problem. The performance of single
threaded writes to NFS will be throttled as a result of the
NFS-imposed COMMIT semantics.
ZFS Relevant Controls
ZFS has two controls that come into this picture. The disk write
caches and the zil_disable tunable. ZFS is designed to work correctly
whether or not the disk write caches are enabled. This is acheived
through explicit cache flush requests, which are generated (for
example) in response to an NFS COMMIT. Enabling the write caches is
then a performance consideration, and can offer performance gains for
some workloads. This is not the same with UFS which is not aware of
the existence of a disk write cache and is not designed to operate
with such cache enabled. Running UFS on a disk with write cache
enabled can lead to corruption of the client's view in the event of a
system crash.
ZFS also has the zil_disable control. ZFS is not designed to operate
with zil_disable set to 1. Setting this variable (before mounting a
ZFS filesystem) means that O_DSYNC writes, fsync as well as NFS COMMIT
operations are all ignored! We note that, even without a ZIL, ZFS will
always maintain a coherent local view of the on-disk state. But by
ignoring NFS COMMIT operations, it will cause the client's view to
become
corrupted (as defined above).
Comparison with UFS
In the original complaint, there was no comparison between a
semantically correct NFS service delivered by ZFS to another
similar NFS service delivered by another filesystem. Let's gather
some more data:
Local and memory based filesystems :
tmpfs : 0.077 sec
ufs : 0.25 sec
zfs : 0.12 sec
NFS service with risk of corruption of client's side view :
nfs/ufs : 7 sec (write cache enable)
nfs/zfs : 4.2 sec (write cache enable,zil_disable=1)
nfs/zfs : 4.7 sec (write cache disable,zil_disable=1)
Semantically correct NFS service :
nfs/ufs : 17 sec (write cache disable)
nfs/zfs : 12 sec (write cache disable,zil_disable=0)
nfs/zfs : 7 sec (write cache enable,zil_disable=0)
We note that with most filesystems we can easily produce an
improper NFS service by enabling the disk write caches. In this
case, a server-side filesystem may think it has commited data to
stable storage but the presence of an enabled disk write cache causes this
assumption to be false. With ZFS, enabling the write caches is not
sufficient to produce an
improper service.
Disabling the ZIL (setting zil_disable to 1 using mdb and then
mounting the filesystem) is one way to generate an improper NFS
service. With the ZIL disabled, commit request are ignored with
potential client's view corruption.
Intelligent Storage
An different topic is about running ZFS on intelligent storage arrays.
One known pathology is that some arrays will
_honor_ the ZFS
request to flush the write caches despite the fact that their caches
are qualified as stable storage. In this case, NFS performance will
be much much worst than otherwise expected. On this topic and ways to
workaround this specific issue, see Jason's .Plan:
Shenanigans with
ZFS.
Conclusion
In many common circumstances, ZFS offers a fine NFS service that
complies with
all NFS semantics even with write caches enabled.
If another filesystem appears much faster, I suggest first making sure
that this other filesystem complies in the same way.
This is not to say that ZFS performance cannot be perfected as clearly
it can. The performance of ZFS is still evolving quite rapidly. In
many situations, ZFS provides the highest throughput of any
filesystem. In others, ZFS performance is highly competitive with
other filesystems. In some cases, ZFS can be slower than other
filesystems -- while in all cases providing end-to-end data integrity,
ease of use and integrated services such as compression, snapshots
etc.
See Also Eric's fine entry on
zil_disable
Posted by James Dickens on janvier 08, 2007 at 02:16 PM MET #