A very significant improvement is coming soon to ZFS. A change that
will increase the general quality of service delivered by ZFS.
Interestingly it's a change that might also slow down your
microbenchmark but nevertheless it's a change you should be eager for.
Write throttling
For a filesystem, write throttling designates the act of blocking
application for some amount of time, as short as possible, waiting for
the proper conditions to allow the write system calls to succeed.
Write throttling is normally required because applications can write
to memory (dirty memory pages) at a rate significantly faster than the
kernel can flush the data to disk. Many workloads dirty memory pages
by writing to the filesystem page cache at near memory copy speed,
possibly using multiple threads issuing high-rates of filesystem
writes. Concurrently, the filesystem is doing it's best to drain all
that data to the disk subsystem.
Given the constraints, the time to empty the filesystem cache to disk
can be longer than the time required for applications to dirty the
cache. Even if one considers storage with fast NVRAM, under sustained
load, that NVRAM will fill up to a point where it needs to wait for a
slow disk I/O to make room for more data to get in.
When committing data to a filesystem in bursts, it can be quite
desirable to push the data at memory speed and then drain the cache to
disk during the lapses between bursts. But when data is generated at
a sustained high rate, lack of throttling leads to total memory
depletion. We thus need at some point to try and match the application
data rate with that of the I/O subsystem. This is the primary goal of
write throttling.
A secondary goal of write throttling is to prevent massive data loss.
When applications do not manage I/O synchronization (i.e don't use
O_DSYNC and fsync), data ends up cached in the filesystem and the
contract is that there is no guarantee that the data will still be
there if a system crash were to occur. So even if the filesystem
cannot be blamed for such data loss, it is still a nice feature to
help prevent such massive losses.
Case in point : UFS Write throttling
For instance UFS would use the fsflush daemon to try to keep data
exposed for no more than 30 seconds (default value of autoup). Also,
UFS would keep track of the amount of I/O outstanding for each
file. Once too much I/O was pending, UFS would throttle writers for
that file. This was controlled through ufs_HW, ufs_LW and their
values were commonly tuned (a bad sign). Eventually old defaults
values were updated and seem to work nicely today. UFS write
throttling thus operates on a per file basis. While there are some
merits to this approach, it can be defeated as it does not manage the
imbalance between memory and disks at a system level.
ZFS Previous write throttling
ZFS is designed around the concept of transaction groups (txg).
Normally, every 5 seconds an _open_ txg goes to the quiesced
state. From that state the quiesced txg will go to the syncing state
which sends dirty data to the I/O subsystem. For each pool, there are
at most 1 txg in each of the 3 states, open, quiescing, syncing. Write
throttling used to occur when the 5 second txg clock would fire while
the syncing txg had not yet completed. The open group would wait on
the quiesced one which waits on the syncing one. Application writers
(write system call) would block, possibly a few seconds, waiting for a
txg to open. In other words, if a txg took more than 5 seconds to
sync to disk, we would globally block writers thus matching their
speed with that of the I/O. But if a workload had a bursty write
behavior that could be synced during the allotted 5 seconds,
application would never be throttled.
The Issue
But ZFS did not sufficiently controled the amount of data that could
get in an open txg. As long as the ARC cache was no more than half
dirty, ZFS would accept data. For a large memory machine or one with
weak storage, this was likely to cause long txg sync times. The
downsides were many :
- if we did ended up throttled, long sync times meant the system
behavior would be sluggish for seconds at a time.
- long txg sync times also meant that our granularity at which
we could generate snapshots would be impacted.
- we ended up with lots of pending data in the cache all of
which could be lost in the event of a crash.
- the ZFS I/O scheduler which prioritizes operations was also
negatively impacted.
- By not throttling we had the possibility that
sequential writes on large files could displace from the ARC
a very large number of smaller objects. Refilling
that data meant very large number of disk I/Os.
Not throttling can paradoxically end up as very
costly for performance.
- the previous code also could at times, not be issuing I/Os
to disk for seconds even though the workload was
critically dependant of storage speed.
- And foremost, lack of throttling depleted memory and prevented
ZFS from reacting to memory pressure.
That ZFS is considered a memory hog is most likely the results of the
the previous throttling code. Once a proper solution is in place, it will
be interesting to see if we behave better on that front.
The Solutions
The new code keeps track of the amount of data accepted in a TXG and
the time it takes to sync. It dynamically adjusts that amount so that
each TXG sync takes about 5 seconds (txg_time variable). It also
clamps the limit to no more than 1/8th of physical memory.
And to avoid the system wide and seconds long throttle effect, the new
code will detect when we are dangerously close to that situation
(7/8th of the limit) and will insert 1 tick delays for applications
issuing writes. This prevents a write intensive thread from hogging
the available space starving out other threads. This delay should
also generally prevent the system wide throttle.
So the new steady state behavior of write intensive workloads is that,
starting with an empty TXG, all threads will be allowed to dirty
memory at full speed until a first threshold of bytes in the TXG is
reached. At that time, every write system call will be delayed by 1
tick thus significantly slowing down the pace of writes. If the
previous TXG completes it's I/Os, then the current TXG will then be
allowed to resume at full speed. But in the unlikely event that a
workload, despite the per write 1-tick delay, manages to fill up the
TXG up to the full threshold we will be forced to throttle all writes
in order to allow the storage to catch up.
It should make the system much better behaved and generally more
performant under sustained write stress.
If you are owner of an unlucky workload that ends up as slowed by more
throttling, do consider the other benefits that you get from the new
code. If that does not compensate for the loss, get in touch and tell
us what your needs are on that front.