A very significant improvement is coming soon to ZFS. A change that
will increase the general quality of service delivered by ZFS.
Interestingly it's a change that might also slow down your
microbenchmark but nevertheless it's a change you should be eager for.
Write throttling
For a filesystem, write throttling designates the act of blocking
application for some amount of time, as short as possible, waiting for
the proper conditions to allow the write system calls to succeed.
Write throttling is normally required because applications can write
to memory (dirty memory pages) at a rate significantly faster than the
kernel can flush the data to disk. Many workloads dirty memory pages
by writing to the filesystem page cache at near memory copy speed,
possibly using multiple threads issuing high-rates of filesystem
writes. Concurrently, the filesystem is doing it's best to drain all
that data to the disk subsystem.
Given the constraints, the time to empty the filesystem cache to disk
can be longer than the time required for applications to dirty the
cache. Even if one considers storage with fast NVRAM, under sustained
load, that NVRAM will fill up to a point where it needs to wait for a
slow disk I/O to make room for more data to get in.
When committing data to a filesystem in bursts, it can be quite
desirable to push the data at memory speed and then drain the cache to
disk during the lapses between bursts. But when data is generated at
a sustained high rate, lack of throttling leads to total memory
depletion. We thus need at some point to try and match the application
data rate with that of the I/O subsystem. This is the primary goal of
write throttling.
A secondary goal of write throttling is to prevent massive data loss.
When applications do not manage I/O synchronization (i.e don't use
O_DSYNC and fsync), data ends up cached in the filesystem and the
contract is that there is no guarantee that the data will still be
there if a system crash were to occur. So even if the filesystem
cannot be blamed for such data loss, it is still a nice feature to
help prevent such massive losses.
Case in point : UFS Write throttling
For instance UFS would use the fsflush daemon to try to keep data
exposed for no more than 30 seconds (default value of autoup). Also,
UFS would keep track of the amount of I/O outstanding for each
file. Once too much I/O was pending, UFS would throttle writers for
that file. This was controlled through ufs_HW, ufs_LW and their
values were commonly tuned (a bad sign). Eventually old defaults
values were updated and seem to work nicely today. UFS write
throttling thus operates on a per file basis. While there are some
merits to this approach, it can be defeated as it does not manage the
imbalance between memory and disks at a system level.
ZFS Previous write throttling
ZFS is designed around the concept of transaction groups (txg).
Normally, every 5 seconds an _open_ txg goes to the quiesced
state. From that state the quiesced txg will go to the syncing state
which sends dirty data to the I/O subsystem. For each pool, there are
at most 1 txg in each of the 3 states, open, quiescing, syncing. Write
throttling used to occur when the 5 second txg clock would fire while
the syncing txg had not yet completed. The open group would wait on
the quiesced one which waits on the syncing one. Application writers
(write system call) would block, possibly a few seconds, waiting for a
txg to open. In other words, if a txg took more than 5 seconds to
sync to disk, we would globally block writers thus matching their
speed with that of the I/O. But if a workload had a bursty write
behavior that could be synced during the allotted 5 seconds,
application would never be throttled.
The Issue
But ZFS did not sufficiently controled the amount of data that could
get in an open txg. As long as the ARC cache was no more than half
dirty, ZFS would accept data. For a large memory machine or one with
weak storage, this was likely to cause long txg sync times. The
downsides were many :
- if we did ended up throttled, long sync times meant the system
behavior would be sluggish for seconds at a time.
- long txg sync times also meant that our granularity at which
we could generate snapshots would be impacted.
- we ended up with lots of pending data in the cache all of
which could be lost in the event of a crash.
- the ZFS I/O scheduler which prioritizes operations was also
negatively impacted.
- By not throttling we had the possibility that
sequential writes on large files could displace from the ARC
a very large number of smaller objects. Refilling
that data meant very large number of disk I/Os.
Not throttling can paradoxically end up as very
costly for performance.
- the previous code also could at times, not be issuing I/Os
to disk for seconds even though the workload was
critically dependant of storage speed.
- And foremost, lack of throttling depleted memory and prevented
ZFS from reacting to memory pressure.
That ZFS is considered a memory hog is most likely the results of the
the previous throttling code. Once a proper solution is in place, it will
be interesting to see if we behave better on that front.
The Solutions
The new code keeps track of the amount of data accepted in a TXG and
the time it takes to sync. It dynamically adjusts that amount so that
each TXG sync takes about 5 seconds (txg_time variable). It also
clamps the limit to no more than 1/8th of physical memory.
And to avoid the system wide and seconds long throttle effect, the new
code will detect when we are dangerously close to that situation
(7/8th of the limit) and will insert 1 tick delays for applications
issuing writes. This prevents a write intensive thread from hogging
the available space starving out other threads. This delay should
also generally prevent the system wide throttle.
So the new steady state behavior of write intensive workloads is that,
starting with an empty TXG, all threads will be allowed to dirty
memory at full speed until a first threshold of bytes in the TXG is
reached. At that time, every write system call will be delayed by 1
tick thus significantly slowing down the pace of writes. If the
previous TXG completes it's I/Os, then the current TXG will then be
allowed to resume at full speed. But in the unlikely event that a
workload, despite the per write 1-tick delay, manages to fill up the
TXG up to the full threshold we will be forced to throttle all writes
in order to allow the storage to catch up.
It should make the system much better behaved and generally more
performant under sustained write stress.
If you are owner of an unlucky workload that ends up as slowed by more
throttling, do consider the other benefits that you get from the new
code. If that does not compensate for the loss, get in touch and tell
us what your needs are on that front.
It sounds great, though I hope that the 1/8 number is tunable.
Posted by Marc on mai 14, 2008 at 03:19 PM MEST #
yep. There are tunables for the target sync times,
the ratio of memory that can be used, an override of that value and we can turn off the whole thing if necessary.
Posted by Roch on mai 14, 2008 at 03:59 PM MEST #
For our situation we do the following and look at write throughput-
create a little sybase database and pick an i/o size (we use 16k i/o). Then start adding devices to the database- we use 32GB devices which usually blows through any write cache a storage array has... once a dba does an alter database to add the storage and we see how the storage system handles i/o.
Posted by dean ross-smith on mai 14, 2008 at 04:01 PM MEST #
Excellent news. I came across this problem in Fall 2006 and have been waiting for a solution. Any idea when we can expect it in the Solaris (OpenSolaris won't do for us) release?
Posted by Anantha on mai 14, 2008 at 05:49 PM MEST #
Good news. Along with continuously fast developed high speed disk and array, the number 1/8 might be tuned. And hope to see another variable to disable it, too. On the other side, zfs pool concepts might hold lots of several kind of storages ( from legacy one to latest one ) at the same pool. Doesn't this throttling hurt zfs ?
Posted by Bonghwan Kim on mai 14, 2008 at 06:59 PM MEST #
Which build do you expect this to integrate into Nevada?
Also, is the timer that decides when to sync to disk (as distinct from the timer checking how long a sync takes to complete) now a tunable - the 5 second timer. Or does this now dynamically adjust?
Thanks
Andrew.
Posted by andrewk8 on mai 14, 2008 at 09:34 PM MEST #
Bonghwan : my take is that the cirscumstances where this will hurt will be very very specific, while the benefits will be felt much more generally. But yes there will be tunables to control how this all works.
Andrewk8 : there is a timer that decides when to sync. And there is code to measure how long a sync takes which I would not call a timer. At low load, a txg will be cut every 30 seconds (tunable). Under stress a txg will be cut every time ZFS estimates that the sync time will match the target time of 5 seconds (tunable).
Posted by Roch on mai 14, 2008 at 11:11 PM MEST #
Still, write throttling implemented at the file system level
cannot address "the imbalance between memory and disks at
a system level" fully, because there are other sources of dirty
pages, like anonymous pages. It seems that the only way to
handle this correctly is to put generic throttling mechanism in
VM, isn't it?
Posted by nikita on mai 15, 2008 at 11:26 AM MEST #
Excellent writeup!
Is there a PSARC case or BugID associated that I can track progress with?
Posted by benr on mai 15, 2008 at 12:21 PM MEST #
Nikita : unless i misunderstand what you mean, the VM already has it's throttling mechanism when dealing with too much demand and memory shortfall. A FS would like to avoid being the cause of VM shortfall and so this is about taking measure to avoid that.
Benr : bugid are 6429205, 6415647
Posted by Roch on mai 16, 2008 at 03:30 PM MEST #
Roch, I looked up the bug id's and they don't seem to indicate what build of Nevada these new features will be available in. IHAC with the exact problem you describe using Nevada today and seeking help. Thanks, Tim.
Posted by Tim Thomas on juin 12, 2008 at 12:50 PM MEST #
Ref my previous comment. It looks like these changes are in snv b87 and later.
Posted by Tim Thomas on juin 12, 2008 at 04:08 PM MEST #
How is the 1 tick delay implemented?
Posted by tildeleb on juillet 17, 2008 at 12:59 AM MEST #