The new ZFS write throttle feature, which integrated in Nevada build 87, specifically addresses write intensive workloads. Today, we take a closer look at the write throttle in action. Our test system is a Sun Fire X4500 running Nevada build 94 with a single ZFS pool of 42 striped disks.
blog@x4500> zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT h 19.0T 620K 19.0T 0% ONLINE -
The zfs_write_throttle.d DTrace script is used to observe the write throttle. In a first test, we start generating write I/O load using a couple of “dd if=/dev/zero of=/h/<file> bs=1024k” commands. Here's an extract of the script output:
--- 2008 Jul 28 14:04:17
Sync rate (/s)
h 1
MB/s
h 1540
Delays/s
h 47
h Sync time (ms)
value ------------- Distribution ------------- count
80 | 0
100 |@@@@@@@@@@@ 3
120 |@@@@ 1
...snip...
260 |@@@@ 1
...snip...
580 |@@@@ 1
...snip...
780 |@@@@ 1
...snip...
1320 |@@@@@@@ 2
1340 | 0
1360 |@@@@ 1
...snip...
1520 |@@@@ 1
1540 | 0
h Written (MB)
value ------------- Distribution ------------- count
< 200 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9
...snip...
3000 |@@@@ 1
...snip...
>= 4000 |@@@@ 1
h Write limit (MB)
value ------------- Distribution ------------- count
7750 | 0
>= 8000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 11
The output has been shortened for clarity. With the default settings in place, one can observe that the average time for synchronizing data to disks takes well over a second [range 100 ms to 1540 ms] (please refer to Sync time distribution).
In a second test, we are reducing the target time for synchronizing data on disk from five seconds (default) to one second (using the zfs_txg_synctime variable). Here's again an extract of the script output:
--- 2008 Jul 28 14:08:27
Sync rate (/s)
h 1
MB/s
h 1681
Delays/s
h 56
h Sync time (ms)
value ------------- Distribution ------------- count
340 | 0
360 |@@@ 1
...snip...
460 |@@@ 1
480 | 0
500 |@@@ 1
...snip...
600 |@@@ 1
...snip...
660 |@@@ 1
...snip...
740 |@@@ 1
760 |@@@ 1
780 |@@@ 1
800 | 0
820 |@@@ 1
840 |@@@@@@ 2
860 |@@@@@@ 2
...snip...
1040 |@@@ 1
1060 | 0
h Written (MB)
value ------------- Distribution ------------- count
< 200 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
...snip...
2400 |@@@ 1
2600 | 0
2800 |@@@ 1
...snip...
>= 4000 |@@@@@@ 2
h Write limit (MB)
value ------------- Distribution ------------- count
2500 | 0
2750 |@@@@@@ 2
...snip...
4750 |@@@ 1
5000 |@@@@@@ 2
5250 | 0
5500 |@@@@@@@@@@@ 4
...snip...
6500 |@@@ 1
6750 | 0
7000 |@@@@@@@@@ 3
...snip...
>= 8000 |@@@ 1
Two things can be seen when comparing with the first test:
a) the average time for synchronizing data to disks has gone down [range 360 ms to 1060 ms].
b) the pool “write limit” mark did move around over time (please refer to Write limit distribution), thus dynamically throttling the incoming application write rate to the available I/O bandwidth.
More parameters are available for tuning (please see the source code), but as usual, use them with caution. To wrap-up, here's one last output extract where the parameter zfs_write_limit_override was set to 800 MB. In setting this parameter, we are enforcing the write limit to the value specified. This can be beneficial for applications that generate a continuous well paced write stream but are sensitive to write delays.
--- 2008 Jul 28 14:54:49
Sync rate (/s)
h 4
MB/s
h 677
Delays/s
h 1
h Sync time (ms)
value ------------- Distribution ------------- count
120 | 0
140 |@@@@@@ 6
160 |@@@@@@@@@@@@@@@ 15
180 |@@@@@@@@@@@@@@@ 15
200 |@@@@@ 5
220 | 0
h Written (MB)
value ------------- Distribution ------------- count
< 200 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 31
200 | 0
400 |@@@@@ 5
600 | 0
800 |@@@@@ 5
1000 | 0
h Write limit (MB)
value ------------- Distribution ------------- count
1250 | 0
1500 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 41
1750 | 0
Hopefully, you have enjoyed these little observations!