|
vdev cache and ZFS database performance
A little while back
Neel
did some nice evaluating of DB on ZFS performance. One issue that he correctly noted was: 6437054 vdev_cache wises up: increase DB performance by 16%.
The major issue for DB performance was that the vdev cache would inflate reads under 16KB (zfs_vdev_cache_max) to 64KB (1 << zfs_vdev_cache_bshift).
As you can guess, this inflating would really hurt typical databases as they do lots of record aligned random I/Os - and the random I/Os are typically under 16KB (Oracle and Postgress are usually configured with 8KB, JavaDB with 4KB, etc.). So why do we have this inflation in the first place? Turns out its really important for pre-fetching metadata. One workload that demonstrates this is the mult-stream sequnetial read workload for
FileBench. We can also use the oltp workload of FileBench to test database performance.
What we changed in order to fix 6437054 was to make the vdev cache
only inflate I/Os for *metadata* - not *user* data. You can now see that logic in vdev_cache_read(). This logically makes sense, as we can now rely on zfetch to correctly pre-fetch user data (which depends more on what the application is doing), and the vdev cache to pre-fetch metadata (which depends more on where it was located on disk).
Ok, yeah, theory is nice, but let's see some measurements...
OLTP workload
Below are the results from using this profile (named 'db.prof'). This was on a thumper, non-debug bits, ZFS configured in a 46 disk RAID-0, and the recordsize set to 8KB.
OpenSolaris results without the fix for 6437054 (onnv-gate:2007-07-11)
diskmonster# filebench db
parsing profile for config: large_db_oltp_8k
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k/thisrun.f
1698: 0.013: OLTP Version 1.16 2005/06/21 21:18:52 personality successfully loaded
1698: 0.014: Creating/pre-allocating files
1698: 0.014: Fileset logfile: 1 files, avg dir = 1024.0, avg depth = 0.0, mbytes=10
1698: 0.014: Creating fileset logfile...
1698: 0.118: Preallocated 1 of 1 of fileset logfile in 1 seconds
1698: 0.118: Fileset datafiles: 10 files, avg dir = 1024.0, avg depth = 0.3, mbytes=102400
1698: 0.118: Creating fileset datafiles...
1698: 341.433: Preallocated 10 of 10 of fileset datafiles in 342 seconds
1698: 341.434: Starting 200 shadow instances
...
1698: 345.768: Running '/usr/lib/filebench/scripts/fs_flush zfs'
1698: 345.774: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k
1698: 345.776: Running...
1698: 466.858: Run took 120 seconds...
1698: 466.913: Per-Operation Breakdown
random-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
shadow-post-dbwr 4237ops/s 0.0mb/s 0.5ms/op 37us/op-cpu
shadow-post-lg 4237ops/s 0.0mb/s 0.0ms/op 7us/op-cpu
shadowhog 4237ops/s 0.0mb/s 0.6ms/op 257us/op-cpu
shadowread 4255ops/s 33.2mb/s 45.7ms/op 67us/op-cpu
dbwr-aiowait 42ops/s 0.0mb/s 0.6ms/op 83us/op-cpu
dbwr-block 42ops/s 0.0mb/s 215.7ms/op 2259us/op-cpu
dbwr-hog 42ops/s 0.0mb/s 0.0ms/op 15us/op-cpu
dbwrite-a 4244ops/s 33.2mb/s 0.1ms/op 9us/op-cpu
lg-block 1ops/s 0.0mb/s 745.3ms/op 6225us/op-cpu
lg-aiowait 1ops/s 0.0mb/s 8.0ms/op 25us/op-cpu
lg-write 1ops/s 0.3mb/s 0.3ms/op 26us/op-cpu
1698: 466.913:
IO Summary: 1034439 ops 8543.3 ops/s, (4255/4245 r/w) 66.7mb/s, 188us cpu/op, 45.7ms latency
1698: 466.913: Stats dump to file 'stats.large_db_oltp_8k.out'
1698: 466.913: in statsdump stats.large_db_oltp_8k.out
1698: 466.922: Shutting down processes
^C 1698: 601.568: Aborting...
1698: 601.568: Deleting ISM...
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k/stats.large_db_oltp_8k.out
diskmonster#
OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18)
diskmonster# filebench db
parsing profile for config: large_db_oltp_8k
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k/thisrun.f
1083: 0.037: OLTP Version 1.16 2005/06/21 21:18:52 personality successfully loaded
1083: 0.037: Creating/pre-allocating files
1083: 0.057: Fileset logfile: 1 files, avg dir = 1024.0, avg depth = 0.0, mbytes=10
1083: 0.057: Creating fileset logfile...
1083: 0.194: Preallocated 1 of 1 of fileset logfile in 1 seconds
1083: 0.194: Fileset datafiles: 10 files, avg dir = 1024.0, avg depth = 0.3, mbytes=102400
1083: 0.194: Creating fileset datafiles...
1083: 335.203: Preallocated 10 of 10 of fileset datafiles in 336 seconds
1083: 335.203: Starting 200 shadow instances
...
1083: 339.484: Creating 221249536 bytes of ISM Shared Memory...
1083: 339.649: Allocated 221249536 bytes of ISM Shared Memory... at fffffd7f8f600000
1083: 339.650: Running '/usr/lib/filebench/scripts/fs_flush zfs'
1083: 339.725: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k
1083: 339.729: Running...
1083: 460.683: Run took 120 seconds...
1083: 460.724: Per-Operation Breakdown
random-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
shadow-post-dbwr 5224ops/s 0.0mb/s 0.8ms/op 30us/op-cpu
shadow-post-lg 5224ops/s 0.0mb/s 0.1ms/op 6us/op-cpu
shadowhog 5223ops/s 0.0mb/s 0.7ms/op 255us/op-cpu
shadowread 5240ops/s 40.9mb/s 36.3ms/op 56us/op-cpu
dbwr-aiowait 52ops/s 0.0mb/s 0.9ms/op 86us/op-cpu
dbwr-block 52ops/s 0.0mb/s 167.6ms/op 1605us/op-cpu
dbwr-hog 52ops/s 0.0mb/s 0.0ms/op 15us/op-cpu
dbwrite-a 5226ops/s 40.8mb/s 0.2ms/op 9us/op-cpu
lg-block 2ops/s 0.0mb/s 600.5ms/op 4580us/op-cpu
lg-aiowait 2ops/s 0.0mb/s 4.0ms/op 23us/op-cpu
lg-write 2ops/s 0.4mb/s 1.2ms/op 24us/op-cpu
1083: 460.724:
IO Summary: 1272557 ops 10520.9 ops/s, (5240/5228 r/w) 82.2mb/s, 156us cpu/op, 36.3ms latency
1083: 460.724: Stats dump to file 'stats.large_db_oltp_8k.out'
1083: 460.724: in statsdump stats.large_db_oltp_8k.out
1083: 460.731: Shutting down processes
1083: 1060.645: Deleting ISM...
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k/stats.large_db_oltp_8k.out
diskmonster#
10520.9 ops/s vs. 8543.3 ops/s, over 20%! That's a nice out of the box improvement!
Multi-Stream Sequential Read workload
A workaround previously mentioned to get better DB performance was to set 'zfs_vdev_cache_max' to 1B (which essentially disables the vdev cache as the random I/Os are not going to be lower than 1B). The problem with this approach is that it really hurst other workloads, such as the multi-stream sequential workload. Below are the results using the same thumper, non-debug bits, ZFS in a 46 disk RAID-0, checksums turned off, NCQ disabled via 'set sata:sata_func_enable = 0x5' in /etc/system', and using this profile (named 'sqread.prof').
OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18), 'zfs_vdev_cache_max' left as its default value
diskmonster# filebench sqread
parsing profile for config: seqread1m
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m/thisrun.f
2135: 0.005: Multi Stream Read Version 1.9 2005/06/21 21:18:52 personality successfully loaded
2135: 0.005: Creating/pre-allocating files
2135: 55.235: Pre-allocated file /bigIO/largefile4
2135: 118.147: Pre-allocated file /bigIO/largefile3
2135: 184.602: Pre-allocated file /bigIO/largefile2
2135: 251.991: Pre-allocated file /bigIO/largefile1
2135: 263.341: Starting 1 seqread instances
2136: 264.348: Starting 1 seqread4 threads
2136: 264.348: Starting 1 seqread3 threads
2136: 264.348: Starting 1 seqread2 threads
2136: 264.348: Starting 1 seqread1 threads
2135: 267.358: Running '/usr/lib/filebench/scripts/fs_flush zfs'
2135: 267.362: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m
2135: 267.362: Running...
2135: 388.128: Run took 120 seconds...
2135: 388.130: Per-Operation Breakdown
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread4 469ops/s 468.7mb/s 2.1ms/op 1391us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread3 454ops/s 454.1mb/s 2.2ms/op 1412us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread2 444ops/s 443.8mb/s 2.2ms/op 1400us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 421ops/s 421.0mb/s 2.4ms/op 1414us/op-cpu
2135: 388.130:
IO Summary: 215878 ops 1787.6 ops/s, (1788/0 r/w) 1787.6mb/s, 1638us cpu/op, 2.2ms latency
2135: 388.130: Stats dump to file 'stats.seqread1m.out'
2135: 388.130: in statsdump stats.seqread1m.out
2135: 388.136: Shutting down processes
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m/stats.seqread1m.out
diskmonster#
OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18), 'zfs_vdev_cache_max' set to 1 (disabled vdev cache)
diskmonster# ./do_sqread
cannot open 'bigIO': no such pool
parsing profile for config: seqread1m
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m/thisrun.f
4110: 0.005: Multi Stream Read Version 1.9 2005/06/21 21:18:52 personality successfully loaded
4110: 0.005: Creating/pre-allocating files
4110: 55.681: Pre-allocated file /bigIO/largefile4
4110: 119.324: Pre-allocated file /bigIO/largefile3
4110: 182.188: Pre-allocated file /bigIO/largefile2
4110: 245.260: Pre-allocated file /bigIO/largefile1
4110: 255.216: Starting 1 seqread instances
4113: 256.222: Starting 1 seqread4 threads
4113: 256.222: Starting 1 seqread3 threads
4113: 256.222: Starting 1 seqread2 threads
4113: 256.222: Starting 1 seqread1 threads
4110: 259.232: Running '/usr/lib/filebench/scripts/fs_flush zfs'
4110: 259.236: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m
4110: 259.236: Running...
4110: 380.112: Run took 120 seconds...
4110: 380.115: Per-Operation Breakdown
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread4 369ops/s 369.5mb/s 2.7ms/op 1034us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread3 375ops/s 375.2mb/s 2.7ms/op 1047us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread2 369ops/s 369.2mb/s 2.7ms/op 1042us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 296ops/s 296.2mb/s 3.4ms/op 1066us/op-cpu
4110: 380.115:
IO Summary: 170443 ops 1410.1 ops/s, (1410/0 r/w) 1410.1mb/s, 1325us cpu/op, 2.8ms latency
4110: 380.115: Stats dump to file 'stats.seqread1m.out'
4110: 380.115: in statsdump stats.seqread1m.out
4110: 380.121: Shutting down processes
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m/stats.seqread1m.out
diskmonster#
So by disabling the vdev cache, the throughput drops from 1787MB/s to 1410MB/s - a 25% regression. So disabling the vdev cache really hurts here. The nice thing is that with the fix for 6437054, we don't have to - and we get great DB performance too. My cake is tasty.
Future Work
Future work to increase DB on ZFS performance includes:
6457709 vdev_knob values should be determined dynamically
6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops
ps: If you want the oltp workload of filebench to run correctly, you'll need this minor fix to 'flowop_library.c':
@@ -600,10 +600,11 @@
aiocb->aio_fildes = filedesc;
aiocb->aio_buf = threadflow->tf_mem + memoffset;
aiocb->aio_nbytes = *flowop->fo_iosize;
aiocb->aio_offset = offset;
+ aiocb->aio_reqprio = 0;
filebench_log(LOG_DEBUG_IMPL,
"aio fd=%d, bytes=%lld, offset=%lld",
filedesc, *flowop->fo_iosize, offset);
happy databasing!
(2007-07-18 17:56:47.0/2007-07-18 17:56:47.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/vdev_cache_improvements_to_help
|