| |
Eric Kustarz's Weblog
e-street
Thursday September 18, 2008 |
|
It's been a great eight years at Sun, but a new opportunity has piqued my interest. So as of today, i'm moving on.
Where to?
Well i'm going to be helping out some friends at a startup:
http://www.lumosity.com/
Best to team ZFS (including those outside of Sun).
eric
(2008-09-18 15:24:06.0/2008-09-18 15:24:06.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/better_things
|
|
|
Wednesday March 19, 2008 |
|
how dedupalicious is your pool?
WIth the putback of 6656655 zdb should be able to display blkptr signatures, we can now get the "signature" of the block pointers in a pool. To see an example, let's first put some content into an empty pool:
heavy# zpool create bigIO c0t0d0 c0t1d0
heavy# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
bigIO 928G 95.5K 928G 0% ONLINE -
heavy# mkfile 1m /bigIO/1m.txt
heavy# echo "dedup me" > /bigIO/ejk.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk2.txt
heavy# echo "no dedup" > /bigIO/nope.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk3.txt
Now lets run zdb with the new option "-S". We pass in "user:all", where "user" tells zdb that we only want user data blocks (as opposed to both user and metadata) and "all" tells zdb to print out all blocks (skipping any checksum algorithm strength comparisons).
heavy# zdb -L -S user:all bigIO
0 131072 1 ZFS plain file fletcher2 uncompressed 0:0:0:0
0 131072 1 ZFS plain file fletcher2 uncompressed 0:0:0:0
0 131072 1 ZFS plain file fletcher2 uncompressed 0:0:0:0
0 131072 1 ZFS plain file fletcher2 uncompressed 0:0:0:0
0 131072 1 ZFS plain file fletcher2 uncompressed 0:0:0:0
0 131072 1 ZFS plain file fletcher2 uncompressed 0:0:0:0
0 131072 1 ZFS plain file fletcher2 uncompressed 0:0:0:0
0 131072 1 ZFS plain file fletcher2 uncompressed 0:0:0:0
0 512 1 ZFS plain file fletcher2 uncompressed 656d207075646564:a:ada40e0eac8cac80:140
0 512 1 ZFS plain file fletcher2 uncompressed 656d207075646564:a:ada40e0eac8cac80:140
0 512 1 ZFS plain file fletcher2 uncompressed 7075646564206f6e:a:eac8cac840dedc0:140
0 512 1 ZFS plain file fletcher2 uncompressed 656d207075646564:a:ada40e0eac8cac80:140
heavy#
This displays the signature of each block pointer - where the columns are level, physical size, number of dvas, object type, checksum type, compression type, and finally the actual checksum of the block.
So this is interesting, but what could we do with this information? Well, one thing we could do, is to figure out how much your pool can take advantage of dedup. Let's assume that the dedup implementation does matching based on the actual checksum and any checksum algorithm is strong enough (in reality, we'd need sha256 or stronger). So starting with the above pool and using a simple perl script 'line_by_line_process.pl' (shown at the end of this blog), we find:
heavy# zdb -L -S user:all bigIO > /tmp/zdb_out.txt
heavy# sort -k 7 -t "`/bin/echo '\t'`" /tmp/zdb_out.txt > /tmp/zdb_out_sorted.txt
heavy# ./line_by_line_process.pl /tmp/zdb_out_sorted.txt
total PSIZE: 0t1050624
total unique PSIZE: 0t132096
total that can be duped: 0t918528
percent that can be duped 87.4269005847953%
heavy#
In our trivial case, we can see that we could get a huge win - 87% of the pool can be dedup'd!. Upon closer examination, we notice that mkfile writes out all zero blocks. If you had compression enabled, there won't be any actual blocks for this file. So let's look at a case where just the "ejk.txt" contents are getting dedup'd:
heavy# zpool destroy bigIO
heavy# zpool create bigIO c0t0d0 c0t1d0
heavy# dd if=/dev/random of=/bigIO/1m.txt bs=1024 count=5
5+0 records in
5+0 records out
heavy# echo "dedup me" > /bigIO/ejk.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk2.txt
heavy# echo "no dedup" > /bigIO/nope.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk3.txt
heavy# zdb -L -S user:all bigIO > /tmp/zdb_out.txt
heavy# sort -k 7 -t "`/bin/echo '\t'`" /tmp/zdb_out.txt > /tmp/zdb_out_sorted.txt
heavy# ./line_by_line_process.pl /tmp/zdb_out_sorted.txt
total PSIZE: 0t7168
total unique PSIZE: 0t6144
total that can be duped: 0t1024
percent that can be duped 14.2857142857143%
heavy#
Ok in this different setup we can see ~14% of capacity can actually be dedup'd - a nice savings on capacity.
So the question becomes - how dedupalicious is your pool?
ps: here is the simple perl script 'line_by_line_process.pl':
#!/usr/bin/perl
# Run this script as:
# % script
# total PSIZE
$totalps = 0;
# total unique PSIZE
$totalups = 0;
$last_cksum = -1;
$path = $ARGV[0];
open(FH, $path) or die "Can't open $!";
while (<>) {
my $line = $_;
($level, $psize, $ndvas, $type, $cksum_alg, $compress, $cksum) = split /\t/, $line, 7;
if ($cksum ne $last_cksum) {
$totalups += $psize;
}
$last_cksum = $cksum;
$totalps += $psize;
}
print "total PSIZE: 0t".$totalps."\n";
print "total unique PSIZE: 0t".$totalups."\n";
print "total that can be duped: 0t".($totalps - $totalups)."\n";
print "percent that can be duped ".($totalps - $totalups) / $totalps * 100 ."%\n";
(2008-03-19 18:32:53.0/2008-03-19 18:32:53.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/how_dedupalicious_is_your_pool
|
|
|
Tuesday September 11, 2007 |
|
NCQ sata_max_queue_depth tunable
Previously, i did some analysis on NCQ in OpenSolaris. It turned out that to get good multi-stream sequential read performance, you had to disable NCQ via the tunable 'sata_func_enable'. Disabling NCQ actually does two things: 1) sets the number of concurrent I/Os to 1 and 2) changes what you send down protocol wise.
Turns out, the first is all we really need to get good performance for the multi-stream sequential read case, and doing the second actually exposes a bug on certain firmware of certain disks. So i highly recommend the newly added 'sata_max_queue_depth' tunable instead of 'sata_func_enable'. As a reminder, put the following in /etc/system and reboot:
set sata:sata_max_queue_depth = 0x1
An admin command to allow you to do this on the fly without rebooting would be another step forward, but no official plans on that just yet.
Thanks Pawel!
(2007-09-24 07:06:50.0/2007-09-11 22:18:32.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/ncq_tunable
|
|
|
Wednesday August 01, 2007 |
|
how to get the spa_history's bonus buffer
Here's a quick debugging note on how to get the offsets of the pool's history log:
> ::spa
ADDR STATE NAME
fffffffec529d5c0 ACTIVE d
> fffffffec529d5c0::print spa_t spa_history
spa_history = 0xd
> ::dbufs -n mos -o 0xd -b bonus
0xffffffff39948990
> 0xffffffff39948990::print dmu_buf_t db_data | ::print spa_history_phys_t
{
sh_pool_create_len = 0x10c
sh_phys_max_off = 0x2000000
sh_bof = 0x10c
sh_eof = 0x30c
sh_records_lost = 0
}
>
(2007-08-01 08:53:09.0/2007-08-01 08:53:09.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/using_mdb_to_get_the
|
|
|
Wednesday July 18, 2007 |
|
vdev cache and ZFS database performance
A little while back
Neel
did some nice evaluating of DB on ZFS performance. One issue that he correctly noted was: 6437054 vdev_cache wises up: increase DB performance by 16%.
The major issue for DB performance was that the vdev cache would inflate reads under 16KB (zfs_vdev_cache_max) to 64KB (1 << zfs_vdev_cache_bshift).
As you can guess, this inflating would really hurt typical databases as they do lots of record aligned random I/Os - and the random I/Os are typically under 16KB (Oracle and Postgress are usually configured with 8KB, JavaDB with 4KB, etc.). So why do we have this inflation in the first place? Turns out its really important for pre-fetching metadata. One workload that demonstrates this is the mult-stream sequnetial read workload for
FileBench. We can also use the oltp workload of FileBench to test database performance.
What we changed in order to fix 6437054 was to make the vdev cache
only inflate I/Os for *metadata* - not *user* data. You can now see that logic in vdev_cache_read(). This logically makes sense, as we can now rely on zfetch to correctly pre-fetch user data (which depends more on what the application is doing), and the vdev cache to pre-fetch metadata (which depends more on where it was located on disk).
Ok, yeah, theory is nice, but let's see some measurements...
OLTP workload
Below are the results from using this profile (named 'db.prof'). This was on a thumper, non-debug bits, ZFS configured in a 46 disk RAID-0, and the recordsize set to 8KB.
OpenSolaris results without the fix for 6437054 (onnv-gate:2007-07-11)
diskmonster# filebench db
parsing profile for config: large_db_oltp_8k
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k/thisrun.f
1698: 0.013: OLTP Version 1.16 2005/06/21 21:18:52 personality successfully loaded
1698: 0.014: Creating/pre-allocating files
1698: 0.014: Fileset logfile: 1 files, avg dir = 1024.0, avg depth = 0.0, mbytes=10
1698: 0.014: Creating fileset logfile...
1698: 0.118: Preallocated 1 of 1 of fileset logfile in 1 seconds
1698: 0.118: Fileset datafiles: 10 files, avg dir = 1024.0, avg depth = 0.3, mbytes=102400
1698: 0.118: Creating fileset datafiles...
1698: 341.433: Preallocated 10 of 10 of fileset datafiles in 342 seconds
1698: 341.434: Starting 200 shadow instances
...
1698: 345.768: Running '/usr/lib/filebench/scripts/fs_flush zfs'
1698: 345.774: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k
1698: 345.776: Running...
1698: 466.858: Run took 120 seconds...
1698: 466.913: Per-Operation Breakdown
random-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
shadow-post-dbwr 4237ops/s 0.0mb/s 0.5ms/op 37us/op-cpu
shadow-post-lg 4237ops/s 0.0mb/s 0.0ms/op 7us/op-cpu
shadowhog 4237ops/s 0.0mb/s 0.6ms/op 257us/op-cpu
shadowread 4255ops/s 33.2mb/s 45.7ms/op 67us/op-cpu
dbwr-aiowait 42ops/s 0.0mb/s 0.6ms/op 83us/op-cpu
dbwr-block 42ops/s 0.0mb/s 215.7ms/op 2259us/op-cpu
dbwr-hog 42ops/s 0.0mb/s 0.0ms/op 15us/op-cpu
dbwrite-a 4244ops/s 33.2mb/s 0.1ms/op 9us/op-cpu
lg-block 1ops/s 0.0mb/s 745.3ms/op 6225us/op-cpu
lg-aiowait 1ops/s 0.0mb/s 8.0ms/op 25us/op-cpu
lg-write 1ops/s 0.3mb/s 0.3ms/op 26us/op-cpu
1698: 466.913:
IO Summary: 1034439 ops 8543.3 ops/s, (4255/4245 r/w) 66.7mb/s, 188us cpu/op, 45.7ms latency
1698: 466.913: Stats dump to file 'stats.large_db_oltp_8k.out'
1698: 466.913: in statsdump stats.large_db_oltp_8k.out
1698: 466.922: Shutting down processes
^C 1698: 601.568: Aborting...
1698: 601.568: Deleting ISM...
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k/stats.large_db_oltp_8k.out
diskmonster#
OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18)
diskmonster# filebench db
parsing profile for config: large_db_oltp_8k
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k/thisrun.f
1083: 0.037: OLTP Version 1.16 2005/06/21 21:18:52 personality successfully loaded
1083: 0.037: Creating/pre-allocating files
1083: 0.057: Fileset logfile: 1 files, avg dir = 1024.0, avg depth = 0.0, mbytes=10
1083: 0.057: Creating fileset logfile...
1083: 0.194: Preallocated 1 of 1 of fileset logfile in 1 seconds
1083: 0.194: Fileset datafiles: 10 files, avg dir = 1024.0, avg depth = 0.3, mbytes=102400
1083: 0.194: Creating fileset datafiles...
1083: 335.203: Preallocated 10 of 10 of fileset datafiles in 336 seconds
1083: 335.203: Starting 200 shadow instances
...
1083: 339.484: Creating 221249536 bytes of ISM Shared Memory...
1083: 339.649: Allocated 221249536 bytes of ISM Shared Memory... at fffffd7f8f600000
1083: 339.650: Running '/usr/lib/filebench/scripts/fs_flush zfs'
1083: 339.725: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k
1083: 339.729: Running...
1083: 460.683: Run took 120 seconds...
1083: 460.724: Per-Operation Breakdown
random-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
shadow-post-dbwr 5224ops/s 0.0mb/s 0.8ms/op 30us/op-cpu
shadow-post-lg 5224ops/s 0.0mb/s 0.1ms/op 6us/op-cpu
shadowhog 5223ops/s 0.0mb/s 0.7ms/op 255us/op-cpu
shadowread 5240ops/s 40.9mb/s 36.3ms/op 56us/op-cpu
dbwr-aiowait 52ops/s 0.0mb/s 0.9ms/op 86us/op-cpu
dbwr-block 52ops/s 0.0mb/s 167.6ms/op 1605us/op-cpu
dbwr-hog 52ops/s 0.0mb/s 0.0ms/op 15us/op-cpu
dbwrite-a 5226ops/s 40.8mb/s 0.2ms/op 9us/op-cpu
lg-block 2ops/s 0.0mb/s 600.5ms/op 4580us/op-cpu
lg-aiowait 2ops/s 0.0mb/s 4.0ms/op 23us/op-cpu
lg-write 2ops/s 0.4mb/s 1.2ms/op 24us/op-cpu
1083: 460.724:
IO Summary: 1272557 ops 10520.9 ops/s, (5240/5228 r/w) 82.2mb/s, 156us cpu/op, 36.3ms latency
1083: 460.724: Stats dump to file 'stats.large_db_oltp_8k.out'
1083: 460.724: in statsdump stats.large_db_oltp_8k.out
1083: 460.731: Shutting down processes
1083: 1060.645: Deleting ISM...
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k/stats.large_db_oltp_8k.out
diskmonster#
10520.9 ops/s vs. 8543.3 ops/s, over 20%! That's a nice out of the box improvement!
Multi-Stream Sequential Read workload
A workaround previously mentioned to get better DB performance was to set 'zfs_vdev_cache_max' to 1B (which essentially disables the vdev cache as the random I/Os are not going to be lower than 1B). The problem with this approach is that it really hurst other workloads, such as the multi-stream sequential workload. Below are the results using the same thumper, non-debug bits, ZFS in a 46 disk RAID-0, checksums turned off, NCQ disabled via 'set sata:sata_func_enable = 0x5' in /etc/system', and using this profile (named 'sqread.prof').
OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18), 'zfs_vdev_cache_max' left as its default value
diskmonster# filebench sqread
parsing profile for config: seqread1m
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m/thisrun.f
2135: 0.005: Multi Stream Read Version 1.9 2005/06/21 21:18:52 personality successfully loaded
2135: 0.005: Creating/pre-allocating files
2135: 55.235: Pre-allocated file /bigIO/largefile4
2135: 118.147: Pre-allocated file /bigIO/largefile3
2135: 184.602: Pre-allocated file /bigIO/largefile2
2135: 251.991: Pre-allocated file /bigIO/largefile1
2135: 263.341: Starting 1 seqread instances
2136: 264.348: Starting 1 seqread4 threads
2136: 264.348: Starting 1 seqread3 threads
2136: 264.348: Starting 1 seqread2 threads
2136: 264.348: Starting 1 seqread1 threads
2135: 267.358: Running '/usr/lib/filebench/scripts/fs_flush zfs'
2135: 267.362: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m
2135: 267.362: Running...
2135: 388.128: Run took 120 seconds...
2135: 388.130: Per-Operation Breakdown
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread4 469ops/s 468.7mb/s 2.1ms/op 1391us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread3 454ops/s 454.1mb/s 2.2ms/op 1412us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread2 444ops/s 443.8mb/s 2.2ms/op 1400us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 421ops/s 421.0mb/s 2.4ms/op 1414us/op-cpu
2135: 388.130:
IO Summary: 215878 ops 1787.6 ops/s, (1788/0 r/w) 1787.6mb/s, 1638us cpu/op, 2.2ms latency
2135: 388.130: Stats dump to file 'stats.seqread1m.out'
2135: 388.130: in statsdump stats.seqread1m.out
2135: 388.136: Shutting down processes
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m/stats.seqread1m.out
diskmonster#
OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18), 'zfs_vdev_cache_max' set to 1 (disabled vdev cache)
diskmonster# ./do_sqread
cannot open 'bigIO': no such pool
parsing profile for config: seqread1m
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m/thisrun.f
4110: 0.005: Multi Stream Read Version 1.9 2005/06/21 21:18:52 personality successfully loaded
4110: 0.005: Creating/pre-allocating files
4110: 55.681: Pre-allocated file /bigIO/largefile4
4110: 119.324: Pre-allocated file /bigIO/largefile3
4110: 182.188: Pre-allocated file /bigIO/largefile2
4110: 245.260: Pre-allocated file /bigIO/largefile1
4110: 255.216: Starting 1 seqread instances
4113: 256.222: Starting 1 seqread4 threads
4113: 256.222: Starting 1 seqread3 threads
4113: 256.222: Starting 1 seqread2 threads
4113: 256.222: Starting 1 seqread1 threads
4110: 259.232: Running '/usr/lib/filebench/scripts/fs_flush zfs'
4110: 259.236: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m
4110: 259.236: Running...
4110: 380.112: Run took 120 seconds...
4110: 380.115: Per-Operation Breakdown
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread4 369ops/s 369.5mb/s 2.7ms/op 1034us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread3 375ops/s 375.2mb/s 2.7ms/op 1047us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread2 369ops/s 369.2mb/s 2.7ms/op 1042us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 296ops/s 296.2mb/s 3.4ms/op 1066us/op-cpu
4110: 380.115:
IO Summary: 170443 ops 1410.1 ops/s, (1410/0 r/w) 1410.1mb/s, 1325us cpu/op, 2.8ms latency
4110: 380.115: Stats dump to file 'stats.seqread1m.out'
4110: 380.115: in statsdump stats.seqread1m.out
4110: 380.121: Shutting down processes
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m/stats.seqread1m.out
diskmonster#
So by disabling the vdev cache, the throughput drops from 1787MB/s to 1410MB/s - a 25% regression. So disabling the vdev cache really hurts here. The nice thing is that with the fix for 6437054, we don't have to - and we get great DB performance too. My cake is tasty.
Future Work
Future work to increase DB on ZFS performance includes:
6457709 vdev_knob values should be determined dynamically
6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops
ps: If you want the oltp workload of filebench to run correctly, you'll need this minor fix to 'flowop_library.c':
@@ -600,10 +600,11 @@
aiocb->aio_fildes = filedesc;
aiocb->aio_buf = threadflow->tf_mem + memoffset;
aiocb->aio_nbytes = *flowop->fo_iosize;
aiocb->aio_offset = offset;
+ aiocb->aio_reqprio = 0;
filebench_log(LOG_DEBUG_IMPL,
"aio fd=%d, bytes=%lld, offset=%lld",
filedesc, *flowop->fo_iosize, offset);
happy databasing!
(2007-07-18 17:56:47.0/2007-07-18 17:56:47.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/vdev_cache_improvements_to_help
|
|
|
Tuesday June 12, 2007 |
|
Sun is known for servers, not laptops. So a filesystem designed by Sun would surely be too powerful and too "heavy" for laptops, that the features of a "datacenter" filesystem wouldn't fit on a laptop. Right? Actually... not. As it turns out, ZFS is a great match for laptops.
Backup
One of the most important things a user needs to do on a laptop is to back his data up. Copying your data to DVD or an external drive is one way. ZFS snapshots with 'zfs send' and 'zfs recv' is a better way. Due to its architecture, snaphots in ZFS are very fast and only take up as much space as much data has changed. For a typical user, taking a snapshot every day, for example, will only take up a small amount of capacity.
So let's start off with a ZFS pool called 'swim' and two filesystems: 'Music' and 'Pictures':
fsh-mullet# zfs list
NAME USED AVAIL REFER MOUNTPOINT
swim 157K 9.60G 21K /swim
swim/Music 18K 9.60G 18K /swim/Music
swim/Pictures 19K 9.60G 19K /swim/Pictures
fsh-mullet# ls /swim/Pictures
bday.jpg good_times.jpg
Taking a snapshot 'today' of Pictures is this easy:
fsh-mullet# zfs snapshot swim/Pictures@today
And now we can see the contents of snapshot 'today' via the '.zfs/snapshot' directory:
fsh-mullet# ls /swim/Pictures/.zfs/snapshot/today
bday.jpg good_times.jpg
fsh-mullet#
If you want to take a snapshot of all your filesystems, then you can do:
fsh-mullet# zfs snapshot -r swim@today
fsh-mullet# zfs list
NAME USED AVAIL REFER MOUNTPOINT
swim 100M 9.50G 21K /swim
swim@today 0 - 21K -
swim/Music 100M 9.50G 100M /swim/Music
swim/Music@today 0 - 100M -
swim/Pictures 19K 9.50G 19K /swim/Pictures
swim/Pictures@today 0 - 19K -
fsh-mullet#
Now that you have snapshots, you can use the built-in features of 'zfs send' and 'zfs recv' to backup your data - even to another machine.
fsh-mullet# zfs send swim/Pictures@today | ssh host2 zfs recv -d backupswim
After you've sent over the first snapshot via 'zfs send', you can then do incremental 'zfs send's:
fsh-mullet# zfs send -i swim/Pictures@today | ssh host2 zfs recv -d backupswim
Now let's look at the backup ZFS pool 'backupswim' on host 'host2':
host2# zfs list
NAME USED AVAIL REFER MOUNTPOINT
backupswim 100M 9.50G 21K /backupswim
backupswim/Music 100M 9.50G 100M /backupswim/Music
backupswim/Music@today 0 - 100M -
backupswim/Pictures 18K 9.50G 18K /backupswim/Pictures
backupswim/Pictures@today 0 - 18K -
What's really nice about using ZFS's snapshots is that you only need to send over (and store) the differences between snapshots. So if you're doing video editing on your laptop, and have a giant 10GB file, but only change, say, 1KB of data on this day, with ZFS you only have to send over 1KB of data - not the entire 10GB of the file. This also means you don't have to store multiple 10GB versions (one per snapshot) of the file on your backup device.
You can also backup with an external hard drive. Create a backup pool on the second hard drive, and just 'zfs send/recv' your nightly snapshots.
Reliability
Since laptops (typically) only have 1 disk, handling disk errors is very important. Bill introduced ditto blocks to handle partial disk failures. With typical filesystems, if part of the disk is corrupted/failing and that part of the disk stores your metadata, you're screwed. There's no way to access the data associated with the inaccessible metadata without backing up. With ditto blocks, ZFS stores multiple copies of the metadata in the pool. In the single disk case, we strategically store multiple copies of the metadata on different locations on disk (such as at the front and back of the disk). A subtle partial disk failure can make other filesystems useless, whereas ZFS can survive.
Matt took ditto blocks one step further and allowed the user to apply it to any filesystem's data. What this means is that you can make your more important data more reliable by stashing away multiple copies of your precious data (without muddying your namespace). Here's how you store two copies of your pictures:
fsh-mullet# zfs set copies=2 swim/Pictures
fsh-mullet# zfs get copies swim/Pictures
NAME PROPERTY VALUE SOURCE
swim/Pictures copies 2 local
fsh-mullet#
Note, the number of copies property only affects future writes (not existing data). So i recommend you set this at filesystem creation time:
fsh-mullet# zfs create -o copies=2 swim/Music
fsh-mullet# zfs get copies swim/Music
NAME PROPERTY VALUE SOURCE
swim/Music copies 2 local
fsh-mullet#
Built-in Compression
With ZFS, compression comes built-in. The current algorithms are lzjb (based on Lempel-Ziv) and gzip. Now its true that your jpegs and mp4s are already compressed quite nicely, but if you want to save capacity on other filesystems, all you have to do is:
fsh-mullet# zfs set compression=on swim/Documents
fsh-mullet# zfs get compression swim/Documents
NAME PROPERTY VALUE SOURCE
swim/Documents compression on local
fsh-mullet#
The default compression algorithm is lzjb. If you want to use gzip, then do:
fsh-mullet# zfs set compression=gzip swim/Documents
fsh-mullet# zfs get compression swim/Documents
NAME PROPERTY VALUE SOURCE
swim/Documents compression gzip local
fsh-mullet#
That single disk stickiness
A major problem with laptops today is the single point of failure: the single disk. It makes complete sense today that laptops are designed this way given the physical space and power issues. But looking foward, as, say, flash gets cheaper and cheaper as well as more reliable, it becomes more and more of a possibility to replace the single disk in laptops. So now that you save physical space, you can actually fit more than one flash device in the laptop. Wouldn't it be really cool if you could then build RAID ontop of the multiple devices? Introducing some hardware RAID controller doesn't make any sense - but software RAID does.
ZFS allows you to do mirroring as well as RAID-Z (ZFS's unique form of RAID-5) - in software.
Creating a mirrored pool is easy:
diskmonster# zpool create swim mirror c7t0d0 c7t1d0
diskmonster# zpool status
pool: swim
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
swim ONLINE 0 0 0
mirror ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
errors: No known data errors
diskmonster#
Similarly, creating a RAID-Z is also easy:
diskmonster# zpool create swim raidz c7t0d0 c7t1d0 c7t2d0 c7t5d0
diskmonster# zpool status
pool: swim
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
swim ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
c7t2d0 ONLINE 0 0 0
c7t5d0 ONLINE 0 0 0
errors: No known data errors
diskmonster#
With either of these configurations, your laptop can now handle a whole device failure.
ZFS on a laptop - a perfect fit.
(2007-06-12 16:35:37.0/2007-06-12 16:35:37.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/zfs_on_a_laptop
|
|
|
Tuesday May 29, 2007 |
|
NCQ
was enabled in OpenSolaris back in snv_47,
but it was only recently (build snv_63),
that multiple concurrent I/Os were enabled - thereby, letting us take advantage of NCQ. I've been waiting for many months for this to happen, so i was curious as to what performance impact NCQ would really have.
The short and sweet is that NCQ is great for multi-threaded random I/O and horrible for multi-streamed sequential I/O.
Below are some side by side comparisons (you may have to expand that browser a bit) with NCQ disabled vs enabled for two main workloads: random reads and sequential reads. I have some different configs for the random reads workload (1 disk, 2 disk, 4 disk, etc.). I also experimented with a different number of threads/streams for each workload.
filebench was used to generate the workloads. More info on filebench is here.
I tested on 2007-03-16 non-debug bits of solaris on a
thumper. The disks were 500GB Hitachis with AJ0A firmware. The filebench profiles i used for the random read workload is here and the sequential read workload is here. All tests used ZFS with checksumming off.
Random Read Results
1 disk
2 disk RAID-0
4 disk RAID-0
16 disk RAID-0
32 disk RAID-0
Sequential Read Results
46 disk RAID-0
Conclusions
Clearly, NCQ gets you better iops (and hence better overall BW) when the workload is purely random I/O and you have more than 1 thread. We see the exact opposite result if the workload is purely sequential I/O and you have multiple streams.
For the multi-streamed sequential read workload, we were curious where the problem was. Using Dtrace, we found ldi_strategy() was taking ~3x longer with NCQ enabled over disabled. Somewhat interesting, but we still didn't know if the problem was in SATA, marvell, disk firmware, or just the disk hardware itself. So we Dtrace'd the sd, sata, and marvell88sx modules (as well as the ldi_strategy() and bdev_strategy() routines) to see where time was being spent. We found that mv_rw_dma_start() was taking ~20x longer. Ah, ha! So we know the problem is either in marvell, but more likely the disk (firmware).
My take on this is that NCQ is an immature technology, and the disk vendors have some work to do (please note, i only got to test this one version of Hitachi drives). I'd love to test out some other SATA disks... you know, when i have free time.
Oh yeah, currently in OpenSolaris, NCQ is enabled by default. If you would like to turn it off, set this in /etc/system and reboot:
set sata:sata_func_enable = 0x5
And now in snv_74, you can leave NCQ enabled, but effectively disable it by setting the number of multiple concurrent I/Os to 1 by putting the following in /etc/system and rebooting:
set sata:sata_max_queue_depth = 0x1
we're running a complete perf run to see if we should have NCQ enabled or disabled by default.
(2007-09-11 21:52:26.0/2007-05-29 12:28:22.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/ncq_performance_analysis
|
|
|
Tuesday April 10, 2007 |
|
Poor Man's Cluster - end the corruption
The putback of 6282725 hostname/hostid should be stored in the label introduces hostid checking when importing a pool.
If the pool was last accessed by another system, then the import is denied (of course can be overridden with the '-f' flag).
This is especially important to people rolling their own cluster's - the so-called poor man's cluster. What people were finding is:
1) clientA creates the pool (using shared storage)
2) clientA reboots/panics
3) clientB forcibily imports the pool
4) clientA comes back up
5) clientA automatically imports the pool via /etc/zfs/zpool.cache
At this point, both clientA and clientB have the same pool imported and both can write to it - however, ZFS is not designed
to have multiple writers (yet), so both clients will quickly corrupt the pool as both have a different view of the pool's state.
Now that we store the hostid in the label and verify the system importing the pool was the last one that accessed the pool, the
poor man's cluster corruption scenario mentioned above can no longer happen. Below is an example using shared storage over iSCSI.
In the example, clientA is 'fsh-weakfish', clientB is 'fsh-mullet'.
First, let's create the pool on clientA (assume both clients are already setup for iSCSI):
fsh-weakfish# zpool create i c2t01000003BAAAE84F00002A0045F86E49d0
fsh-weakfish# zpool status
pool: i
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
i ONLINE 0 0 0
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE 0 0 0
errors: No known data errors
fsh-weakfish# zfs create i/wombat
fsh-weakfish# zfs create i/hulio
fsh-weakfish# zfs list
NAME USED AVAIL REFER MOUNTPOINT
i 154K 9.78G 19K /i
i/hulio 18K 9.78G 18K /i/hulio
i/wombat 18K 9.78G 18K /i/wombat
fsh-weakfish#
Note the enhanced information 'zpool import' reports on clientB:
fsh-mullet# zpool import
pool: i
id: 8574825092618243264
state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-EY
config:
i ONLINE
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE
fsh-mullet# zpool import i
cannot import 'i': pool may be in use from other system, it was last accessed by
fsh-weakfish (hostid: 0x4ab08c2) on Tue Apr 10 09:33:07 2007
use '-f' to import anyway
fsh-mullet#
Ok, we don't want to forcibly import the pool until clientA is down. So after clientA (fsh-weakfish) has rebooted,
forcibly import the pool on clientB (fsh-mullet):
fsh-weakfish# reboot
....
fsh-mullet# zpool import -f i
fsh-mullet# zpool status
pool: i
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
i ONLINE 0 0 0
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE 0 0 0
errors: No known data errors
fsh-mullet#
After clientA comes back up, we'll see this message via syslog:
WARNING: pool 'i' could not be loaded as it was last accessed by another system
(host: fsh-mullet hostid: 0x8373b35b). See: http://www.sun.com/msg/ZFS-8000-EY
And just to double check to make sure that pool 'i' is in fact not loaded:
fsh-weakfish# zpool list
no pools available
fsh-weakfish#
And to verify the pool has not been corrupted from clientB's view of the world, we see:
fsh-mullet# zpool scrub i
fsh-mullet# zpool status
pool: i
state: ONLINE
scrub: scrub completed with 0 errors on Tue Apr 10 10:28:03 2007
config:
NAME STATE READ WRITE CKSUM
i ONLINE 0 0 0
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE 0 0 0
errors: No known data errors
fsh-mullet# zfs list
NAME USED AVAIL REFER MOUNTPOINT
i 156K 9.78G 21K /i
i/hulio 18K 9.78G 18K /i/hulio
i/wombat 18K 9.78G 18K /i/wombat
fsh-mullet#
See you never again poor man's cluster corruption.
One detail i'd like to point out is that you have to be careful on *when* you forcibly import a pool. For instance,
if you forcibly import the pool on clientB *before* you reboot clientA then corruption can still happen. This is because
the command reboot(1M) cleanly takes down the machine, which means it unmounts all filesystems, and unmounting a
filesystem will write a bit of data to the pool.
To see the new information on the label, you can use zdb(1M):
fsh-mullet# zdb -l /dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
version=6
name='i'
state=0
txg=665
pool_guid=8574825092618243264
hostid=2205397851
hostname='fsh-mullet'
top_guid=5676430250453749577
guid=5676430250453749577
vdev_tree
type='disk'
id=0
guid=5676430250453749577
path='/dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0'
devid='id1,ssd@x01000003baaae84f00002a0045f86e49/a'
whole_disk=1
metaslab_array=14
metaslab_shift=26
ashift=9
asize=10724048896
DTL=30
--------------------------------------------
LABEL 1
--------------------------------------------
...
(2007-04-18 05:05:43.0/2007-04-10 10:57:47.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/poor_man_s_cluster_end
|
|
|
Friday February 02, 2007 |
|
corrupted files and 'zpool status -v'
If ZFS detects either a checksum error or read I/O failure and is not able to correct it (say by successfully reading from the other side of a mirror), then it will store a log of objects that are damaged permanently (perhaps due to silent corruption).
Previously (that is, before snv_57), the output we gave was only somewhat useful:
# zpool status -v
pool: monkey
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
monkey ONLINE 26 0 0
c1t1d0s7 ONLINE 12 0 0
c1t1d0s6 ONLINE 14 0 0
errors: The following persistent errors have been detected:
DATASET OBJECT RANGE
0x0 0x13 lvl=0 blkid=0
0x5 0x4 lvl=0 blkid=0
0x17 0x4 lvl=0 blkid=0
0x1d 0x4 lvl=0 blkid=0
0x24 0x5 lvl=0 blkid=0
0x2a 0x4 lvl=0 blkid=0
0x2a 0x6 lvl=0 blkid=0
0x30 0x4 lvl=0 blkid=0
0x36 0x0 lvl=0 blkid=2
If you were lucky, the DATASET object number would actually get converted into a dataset name. If it didn't then you would have to use zdb(1M) to figure out what the dataset name/mountpoint was. After that, you would have to use the '-inum' option to find(1) to figure out what the actual file was (see the opensolaris thread on it). While it is really powerful to even have this ability, it would be really nice to have ZFS do all the dirty work for you - we are after all shooting for easy administration.
With the putback of:
6410433 'zpool status -v' would be more useful with filenames, observability has been greatly increased!:
# zpool status -v
pool: monkey
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
monkey ONLINE 24 0 0
c1t1d0s6 ONLINE 10 0 0
c1t1d0s7 ONLINE 14 0 0
errors: Permanent errors have been detected in the following files:
/monkey/a.txt
/monkey/bananas/b.txt
/eric/c.txt
/monkey/sub/dir/d.txt
monkey/ghost:/e.txt
monkey/ghost:/boo/f.txt
monkey/dnode:<0x0>
<metadata>:<0x13>
For the listings above, we attempt to print out the full path to the file. If we successfully find the full path and the dataset is mounted then we print out the full path with a preceding "/" (such as in the "/monkey/a.txt" example above). If we successfully find it, but the dataset is not mounted, then we print out the dataset name (no preceding "/"), followed by the path within the dataset to the file (see the "monkey/ghost:/e.txt" example above).
If we can't successfully translate the object number to a file path (either due to error or the object doesn't have a real file path associated with it as is the case for say a dnode_t), then we print out the dataset name followed by the object's number (as in the "monkey/dnode:<0x0>" case above). If an object in the MOS gets corrupted then we print out the special tag of <metadata>, followed by the object number.
Couple this with background scrubbing and you have very impressive fault management and observability. What other filesystem/storage system can give you this ability?
Note: these changes are in snv_57, will hopefully make s10u4, and perhaps even Leopard :)
If you're stuck on old bits (without the above mentioned changes) and are trying to figure out how to translate object numbers to filenames,
then check out
this thread
(2007-05-15 10:30:07.0/2007-02-02 17:10:53.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/damaged_files_and_zpool_status
|
|
|
Tuesday November 21, 2006 |
|
People are finding that setting 'zil_disable' seems to increase their performance - especially NFS/ZFS performance. But what does setting 'zil_disable' to 1 really do? It completely disables the ZIL. Ok fine, what does that mean?
Disabling the ZIL causes ZFS to not immediatley write synchronous operations to disk/storage. With the ZIL disabled, the synchronous operations (such as fsync(), O_DSYNC, OP_COMMIT for NFS, etc.) will be written to disk, just at the same guarantees as asynchronous operations. That means you can return success to applications/NFS clients before the data has been commited to stable storage. In the event of a server crash, if the data hasn't been written out to the storage, it is lost forever.
With the ZIL disabled, no ZIL log records are written.
Note: disabling the ZIL does NOT compromise filesystem integrity. Disabling the ZIL does NOT cause corruption in ZFS.
Disabling the ZIL is definitely frowned upon and can cause your applications much confusion. Disabling the ZIL can cause corruption for NFS clients in the case where a reply to the client is done before the server crashes, and the server crashes before the data is commited to stable storage. If you can't live with this, then don't turn off the ZIL.
The 'zil_disable' tuneable will go away once 6280630 zil synchronicity is putback.
Hmm, so all of this sounds shady - so why did we add 'zil_disable' to the code base? Not for people to use, but as an easy way to do performance measurements (to isolate areas outside the ZIL).
If you'd like more information on how the ZIL works, check out Neil's blog and Neelakanth's blog.
(2006-11-27 08:49:17.0/2006-11-21 10:47:51.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/zil_disable
|
|
|
Monday October 16, 2006 |
|
ZFS - now with history!
When disaster strikes and you'd really like to know what people have been doing to your pool, what do you do? Until now, there was nothing elegant. Enter 'zpool history'.
# zpool create hewitt c1d0
# zfs create hewitt/jen
# zfs create hewitt/jen/love
# zpool history
History for 'hewitt':
2006-10-16.17:54:02 zpool create hewitt c1d0
2006-10-16.17:54:11 zfs create hewitt/jen
2006-10-16.17:54:15 zfs create hewitt/jen/love
#
All subcommands of zfs(1M) and zpool(1M) that modify the state of the pool get logged persistently to disk. That means no matter where you take your pool or what machine is currently accessing it (such as in the SunCluster failover case), your history follows. Sorta like your permanent record.
Now you have a convenient way of finding out if someone did something bad to your pool...
bad_admin# zfs set checksum=off hewitt
bad_admin# zfs destroy hewitt/jen/love
good_admin# zpool history
History for 'hewitt':
2006-10-16.17:54:02 zpool create hewitt c1d0
2006-10-16.17:54:11 zfs create hewitt/jen
2006-10-16.17:54:15 zfs create hewitt/jen/love
2006-10-16.17:54:35 zfs set checksum=off hewitt
2006-10-16.17:57:29 zfs destroy hewitt/jen/love
#
The history log is implemented using a ring buffer of <packed record length, record nvlist> tuples. More details can be found in spa_history.c, which contains the main kernel code changes for 'zpool history'. The history log's size is 1% of your pool, with a maximum of 32MB and a minimum of 128KB. Note: the original creation of the pool via 'zpool create' is never overwritten.
If you add a new subcommand to zfs(1m) or zpool(1M), all you need to do is call zpool_log_history(). If you build a new consumer of 'zpool history' (such as a GUI), then you need to call zpool_get_history(), and parse the nvlist. A good example of that is in get_history_one().
In the future, we will add the ability to also log uid, hostname, and zonename. We're also looking at adding "internal events" to the log since some subcommands actually take more than one txg, and we'd like to log history every txg (this would be more for developers and debuggers than admins).
These changes are in snv_51, and i would expect s10_u4 (though that schedule hasn't been decided yet).
Enjoy making history.
(2006-10-17 23:01:12.0/2006-10-16 19:11:46.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/zpool_history
|
|
|
Monday August 07, 2006 |
|
As part of the I/O scheduling, ZFS has a field called 'zfs_vdev_max_pending'. This limits the maximum number of I/Os we can send down per leaf vdev. This is NOT the maximum per filesystem or per pool. Currently the default is 35. This is a good number for today's disk drives; however, it is not a good number for storage arrays that are really comprised of many disks but exported to ZFS as a single device.
This limit is a really good thing when you have a heavy I/O load as described in Bill's "ZFS vs. The Benchmark" blog.
But if you've created say a 2 device mirrored pool - where each device is really a 10 disk storage array, and you think that ZFS just
isn't doing enough I/O for you, here's a script to see if that's true:
#!/usr/sbin/dtrace -s
vdev_queue_io_to_issue:return
/arg1 != NULL/
{
@c["issued I/O"] = count();
}
vdev_queue_io_to_issue:return
/arg1 == NULL/
{
@c["didn't issue I/O"] = count();
}
vdev_queue_io_to_issue:entry
{
@avgers["avg pending I/Os"] = avg(args[0]->vq_pending_tree.avl_numnodes);
@lquant["quant pending I/Os"] = quantize(args[0]->vq_pending_tree.avl_numnodes);
@c["total times tried to issue I/O"] = count();
}
vdev_queue_io_to_issue:entry
/args[0]->vq_pending_tree.avl_numnodes > 349/
{
@avgers["avg pending I/Os > 349"] = avg(args[0]->vq_pending_tree.avl_numnodes);
@quant["quant pending I/Os > 349"] = lquantize(args[0]->vq_pending_tree.avl_numnodes, 33, 1000, 1);
@c["total times tried to issue I/O where > 349"] = count();
}
/* bail after 5 minutes */
tick-300sec
{
exit(0);
}
If you see the "avg pending I/Os" hitting your vq_max_pending limit, then raising the limit would be a good thing. The way to do that used to be per vdev, but we now have a single global way to change all vdevs.
heavy# mdb -kw
Loading modules: [ unix genunix specfs dtrace cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba fctl nca lofs zfs random nfs cpc fcip logindmux ptm sppp ipc ]
> zfs_vdev_max_pending/E
zfs_vdev_max_pending:
zfs_vdev_max_pending: 35
> zfs_vdev_max_pending/W 0t70
zfs_vdev_max_pending: 0x23 = 0x46
> zfs_vdev_max_pending/E
zfs_vdev_max_pending:
zfs_vdev_max_pending: 70
>
The above will change the max # of pending requests to 70, instead of 35.
So having people tune variables is never desireable, and we'd like 'vq_max_pending' (among others) to be dynamically set, see:
6457709 vdev_knob values should be determined dynamically .
(2008-03-03 14:19:42.0/2006-08-07 11:22:50.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/vq_max_pending
|
|
|
Friday March 31, 2006 |
|
dscripts to time ZFS VOPs
So you're running some app, and you're curious where zfs is spending its time... here's some dscripts to figure out how much time
each VOP is taking.
This one (
zfs_get_vop_times.d)
grabs the number of times each VOP was called, the VOP's average time, and the total time spent in each VOP. This
is for all ZFS file systems on the system. It generates output like:
# ./zvop_times.d
dtrace: script './zvop_times.d' matched 66 probes
^C
CPU ID FUNCTION:NAME
17 2 :END
ZFS COUNT
zfs_fsync 61
zfs_write 494
zfs_read 520
ZFS AVG TIME
zfs_read 2737251
zfs_write 6992704
zfs_fsync 73401109
ZFS SUM TIME
zfs_read 1423370640
zfs_write 3454396080
zfs_fsync 4477467680
This one (
zvop_times_fsid.d)
does the same as above but just for one file system - namely the one you specify via passed in FSID ints.
Lastly, this one (
zvop_times_fsid_large.d)
does the same as above (tracking per FSID), but also spits out the stack and quantize information when a zfs VOP
call goes over X time - where X is passed into the script. This makes it easy to see if there's some really slow calls.
It generates output like (skipping the output thats the same from the above examples):
# ./zvop_times_fsid_large.d 0x7929404d 0xb3d52b08 50000000
dtrace: script './zvop_times_fsid_large.d' matched 123 probes
CPU ID FUNCTION:NAME
14 35984 zfs_read:return
genunix`fop_read+0x20
genunix`read+0x29c
unix`syscall_trap32+0x1e8
16 35994 zfs_putpage:return
genunix`fop_putpage+0x1c
nfssrv`rfs3_commit+0x110
nfssrv`common_dispatch+0x588
rpcmod`svc_getreq+0x1ec
rpcmod`svc_run+0x1e8
nfs`nfssys+0x1b8
unix`syscall_trap32+0x1e8
18 35994 zfs_putpage:return
genunix`fop_putpage+0x1c
nfssrv`rfs3_commit+0x110
nfssrv`common_dispatch+0x588
rpcmod`svc_getreq+0x1ec
rpcmod`svc_run+0x1e8
nfs`nfssys+0x1b8
unix`syscall_trap32+0x1e8
12 35972 zfs_fsync:return
genunix`fop_fsync+0x14
nfssrv`rfs4_createfile+0x500
nfssrv`rfs4_do_opennull+0x44
nfssrv`rfs4_op_open+0x380
nfssrv`rfs4_compound+0x208
nfssrv`rfs4_dispatch+0x11c
nfssrv`common_dispatch+0x154
rpcmod`svc_getreq+0x1ec
rpcmod`svc_run+0x1e8
nfs`nfssys+0x1b8
unix`syscall_trap32+0x1e8
^C
zfs_fsync
value ------------- Distribution ------------- count
33554432 | 0
67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
134217728 | 0
zfs_putpage
value ------------- Distribution ------------- count
33554432 | 0
67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
134217728 | 0
zfs_read
value ------------- Distribution ------------- count
67108864 | 0
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
268435456 | 0
Feel free to play around with these scripts as well as add/subtract.
(2006-03-31 15:29:56.0/2006-03-31 15:29:56.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/dscript_to_time_zfs_vops
|
|
|
Wednesday November 16, 2005 |
|
Now let's run a simple but popular benchmark - Netapp's postmark.
Let's see how long it takes to do 1,000,000 transactions.
First lets try ZFS:
mcp# ./postmark
PostMark v1.5 : 3/27/01
pm>set location=/scsi_zfs
pm>set transactions=1000000
pm>run
Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
220 seconds total
214 seconds of transactions (4672 per second)
Files:
519830 created (2362 per second)
Creation alone: 20000 files (4000 per second)
Mixed with transactions: 499830 files (2335 per second)
500124 read (2337 per second)
494776 appended (2312 per second)
519830 deleted (2362 per second)
Deletion alone: 19660 files (19660 per second)
Mixed with transactions: 500170 files (2337 per second)
Data:
3240.97 megabytes read (14.73 megabytes per second)
3365.07 megabytes written (15.30 megabytes per second)
pm>
During the run, i used our good buddy
zpool(1M)
to see how much IO
we were doing:
mcp# zpool iostat 1
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
scsi_zfs 32.5K 68.0G 0 207 0 6.25M
scsi_zfs 32.5K 68.0G 0 821 0 24.1M
scsi_zfs 32.5K 68.0G 0 978 0 28.6M
scsi_zfs 32.5K 68.0G 0 1.04K 0 30.3M
scsi_zfs 32.5K 68.0G 0 1.01K 0 27.6M
scsi_zfs 129M 67.9G 0 797 0 16.2M
scsi_zfs 129M 67.9G 0 832 0 27.4M
Ok, onto UFS:
mcp# ./postmark
PostMark v1.5 : 3/27/01
pm>set location=/export/scsi_ufs
pm>set transactions=1000000
pm>run
Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
3450 seconds total
3419 seconds of transactions (292 per second)
Files:
519830 created (150 per second)
Creation alone: 20000 files (909 per second)
Mixed with transactions: 499830 files (146 per second)
500124 read (146 per second)
494776 appended (144 per second)
519830 deleted (150 per second)
Deletion alone: 19660 files (2184 per second)
Mixed with transactions: 500170 files (146 per second)
Data:
3240.97 megabytes read (961.96 kilobytes per second)
3365.07 megabytes written (998.79 kilobytes per second)
pm>
Also, during the run grabbed a little iostat to see how UFS's IO
was doing:
mcp# iostat -Mxnz 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 820.9 0.0 3.4 142.5 256.0 173.5 311.9 100 100 c4t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 797.0 0.0 3.1 129.2 256.0 162.1 321.2 100 100 c4t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 777.0 0.0 3.1 128.0 256.0 164.7 329.5 100 100 c4t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 827.1 0.0 4.0 128.8 256.0 155.7 309.5 100 100 c4t1d0
Yikes! so looking at throughput (number of transactions per second)
ZFS is ~16x better than UFS on this benchmark. Ok so ZFS
is not this good on every benchmark when compared to UFS, but we rather
like this one.
This was run on a 2 way opteron box, using the same SCSI disk for both
ZFS and UFS.
(2006-08-08 10:49:56.0/2005-11-16 14:46:20.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/fs_perf_201_postmark
FS perf 102 : Filesystem Bandwith
Now that you can grab the disks's BW, the next question is "How do i see what
BW my local file system can push?". First lets check writes for ZFS:
fsh-mullet# /bin/time sh -c 'lockfs -f .; mkfile 1g 1g.txt; lockfs -f .'
real 17.1
user 0.0
sys 1.1
So that's 1GB/17.1s = ~62MB/s for a 1 gig file. During the
mkfile(1M), you can use iostat(1M) to see how much disk BW is
going on:
fsh-mullet# iostat -Mxnz 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 541.0 0.0 67.6 0.0 35.0 0.0 64.7 1 100 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 567.0 0.0 70.3 0.0 33.9 0.0 59.9 1 100 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 254.9 0.0 29.0 0.0 15.7 0.0 61.6 0 64 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 528.1 0.0 66.0 0.0 35.0 0.0 66.2 1 100 c0t1d0
We can also use
zpool(1M)
to show just the IO for zfs:
fsh-mullet# zpool iostat 1
bw_hog 32.5K 33.7G 0 538 0 67.4M
bw_hog 189M 33.6G 0 30 0 459K
bw_hog 189M 33.6G 0 0 0 0
bw_hog 189M 33.6G 0 509 0 63.7M
bw_hog 189M 33.6G 0 544 0 68.1M
bw_hog 189M 33.6G 0 544 0 68.1M
bw_hog 189M 33.6G 0 535 0 67.0M
Now let's look at UFS writes:
fsh-mullet# /bin/time sh -c 'lockfs -f .; mkfile 1g 1g.txt; lockfs -f .'
real 18.7
user 0.1
sys 6.3
So UFS is doing 1GB/18.7s = ~57MB/s. Let's see some of that iostat:
fsh-mullet# iostat -Mxnz 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
4.0 70.0 0.0 58.9 0.0 10.8 0.0 145.6 0 99 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
3.0 70.0 0.0 57.8 0.0 10.6 0.0 144.5 0 99 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
4.0 70.0 0.0 59.4 0.0 11.2 0.0 151.3 0 99 c0t1d0
This was done on a 2-way v210 sparc box, using a SCSI disk.
And why the 'lockfs' call you ask? This ensures that all data is flushed to
disk - and measuring how long it takes to do something that doesn't necessarily
get flushed is just not legit in this case. Persistent data is good.
(2005-11-16 10:18:45.0/2005-11-16 10:18:06.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/fs_perf_102_filesystem_bw
|
|
|
|
| « November 2009 | | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | | | | | | | | | | | | | | | Today |
Today's Page Hits: 129
|