|
Now let's run a simple but popular benchmark - Netapp's postmark.
Let's see how long it takes to do 1,000,000 transactions.
First lets try ZFS:
mcp# ./postmark
PostMark v1.5 : 3/27/01
pm>set location=/scsi_zfs
pm>set transactions=1000000
pm>run
Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
220 seconds total
214 seconds of transactions (4672 per second)
Files:
519830 created (2362 per second)
Creation alone: 20000 files (4000 per second)
Mixed with transactions: 499830 files (2335 per second)
500124 read (2337 per second)
494776 appended (2312 per second)
519830 deleted (2362 per second)
Deletion alone: 19660 files (19660 per second)
Mixed with transactions: 500170 files (2337 per second)
Data:
3240.97 megabytes read (14.73 megabytes per second)
3365.07 megabytes written (15.30 megabytes per second)
pm>
During the run, i used our good buddy
zpool(1M)
to see how much IO
we were doing:
mcp# zpool iostat 1
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
scsi_zfs 32.5K 68.0G 0 207 0 6.25M
scsi_zfs 32.5K 68.0G 0 821 0 24.1M
scsi_zfs 32.5K 68.0G 0 978 0 28.6M
scsi_zfs 32.5K 68.0G 0 1.04K 0 30.3M
scsi_zfs 32.5K 68.0G 0 1.01K 0 27.6M
scsi_zfs 129M 67.9G 0 797 0 16.2M
scsi_zfs 129M 67.9G 0 832 0 27.4M
Ok, onto UFS:
mcp# ./postmark
PostMark v1.5 : 3/27/01
pm>set location=/export/scsi_ufs
pm>set transactions=1000000
pm>run
Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
3450 seconds total
3419 seconds of transactions (292 per second)
Files:
519830 created (150 per second)
Creation alone: 20000 files (909 per second)
Mixed with transactions: 499830 files (146 per second)
500124 read (146 per second)
494776 appended (144 per second)
519830 deleted (150 per second)
Deletion alone: 19660 files (2184 per second)
Mixed with transactions: 500170 files (146 per second)
Data:
3240.97 megabytes read (961.96 kilobytes per second)
3365.07 megabytes written (998.79 kilobytes per second)
pm>
Also, during the run grabbed a little iostat to see how UFS's IO
was doing:
mcp# iostat -Mxnz 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 820.9 0.0 3.4 142.5 256.0 173.5 311.9 100 100 c4t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 797.0 0.0 3.1 129.2 256.0 162.1 321.2 100 100 c4t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 777.0 0.0 3.1 128.0 256.0 164.7 329.5 100 100 c4t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 827.1 0.0 4.0 128.8 256.0 155.7 309.5 100 100 c4t1d0
Yikes! so looking at throughput (number of transactions per second)
ZFS is ~16x better than UFS on this benchmark. Ok so ZFS
is not this good on every benchmark when compared to UFS, but we rather
like this one.
This was run on a 2 way opteron box, using the same SCSI disk for both
ZFS and UFS.
(2006-08-08 10:49:56.0/2005-11-16 14:46:20.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/fs_perf_201_postmark
bigger filehandles for NFSv4 - die NFSv2 die
So what does a filehandle created by the Solaris NFS server look like? If we
take a gander at the
fhandle_t
struct, we see its layout:
struct svcfh {
fsid_t fh_fsid; /* filesystem id */
ushort_t fh_len; /* file number length */
char fh_data[NFS_FHMAXDATA]; /* and data */
ushort_t fh_xlen; /* export file number length */
char fh_xdata[NFS_FHMAXDATA]; /* and data */
};
typedef struct svcfh fhandle_t;
Where fh_len represents the length of valid bytes in fh_data, and
likewise, fh_xlen is the length fh_xdata. Note, NFS_FHMAXDATA used
to be:
#define NFS_FHMAXDATA ((NFS_FHSIZE - sizeof (struct fhsize) + 8) / 2)
To be less confusing, I removed fhsize and shortened that to:
#define NFS_FHMAXDATA 10
Ok, but where does fh_data come from? Its the FID (via VOP_FID) of the
local file system. fh_data represents the actual file of the filehandle,
and fh_xdata represents the exported file/directory. So for NFSv2 and
NFSv3, the filehandle is basically:
fsid + file FID + exported FID
NFSv4 is pretty much the same thing, except at the end we add two fields, and you
can see the layout in
nfs_fh4_fmt_t:
struct nfs_fh4_fmt {
fhandle_ext_t fh4_i;
uint32_t fh4_flag;
uint32_t fh4_volatile_id;
};
The fh4_flag is used to distinguish named attributes from "normal" files,
and fh4_volatile_id is currently only currently used for testing purposes
- for testing volatile filehandles of course, and since Solaris doesn't have a
local file system that doesn't have persistent filehandles we don't need to use
fh4_volatile_id quite yet.
So back to the magical "10" for NFS_FHMAXDATA... what's going on there? Well,
adding those fields up, you get: 8(fsid) + 2(len) + 10(data) + 2(xlen) +
10(xdata) = 32 bytes. Which is the protocol limitation of
NFSv2 - just look for "FHSIZE". So
the Solaris server is currently limiting its filehandles to 10 byte FIDs just to
make NFSv2 happy. Note, this limitation has purposely crept into the local
file systems to make this all work, check out UFS's
ufid:
/*
* This overlays the fid structure (see vfs.h)
*
* LP64 note: we use int32_t instead of ino_t since UFS does not use
* inode numbers larger than 32-bits and ufid's are passed to NFS
* which expects them to not grow in size beyond 10 bytes (12 including
* the length).
*/
struct ufid {
ushort_t ufid_len;
ushort_t ufid_flags;
int32_t ufid_ino;
int32_t ufid_gen;
};
Note that NFSv3's protocol limitation is 64 bytes and NFSv4's limitation is 128
bytes. So these two file systems could theoreticallly give out bigger
filehandles, but there's two reasons why they don't for currently existing
data: 1) there's really no need and more importantly 2) the filehandles MUST be
the same on the wire before any change is done. If 2) isn't satisfied, then
all clients with active mounts will get STALE errors when the longer filehandles
are introduced. Imagine a server giving out 32 byte filehandles over NFSv3 for
a file, then the server is upgraded and now gives out 64 byte filehandles - even
if all the extra 32 bytes are zeroed out, that's a different filehandle and the
client will think it has a STALE reference. Now a force umount or client reboot
will fix the problem, but it seems pretty harsh to force all active clients to
perform some manual admin action for a simple (and should be harmless) server
upgrade.
So yeah my blog title is how i changed filehandles to be bigger - which almost
contradicts the above paragraph. The key point to note is that files that have
never been served up via NFS have never had a filehandle generated for them
(duh), so they can be whatever length the protocol allows and we don't have to
worry about STALE filehandles.
If you're not familiar with ZFS's .zfs/snapshot, there will be a future blog
on it soon. But basically it places a dot file (.zfs) under the "main" file
system at its root, and all snapshots created are then placed namespace-wise
under .zfs/snapshot. Here's an example:
fsh-mullet# zfs snapshot bw_hog@monday
fsh-mullet# zfs snapshot bw_hog@tuesday
fsh-mullet# ls -a /bw_hog
. .. .zfs aces.txt is.txt zfs.txt
fsh-mullet# ls -a /bw_hog/.zfs/snapshot
. .. monday tuesday
fsh-mullet# ls -a /bw_hog/.zfs/snapshot/monday
. .. aces.txt is.txt zfs.txt
fsh-mullet#
With the introduction of .zfs/snapshot, we were faced with an interesting dilemma
for NFS - either only have NFS clients that could do "mirror mounts" have access
to the .zfs directory OR increase ZFS's fid for files under .zfs. "Mirror mounts" would allow us to do the technically correct solution of having a unique
FSID for the "main" file system and each of its snapshots. This requires NFS
clients to cross server mount points. The latter option has one FSID for the
"main" file system and all of its snapshots. This means the same file under
the "main" file system and any of its snapshots will appear to be the same - so
things like "cp" over NFS won't like it.
"Mirror mounts" is our lingo for letting clients cross server file system
boundaries - as dictated by the FSID (file system identifier). This is totally
legit in NFSv4 (see section "7.7. Mount Point Crossing" and section
"5.11.7. mounted_on_fileid" in
rfc 3530). NFSv3 doesn't really
allow this functionality (see "3.3.3 Procedure 3: LOOKUP - Lookup filename"
here). Though, with some little
trickery, i'm sure it could be achieved - perhaps via the automounter?
The problem with mirror mounts is that no one has actually implemented them. So
if we went with the more technically correct solution of having a unique FSID
for the "main" local file system and a unique FSID for all its snapshots, only
Solaris Update 2(?) NFSv4 clients would be able to access .zfs upon initial
delivery of ZFS. That seems silly.
If we instead bend a little on the unique FSID, then all NFS clients in existence
today can access .zfs. That seems much more attractive. Oh wait... small
problem. We would rather like at least the filehandles to be different
for files in the "main" files ystem from the snapshots - this ensures NFS doesn't
get completely confused. Slight problem is that the filehandles we give out
today are maxed out at the 32 byte NFSv2 protocol limitation (as mentioned
above). If we add any other bit of uniqueness to the filehandles (such as a
snapshot identifier) then v2 just can't handle it.... hmmm...
Well you know what? Tough s*&t v2. Seriously, you are antiquated and really
need to go away. So since the snapshot identifier doesn't need to be added
to the "main" file system. FIDs for non-.zfs snapshot files will remain the same
size and fit within NFSv2's limitations. So we can access ZFS over NFSv2, just
will be denied .zfs's goodness:
fsh-weakfish# mount -o vers=2 fsh-mullet:/bw_hog /mnt
fsh-weakfish# ls /mnt/.zfs/snapshot/
monday tuesday
fsh-weakfish# ls /mnt/.zfs/snapshot/monday
/mnt/.zfs/snapshot/monday: Object is remote
fsh-weakfish#
So what about v3 and v4? Well since v4 is the default for Solaris and its code
is simpler, i just changed v4 to handle bigger filehandles for now. NFSv3 is
coming
soooon.
So we basically have the same structure as fhandle_t, except we extend it
a bit for NFSv4 via
fhandle4_t:
/*
* This is the in-memory structure for an NFSv4 extended filehandle.
*/
typedef struct {
fsid_t fhx_fsid; /* filesystem id */
ushort_t fhx_len; /* file number length */
char fhx_data[NFS_FH4MAXDATA]; /* and data */
ushort_t fhx_xlen; /* export file number length */
char fhx_xdata[NFS_FH4MAXDATA]; /* and data */
} fhandle4_t;
So the only difference is that FIDs can be up to 26 bytes instead of 10 bytes.
Why 26? Thats NFSv3's protocol limitation - 64 bytes. And if we ever need
larger than 64 byte filehandles for NFSv4, its easy to change - just create a
new struct with the capacity for larger FIDs and use that for NFSv4. Why will
it be easier in the future than it was for this change? Well part of what i
needed to do to make NFSv4 filehandles backwards compatible is that when
filehandles are actuallly XDR'd, we need to parse them so that filehandles that
used to be given out with 10 byte FIDs (based on the fhandle_t struct) continue
to give out filehandles base on 10 byte FIDs, but at the same time VOP_FID()s
that return larger than 10 byte FIDs (such as .zfs) are allowed to do so. So
NFSv4 will return different length filehandles based on the need of the local
file system.
So checking out
xdr_nfs_resop4,
the old code (knowing that the filehandle was safe to be a contigious set of
bytes), simply did this:
case OP_GETFH:
if (!xdr_int(xdrs,
(int32_t *)&objp->nfs_resop4_u.opgetfh.status))
return (FALSE);
if (objp->nfs_resop4_u.opgetfh.status != NFS4_OK)
return (TRUE);
return (xdr_bytes(xdrs,
(char **)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_val,
(uint_t *)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_len,
NFS4_FHSIZE));
Now, instead of simply doing a xdr_bytes, we use the template
of fhandle_ext_t and internally always have the space for 26 byte FIDS
but for OTW we skip bytes depending on what fhx_len and fhx_xlen, see
xdr_encode_nfs_fh4.
whew, that's enough about filehandles for 2005.
(2007-05-07 11:04:22.0/2005-11-16 10:32:11.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/bigger_filehandles_for_nfsv4_die
FS perf 102 : Filesystem Bandwith
Now that you can grab the disks's BW, the next question is "How do i see what
BW my local file system can push?". First lets check writes for ZFS:
fsh-mullet# /bin/time sh -c 'lockfs -f .; mkfile 1g 1g.txt; lockfs -f .'
real 17.1
user 0.0
sys 1.1
So that's 1GB/17.1s = ~62MB/s for a 1 gig file. During the
mkfile(1M), you can use iostat(1M) to see how much disk BW is
going on:
fsh-mullet# iostat -Mxnz 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 541.0 0.0 67.6 0.0 35.0 0.0 64.7 1 100 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 567.0 0.0 70.3 0.0 33.9 0.0 59.9 1 100 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 254.9 0.0 29.0 0.0 15.7 0.0 61.6 0 64 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 528.1 0.0 66.0 0.0 35.0 0.0 66.2 1 100 c0t1d0
We can also use
zpool(1M)
to show just the IO for zfs:
fsh-mullet# zpool iostat 1
bw_hog 32.5K 33.7G 0 538 0 67.4M
bw_hog 189M 33.6G 0 30 0 459K
bw_hog 189M 33.6G 0 0 0 0
bw_hog 189M 33.6G 0 509 0 63.7M
bw_hog 189M 33.6G 0 544 0 68.1M
bw_hog 189M 33.6G 0 544 0 68.1M
bw_hog 189M 33.6G 0 535 0 67.0M
Now let's look at UFS writes:
fsh-mullet# /bin/time sh -c 'lockfs -f .; mkfile 1g 1g.txt; lockfs -f .'
real 18.7
user 0.1
sys 6.3
So UFS is doing 1GB/18.7s = ~57MB/s. Let's see some of that iostat:
fsh-mullet# iostat -Mxnz 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
4.0 70.0 0.0 58.9 0.0 10.8 0.0 145.6 0 99 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
3.0 70.0 0.0 57.8 0.0 10.6 0.0 144.5 0 99 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
4.0 70.0 0.0 59.4 0.0 11.2 0.0 151.3 0 99 c0t1d0
This was done on a 2-way v210 sparc box, using a SCSI disk.
And why the 'lockfs' call you ask? This ensures that all data is flushed to
disk - and measuring how long it takes to do something that doesn't necessarily
get flushed is just not legit in this case. Persistent data is good.
(2005-11-16 10:18:45.0/2005-11-16 10:18:06.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/fs_perf_102_filesystem_bw
FS perf 101 : Disk Bandwith
One of the first questions to ask when testing local file system performance
is "what bandwith can my disk give me"? First let's do writes:
sparc-box# dd if=/dev/zero of=/dev/dsk/c0t1d0s0 bs=1024k count=1000&
sparc-box# iostat -Mxnz 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 6462.1 0.0 50.5 128.2 256.0 19.8 39.6 100 100 sd0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 6495.8 0.0 50.7 128.8 256.0 19.8 39.4 100 100 sd0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 6491.1 0.0 50.7 128.7 256.0 19.8 39.4 100 100 sd0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 6494.4 0.0 50.7 128.6 256.0 19.8 39.4 100 100 sd0
Now let's do reads:
sparc-box# dd if=/dev/dsk/c0t1d0s0 of=/dev/null bs=1024k count=1000&
sparc-box# iostat -Mxnz 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
548.0 0.0 68.5 0.0 0.0 1.7 0.0 3.1 1 100 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
548.0 0.0 68.5 0.0 0.0 1.7 0.0 3.1 1 100 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
548.1 0.0 68.5 0.0 0.0 1.7 0.0 3.1 1 100 c0t1d0
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
548.0 0.0 68.5 0.0 0.0 1.7 0.0 3.1 1 100 c0t1d0
So this 2 way sparc v210 using a SCSI disk can handle writes at ~50MB/s, and
reads at ~68MB/s.
(2005-11-16 10:06:18.0/2005-11-16 10:06:18.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/fs_perf_101_disk_bw
|