Eric Kustarz's Weblog

e-street

All | FileBench | NFS | SETUP | ZFS

20051116 Wednesday November 16, 2005

 FS perf 201 : Postmark

Now let's run a simple but popular benchmark - Netapp's postmark. Let's see how long it takes to do 1,000,000 transactions.

First lets try ZFS:

mcp# ./postmark                  
PostMark v1.5 : 3/27/01
pm>set location=/scsi_zfs
pm>set transactions=1000000
pm>run
Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
        220 seconds total
        214 seconds of transactions (4672 per second)

Files:
        519830 created (2362 per second)
                Creation alone: 20000 files (4000 per second)
                Mixed with transactions: 499830 files (2335 per second)
        500124 read (2337 per second)
        494776 appended (2312 per second)
        519830 deleted (2362 per second)
                Deletion alone: 19660 files (19660 per second)
                Mixed with transactions: 500170 files (2337 per second)

Data:
        3240.97 megabytes read (14.73 megabytes per second)
        3365.07 megabytes written (15.30 megabytes per second)
pm>

During the run, i used our good buddy zpool(1M) to see how much IO we were doing:

mcp# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
scsi_zfs    32.5K  68.0G      0    207      0  6.25M
scsi_zfs    32.5K  68.0G      0    821      0  24.1M
scsi_zfs    32.5K  68.0G      0    978      0  28.6M
scsi_zfs    32.5K  68.0G      0  1.04K      0  30.3M
scsi_zfs    32.5K  68.0G      0  1.01K      0  27.6M
scsi_zfs     129M  67.9G      0    797      0  16.2M
scsi_zfs     129M  67.9G      0    832      0  27.4M

Ok, onto UFS:

mcp# ./postmark 
PostMark v1.5 : 3/27/01
pm>set location=/export/scsi_ufs
pm>set transactions=1000000
pm>run
Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
        3450 seconds total
        3419 seconds of transactions (292 per second)

Files:
        519830 created (150 per second)
                Creation alone: 20000 files (909 per second)
                Mixed with transactions: 499830 files (146 per second)
        500124 read (146 per second)
        494776 appended (144 per second)
        519830 deleted (150 per second)
                Deletion alone: 19660 files (2184 per second)
                Mixed with transactions: 500170 files (146 per second)

Data:
        3240.97 megabytes read (961.96 kilobytes per second)
        3365.07 megabytes written (998.79 kilobytes per second)
pm>

Also, during the run grabbed a little iostat to see how UFS's IO was doing:

mcp# iostat -Mxnz 1
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  820.9    0.0    3.4 142.5 256.0  173.5  311.9 100 100 c4t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  797.0    0.0    3.1 129.2 256.0  162.1  321.2 100 100 c4t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  777.0    0.0    3.1 128.0 256.0  164.7  329.5 100 100 c4t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  827.1    0.0    4.0 128.8 256.0  155.7  309.5 100 100 c4t1d0

Yikes! so looking at throughput (number of transactions per second) ZFS is ~16x better than UFS on this benchmark. Ok so ZFS is not this good on every benchmark when compared to UFS, but we rather like this one.

This was run on a 2 way opteron box, using the same SCSI disk for both ZFS and UFS.



(2006-08-08 10:49:56.0/2005-11-16 14:46:20.0) Permalink Comments [13]
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/fs_perf_201_postmark

 bigger filehandles for NFSv4 - die NFSv2 die

So what does a filehandle created by the Solaris NFS server look like? If we take a gander at the fhandle_t struct, we see its layout:

struct svcfh {
    fsid_t	fh_fsid;			/* filesystem id */
    ushort_t    fh_len;			        /* file number length */
    char	fh_data[NFS_FHMAXDATA];		/* and data */
    ushort_t    fh_xlen;			/* export file number length */
    char	fh_xdata[NFS_FHMAXDATA];	/* and data */
};
typedef struct svcfh fhandle_t;

Where fh_len represents the length of valid bytes in fh_data, and likewise, fh_xlen is the length fh_xdata. Note, NFS_FHMAXDATA used to be:

#define	NFS_FHMAXDATA	((NFS_FHSIZE - sizeof (struct fhsize) + 8) / 2)

To be less confusing, I removed fhsize and shortened that to:

#define NFS_FHMAXDATA    10

Ok, but where does fh_data come from? Its the FID (via VOP_FID) of the local file system. fh_data represents the actual file of the filehandle, and fh_xdata represents the exported file/directory. So for NFSv2 and NFSv3, the filehandle is basically:
fsid + file FID + exported FID

NFSv4 is pretty much the same thing, except at the end we add two fields, and you can see the layout in nfs_fh4_fmt_t:

struct nfs_fh4_fmt {
 	fhandle_ext_t fh4_i;
 	uint32_t      fh4_flag;
 	uint32_t      fh4_volatile_id;
};

The fh4_flag is used to distinguish named attributes from "normal" files, and fh4_volatile_id is currently only currently used for testing purposes - for testing volatile filehandles of course, and since Solaris doesn't have a local file system that doesn't have persistent filehandles we don't need to use fh4_volatile_id quite yet.

So back to the magical "10" for NFS_FHMAXDATA... what's going on there? Well, adding those fields up, you get: 8(fsid) + 2(len) + 10(data) + 2(xlen) + 10(xdata) = 32 bytes. Which is the protocol limitation of NFSv2 - just look for "FHSIZE". So the Solaris server is currently limiting its filehandles to 10 byte FIDs just to make NFSv2 happy. Note, this limitation has purposely crept into the local file systems to make this all work, check out UFS's ufid:

/*
 * This overlays the fid structure (see vfs.h)
 *
 * LP64 note: we use int32_t instead of ino_t since UFS does not use
 * inode numbers larger than 32-bits and ufid's are passed to NFS
 * which expects them to not grow in size beyond 10 bytes (12 including
 * the length).
 */
struct ufid {
 	ushort_t ufid_len;
 	ushort_t ufid_flags;
	int32_t	ufid_ino;
 	int32_t	ufid_gen;
};

Note that NFSv3's protocol limitation is 64 bytes and NFSv4's limitation is 128 bytes. So these two file systems could theoreticallly give out bigger filehandles, but there's two reasons why they don't for currently existing data: 1) there's really no need and more importantly 2) the filehandles MUST be the same on the wire before any change is done. If 2) isn't satisfied, then all clients with active mounts will get STALE errors when the longer filehandles are introduced. Imagine a server giving out 32 byte filehandles over NFSv3 for a file, then the server is upgraded and now gives out 64 byte filehandles - even if all the extra 32 bytes are zeroed out, that's a different filehandle and the client will think it has a STALE reference. Now a force umount or client reboot will fix the problem, but it seems pretty harsh to force all active clients to perform some manual admin action for a simple (and should be harmless) server upgrade.

So yeah my blog title is how i changed filehandles to be bigger - which almost contradicts the above paragraph. The key point to note is that files that have never been served up via NFS have never had a filehandle generated for them (duh), so they can be whatever length the protocol allows and we don't have to worry about STALE filehandles.

If you're not familiar with ZFS's .zfs/snapshot, there will be a future blog on it soon. But basically it places a dot file (.zfs) under the "main" file system at its root, and all snapshots created are then placed namespace-wise under .zfs/snapshot. Here's an example:

fsh-mullet# zfs snapshot bw_hog@monday
fsh-mullet# zfs snapshot bw_hog@tuesday
fsh-mullet# ls -a /bw_hog
.         ..        .zfs      aces.txt  is.txt    zfs.txt
fsh-mullet# ls -a /bw_hog/.zfs/snapshot
.        ..       monday   tuesday
fsh-mullet# ls -a /bw_hog/.zfs/snapshot/monday
.         ..        aces.txt  is.txt    zfs.txt
fsh-mullet#

With the introduction of .zfs/snapshot, we were faced with an interesting dilemma for NFS - either only have NFS clients that could do "mirror mounts" have access to the .zfs directory OR increase ZFS's fid for files under .zfs. "Mirror mounts" would allow us to do the technically correct solution of having a unique FSID for the "main" file system and each of its snapshots. This requires NFS clients to cross server mount points. The latter option has one FSID for the "main" file system and all of its snapshots. This means the same file under the "main" file system and any of its snapshots will appear to be the same - so things like "cp" over NFS won't like it.

"Mirror mounts" is our lingo for letting clients cross server file system boundaries - as dictated by the FSID (file system identifier). This is totally legit in NFSv4 (see section "7.7. Mount Point Crossing" and section "5.11.7. mounted_on_fileid" in rfc 3530). NFSv3 doesn't really allow this functionality (see "3.3.3 Procedure 3: LOOKUP - Lookup filename" here). Though, with some little trickery, i'm sure it could be achieved - perhaps via the automounter?

The problem with mirror mounts is that no one has actually implemented them. So if we went with the more technically correct solution of having a unique FSID for the "main" local file system and a unique FSID for all its snapshots, only Solaris Update 2(?) NFSv4 clients would be able to access .zfs upon initial delivery of ZFS. That seems silly.

If we instead bend a little on the unique FSID, then all NFS clients in existence today can access .zfs. That seems much more attractive. Oh wait... small problem. We would rather like at least the filehandles to be different for files in the "main" files ystem from the snapshots - this ensures NFS doesn't get completely confused. Slight problem is that the filehandles we give out today are maxed out at the 32 byte NFSv2 protocol limitation (as mentioned above). If we add any other bit of uniqueness to the filehandles (such as a snapshot identifier) then v2 just can't handle it.... hmmm...

Well you know what? Tough s*&t v2. Seriously, you are antiquated and really need to go away. So since the snapshot identifier doesn't need to be added to the "main" file system. FIDs for non-.zfs snapshot files will remain the same size and fit within NFSv2's limitations. So we can access ZFS over NFSv2, just will be denied .zfs's goodness:

fsh-weakfish# mount -o vers=2 fsh-mullet:/bw_hog /mnt
fsh-weakfish# ls /mnt/.zfs/snapshot/      
monday   tuesday
fsh-weakfish# ls /mnt/.zfs/snapshot/monday
/mnt/.zfs/snapshot/monday: Object is remote
fsh-weakfish# 

So what about v3 and v4? Well since v4 is the default for Solaris and its code is simpler, i just changed v4 to handle bigger filehandles for now. NFSv3 is coming soooon. So we basically have the same structure as fhandle_t, except we extend it a bit for NFSv4 via fhandle4_t:

/*
 * This is the in-memory structure for an NFSv4 extended filehandle.
 */
typedef struct {
        fsid_t  fhx_fsid;                       /* filesystem id */
        ushort_t fhx_len;                       /* file number length */
        char    fhx_data[NFS_FH4MAXDATA];    /* and data */
        ushort_t fhx_xlen;                      /* export file number length */
        char    fhx_xdata[NFS_FH4MAXDATA];   /* and data */
} fhandle4_t;

So the only difference is that FIDs can be up to 26 bytes instead of 10 bytes. Why 26? Thats NFSv3's protocol limitation - 64 bytes. And if we ever need larger than 64 byte filehandles for NFSv4, its easy to change - just create a new struct with the capacity for larger FIDs and use that for NFSv4. Why will it be easier in the future than it was for this change? Well part of what i needed to do to make NFSv4 filehandles backwards compatible is that when filehandles are actuallly XDR'd, we need to parse them so that filehandles that used to be given out with 10 byte FIDs (based on the fhandle_t struct) continue to give out filehandles base on 10 byte FIDs, but at the same time VOP_FID()s that return larger than 10 byte FIDs (such as .zfs) are allowed to do so. So NFSv4 will return different length filehandles based on the need of the local file system.

So checking out xdr_nfs_resop4, the old code (knowing that the filehandle was safe to be a contigious set of bytes), simply did this:

case OP_GETFH:
	if (!xdr_int(xdrs,
		     (int32_t *)&objp->nfs_resop4_u.opgetfh.status))
		return (FALSE);
	if (objp->nfs_resop4_u.opgetfh.status != NFS4_OK)
		return (TRUE);
	return (xdr_bytes(xdrs,
	    (char **)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_val,
	    (uint_t *)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_len,
	    NFS4_FHSIZE));

Now, instead of simply doing a xdr_bytes, we use the template of fhandle_ext_t and internally always have the space for 26 byte FIDS but for OTW we skip bytes depending on what fhx_len and fhx_xlen, see xdr_encode_nfs_fh4.

whew, that's enough about filehandles for 2005.



(2007-05-07 11:04:22.0/2005-11-16 10:32:11.0) Permalink Comments [2]
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/bigger_filehandles_for_nfsv4_die

 FS perf 102 : Filesystem Bandwith

Now that you can grab the disks's BW, the next question is "How do i see what BW my local file system can push?". First lets check writes for ZFS:

fsh-mullet# /bin/time sh -c 'lockfs -f .; mkfile 1g 1g.txt; lockfs -f .'
real       17.1
user        0.0
sys         1.1

So that's 1GB/17.1s = ~62MB/s for a 1 gig file. During the mkfile(1M), you can use iostat(1M) to see how much disk BW is going on:

fsh-mullet# iostat -Mxnz 1
              
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  541.0    0.0   67.6  0.0 35.0    0.0   64.7   1 100 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  567.0    0.0   70.3  0.0 33.9    0.0   59.9   1 100 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  254.9    0.0   29.0  0.0 15.7    0.0   61.6   0  64 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  528.1    0.0   66.0  0.0 35.0    0.0   66.2   1 100 c0t1d0

We can also use zpool(1M) to show just the IO for zfs:

fsh-mullet# zpool iostat 1
bw_hog      32.5K  33.7G      0    538      0  67.4M
bw_hog       189M  33.6G      0     30      0   459K
bw_hog       189M  33.6G      0      0      0      0
bw_hog       189M  33.6G      0    509      0  63.7M
bw_hog       189M  33.6G      0    544      0  68.1M
bw_hog       189M  33.6G      0    544      0  68.1M
bw_hog       189M  33.6G      0    535      0  67.0M

Now let's look at UFS writes:

fsh-mullet# /bin/time sh -c 'lockfs -f .; mkfile 1g 1g.txt; lockfs -f .'
real       18.7
user        0.1
sys         6.3

So UFS is doing 1GB/18.7s = ~57MB/s. Let's see some of that iostat:

fsh-mullet# iostat -Mxnz 1
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    4.0   70.0    0.0   58.9  0.0 10.8    0.0  145.6   0  99 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    3.0   70.0    0.0   57.8  0.0 10.6    0.0  144.5   0  99 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    4.0   70.0    0.0   59.4  0.0 11.2    0.0  151.3   0  99 c0t1d0

This was done on a 2-way v210 sparc box, using a SCSI disk.

And why the 'lockfs' call you ask? This ensures that all data is flushed to disk - and measuring how long it takes to do something that doesn't necessarily get flushed is just not legit in this case. Persistent data is good.



(2005-11-16 10:18:45.0/2005-11-16 10:18:06.0) Permalink Comments [2]
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/fs_perf_102_filesystem_bw

 FS perf 101 : Disk Bandwith

One of the first questions to ask when testing local file system performance is "what bandwith can my disk give me"? First let's do writes:

sparc-box# dd if=/dev/zero of=/dev/dsk/c0t1d0s0 bs=1024k count=1000&
sparc-box# iostat -Mxnz 1
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 6462.1    0.0   50.5 128.2 256.0   19.8   39.6 100 100 sd0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 6495.8    0.0   50.7 128.8 256.0   19.8   39.4 100 100 sd0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 6491.1    0.0   50.7 128.7 256.0   19.8   39.4 100 100 sd0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 6494.4    0.0   50.7 128.6 256.0   19.8   39.4 100 100 sd0

Now let's do reads:

sparc-box# dd if=/dev/dsk/c0t1d0s0 of=/dev/null bs=1024k count=1000&
sparc-box# iostat -Mxnz 1
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  548.0    0.0   68.5    0.0  0.0  1.7    0.0    3.1   1 100 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  548.0    0.0   68.5    0.0  0.0  1.7    0.0    3.1   1 100 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  548.1    0.0   68.5    0.0  0.0  1.7    0.0    3.1   1 100 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  548.0    0.0   68.5    0.0  0.0  1.7    0.0    3.1   1 100 c0t1d0

So this 2 way sparc v210 using a SCSI disk can handle writes at ~50MB/s, and reads at ~68MB/s.



(2005-11-16 10:06:18.0/2005-11-16 10:06:18.0) Permalink Comments [1]
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/fs_perf_101_disk_bw


« November 2005 »
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today


XML





Today's Page Hits: 52