|
bigger filehandles for NFSv4 - die NFSv2 die
So what does a filehandle created by the Solaris NFS server look like? If we
take a gander at the
fhandle_t
struct, we see its layout:
struct svcfh {
fsid_t fh_fsid; /* filesystem id */
ushort_t fh_len; /* file number length */
char fh_data[NFS_FHMAXDATA]; /* and data */
ushort_t fh_xlen; /* export file number length */
char fh_xdata[NFS_FHMAXDATA]; /* and data */
};
typedef struct svcfh fhandle_t;
Where fh_len represents the length of valid bytes in fh_data, and
likewise, fh_xlen is the length fh_xdata. Note, NFS_FHMAXDATA used
to be:
#define NFS_FHMAXDATA ((NFS_FHSIZE - sizeof (struct fhsize) + 8) / 2)
To be less confusing, I removed fhsize and shortened that to:
#define NFS_FHMAXDATA 10
Ok, but where does fh_data come from? Its the FID (via VOP_FID) of the
local file system. fh_data represents the actual file of the filehandle,
and fh_xdata represents the exported file/directory. So for NFSv2 and
NFSv3, the filehandle is basically:
fsid + file FID + exported FID
NFSv4 is pretty much the same thing, except at the end we add two fields, and you
can see the layout in
nfs_fh4_fmt_t:
struct nfs_fh4_fmt {
fhandle_ext_t fh4_i;
uint32_t fh4_flag;
uint32_t fh4_volatile_id;
};
The fh4_flag is used to distinguish named attributes from "normal" files,
and fh4_volatile_id is currently only currently used for testing purposes
- for testing volatile filehandles of course, and since Solaris doesn't have a
local file system that doesn't have persistent filehandles we don't need to use
fh4_volatile_id quite yet.
So back to the magical "10" for NFS_FHMAXDATA... what's going on there? Well,
adding those fields up, you get: 8(fsid) + 2(len) + 10(data) + 2(xlen) +
10(xdata) = 32 bytes. Which is the protocol limitation of
NFSv2 - just look for "FHSIZE". So
the Solaris server is currently limiting its filehandles to 10 byte FIDs just to
make NFSv2 happy. Note, this limitation has purposely crept into the local
file systems to make this all work, check out UFS's
ufid:
/*
* This overlays the fid structure (see vfs.h)
*
* LP64 note: we use int32_t instead of ino_t since UFS does not use
* inode numbers larger than 32-bits and ufid's are passed to NFS
* which expects them to not grow in size beyond 10 bytes (12 including
* the length).
*/
struct ufid {
ushort_t ufid_len;
ushort_t ufid_flags;
int32_t ufid_ino;
int32_t ufid_gen;
};
Note that NFSv3's protocol limitation is 64 bytes and NFSv4's limitation is 128
bytes. So these two file systems could theoreticallly give out bigger
filehandles, but there's two reasons why they don't for currently existing
data: 1) there's really no need and more importantly 2) the filehandles MUST be
the same on the wire before any change is done. If 2) isn't satisfied, then
all clients with active mounts will get STALE errors when the longer filehandles
are introduced. Imagine a server giving out 32 byte filehandles over NFSv3 for
a file, then the server is upgraded and now gives out 64 byte filehandles - even
if all the extra 32 bytes are zeroed out, that's a different filehandle and the
client will think it has a STALE reference. Now a force umount or client reboot
will fix the problem, but it seems pretty harsh to force all active clients to
perform some manual admin action for a simple (and should be harmless) server
upgrade.
So yeah my blog title is how i changed filehandles to be bigger - which almost
contradicts the above paragraph. The key point to note is that files that have
never been served up via NFS have never had a filehandle generated for them
(duh), so they can be whatever length the protocol allows and we don't have to
worry about STALE filehandles.
If you're not familiar with ZFS's .zfs/snapshot, there will be a future blog
on it soon. But basically it places a dot file (.zfs) under the "main" file
system at its root, and all snapshots created are then placed namespace-wise
under .zfs/snapshot. Here's an example:
fsh-mullet# zfs snapshot bw_hog@monday
fsh-mullet# zfs snapshot bw_hog@tuesday
fsh-mullet# ls -a /bw_hog
. .. .zfs aces.txt is.txt zfs.txt
fsh-mullet# ls -a /bw_hog/.zfs/snapshot
. .. monday tuesday
fsh-mullet# ls -a /bw_hog/.zfs/snapshot/monday
. .. aces.txt is.txt zfs.txt
fsh-mullet#
With the introduction of .zfs/snapshot, we were faced with an interesting dilemma
for NFS - either only have NFS clients that could do "mirror mounts" have access
to the .zfs directory OR increase ZFS's fid for files under .zfs. "Mirror mounts" would allow us to do the technically correct solution of having a unique
FSID for the "main" file system and each of its snapshots. This requires NFS
clients to cross server mount points. The latter option has one FSID for the
"main" file system and all of its snapshots. This means the same file under
the "main" file system and any of its snapshots will appear to be the same - so
things like "cp" over NFS won't like it.
"Mirror mounts" is our lingo for letting clients cross server file system
boundaries - as dictated by the FSID (file system identifier). This is totally
legit in NFSv4 (see section "7.7. Mount Point Crossing" and section
"5.11.7. mounted_on_fileid" in
rfc 3530). NFSv3 doesn't really
allow this functionality (see "3.3.3 Procedure 3: LOOKUP - Lookup filename"
here). Though, with some little
trickery, i'm sure it could be achieved - perhaps via the automounter?
The problem with mirror mounts is that no one has actually implemented them. So
if we went with the more technically correct solution of having a unique FSID
for the "main" local file system and a unique FSID for all its snapshots, only
Solaris Update 2(?) NFSv4 clients would be able to access .zfs upon initial
delivery of ZFS. That seems silly.
If we instead bend a little on the unique FSID, then all NFS clients in existence
today can access .zfs. That seems much more attractive. Oh wait... small
problem. We would rather like at least the filehandles to be different
for files in the "main" files ystem from the snapshots - this ensures NFS doesn't
get completely confused. Slight problem is that the filehandles we give out
today are maxed out at the 32 byte NFSv2 protocol limitation (as mentioned
above). If we add any other bit of uniqueness to the filehandles (such as a
snapshot identifier) then v2 just can't handle it.... hmmm...
Well you know what? Tough s*&t v2. Seriously, you are antiquated and really
need to go away. So since the snapshot identifier doesn't need to be added
to the "main" file system. FIDs for non-.zfs snapshot files will remain the same
size and fit within NFSv2's limitations. So we can access ZFS over NFSv2, just
will be denied .zfs's goodness:
fsh-weakfish# mount -o vers=2 fsh-mullet:/bw_hog /mnt
fsh-weakfish# ls /mnt/.zfs/snapshot/
monday tuesday
fsh-weakfish# ls /mnt/.zfs/snapshot/monday
/mnt/.zfs/snapshot/monday: Object is remote
fsh-weakfish#
So what about v3 and v4? Well since v4 is the default for Solaris and its code
is simpler, i just changed v4 to handle bigger filehandles for now. NFSv3 is
coming
soooon.
So we basically have the same structure as fhandle_t, except we extend it
a bit for NFSv4 via
fhandle4_t:
/*
* This is the in-memory structure for an NFSv4 extended filehandle.
*/
typedef struct {
fsid_t fhx_fsid; /* filesystem id */
ushort_t fhx_len; /* file number length */
char fhx_data[NFS_FH4MAXDATA]; /* and data */
ushort_t fhx_xlen; /* export file number length */
char fhx_xdata[NFS_FH4MAXDATA]; /* and data */
} fhandle4_t;
So the only difference is that FIDs can be up to 26 bytes instead of 10 bytes.
Why 26? Thats NFSv3's protocol limitation - 64 bytes. And if we ever need
larger than 64 byte filehandles for NFSv4, its easy to change - just create a
new struct with the capacity for larger FIDs and use that for NFSv4. Why will
it be easier in the future than it was for this change? Well part of what i
needed to do to make NFSv4 filehandles backwards compatible is that when
filehandles are actuallly XDR'd, we need to parse them so that filehandles that
used to be given out with 10 byte FIDs (based on the fhandle_t struct) continue
to give out filehandles base on 10 byte FIDs, but at the same time VOP_FID()s
that return larger than 10 byte FIDs (such as .zfs) are allowed to do so. So
NFSv4 will return different length filehandles based on the need of the local
file system.
So checking out
xdr_nfs_resop4,
the old code (knowing that the filehandle was safe to be a contigious set of
bytes), simply did this:
case OP_GETFH:
if (!xdr_int(xdrs,
(int32_t *)&objp->nfs_resop4_u.opgetfh.status))
return (FALSE);
if (objp->nfs_resop4_u.opgetfh.status != NFS4_OK)
return (TRUE);
return (xdr_bytes(xdrs,
(char **)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_val,
(uint_t *)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_len,
NFS4_FHSIZE));
Now, instead of simply doing a xdr_bytes, we use the template
of fhandle_ext_t and internally always have the space for 26 byte FIDS
but for OTW we skip bytes depending on what fhx_len and fhx_xlen, see
xdr_encode_nfs_fh4.
whew, that's enough about filehandles for 2005.
(2007-05-07 11:04:22.0/2005-11-16 10:32:11.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/bigger_filehandles_for_nfsv4_die
|
Posted by Julius Rahmandar on January 27, 2006 at 10:49 AM PST #
Ok, so "A" is your server running s10... "B" (s9) and "C" (s10) are your clients.
Is the /ramdisk temporary? What are you trying to "ls -l" when you get "no such file or directory?
Is the 90 seconds after "A" comes back up or is that including the reboot time?
Note this blog isn't about fixing filehandles, its about extending them when the local file system needs a larger FID - like ZFS's .snapshot. So what you're seeing has nothing to do with filehandles.
Please follow up with a message to nfs-discuss@opensolaris.org - its easier to respond there.
Posted by eric kustarz on January 27, 2006 at 10:58 AM PST #