I've got two clients doing some simple file copies, file removes, and directory removes. After a while, I stop both test scripts and unmount the mds. We can see the relationship between files, layouts, and device tables (mpd):
[root@pnfs-17-24 ~]> ./rlays.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_Layout_entry_cache|::print struct rfs4_dbe
refcnt = 0x50
refcnt = 0x99
refcnt = 0x49
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt | wc -l
+ mdb -k
+ echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
303
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_mpd_entry_cache|::print struct rfs4_dbe
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
I just fixed a bug where a layout request invalidated the old layout. So we can see here that there are over 300 files with state which are still active, there are 3 corresponding layouts, each with a different usage, and 3 mpds, each with a hold from their respective layout. The 3 layouts correspond to the 3 policies in effect in the spe. There is no round robin scheduling going on, which might see more layouts in use.
I haven't fixed it yet, and this example doesn't show it, but each layout create is going to cause a corresponding mpd create. Which is okay for now, the only difference we could support would be a different stripe unit size.
And we can see that the files have been harvested, but the layouts have not:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt | wc -l
+ mdb -k
+ echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
0
[root@pnfs-17-24 ~]> ./rlays.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_Layout_entry_cache|::print struct rfs4_dbe
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_mpd_entry_cache|::print struct rfs4_dbe
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
The layouts are ripe for plucking, with a refcnt of 1, it means that they are in the table, but no one else references them. Let's see if we can quickly use them again:
[root@pnfs-17-24 ~]> ./rlays.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_Layout_entry_cache|::print struct rfs4_dbe
refcnt = 0x5
refcnt = 0xb
refcnt = 0x4
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_mpd_entry_cache|::print struct rfs4_dbe
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
Now I don't have proof here that these are the same entries in the db, but what I wanted to point out is that I never invalidate entries in either the layout or mpd tables. Instead, I rely on the fact that they can not go away until all external references are gone. Based on timing, I'd say we did reuse the entries, otherwise I would expect to see 6 mpds and not 3 of them. I.e., they do not get reused and if there were new layout entries, we'd expect to see 6 mpds.
I'm interested in that layout and mpd entries get harvested as they become stale. Because of the prior assumption that there was only ever 1 of each, it never was an issue.
If we recall our mappings, each mpd can belong to several layouts, each layout can belong to several files, etc. So when I discovered that mpds were sticking in memory with a refcnt of 2, I checked the layouts, saw that they had a refcnt of 2, and then checked the files:
[root@pnfs-17-24 ~]> more rlays.sh
#!/bin/sh -x
echo '::walk mds_Layout_entry_cache|::print struct rfs4_dbe' | mdb -k
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
Now, some of these would go away, but in the end, several would stick in memory. I actually found 2 nasty rfs4_file_t rele bugs and fixed them, but the issue stayed.
I thought my last fix did the trick, I turned my attention away for a bit as I worked on my MacBook Pro, and the file references went away. I rebooted and tried again. No such luck, that was when I got the above.
I unmounted the mds from my two clients and went away to study the code, trying to find out how an mds_op_close() did a file rele which would lead to the layout rele.
I was going to reboot, without any luck, and try the case where I created the files and did nothing else - the remove code is suspect to me still. Well, before I did that, I checked the file refcnts:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k+
echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
My guess is that somehow the clients are causing state to be clung to on the server. It is only on the unmount that a final rele takes place. It has been some time since I did that last check:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k + echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
[root@pnfs-17-24 ~]> ./rlays.sh
+ mdb -k
+ echo ::walk mds_Layout_entry_cache|::print struct rfs4_dbe
[root@pnfs-17-24 ~]> ./rmpd.sh
+ mdb -k
+ echo ::walk mds_mpd_entry_cache|::print struct rfs4_dbe
{
lock = [
{
_opaque = [ 0 ]
},
]
refcnt = 0x1
Whoops, I expected the mpds to be out of memory as well and after a bit of patience, they are. :->
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt + mdb -k + echo ::walk mds_mpd_entry_cache|::print struct rfs4_dbe
A quick test of creating 4 files sees:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
refcnt = 0x5
refcnt = 0x5
refcnt = 0x5
refcnt = 0x5
If we then unmount, the snoop is mainly the following:
[root@pnfs-4-02 ~]> grep RETURN\) xxx NFS: Op = 8 (DELEGRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 51 (LAYOUTRETURN)
And indeed, we see the refcnts get decremented by two for each file:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
refcnt = 0x3
refcnt = 0x3
refcnt = 0x3
refcnt = 0x3
And after a while:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt + mdb -k + echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
I'm actually happy that this strange behavior occurred. I wouldn't have taken the time to painstakingly poured over every file rele and I wouldn't have thus found the two bugs. The first was in error handling in the remove case and the second was in file delegation recalls. The first was probably hit or miss to being found. The second was a timebomb.
The spe is a simple policy engine, which dictates the stripe size, the block size, and the set of devices used during file creation - these three items form the layout. The impact of that statement is subtle, but it sets some expectations. The spe code only fires on a file creation, so if a file already exists, then we read in the on disk layout (odl) to get the layout.
So, if files were created under a different set of policies, then there will be no problem accessing the files under the new set of policies.
Here is a file we created before we even enabled spe:
[root@pnfs-17-21 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/redo.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
Layout obtained at: Sat Jul 25 17:51:55:18886 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
I've rebooted the client and mds many times, I've added spe policies, but I can check to see that we get the exact same layout. (Note, I've also changed the mount point to be the root of the mds, so expect the path to be different!)
[root@pnfs-4-02 ~]> touch /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c
Layout unacquired
[root@pnfs-4-02 ~]> cat /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c > /dev/null
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 1
Layout [0]:
Layout obtained at: Mon Jul 27 11:38:40:673328 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
One thing to note right off the bat is that 'touch' does not cause the file layout to be loaded. That is simple, the 'touch' command is a meta-data operation and can be serviced on the mds. It does not need to open the file. I'd have to look at the snoop output to be sure, but we look like we are getting the same layout.
Now, if I do the same copy as before, we see:
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redome.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 116
Layout [0]:
Layout obtained at: Mon Jul 27 11:43:30:691639 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 3968
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
We just happen to get a stripe count of 4, but notice that the stripe unit is much smaller. We can tell that the extension of '.c' caused the first policy to fire:
[root@pnfs-17-24 ~]> more /etc/policies.spe 1000, 4, 4k, pool17, ext == c 2000, 8, 1k, pool17:pool4, path == /pnfs1 3000, 3, 8k, pool4:pool17, path == /pnfs2
Okay, what if we call it redome.h? That should cause the last policy to match. Note that the test of the path attribute is based on the path of the mds and not the client. Also, if we have a mount point under the root of the mds, we still get the complete path when we do the check.
[root@pnfs-4-02 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/pnfs2/pnfs/redome.h
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redome.h
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 64
Layout [0]:
Layout obtained at: Mon Jul 27 11:46:20:549943 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 3, stripe unit: 8000
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
And it does! We can see it in both the stripe count and stripe unit.
And these files will have those same layouts as long as they live.
Well, I need to go figure out some more unit tests to break my code...
Okay, I've forced the ds_dataset_name to be present at creation. And I'm getting pretty hashes. I'm still not generating a mds_sid array.
When we store 'pnfs-4-01:pnfs1/d10':
Jul 27 10:59:45 pnfs-17-24 nfssrv: utf8_hash of |pnfs-4-01:pnfs1/d10|[20] is 55811466
And when we try to search for it:
Jul 27 11:04:51 pnfs-17-24 nfssrv: utf8_hash of |pnfs-4-01:pnfs1/ds10|[21] is 111623246
The lengths are different, which will cause a bit shift. And we see that one is 'd10' and the other is 'ds10'. Well, I'll change '/etc/npools.spe' as I don't have renaming a dataset name <-> mds_sid working yet.
[root@pnfs-4-01 ~]> zfs rename pnfs1/d10 pnfs1/ds10 cannot rename 'pnfs1/d10': dataset is busy
And there could be other problems.
But anyway, fix the dataset name, and we have spe, I repeat, we have spe!
Here we can see a file creation without any policy firing:
[root@pnfs-4-02 ~]> mount -o vers=4 pnfs-17-24:/ /pnfs/pnfs-17-24
[root@pnfs-4-02 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/pnfs1/nfs41/spek4.txt
[root@pnfs-4-02 ~]> nfsstat -l !$
nfsstat -l /pnfs/pnfs-17-24/pnfs1/nfs41/spek4.txt
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 4
Layout [0]:
Layout obtained at: Mon Jul 27 11:04:51:725322 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 10, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
Stripe [3]:
tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
Stripe [4]:
tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
Stripe [5]:
tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
Stripe [6]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [7]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [8]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [9]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
And here we can see the same file copy resulting in not only a different number of stripes, but to a different dataset order!
[root@pnfs-4-02 ~]> umount /pnfs/pnfs-17-24
[root@pnfs-4-02 ~]> mount -o vers=4 pnfs-17-24:/ /pnfs/pnfs-17-24
[root@pnfs-4-02 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/pnfs1/nfs41/spek5.txt
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs1/nfs41/spek5.txt
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 140
Layout [0]:
Layout obtained at: Mon Jul 27 11:27:06:306959 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 8, stripe unit: 960
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [4]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [5]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [6]:
tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
Stripe [7]:
tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
So, I started off a test run with the kspe. It allows me to decide which datasets are going to be used in a layout. (see my slides from Oklahoma City OpenSolaris User Group (OKCOSUG) presentation). I set up a simple set of npools and policies:
[root@pnfs-17-24 /etc]> more npools.spe pool17 pnfs-17-22:pnfs1/ds1 pnfs-17-23:pnfs1/ds3 pnfs-17-22:pnfs2/ds2 pnfs-17-23 :pnfs2/ds4 pnfs-17-23:pnfs1/ds7 pnfs-17-22:pnfs1/ds8 pool4 pnfs-4-01:pnfs1/ds4 pnfs-4-01:pnfs2/ds5 pnfs-4-01:pnfs2/d9 pnfs-4-01:pnfs1 /ds10 [root@pnfs-17-24 /etc]> more policies.spe 1000, 4, 4k, pool17, ext == c 2000, 8, 1k, pool17:pool4, path == /pnfs1 3000, 3, 8k, pool4:pool17, path == /pnfs2
Note that a dataset identifier is a combination of host name and the zfs filesystem. I wrote the code and I still struggle with this is not a path name, it is a zfs name. I.e., it is not ':/' as a connector. 'pnfs2/ds2' is the zfs name and not '/pnfs2/ds2'!
And struggled a bit to load them, until I found my kspe implementation notes.
And right off the bat, I see my policies are not matching and I've got a core!
[root@pnfs-4-02 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/pnfs1/nfs41/spe_1234.32k.raw.txt
yields
[root@pnfs-17-24 ~]> rc = 0, eval = 0, id = 1000
rc = 0, eval = 1, id = 2000
WARNING: spe: 1000 8
panic[cpu1]/thread=ffffff01d74bf4e0: BAD TRAP: type=e (#pf Page fault) rp=ffffff0008162d70 addr=0 occurred in module "nfssrv" due to a NULL pointer dereference
...
[1]> $c
kmdb_enter+0xb()
...
nfssrv`mds_layout_hash+0x19(ffffff0008162fe0)
...
[1]> ffffff0008162fe0::print layout_core_t
{
lc_stripe_unit = 0x3e8
lc_stripe_count = 0x8
lc_mds_sids = 0
}
Which tells me that the kspe code did not generate a mds_sid array. Again, no problem, I just wrote that code last week and this is the first test of it!
So the bug was a stupid error:
return (mds_sids ? 0 : ENOENT);
versus
return (*mds_sids ? 0 : ENOENT);
Success was being returned with an empty array of mds_sids. There is still a bug, we should be finding matches, but at least it is not so nasty.
The bug is that these comparisons are failing:
Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool17| at 7 to global |pool17| at 7 Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool17| at 7 to global |pool4| at 6 Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool4| at 6 to global |pool17| at 7 Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool4| at 6 to global |pool4| at 6 Jul 27 01:37:45 pnfs-17-24 nfssrv: spe_map_npools_to_mds_sids: No matching npools!
The first and last should match. Ah, they do (as shown by additional debug logic), which shows me barking up a wrong tree!
D'oh! I found it! If we look at this code:
spe_map_npools_to_mds_sids(kspe_state_t *kspe, spe_policy *sp,
...
spe_npool *sn;
spe_npool *np;
/*
* For each npool in the policy, find it in the
* list of npools, and start assigning datasets.
*/
for (sn = sp->sp_npools; sn != NULL; sn = sn->next) {
for (np = kspe->ks_npools; np; np = np->next) {
cmp = utf8_compare(&np->sn_name, &sn->sn_name);
if (cmp == 0) {
/*
* Now we fill in entries in the *mds_sids
* array.
*/
for (ss = sn->sn_dses; ss; ss = ss->next) {
We see I was lazy in assuming that sn and np were the same thing. Note that they are the same type of object, but the sn points to the npools in the policy and np points to the npools in the global list. The point is that a npool can be in multiple policies. So rather than store the list of datasets in the policies (which is a nightmare for updating) or pointers to npools in the policies (which sounds good and I can't remember why not!) we store the datasets in the global list.
So instead of searching a list of datasets in the global list, we search an empty list in the policy. :->
And I'm at the next bug:
Jul 27 02:11:04 pnfs-17-24 nfssrv: mds_ds_path_to_mds_sid returned an error!
We're not finding a match as we search the ds_guid_info database. When I had a search issue a couple nights ago, it turned out to be the hash function. If we look at what we have:
instp->ds_guid_info_dataset_name_idx =
rfs4_index_create(instp->ds_guid_info_tab,
"DS_guid-dataset-name-idx", mds_str_hash,
ds_guid_info_dataset_name_compare, ds_guid_info_dataset_name_mkkey,
FALSE);
We are using mds_str_hash on an utf8_string - which is a length and a string. I think that will be problematic. I'm not too happy with the hash functions here in general, but for now all I care about is that they are consistent.
They aren't consistent in this case and it is getting late...
As it boots:
Jul 27 03:33:41 pnfs-17-24 nfssrv: utf8_hash of empty str
As we search:
Jul 27 03:34:35 pnfs-17-24 nfssrv: utf8_hash of |pnfs-17-22:pnfs1/ds1|[21] is 111612431
We ought to see a non-NULL addition and we ought to see 10 messages, not 1.
Bzzt! That is an expectation and not code. The debug statements are correct. If we look at the entry create routine, which is called just before the hash function, we see:
static bool_t
ds_guid_info_create(rfs4_entry_t u_entry, void *arg)
{
...
pgi->ds_dataset_name.utf8string_val = NULL;
pgi->ds_dataset_name.utf8string_len = 0;
We don't have our hands on that info -- or do we?
We do - I can fix this - but not now. It will wait until tomorrow, because I've got to sleep...
So after a reboot, a diff of a file copied to the server is showing problems. There are several triage steps to take:
I can show the problem thusly:
[root@pnfs-17-21 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw
[root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw | wc -l
0
How do I know that the client is even getting the bits back from the server. If they are still in local buffers, this may not be happening. So, lets try from another client:
[root@pnfs-4-02 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw | wc -l
132
So it isn't the reboot and it probably is not the ODL.
Okay, I just configured a stock community and repeated the same test and there was no loss of data. Clearly either my changes or changes I failed to do have caused the problem. I.e., either I caused a bug or I exposed an existing bug - one that the new feature set enabled. Either way, I have to fix it before I can integrate my changes.
I'm beginning to think it is in code which has yet to be implemented:
[thud@ultralord nfs]> grep dnk_sid * dserv_server.c: dskey.dnk_sid = &fhp->fh.v1.mds_sid; [thud@ultralord nfs]> grep dnk_fid * dserv_server.c: CRC32(rc, key->dnk_fid->val, key->dnk_fid->len, dserv_server.c: NFS_AVL_COMPARE(a->dnk_fid->len, b->dnk_fid->len); dserv_server.c: rc = memcmp(a->dnk_fid->val, b->dnk_fid->val, dserv_server.c: a->dnk_fid->len); dserv_server.c: data->dnd_fid = key->dnk_fid; dserv_server.c: dskey.dnk_fid = &fhp->fh.v1.mds_fid; dserv_server.c: key->dnk_fid = &key->dnk_real_fid;
I enabled the setting of 'dnk_sid', but no one uses it. I bet what is happening is that both 32k chunks of the files are being written on the same dataset...
With some debug printfs, we can see that the DSes are picking different object sets to save the data:
Jul 26 21:28:46 pnfs-17-22 nfssrv: Accessing pnfs2/ds2/EF25FBCF08FA1BD7 for D336969B13CB6C0E000000000000001E Jul 26 21:28:46 pnfs-17-22 nfssrv: Accessing pnfs1/ds1/EF25FBCF08FA1BD7 for 86FE860F0E0B87D4000000000000001E
So I implemented the code for 'dnk_sid' and now I'm seeing:
[root@pnfs-4-02 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/slow | wc -l
0
It also works now across a reboot.
The bad thing here is that I'm making it up on the fly, I don't understand the nnode code. I'm going to have to get a code review just on that chunk.
Actually, strike that. I know there was a bug, missing code to be exact. I know my fix works, and I understand the changes I made. I don't understand how those changes work in the bigger picture of the nnode code. Oh, I understand that when we tried to access the two different datasets on the same DS, we would sometimes get one instead of the other. I'm still going to talk with the author, I want to make sure my fixes are optimal.
Last night, I did the following test of file creation:
[root@pnfs-17-21 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
Layout obtained at: Sat Jul 25 02:55:17:818063 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/Pnfs4_vnops.c
-rw-r--r-- 1 root root 409540 Jul 25 02:55 /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
So we have pNFS!
I then went to bed, knowing that the next set of testing to do was what we call on disk layouts or odl. When we create a file, we generate a layout based on all of the currently reporting datasets (ignore kspe, which I will soon enable and start testing). When we generate the create layout, we need to stuff it somewhere such that a client which opens the file either for appending or reading can get at the contents. So we store the layout on disk, hence odl.
The system we had in place made assumptions like all devices would be there, they would report in the same order, etc. Well, with using the mds_sids as keys, I've changed that.
To test things, I've rebooted the client and the servers. I want to make sure that there is nothing in memory. Soon, I'll add a new DS to make sure the number of reporting datasets does not have an impact. But for now, what happens?
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c2 Binary files nfs4_vnops.c and /pnfs/pnfs-17-24/Pnfs4_vnops.c2 differ [root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c2 Layout unacquired
I don't need snoop, I need to start examining the odl code. And to be honest, I was expecting problems here.
I don't think the layout was written to disk. For now, we plaop it down in '/var':
[root@pnfs-17-24 ~]> cd /var/nfs/v4_state/layouts [root@pnfs-17-24 layouts]> ls -la total 6 drwxr-xr-x 3 daemon daemon 512 Jun 26 17:27 . drwxr-xr-x 5 daemon daemon 512 Jul 23 16:30 .. drwxr-xr-x 2 daemon daemon 512 Jun 26 17:27 2d90005 [root@pnfs-17-24 layouts]> cd 2d90005/ [root@pnfs-17-24 2d90005]> ls -la total 4 drwxr-xr-x 2 daemon daemon 512 Jun 26 17:27 . drwxr-xr-x 3 daemon daemon 512 Jun 26 17:27 ..
Found the problem!
#ifdef PERSIST_LO_ENABLED
And it wasn't enabled!
IF I enable it and try the experiment again:
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/a13.c | wc -l
21433
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/a13.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 14
Layout [0]:
Layout obtained at: Sat Jul 25 17:49:59:853319 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
I should have gotten the layout from before the write. But it looks wrong. They are out of order, see:
[root@pnfs-17-21 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/redo.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
Layout obtained at: Sat Jul 25 17:51:55:18886 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Reboot and we see:
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 14
Layout [0]:
Layout obtained at: Sat Jul 25 17:55:33:205567 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
So the good news is that odl works, the bad news is just not right.
We can actually see this in action. I wrote a simple program to write 32k of 'a', 32k of 'b', 32k of 'c', and 32k of 'd's to a file. I then copied that file over.
[root@pnfs-17-21 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw [root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw [root@pnfs-17-21 ~]> reboot ... [root@pnfs-17-21 ~]> mount -o vers=4 pnfs-17-24:/pnfs2/pnfs /pnfs/pnfs-17-24 [root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw Binary files 1234.32k.raw and /pnfs/pnfs-17-24/b1234.32k.raw differ
The first diff had to have gotten the same layout as before. Now when we reboot, we see a difference. And since the pattern is so simple, we can see what might be going on:
[root@pnfs-17-21 ~]> grep a /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
0
[root@pnfs-17-21 ~]> grep b /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
31
[root@pnfs-17-21 ~]> grep c /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
0
[root@pnfs-17-21 ~]> grep d /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
64
We need to snoop the filehandles from the create and then from the later read.
cre: fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 cre: fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 cre: fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 cre: fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
And they match. The problem isn't in the filehandles we hand out.
I'm going to end it there for now...
That last bug was actually pretty easy to solve. It turned out to be more code which had never been tested before. If we look at the code in usr/src/uts/common/fs/nfs/dserv_server.c (which will be modified by the time you read this), we see:
711 mutex_enter(&inst->dmi_content_lock);
712 error = find_open_root_objset(inst, sid, &root_objset);
713 if (error) {
714 error = (error == ENOENT) ? EIO : error;
715 goto out;
716 }
717
718 error = find_open_mdsfs_objset(inst, dataset_id, root_objset,
719 &(dnd->dnd_objset));
720 if (error == 0 || error != ENOENT) {
721 if (error == 0)
722 dnd->dnd_flags |= DSERV_NNODE_FLAG_OBJSET;
723 goto out;
724 }
The EAGAIN in Hard day of debugging the hard stuff, but now have multiple datasets on the same DS was returned on 712 by find_open_root_objset(). The new panic was occurring down in the call to find_open_mdsfs_objset() on 718. Before my changes, find_open_root_objset() always returned the first pNFS dataset root. So if you had two, you only ever saw the one.
I decided to skip right to the code and it ended up being easy to spot in find_open_root_objset(), I started there because I hadn't changed find_open_mdsfs_objset() and it was known to work.
541 static int
542 find_open_root_objset(dserv_mds_instance_t *inst, mds_sid mds_sid,
543 open_root_objset_t **root_objset)
544 {
...
603 /*
604 * Find the root pNFS object set.
605 */
606 for (tmp_root = list_head(&inst->dmi_datasets); tmp_root != NULL;
607 tmp_root = list_next(&inst->dmi_datasets, tmp_root)) {
608 if (ds_guid.dg_zpool_guid ==
609 tmp_root->oro_ds_guid.dg_zpool_guid &&
610 ds_guid.dg_objset_guid ==
611 tmp_root->oro_ds_guid.dg_objset_guid) {
612 /*
613 * This is our root pNFS object set!
614 */
615 found_root_objset = 1;
616 break;
617 }
618 }
*root_objset is never set when it is found. So, when we use it on 718, it is garbage. The fix is a simple assignment. Hmm, I could replace tmp_root with *root_objset, which is what the original author seemed to think was going on here.
But in any event, here is the 'pee' in pNFS:
[root@pnfs-17-21 ~]> mount -o vers=4 pnfs-17-24:/pnfs2/pnfs /pnfs/pnfs-17-24
[root@pnfs-17-21 ~]> cp /etc/passwd /pnfs/pnfs-17-24/qwhoei
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/qwhoei
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 1
Layout [0]:
Layout obtained at: Sat Jul 25 02:20:00:343367 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/qwhoei
-rw-r--r-- 1 root root 881 Jul 25 02:20 /pnfs/pnfs-17-24/qwhoei
Well, actually, there really isn't any parallel activity going on here. The file is 881 bytes and the default stripe size is 32k. So it all goes to stripe 0. But there is a lot going on in the background which was touched by my changes.
I could either start testing the kspe to get a smaller stripe size or I can use mkfile to get a file of 2*4*32k such that there are two writes to each stripe.
[root@pnfs-17-21 ~]> mkfile 256k chunky
[root@pnfs-17-21 ~]> ls -la chunky
-rw------T 1 root root 262144 Jul 25 02:34 chunky
[root@pnfs-17-21 ~]> cp chunky /pnfs/pnfs-17-24/pChunky
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/pChunky
-rw------- 1 root root 262144 Jul 25 02:34 /pnfs/pnfs-17-24/pChunky
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/pChunky
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 8
Layout [0]:
Layout obtained at: Sat Jul 25 02:34:22:325669 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
But I have no idea if the file is identical. I'll test that later...
I can test that with using a large text file, say 440k:
[root@pnfs-17-21 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
Layout obtained at: Sat Jul 25 02:55:17:818063 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/Pnfs4_vnops.c
-rw-r--r-- 1 root root 409540 Jul 25 02:55 /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
So we have pNFS!
The good news:
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/bubble
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 0
Layout [0]:
Layout obtained at: Fri Jul 24 22:33:22:698731 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
As you can see, the first and second stripe go onto the same DS. How can we make sure that these are for different datasets though? Well, snoop should tell us!
NFS: Op = 50 (LAYOUTGET) NFS: Status = 0 (NFS4_OK) NFS: State ID hash = 0721 NFS: len = 12 val = 710000004306000000000000 NFS: State ID Sequence ID = 1 NFS: Layout [0]: NFS: Return layout on close = FALSE NFS: Layout offset = 0 NFS: Layout length = 18446744073709551615 NFS: Layout iomode = LAYOUTIOMODE4_RW NFS: Layout type = LAYOUT4_NFSV4_1_FILES NFS: Device ID hash = [03F9] NFS: (16) 000000000000000000000000C8000000 NFS: Util = 0x8001 NFS: stripe_unit = 512 NFS: UFLG_DENSE NFS: First stripe = 0 NFS: fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS: fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS: fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS: fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000 NFS:
fh[0] and fh[1] differ in the mds_sid portion of the file handle: '10D336969B13CB6C0E' vs '1086FE860F0E0B87D4'. And the 0x10 at the start is probably the length. Amazingly, the mds_sid is also the only difference with fh[2] and fh[3]. That is because the mds_sid is what tells the DS which dataset to act on. The combination of the mds_sid and the ip addresses in the device info are what lets the client send the correct data to the correct DS:
NFS: Op = 47 (GETDEVICEINFO) NFS: Status = 0 (NFS4_OK) NFS: Layout type = LAYOUT4_NFSV4_1_FILES NFS: Device address: NFS: stripe_indices: 0, 1, 2, 3 NFS: multipath_ds_list[0]: [0] tcp:10.1.233.192.183.161 NFS: multipath_ds_list[1]: [0] tcp:10.1.233.192.183.161 NFS: multipath_ds_list[2]: [0] tcp:10.1.233.193.183.161 NFS: multipath_ds_list[3]: [0] tcp:10.1.233.193.183.161 NFS:
Now, I said good news, which implies bad news. And here it is:
[thud@pnfs-17-21 ~]> cp /etc/passwd /pnfs/pnfs-17-24/bubble cp: /pnfs/pnfs-17-24/bubble: write: Permission denied [thud@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/bubble -rw-r--r-- 1 thud staff 0 Jul 24 22:33 /pnfs/pnfs-17-24/bubble
So the create is working, but the data is not appearing. I'm thinking an issue with the client talking to the DSes. And with a new filehandle, I wonder if it is not being decoded correctly?
Well, it is client to DS communication, as a snoop shows us:
NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Compound) NFS: Tag = pnfs write NFS: Status = 10012 (NFS4ERR_LOCKED) NFS: Number of results = 3 NFS: NFS: Op = 53 (SEQUENCE) NFS: Status = 0 (NFS4_OK) NFS: Session ID hash = [3716] NFS: (16) 00000000FFFFFFFFD700000064000000 NFS: Sequence ID = 1 NFS: Slot ID = 0 NFS: Highest slotid = 200 NFS: Target highest slotid = 200 NFS: status = 0 NFS: NFS: Op = 22 (PUTFH) NFS: Status = 0 (NFS4_OK) NFS: NFS: Op = 38 (WRITE) NFS: Status = 10012 (NFS4ERR_LOCKED) NFS:
I just fixed an issue in the encoding of the ds_filehandle, so perhaps I have it in reverse? And no, the decoding was okay. We need to look for NFS4ERR_LOCKED. So either locking code is getting to me or it is EAGAIN - which is bad to track.
Well, not so bad, we know it is seen during the WRITE.
Nothing was panning out, so I went to the webrev (http://cr.opensolaris.org/~tdh/perse/) and I'm suspicious of this code in usr/src/uts/common/fs/nfs/dserv_server.c :
578 /*
579 * Use the MDS SID (from the file handle) to find the real data
580 * server guid (zpool guid + id of the root pNFS object set).
581 */
582 for (sid_map = list_head(&inst->dmi_mds_sids); sid_map != NULL;
583 sid_map = list_next(&inst->dmi_mds_sids, sid_map)) {
584 if ((mds_sid.len == sid_map->msm_mds_storid.len) &&
585 (memcmp(mds_sid.val, sid_map->msm_mds_storid.val,
586 sid_map->msm_mds_storid.len) == 0)) {
587 found_mds_sid = 1;
588 ds_guid = sid_map->msm_ds_guid;
589 break;
590 }
591 }
592
593 /*
594 * If we have no record of the given MDS SID it may mean that
595 * we haven't been able to do the REPORTAVAIL for this particular
596 * resource. Therefore, just tell the client to try again later.
597 * In the future, we will attempt to ask the MDS for this information
598 * via DS_MAP_MDSSID.
599 */
600 if (found_mds_sid != 1)
601 return (EAGAIN);
Until my changeset, the mds_sid was always 0s. And that piece of code was surrounded by an '#ifdef 0'. Great, I have my culprit!
I added some debug statements to show what the DS was seeing as it looped across the sid_map:
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sid_map is NOT NULL Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: fh[16] 312CE43624C45DB0000000000000001E Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sm[16] 768279474C9ACF441E00000000000000 Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: fh[16] 312CE43624C45DB0000000000000001E Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sm[16] B05DC42436E42C311E00000000000000 Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: no root objset found!
I'm very suspicious that the reply from the DS_REPORTAVAIL (MDS to DS) is xdr encoding the mds_sid and the one in the file handle is not being encoded. I think the fh mds_sid above is the same as the second sid_map one.
The root cause here was that the mds_sid used to be sent across as 0s. As such, when the MDS did an xdr encode and the DS did not do an xdr decode, it really didn't matter. And the mds_sid was actually ignored on the DS.
Since we don't encode the mds_sid when we send it in the ds_filehandle, I decided to not encode it in the DS_REPORTAVAIL result. As such, we will be able to look at network traces and see if they match in both places.
This analysis took several hours. I keep on getting blind-sided by thinking existing code works correctly. I looked down several side paths I didn't mention (like the MDS treating the mds_sid array like a pointer and not an array of mds_sids).
I can also prove that I've fixed the bug, as I have a new one on the DS, which is downstream of the old one:
find_open_root_objset: sid_map is NOT NULL find_open_root_objset: fh[16] D336969B13CB6C0E000000000000001E find_open_root_objset: sm[16] 86FE860F0E0B87D4000000000000001E panic[cpu1]/thread=ffffff01d7880000: BAD TRAP: type=e (#pf Page fault) rp=ffffff0007b3a0e0 addr=100428 occurred in module "genunix" due to an illegal access to a user address ... ffffff0007b3a1e0 genunix:list_head+10 () ffffff0007b3a230 nfssrv:find_open_mdsfs_objset+40 () ffffff0007b3a6e0 nfssrv:dserv_nnode_data_getobjset+158 () ffffff0007b3a750 nfssrv:dserv_nnode_io_prep+85 () ffffff0007b3a7a0 nfssrv:nnop_io_prep+41 () ffffff0007b3a960 nfssrv:mds_op_write+134 () ffffff0007b3a9b0 nfssrv:rfs41_op_dispatch+72 ()
But it is late and I don't know if I'll blog about finding out what this one does...
I'm in the process of finishing off some coding I started before the BAT last month. It started off as a simple observation, my DSes with two datasets were only using the first dataset. It was time to rewrite the device and layouts to use the mds_sids. That is an unique value which identifies the ZFS dataset on a zpool. It should be unique in allowing a mapping from dataset name: 'ds1:ppool/pnfs' for humans and the value used by computers.
Well, doing all that also means it is time for the kspe (kernel simple policy engine) to be integrated. I had all of the above coded and the prototype for the kspe, but had to go to BAT to test other stuff. Well, this week I picked things back up and finished the code off. I'm still not completely happy with it, especially in where the kspe lives, i.e., in the 'nfs' or 'nfssrv' module.
I got my MDS up and running, but I was hitting issues with the above. But that only caused the modules not to load. I was able to use scp to get new modules over. And when I fixed those issues, I ran into the next major one. I had added another index to the layout nfs4 database table and I'd only told the table it had one.
I got in a nasty reboot loop and because of fastboot I couldn't get to to the grub menu. I ended up doing a power cycle from the LOM and that caused the grub menu to come up. I could then boot into single user mode and do:
# echo "setprop boot-file 'kmdb'" >> /a/boot/solaris/bootenv.rc # reboot
That ended up fixing the fastboot issue. Now what I would do in a panic situation was get into single user mode and issue:
# cd / # rm kernel/misc/amd64/nfssrv kernel/misc/nfssrv # reboot
That would bring me up safely in multi-user mode and then I could scp the new kernel modules over.
But I forgot that the DSes use the same initialization routines as the MDSes. If I hadn't already made all of the boxes configured to drop into kmdb on a panic, I would have been ticked.
Hey, I have a DS up and running with one dataset (gotta start small) and a down MDS:
[root@pnfs-17-24 ~]> panic[cpu1]/thread=ffffff01d9ee70c0: assertion failed: e->refcnt > 1, file: ../../common/fs/nfs/nfs4_db.c, line: 133 ffffff00081aa980 genunix:assfail+7e () ffffff00081aa9b0 nfssrv:rfs4_dbe_rele+67 () ffffff00081aaa90 nfssrv:ds_reportavail+4d8 () ffffff00081aab40 nfssrv:nfs_ds_cp_dispatch+9e () ffffff00081aac30 rpcmod:svc_getreq+20d () ffffff00081aaca0 rpcmod:svc_run+197 () ffffff00081aacd0 rpcmod:svc_do_run+81 () ffffff00081abeb0 nfs:nfssys+a0e () ffffff00081abf00 unix:brand_sys_syscall32+292 ()
I'm off to solve it. See ya' later!
I threw together a quick and dirty overview of pNFS, OpenSolaris, and the work I do. It doesn't go into detail, but it is another reference on how to get an OpenSolaris pNFS community up and running.
I need to revisit it with vbox examples. I ran out of time, which I regret.
The crowd was good - I ended up slanting the slides towards a crowd that would be willing to play with beta software. In retrospect, I wish I had drawn out the crowd more to see how they might be interested in deploying pNFS.
One of my themes in the presentation, i.e., I'm not sure it would come out in just reading the slides, is that as developers, we are focused on delivering feature functionality right now and not on management. I'm not talking about data management, we have a great story there with spe, but in pNFS community management. I.e., how does an admin manage the set of machines.
The BUI (and scripting in the CLI) on the Unified Storage 7000 really drives home this point. Taking that approach and extending it such that the BUI on the MDS also configures the DSes would be a major bonus in our implementation.
Oh yeah, the slides are available at okcosug-pnfs-spe.pdf
I just did a branch merge of the nfs41-gate with the onnv_117 tag of the onnv-gate.
And the closed tree had some changes.
I've added a tag of closedv11 to the nfs41-gate and you can download new versions of the closed-bins at http://www.opensolaris.org/os/project/nfsv41/downloads/
There are two ways you can go about installing pNFS on your machines:
We decided to skip branch merges to Nevada builds 113-116. We had issues with the hardware in the lab and vboxes. Plus we wanted a stable environment for the BakeAThon. I'm having a hard time merging with build 117. One of the issues was a panic on the client (which I could patch) due to getting NFS4ERR_BADSESSION from a DS. I was able to see the DS replying on the exchange_id request in a packet trace. There are only a handful of places where the server returns this error code, so I was pretty sure I could reuse a Dtrace script I had laying around:
:nfssrv:mds_lease_chk:return
/args[1] != 0/
{
printf("rc1 = %d", args[1]);
}
Until I came along this code snippet:
*cs->statusp = resp->dsr_status =
NFS4ERR_UNSAFE_COMPOUND;
goto final;
...
final:
DTRACE_NFSV4_2(op__destroy__session__done,
struct compound_state *, cs,
DESTROY_SESSION4res *, resp);
}
Just great, no return value. How could I catch it?
Well, the answer is staring us right in the face, we can use op__destroy__session__done. I had to snoop around for examples in a colleague's home directory (which is kinda why I blog about this stuff, it is easier to find), but ended up with this:
:::op-bind-conn-to-session-done
{
bresp = (BIND_CONN_TO_SESSION4res *)arg1;
printf("rc1 = %d", bresp->bctsr_status);
}
:::op-destroy-session-done
{
dresp = (DESTROY_SESSION4res *)arg1;
printf("rc1 = %d", dresp->dsr_status);
}
Note that I wasn't too worried about being specific as to which module you found these calls in, I was betting on them being very unique.
Oh, and I was able to narrow down where the NFS4ERR_BADSESSION was coming from. And then I had to add debug statements to find out why. :-< I bet I might have been able to do it still with Dtrace. :->
So we had our first BakeAThon hosted by NetApp. I thought it went very well and it enabled the NetApp folks to see how much work goes into hosting the event.
I can't go into details that involves other vendors, because of a NDA, but I can say that we did a lot of testing on new features in both the client and the server. I'm not that familiar with the client side changes, I think we had the compound constructor changes by Bob Mastors and Karen Rochford was testing a rewrite of the layout handling code.
I was more excited about the changes on the server side:
I also brought a major rewrite of the layout handling code (allowing for multiple layouts, etc), device handling (allowing multiple datasets on the same DS) and the integration of the kspe (kernel simple policy engine), but I probably needed another week of development/testing on that code.
The other major development is that we are finally making the switch to using Virtual Boxes for our testing needs. We typically would want 3-4 machines per developer. We can refine this by having a pair of public communities (MDS plus 2 DSes). One as the stable system and the other as a sanity test rig. And then each client developer gets their own machine for a client and each server developer gets a community.
Well, with Virtual Boxes working on a wide range of host OSes, we can have the communities hosted on one or two beefier machines, perhaps sharing room with the build server, and each developer can bring up what they need on their laptops.
I'm going to be giving a presentation on July 7th over pNFS at the Oklahoma City OpenSolaris User Group - OKCOSUG. I'd like to give a live demo using virtual boxes, but I'm not making any promises...
I've hit a nasty little bug which requires an orderly shutdown of the DS as the client is pounding it with traffic:
panic[cpu1]/thread=ffffff01d16d38e0: rw_destroy: lock still active, lp=ffffff01e60c6e08 wwwh=10 thread=ffffff01d16d38e0 ffffff00090a0b50 unix:rw_panic+6f () ffffff00090a0b70 unix:rw_destroy+33 () ffffff00090a0ba0 nfssrv:dserv_mds_instance_destroy+6d () ffffff00090a0bf0 genunix:kmem_cache_free_debug+29c () ffffff00090a0c50 genunix:kmem_cache_free+90 () ffffff00090a0c90 nfssrv:dserv_mds_instance_teardown+2b8 () ffffff00090a0cd0 unix:stubs_common_code+51 () ffffff00090a1eb0 nfs:nfssys+73 () ffffff00090a1f00 unix:brand_sys_syscall32+292 ()
Lock still active, I take that to mean that we are trying to destroy it as another thread has it held.
If we look at dserv_mds_instance_destroy, it only helps as far as getting us in the right source file. Strike that, as there is only one read/write lock, it also tells us which one. The issue is this snippet in dserv_instance_enter:
220 if (inst == NULL)
221 return (ESRCH);
222
223 rw_enter(&inst->dmi_inst_lock, lock_type);
224 /*
225 * dmi_teardown_in_progress is only set in one place,
226 * dserv_mds_teardown_instance() and when doing so the dmi_inst_lock
227 * is held as a WRITER, therefore, it is safe to check it without
228 * holding the dmi_content_lock.
229 */
230 if (inst->dmi_teardown_in_progress == B_TRUE) {
231 rw_exit(&inst->dmi_inst_lock);
232 if (lock_type == RW_READER)
233 return (EIO);
234 else if (lock_type == RW_WRITER)
235 /*
236 * This will protect from receiving multiple teardown
237 * commands happening at once.
238 */
239 return (EBUSY);
240 }
I thought the issue was that if we got past here, we had multiple references held to the lock by threads processing compounds. I.e., I've been dealing with refcounts and all problems look like refcounts will solve them. I had the refcount code halfway implemented and was looking to send a wakeup to try and get the writer going when I realized what the code was really trying to do.
If the instance had been removed from memory totally, then we would bail out on the first check at line 220. If it was in the process of tearing down, then the lock would be held by the WRITER, which would mean that all READERs had exited. I.e., the refcount of READERs had to be 0. No use tracking something that does not matter.
The problem had to be in that window between starting to tear down and removing the instance from the avl tree (look at the code in usr/src/uts/common/fs/nfs/dserv_mds.c for more context). And the most obvious thing is that a READER which comes along to check has to grab the lock to check. So, this code solves that:
216 bool_t grab_lock = FALSE;
...
223 if (inst == NULL)
224 return (ESRCH);
225
226 /*
227 * If dmi_teardown_in_progress is set, then we can't grab the
228 * lock. I.e., we are in the midst of either tearing it
229 * down or we have torn it down.
230 */
231 retry_with_lock:
232 if (grab_lock) {
233 /*
234 * Now we have to grab the lock and make sure that it is not
235 * true!
236 */
237 rw_enter(&inst->dmi_inst_lock, lock_type);
238 }
239
240 /*
241 * dmi_teardown_in_progress is only set in one place,
242 * dserv_mds_teardown_instance() and when doing so the dmi_inst_lock
243 * is held as a WRITER, therefore, it is safe to check it without
244 * holding the dmi_content_lock.
245 */
246 if (inst->dmi_teardown_in_progress == B_TRUE) {
247 if (grab_lock)
248 rw_exit(&inst->dmi_inst_lock);
249 if (lock_type == RW_READER)
250 return (EIO);
251 else if (lock_type == RW_WRITER)
252 /*
253 * This will protect from receiving multiple teardown
254 * commands happening at once.
255 */
256 return (EBUSY);
257 }
258
259 if (!grab_lock) {
260 grab_lock = TRUE;
261 goto retry_with_lock;
262 }
We will try to check inst->dmi_teardown_in_progress twice. The first time we will do it without the lock. The only way it can be 1 is if a tear down is in progress. We aren't going to keep the lock in that case, we are going to exit right away. But if it is 0 here, we have no way of knowing whether another thread just modified it. So in that case we have to grab the lock and check again.
I thought this would solve the problem, but I ended up with the same panic. The new issue can be found in rwlock(9F):
The rw_enter() function acquires the lock, and blocks if necessary. If enter_type is RW_READER, the caller blocks if there is a writer or a thread attempting to enter for writ- ing. If enter_type is RW_WRITER, the caller blocks if any thread holds the lock.
So say a WRITER comes along, it will wait until all of the READERS drain. At that point, we know the refcnt is 0. The problem is that if another READER comes along, it will now block until the WRITER is done. It will keep the lock active:
...
dmi_inst_lock = {
_opaque = [ 0x10 ]
}
...
[1]> ffffff01e60c6dc8::print dserv_mds_instance_t dmi_inst_lock|::rwlock
ADDR OWNER/COUNT FLAGS WAITERS
ffffff01e60c6e08 READERS=2 B000
What we have to do as a READER is try to grab the lock. If we have to block, then in this case only, we know tear-down has started!
[th199096@aus-build-x86 nfs]> grep dserv_instance_enter *c dserv_mds.c:dserv_instance_enter(krw_t lock_type, boolean_t create_instance, dserv_mds.c: * This function frees any of the locks taken by dserv_instance_enter dserv_mds.c: error = dserv_instance_enter(RW_WRITER, B_FALSE, &inst); dserv_mds.c: error = dserv_instance_enter(RW_READER, B_TRUE, &inst); dserv_mds.c: error = dserv_instance_enter(RW_READER, B_TRUE, &inst); dserv_mds.c: error = dserv_instance_enter(RW_READER, B_TRUE, &inst); dserv_mds.c: error = dserv_instance_enter(RW_READER, B_FALSE, &inst); dserv_mds.c: error = dserv_instance_enter(RW_READER, B_FALSE, &inst); dserv_server.c: error = dserv_instance_enter(RW_READER, B_FALSE, &inst);
So we could do this:
238 if (rw_tryenter(&inst->dmi_inst_lock, lock_type) == 0) {
239 if (lock_type == RW_READER)
240 return (EIO);
241 else if (lock_type == RW_WRITER)
242 rw_enter(&inst->dmi_inst_lock, lock_type);
243 }
But I really hate doing this as it makes an assumption about there only ever being one reason to grab as a WRITER and no way to programmatically enforce it. Would a comment suffice?
What we really want to do is sleep and when awoken, retry to grab the lock. We would also have to get the instance pointer fresh.
I think a quick comment and this change will accomplish all I want:
238 if (rw_tryenter(&inst->dmi_inst_lock, lock_type) == 0) {
239 if (lock_type == RW_READER)
240 return (EAGAIN);
241 else if (lock_type == RW_WRITER)
242 rw_enter(&inst->dmi_inst_lock, lock_type);
243 }
I.e., let the caller try again if it wants to!
So code, build, test!