« December 2009
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today
XML

Neat blogs

Navigation

Editing

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

20090728 Tuesday July 28, 2009
Showing relationships between files, layouts, and mpds

I've got two clients doing some simple file copies, file removes, and directory removes. After a while, I stop both test scripts and unmount the mds. We can see the relationship between files, layouts, and device tables (mpd):

[root@pnfs-17-24 ~]> ./rlays.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_Layout_entry_cache|::print  struct rfs4_dbe 
    refcnt = 0x50
    refcnt = 0x99
    refcnt = 0x49
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt | wc -l
+ mdb -k 
+ echo ::walk mds_File_entry_cache|::print  struct rfs4_dbe 
     303
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_mpd_entry_cache|::print  struct rfs4_dbe  
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2

I just fixed a bug where a layout request invalidated the old layout. So we can see here that there are over 300 files with state which are still active, there are 3 corresponding layouts, each with a different usage, and 3 mpds, each with a hold from their respective layout. The 3 layouts correspond to the 3 policies in effect in the spe. There is no round robin scheduling going on, which might see more layouts in use.

I haven't fixed it yet, and this example doesn't show it, but each layout create is going to cause a corresponding mpd create. Which is okay for now, the only difference we could support would be a different stripe unit size.

And we can see that the files have been harvested, but the layouts have not:

[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt | wc -l
+ mdb -k 
+ echo ::walk mds_File_entry_cache|::print  struct rfs4_dbe 
       0
[root@pnfs-17-24 ~]> ./rlays.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_Layout_entry_cache|::print  struct rfs4_dbe 
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_mpd_entry_cache|::print  struct rfs4_dbe  
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2

The layouts are ripe for plucking, with a refcnt of 1, it means that they are in the table, but no one else references them. Let's see if we can quickly use them again:

[root@pnfs-17-24 ~]> ./rlays.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_Layout_entry_cache|::print  struct rfs4_dbe 
    refcnt = 0x5
    refcnt = 0xb
    refcnt = 0x4
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_mpd_entry_cache|::print  struct rfs4_dbe  
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2

Now I don't have proof here that these are the same entries in the db, but what I wanted to point out is that I never invalidate entries in either the layout or mpd tables. Instead, I rely on the fact that they can not go away until all external references are gone. Based on timing, I'd say we did reuse the entries, otherwise I would expect to see 6 mpds and not 3 of them. I.e., they do not get reused and if there were new layout entries, we'd expect to see 6 mpds.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
Tracking down a nasty refcnt bug with rfs4_file_t, or am I?

I'm interested in that layout and mpd entries get harvested as they become stale. Because of the prior assumption that there was only ever 1 of each, it never was an issue.

If we recall our mappings, each mpd can belong to several layouts, each layout can belong to several files, etc. So when I discovered that mpds were sticking in memory with a refcnt of 2, I checked the layouts, saw that they had a refcnt of 2, and then checked the files:

[root@pnfs-17-24 ~]> more rlays.sh 
#!/bin/sh -x
echo '::walk mds_Layout_entry_cache|::print  struct rfs4_dbe' | mdb -k
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_File_entry_cache|::print  struct rfs4_dbe 
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2
    refcnt = 0x2

Now, some of these would go away, but in the end, several would stick in memory. I actually found 2 nasty rfs4_file_t rele bugs and fixed them, but the issue stayed.

I thought my last fix did the trick, I turned my attention away for a bit as I worked on my MacBook Pro, and the file references went away. I rebooted and tried again. No such luck, that was when I got the above.

I unmounted the mds from my two clients and went away to study the code, trying to find out how an mds_op_close() did a file rele which would lead to the layout rele.

I was going to reboot, without any luck, and try the case where I created the files and did nothing else - the remove code is suspect to me still. Well, before I did that, I checked the file refcnts:

[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k+  
echo ::walk mds_File_entry_cache|::print  struct rfs4_dbe 
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1
    refcnt = 0x1

My guess is that somehow the clients are causing state to be clung to on the server. It is only on the unmount that a final rele takes place. It has been some time since I did that last check:

[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k + echo ::walk mds_File_entry_cache|::print  struct rfs4_dbe 

[root@pnfs-17-24 ~]> ./rlays.sh 
+ mdb -k 
+ echo ::walk mds_Layout_entry_cache|::print  struct rfs4_dbe 
[root@pnfs-17-24 ~]> ./rmpd.sh 
+ mdb -k 
+ echo ::walk mds_mpd_entry_cache|::print  struct rfs4_dbe  
{
    lock = [
        {
            _opaque = [ 0 ]
        },
    ]
    refcnt = 0x1

Whoops, I expected the mpds to be out of memory as well and after a bit of patience, they are. :->

[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_mpd_entry_cache|::print  struct rfs4_dbe  

A quick test of creating 4 files sees:

[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_File_entry_cache|::print  struct rfs4_dbe 
    refcnt = 0x5
    refcnt = 0x5
    refcnt = 0x5
    refcnt = 0x5

If we then unmount, the snoop is mainly the following:

[root@pnfs-4-02 ~]> grep RETURN\) xxx
NFS:  Op = 8 (DELEGRETURN)
NFS:  Op = 8 (DELEGRETURN)
NFS:  Op = 51 (LAYOUTRETURN)
NFS:  Op = 51 (LAYOUTRETURN)
NFS:  Op = 8 (DELEGRETURN)
NFS:  Op = 8 (DELEGRETURN)
NFS:  Op = 51 (LAYOUTRETURN)
NFS:  Op = 51 (LAYOUTRETURN)
NFS:  Op = 8 (DELEGRETURN)
NFS:  Op = 8 (DELEGRETURN)
NFS:  Op = 51 (LAYOUTRETURN)
NFS:  Op = 51 (LAYOUTRETURN)
NFS:  Op = 8 (DELEGRETURN)
NFS:  Op = 8 (DELEGRETURN)
NFS:  Op = 51 (LAYOUTRETURN)
NFS:  Op = 51 (LAYOUTRETURN)

And indeed, we see the refcnts get decremented by two for each file:

[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_File_entry_cache|::print  struct rfs4_dbe 
    refcnt = 0x3
    refcnt = 0x3
    refcnt = 0x3
    refcnt = 0x3

And after a while:

[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k 
+ echo ::walk mds_File_entry_cache|::print  struct rfs4_dbe 

I'm actually happy that this strange behavior occurred. I wouldn't have taken the time to painstakingly poured over every file rele and I wouldn't have thus found the two bugs. The first was in error handling in the remove case and the second was in file delegation recalls. The first was probably hit or miss to being found. The second was a timebomb.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090727 Monday July 27, 2009
How spe interacts with file opens

The spe is a simple policy engine, which dictates the stripe size, the block size, and the set of devices used during file creation - these three items form the layout. The impact of that statement is subtle, but it sets some expectations. The spe code only fires on a file creation, so if a file already exists, then we read in the on disk layout (odl) to get the layout.

So, if files were created under a different set of policies, then there will be no problem accessing the files under the new set of policies.

Here is a file we created before we even enabled spe:

[root@pnfs-17-21 ~]> cp nfs4_vnops.c  /pnfs/pnfs-17-24/redo.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
        Layout obtained at: Sat Jul 25 17:51:55:18886 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [1]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

I've rebooted the client and mds many times, I've added spe policies, but I can check to see that we get the exact same layout. (Note, I've also changed the mount point to be the root of the mds, so expect the path to be different!)

[root@pnfs-4-02 ~]> touch /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c
Layout unacquired
[root@pnfs-4-02 ~]> cat /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c > /dev/null
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 1
Layout [0]:
        Layout obtained at: Mon Jul 27 11:38:40:673328 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [1]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

One thing to note right off the bat is that 'touch' does not cause the file layout to be loaded. That is simple, the 'touch' command is a meta-data operation and can be serviced on the mds. It does not need to open the file. I'd have to look at the snoop output to be sure, but we look like we are getting the same layout.

Now, if I do the same copy as before, we see:

[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redome.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 116
Layout [0]:
        Layout obtained at: Mon Jul 27 11:43:30:691639 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 3968
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK

We just happen to get a stripe count of 4, but notice that the stripe unit is much smaller. We can tell that the extension of '.c' caused the first policy to fire:

[root@pnfs-17-24 ~]> more /etc/policies.spe 
1000, 4, 4k, pool17, ext == c
2000, 8, 1k, pool17:pool4, path == /pnfs1
3000, 3, 8k, pool4:pool17, path == /pnfs2

Okay, what if we call it redome.h? That should cause the last policy to match. Note that the test of the path attribute is based on the path of the mds and not the client. Also, if we have a mount point under the root of the mds, we still get the complete path when we do the check.

[root@pnfs-4-02 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/pnfs2/pnfs/redome.h
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redome.h
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 64
Layout [0]:
        Layout obtained at: Mon Jul 27 11:46:20:549943 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 3, stripe unit: 8000
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

And it does! We can see it in both the stripe count and stripe unit.

And these files will have those same layouts as long as they live.

Well, I need to go figure out some more unit tests to break my code...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
More on getting the kspe code to generate an array of mds_sid

Okay, I've forced the ds_dataset_name to be present at creation. And I'm getting pretty hashes. I'm still not generating a mds_sid array.

When we store 'pnfs-4-01:pnfs1/d10':

Jul 27 10:59:45 pnfs-17-24 nfssrv: utf8_hash of |pnfs-4-01:pnfs1/d10|[20] is 55811466

And when we try to search for it:

Jul 27 11:04:51 pnfs-17-24 nfssrv: utf8_hash of |pnfs-4-01:pnfs1/ds10|[21] is 111623246

The lengths are different, which will cause a bit shift. And we see that one is 'd10' and the other is 'ds10'. Well, I'll change '/etc/npools.spe' as I don't have renaming a dataset name <-> mds_sid working yet.

[root@pnfs-4-01 ~]> zfs rename pnfs1/d10 pnfs1/ds10
cannot rename 'pnfs1/d10': dataset is busy

And there could be other problems.

But anyway, fix the dataset name, and we have spe, I repeat, we have spe!

Here we can see a file creation without any policy firing:

[root@pnfs-4-02 ~]> mount -o vers=4 pnfs-17-24:/ /pnfs/pnfs-17-24
[root@pnfs-4-02 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/pnfs1/nfs41/spek4.txt
[root@pnfs-4-02 ~]> nfsstat -l !$
nfsstat -l /pnfs/pnfs-17-24/pnfs1/nfs41/spek4.txt
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 4
Layout [0]:
        Layout obtained at: Mon Jul 27 11:04:51:725322 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 10, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
        Stripe [3]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
        Stripe [4]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
        Stripe [5]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
        Stripe [6]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [7]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [8]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [9]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK

And here we can see the same file copy resulting in not only a different number of stripes, but to a different dataset order!

[root@pnfs-4-02 ~]> umount /pnfs/pnfs-17-24
[root@pnfs-4-02 ~]> mount -o vers=4 pnfs-17-24:/ /pnfs/pnfs-17-24
[root@pnfs-4-02 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/pnfs1/nfs41/spek5.txt
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs1/nfs41/spek5.txt
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 140
Layout [0]:
        Layout obtained at: Mon Jul 27 11:27:06:306959 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 8, stripe unit: 960
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [4]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [5]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [6]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
        Stripe [7]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK

Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
First real test of kspe!

So, I started off a test run with the kspe. It allows me to decide which datasets are going to be used in a layout. (see my slides from Oklahoma City OpenSolaris User Group (OKCOSUG) presentation). I set up a simple set of npools and policies:

[root@pnfs-17-24 /etc]> more npools.spe 
pool17 pnfs-17-22:pnfs1/ds1 pnfs-17-23:pnfs1/ds3 pnfs-17-22:pnfs2/ds2 pnfs-17-23
:pnfs2/ds4 pnfs-17-23:pnfs1/ds7 pnfs-17-22:pnfs1/ds8
pool4 pnfs-4-01:pnfs1/ds4 pnfs-4-01:pnfs2/ds5 pnfs-4-01:pnfs2/d9 pnfs-4-01:pnfs1
/ds10
[root@pnfs-17-24 /etc]> more policies.spe 
1000, 4, 4k, pool17, ext == c
2000, 8, 1k, pool17:pool4, path == /pnfs1
3000, 3, 8k, pool4:pool17, path == /pnfs2

Note that a dataset identifier is a combination of host name and the zfs filesystem. I wrote the code and I still struggle with this is not a path name, it is a zfs name. I.e., it is not ':/' as a connector. 'pnfs2/ds2' is the zfs name and not '/pnfs2/ds2'!

And struggled a bit to load them, until I found my kspe implementation notes.

And right off the bat, I see my policies are not matching and I've got a core!

[root@pnfs-4-02 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/pnfs1/nfs41/spe_1234.32k.raw.txt

yields

[root@pnfs-17-24 ~]> rc = 0, eval = 0, id = 1000
rc = 0, eval = 1, id = 2000
WARNING: spe: 1000 8 

panic[cpu1]/thread=ffffff01d74bf4e0: BAD TRAP: type=e (#pf Page fault) rp=ffffff0008162d70 addr=0 occurred in module "nfssrv" due to a NULL pointer dereference
...
[1]> $c
kmdb_enter+0xb()
...
nfssrv`mds_layout_hash+0x19(ffffff0008162fe0)
...
[1]> ffffff0008162fe0::print layout_core_t
{
    lc_stripe_unit = 0x3e8
    lc_stripe_count = 0x8
    lc_mds_sids = 0
}

Which tells me that the kspe code did not generate a mds_sid array. Again, no problem, I just wrote that code last week and this is the first test of it!

So the bug was a stupid error:

                        return (mds_sids ? 0 : ENOENT);
versus
                        return (*mds_sids ? 0 : ENOENT);

Success was being returned with an empty array of mds_sids. There is still a bug, we should be finding matches, but at least it is not so nasty.

The bug is that these comparisons are failing:

Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool17| at 7 to global |pool17| at 7
Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool17| at 7 to global |pool4| at 6
Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool4| at 6 to global |pool17| at 7
Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool4| at 6 to global |pool4| at 6
Jul 27 01:37:45 pnfs-17-24 nfssrv: spe_map_npools_to_mds_sids: No matching npools!

The first and last should match. Ah, they do (as shown by additional debug logic), which shows me barking up a wrong tree!

D'oh! I found it! If we look at this code:

spe_map_npools_to_mds_sids(kspe_state_t *kspe, spe_policy *sp,
...
        spe_npool       *sn;
        spe_npool       *np;

        /*
         * For each npool in the policy, find it in the
         * list of npools, and start assigning datasets.
         */
        for (sn = sp->sp_npools; sn != NULL; sn = sn->next) {
                for (np = kspe->ks_npools; np; np = np->next) {
                        cmp = utf8_compare(&np->sn_name, &sn->sn_name);
                        if (cmp == 0) {
                                /*
                                 * Now we fill in entries in the *mds_sids
                                 * array.
                                 */
                                for (ss = sn->sn_dses; ss; ss = ss->next) {

We see I was lazy in assuming that sn and np were the same thing. Note that they are the same type of object, but the sn points to the npools in the policy and np points to the npools in the global list. The point is that a npool can be in multiple policies. So rather than store the list of datasets in the policies (which is a nightmare for updating) or pointers to npools in the policies (which sounds good and I can't remember why not!) we store the datasets in the global list.

So instead of searching a list of datasets in the global list, we search an empty list in the policy. :->

And I'm at the next bug:

Jul 27 02:11:04 pnfs-17-24 nfssrv: mds_ds_path_to_mds_sid returned an error!

We're not finding a match as we search the ds_guid_info database. When I had a search issue a couple nights ago, it turned out to be the hash function. If we look at what we have:

        instp->ds_guid_info_dataset_name_idx =
            rfs4_index_create(instp->ds_guid_info_tab,
            "DS_guid-dataset-name-idx", mds_str_hash,
            ds_guid_info_dataset_name_compare, ds_guid_info_dataset_name_mkkey,
            FALSE);

We are using mds_str_hash on an utf8_string - which is a length and a string. I think that will be problematic. I'm not too happy with the hash functions here in general, but for now all I care about is that they are consistent.

They aren't consistent in this case and it is getting late...

As it boots:

Jul 27 03:33:41 pnfs-17-24 nfssrv: utf8_hash of empty str

As we search:

Jul 27 03:34:35 pnfs-17-24 nfssrv: utf8_hash of |pnfs-17-22:pnfs1/ds1|[21] is 111612431

We ought to see a non-NULL addition and we ought to see 10 messages, not 1.

Bzzt! That is an expectation and not code. The debug statements are correct. If we look at the entry create routine, which is called just before the hash function, we see:

static bool_t
ds_guid_info_create(rfs4_entry_t u_entry, void *arg)
{
...
        pgi->ds_dataset_name.utf8string_val = NULL;
        pgi->ds_dataset_name.utf8string_len = 0;

We don't have our hands on that info -- or do we?

We do - I can fix this - but not now. It will wait until tomorrow, because I've got to sleep...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090726 Sunday July 26, 2009
File contents not matching

So after a reboot, a diff of a file copied to the server is showing problems. There are several triage steps to take:

  1. See if this happens with the old server - this will isolate whether or not it is an existing bug.
  2. See if it happens across an unmount - this will isolate whether the reboot is doing something.
  3. See if it happens from another client - this will also isolate whether it is the reboot or not.
  4. Try it such that one write goes to each DS and not to each datatset - which will tell us if the writes are wrong, i.e., the FH decode is not working correctly.

I can show the problem thusly:

[root@pnfs-17-21 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw
[root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw | wc -l
       0

How do I know that the client is even getting the bits back from the server. If they are still in local buffers, this may not be happening. So, lets try from another client:

[root@pnfs-4-02 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw | wc -l
     132

So it isn't the reboot and it probably is not the ODL.

Okay, I just configured a stock community and repeated the same test and there was no loss of data. Clearly either my changes or changes I failed to do have caused the problem. I.e., either I caused a bug or I exposed an existing bug - one that the new feature set enabled. Either way, I have to fix it before I can integrate my changes.

I'm beginning to think it is in code which has yet to be implemented:

[thud@ultralord nfs]> grep dnk_sid *
dserv_server.c:	dskey.dnk_sid = &fhp->fh.v1.mds_sid;
[thud@ultralord nfs]> grep dnk_fid *
dserv_server.c:	CRC32(rc, key->dnk_fid->val, key->dnk_fid->len,
dserv_server.c:	NFS_AVL_COMPARE(a->dnk_fid->len, b->dnk_fid->len);
dserv_server.c:	rc = memcmp(a->dnk_fid->val, b->dnk_fid->val,
dserv_server.c:	    a->dnk_fid->len);
dserv_server.c:	data->dnd_fid = key->dnk_fid;
dserv_server.c:	dskey.dnk_fid = &fhp->fh.v1.mds_fid;
dserv_server.c:	key->dnk_fid = &key->dnk_real_fid;

I enabled the setting of 'dnk_sid', but no one uses it. I bet what is happening is that both 32k chunks of the files are being written on the same dataset...

With some debug printfs, we can see that the DSes are picking different object sets to save the data:

Jul 26 21:28:46 pnfs-17-22 nfssrv: Accessing pnfs2/ds2/EF25FBCF08FA1BD7 for D336969B13CB6C0E000000000000001E
Jul 26 21:28:46 pnfs-17-22 nfssrv: Accessing pnfs1/ds1/EF25FBCF08FA1BD7 for 86FE860F0E0B87D4000000000000001E

So I implemented the code for 'dnk_sid' and now I'm seeing:

[root@pnfs-4-02 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/slow | wc -l
       0

It also works now across a reboot.

The bad thing here is that I'm making it up on the fly, I don't understand the nnode code. I'm going to have to get a code review just on that chunk.

Actually, strike that. I know there was a bug, missing code to be exact. I know my fix works, and I understand the changes I made. I don't understand how those changes work in the bigger picture of the nnode code. Oh, I understand that when we tried to access the two different datasets on the same DS, we would sometimes get one instead of the other. I'm still going to talk with the author, I want to make sure my fixes are optimal.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
Testing On Disk Layouts

Last night, I did the following test of file creation:

[root@pnfs-17-21 ~]>  cp nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
        Layout obtained at: Sat Jul 25 02:55:17:818063 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la  /pnfs/pnfs-17-24/Pnfs4_vnops.c
-rw-r--r--   1 root     root      409540 Jul 25 02:55 /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c

So we have pNFS!

I then went to bed, knowing that the next set of testing to do was what we call on disk layouts or odl. When we create a file, we generate a layout based on all of the currently reporting datasets (ignore kspe, which I will soon enable and start testing). When we generate the create layout, we need to stuff it somewhere such that a client which opens the file either for appending or reading can get at the contents. So we store the layout on disk, hence odl.

The system we had in place made assumptions like all devices would be there, they would report in the same order, etc. Well, with using the mds_sids as keys, I've changed that.

To test things, I've rebooted the client and the servers. I want to make sure that there is nothing in memory. Soon, I'll add a new DS to make sure the number of reporting datasets does not have an impact. But for now, what happens?

[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c2
Binary files nfs4_vnops.c and /pnfs/pnfs-17-24/Pnfs4_vnops.c2 differ
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c2
Layout unacquired

I don't need snoop, I need to start examining the odl code. And to be honest, I was expecting problems here.

I don't think the layout was written to disk. For now, we plaop it down in '/var':

[root@pnfs-17-24 ~]> cd /var/nfs/v4_state/layouts
[root@pnfs-17-24 layouts]> ls -la
total 6
drwxr-xr-x   3 daemon   daemon       512 Jun 26 17:27 .
drwxr-xr-x   5 daemon   daemon       512 Jul 23 16:30 ..
drwxr-xr-x   2 daemon   daemon       512 Jun 26 17:27 2d90005
[root@pnfs-17-24 layouts]> cd 2d90005/
[root@pnfs-17-24 2d90005]> ls -la
total 4
drwxr-xr-x   2 daemon   daemon       512 Jun 26 17:27 .
drwxr-xr-x   3 daemon   daemon       512 Jun 26 17:27 ..

Found the problem!

#ifdef PERSIST_LO_ENABLED

And it wasn't enabled!

IF I enable it and try the experiment again:


[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/a13.c | wc -l
   21433
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/a13.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 14
Layout [0]:
        Layout obtained at: Sat Jul 25 17:49:59:853319 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [3]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

I should have gotten the layout from before the write. But it looks wrong. They are out of order, see:

[root@pnfs-17-21 ~]> cp nfs4_vnops.c  /pnfs/pnfs-17-24/redo.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
        Layout obtained at: Sat Jul 25 17:51:55:18886 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [1]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

Reboot and we see:

[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 14
Layout [0]:
        Layout obtained at: Sat Jul 25 17:55:33:205567 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [1]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

So the good news is that odl works, the bad news is just not right.

We can actually see this in action. I wrote a simple program to write 32k of 'a', 32k of 'b', 32k of 'c', and 32k of 'd's to a file. I then copied that file over.

[root@pnfs-17-21 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw
[root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw
[root@pnfs-17-21 ~]> reboot
...
[root@pnfs-17-21 ~]> mount -o vers=4 pnfs-17-24:/pnfs2/pnfs /pnfs/pnfs-17-24
[root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw
Binary files 1234.32k.raw and /pnfs/pnfs-17-24/b1234.32k.raw differ

The first diff had to have gotten the same layout as before. Now when we reboot, we see a difference. And since the pattern is so simple, we can see what might be going on:

[root@pnfs-17-21 ~]> grep a /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
       0
[root@pnfs-17-21 ~]> grep b /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
      31
[root@pnfs-17-21 ~]> grep c /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
       0
[root@pnfs-17-21 ~]> grep d /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
      64

We need to snoop the filehandles from the create and then from the later read.

cre:  fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
odl:  fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
cre:  fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
odl:  fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
cre:  fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
odl:  fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
cre:  fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
odl:  fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000

And they match. The problem isn't in the filehandles we hand out.

I'm going to end it there for now...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090725 Saturday July 25, 2009
And we have the pee in pNFS!

That last bug was actually pretty easy to solve. It turned out to be more code which had never been tested before. If we look at the code in usr/src/uts/common/fs/nfs/dserv_server.c (which will be modified by the time you read this), we see:

 711         mutex_enter(&inst->dmi_content_lock);
 712         error = find_open_root_objset(inst, sid, &root_objset);
 713         if (error) {
 714                 error = (error == ENOENT) ? EIO : error;
 715                 goto out;
 716         }
 717 
 718         error = find_open_mdsfs_objset(inst, dataset_id, root_objset,
 719             &(dnd->dnd_objset));
 720         if (error == 0 || error != ENOENT) {
 721                 if (error == 0)
 722                         dnd->dnd_flags |= DSERV_NNODE_FLAG_OBJSET;
 723                 goto out;
 724         }

The EAGAIN in Hard day of debugging the hard stuff, but now have multiple datasets on the same DS was returned on 712 by find_open_root_objset(). The new panic was occurring down in the call to find_open_mdsfs_objset() on 718. Before my changes, find_open_root_objset() always returned the first pNFS dataset root. So if you had two, you only ever saw the one.

I decided to skip right to the code and it ended up being easy to spot in find_open_root_objset(), I started there because I hadn't changed find_open_mdsfs_objset() and it was known to work.

 541 static int
 542 find_open_root_objset(dserv_mds_instance_t *inst, mds_sid mds_sid,
 543     open_root_objset_t **root_objset)
 544 {
...
 603         /*
 604          * Find the root pNFS object set.
 605          */
 606         for (tmp_root = list_head(&inst->dmi_datasets); tmp_root != NULL;
 607             tmp_root = list_next(&inst->dmi_datasets, tmp_root)) {
 608                 if (ds_guid.dg_zpool_guid ==
 609                     tmp_root->oro_ds_guid.dg_zpool_guid &&
 610                     ds_guid.dg_objset_guid ==
 611                     tmp_root->oro_ds_guid.dg_objset_guid) {
 612                         /*
 613                          * This is our root pNFS object set!
 614                          */
 615                         found_root_objset = 1;
 616                         break;
 617                 }
 618         }

*root_objset is never set when it is found. So, when we use it on 718, it is garbage. The fix is a simple assignment. Hmm, I could replace tmp_root with *root_objset, which is what the original author seemed to think was going on here.

But in any event, here is the 'pee' in pNFS:

[root@pnfs-17-21 ~]> mount -o vers=4 pnfs-17-24:/pnfs2/pnfs /pnfs/pnfs-17-24
[root@pnfs-17-21 ~]> cp /etc/passwd /pnfs/pnfs-17-24/qwhoei
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/qwhoei
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 1
Layout [0]:
        Layout obtained at: Sat Jul 25 02:20:00:343367 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/qwhoei
-rw-r--r--   1 root     root         881 Jul 25 02:20 /pnfs/pnfs-17-24/qwhoei

Well, actually, there really isn't any parallel activity going on here. The file is 881 bytes and the default stripe size is 32k. So it all goes to stripe 0. But there is a lot going on in the background which was touched by my changes.

I could either start testing the kspe to get a smaller stripe size or I can use mkfile to get a file of 2*4*32k such that there are two writes to each stripe.

[root@pnfs-17-21 ~]> mkfile 256k chunky
[root@pnfs-17-21 ~]> ls -la chunky 
-rw------T   1 root     root      262144 Jul 25 02:34 chunky
[root@pnfs-17-21 ~]> cp chunky /pnfs/pnfs-17-24/pChunky
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/pChunky
-rw-------   1 root     root      262144 Jul 25 02:34 /pnfs/pnfs-17-24/pChunky
[root@pnfs-17-21 ~]>  nfsstat -l /pnfs/pnfs-17-24/pChunky
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 8
Layout [0]:
        Layout obtained at: Sat Jul 25 02:34:22:325669 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK

But I have no idea if the file is identical. I'll test that later...

I can test that with using a large text file, say 440k:

[root@pnfs-17-21 ~]>  cp nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
        Layout obtained at: Sat Jul 25 02:55:17:818063 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la  /pnfs/pnfs-17-24/Pnfs4_vnops.c
-rw-r--r--   1 root     root      409540 Jul 25 02:55 /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c

So we have pNFS!


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
Hard day of debugging the hard stuff, but now have multiple datasets on the same DS

The good news:

[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/bubble
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 0
Layout [0]:
        Layout obtained at: Fri Jul 24 22:33:22:698731 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK

As you can see, the first and second stripe go onto the same DS. How can we make sure that these are for different datasets though? Well, snoop should tell us!

NFS:  Op = 50 (LAYOUTGET)
NFS:  Status = 0 (NFS4_OK)
NFS:  State ID hash = 0721
NFS:      len = 12    val = 710000004306000000000000
NFS:      State ID Sequence ID = 1
NFS:  Layout [0]:
NFS:  Return layout on close = FALSE
NFS:  Layout offset = 0
NFS:  Layout length = 18446744073709551615
NFS:  Layout iomode = LAYOUTIOMODE4_RW
NFS:  Layout type = LAYOUT4_NFSV4_1_FILES
NFS:  Device ID hash = [03F9]
NFS:      (16) 000000000000000000000000C8000000
NFS:  Util = 0x8001
NFS:      stripe_unit = 512
NFS:      UFLG_DENSE
NFS:  First stripe = 0
NFS:  fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000
NFS:  fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000
NFS:  fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000
NFS:  fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A190000000000B90000000000
NFS:

fh[0] and fh[1] differ in the mds_sid portion of the file handle: '10D336969B13CB6C0E' vs '1086FE860F0E0B87D4'. And the 0x10 at the start is probably the length. Amazingly, the mds_sid is also the only difference with fh[2] and fh[3]. That is because the mds_sid is what tells the DS which dataset to act on. The combination of the mds_sid and the ip addresses in the device info are what lets the client send the correct data to the correct DS:

NFS:  Op = 47 (GETDEVICEINFO)
NFS:  Status = 0 (NFS4_OK)
NFS:  Layout type = LAYOUT4_NFSV4_1_FILES
NFS:  Device address:
NFS:      stripe_indices: 0, 1, 2, 3
NFS:      multipath_ds_list[0]: [0] tcp:10.1.233.192.183.161
NFS:      multipath_ds_list[1]: [0] tcp:10.1.233.192.183.161
NFS:      multipath_ds_list[2]: [0] tcp:10.1.233.193.183.161
NFS:      multipath_ds_list[3]: [0] tcp:10.1.233.193.183.161
NFS:

Now, I said good news, which implies bad news. And here it is:

[thud@pnfs-17-21 ~]> cp /etc/passwd /pnfs/pnfs-17-24/bubble
cp: /pnfs/pnfs-17-24/bubble: write: Permission denied
[thud@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/bubble
-rw-r--r--   1 thud     staff          0 Jul 24 22:33 /pnfs/pnfs-17-24/bubble

So the create is working, but the data is not appearing. I'm thinking an issue with the client talking to the DSes. And with a new filehandle, I wonder if it is not being decoded correctly?

Well, it is client to DS communication, as a snoop shows us:

NFS:  ----- Sun NFS -----
NFS:
NFS:  Proc = 1 (Compound)
NFS:  Tag = pnfs write
NFS:  Status = 10012 (NFS4ERR_LOCKED)
NFS:  Number of results = 3
NFS:
NFS:  Op = 53 (SEQUENCE)
NFS:  Status = 0 (NFS4_OK)
NFS:  Session ID hash = [3716]
NFS:      (16) 00000000FFFFFFFFD700000064000000
NFS:  Sequence ID = 1
NFS:  Slot ID = 0
NFS:  Highest slotid = 200
NFS:  Target highest slotid = 200
NFS:  status = 0
NFS:
NFS:  Op = 22 (PUTFH)
NFS:  Status = 0 (NFS4_OK)
NFS:
NFS:  Op = 38 (WRITE)
NFS:  Status = 10012 (NFS4ERR_LOCKED)
NFS:

I just fixed an issue in the encoding of the ds_filehandle, so perhaps I have it in reverse? And no, the decoding was okay. We need to look for NFS4ERR_LOCKED. So either locking code is getting to me or it is EAGAIN - which is bad to track.

Well, not so bad, we know it is seen during the WRITE.

Nothing was panning out, so I went to the webrev (http://cr.opensolaris.org/~tdh/perse/) and I'm suspicious of this code in usr/src/uts/common/fs/nfs/dserv_server.c :

578         /*
 579          * Use the MDS SID (from the file handle) to find the real data
 580          * server guid (zpool guid + id of the root pNFS object set).
 581          */
 582         for (sid_map = list_head(&inst->dmi_mds_sids); sid_map != NULL;
 583             sid_map = list_next(&inst->dmi_mds_sids, sid_map)) {
 584                 if ((mds_sid.len == sid_map->msm_mds_storid.len) &&
 585                     (memcmp(mds_sid.val, sid_map->msm_mds_storid.val,
 586                     sid_map->msm_mds_storid.len) == 0)) {
 587                         found_mds_sid = 1;
 588                         ds_guid = sid_map->msm_ds_guid;
 589                         break;
 590                 }
 591         }
 592 
 593         /*
 594          * If we have no record of the given MDS SID it may mean that
 595          * we haven't been able to do the REPORTAVAIL for this particular
 596          * resource.  Therefore, just tell the client to try again later.
 597          * In the future, we will attempt to ask the MDS for this information
 598          * via DS_MAP_MDSSID.
 599          */
 600         if (found_mds_sid != 1)
 601                 return (EAGAIN);

Until my changeset, the mds_sid was always 0s. And that piece of code was surrounded by an '#ifdef 0'. Great, I have my culprit!

I added some debug statements to show what the DS was seeing as it looped across the sid_map:

Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sid_map is NOT NULL
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: fh[16] 312CE43624C45DB0000000000000001E
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sm[16] 768279474C9ACF441E00000000000000
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: fh[16] 312CE43624C45DB0000000000000001E
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: sm[16] B05DC42436E42C311E00000000000000
Jul 25 00:26:35 pnfs-17-23 nfssrv: find_open_root_objset: no root objset found!

I'm very suspicious that the reply from the DS_REPORTAVAIL (MDS to DS) is xdr encoding the mds_sid and the one in the file handle is not being encoded. I think the fh mds_sid above is the same as the second sid_map one.

The root cause here was that the mds_sid used to be sent across as 0s. As such, when the MDS did an xdr encode and the DS did not do an xdr decode, it really didn't matter. And the mds_sid was actually ignored on the DS.

Since we don't encode the mds_sid when we send it in the ds_filehandle, I decided to not encode it in the DS_REPORTAVAIL result. As such, we will be able to look at network traces and see if they match in both places.

This analysis took several hours. I keep on getting blind-sided by thinking existing code works correctly. I looked down several side paths I didn't mention (like the MDS treating the mds_sid array like a pointer and not an array of mds_sids).

I can also prove that I've fixed the bug, as I have a new one on the DS, which is downstream of the old one:

find_open_root_objset: sid_map is NOT NULL
find_open_root_objset: fh[16] D336969B13CB6C0E000000000000001E
find_open_root_objset: sm[16] 86FE860F0E0B87D4000000000000001E

panic[cpu1]/thread=ffffff01d7880000: BAD TRAP: type=e (#pf Page fault) rp=ffffff0007b3a0e0 addr=100428 occurred in module "genunix" due to an illegal access to a user address
...
ffffff0007b3a1e0 genunix:list_head+10 ()
ffffff0007b3a230 nfssrv:find_open_mdsfs_objset+40 ()
ffffff0007b3a6e0 nfssrv:dserv_nnode_data_getobjset+158 ()
ffffff0007b3a750 nfssrv:dserv_nnode_io_prep+85 ()
ffffff0007b3a7a0 nfssrv:nnop_io_prep+41 ()
ffffff0007b3a960 nfssrv:mds_op_write+134 ()
ffffff0007b3a9b0 nfssrv:rfs41_op_dispatch+72 ()

But it is late and I don't know if I'll blog about finding out what this one does...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090724 Friday July 24, 2009
Starting to test massive layout, device, and kspe changes

I'm in the process of finishing off some coding I started before the BAT last month. It started off as a simple observation, my DSes with two datasets were only using the first dataset. It was time to rewrite the device and layouts to use the mds_sids. That is an unique value which identifies the ZFS dataset on a zpool. It should be unique in allowing a mapping from dataset name: 'ds1:ppool/pnfs' for humans and the value used by computers.

Well, doing all that also means it is time for the kspe (kernel simple policy engine) to be integrated. I had all of the above coded and the prototype for the kspe, but had to go to BAT to test other stuff. Well, this week I picked things back up and finished the code off. I'm still not completely happy with it, especially in where the kspe lives, i.e., in the 'nfs' or 'nfssrv' module.

I got my MDS up and running, but I was hitting issues with the above. But that only caused the modules not to load. I was able to use scp to get new modules over. And when I fixed those issues, I ran into the next major one. I had added another index to the layout nfs4 database table and I'd only told the table it had one.

I got in a nasty reboot loop and because of fastboot I couldn't get to to the grub menu. I ended up doing a power cycle from the LOM and that caused the grub menu to come up. I could then boot into single user mode and do:

# echo "setprop boot-file 'kmdb'" >> /a/boot/solaris/bootenv.rc
# reboot

That ended up fixing the fastboot issue. Now what I would do in a panic situation was get into single user mode and issue:

# cd /
# rm kernel/misc/amd64/nfssrv kernel/misc/nfssrv 
# reboot

That would bring me up safely in multi-user mode and then I could scp the new kernel modules over.

But I forgot that the DSes use the same initialization routines as the MDSes. If I hadn't already made all of the boxes configured to drop into kmdb on a panic, I would have been ticked.

Hey, I have a DS up and running with one dataset (gotta start small) and a down MDS:

[root@pnfs-17-24 ~]> 
panic[cpu1]/thread=ffffff01d9ee70c0: assertion failed: e->refcnt > 1, file: ../../common/fs/nfs/nfs4_db.c, line: 133

ffffff00081aa980 genunix:assfail+7e ()
ffffff00081aa9b0 nfssrv:rfs4_dbe_rele+67 ()
ffffff00081aaa90 nfssrv:ds_reportavail+4d8 ()
ffffff00081aab40 nfssrv:nfs_ds_cp_dispatch+9e ()
ffffff00081aac30 rpcmod:svc_getreq+20d ()
ffffff00081aaca0 rpcmod:svc_run+197 ()
ffffff00081aacd0 rpcmod:svc_do_run+81 ()
ffffff00081abeb0 nfs:nfssys+a0e ()
ffffff00081abf00 unix:brand_sys_syscall32+292 ()

I'm off to solve it. See ya' later!


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090708 Wednesday July 08, 2009
Slide deck from Oklahoma City OpenSolaris User Group (OKCOSUG) presentation

I threw together a quick and dirty overview of pNFS, OpenSolaris, and the work I do. It doesn't go into detail, but it is another reference on how to get an OpenSolaris pNFS community up and running.

I need to revisit it with vbox examples. I ran out of time, which I regret.

The crowd was good - I ended up slanting the slides towards a crowd that would be willing to play with beta software. In retrospect, I wish I had drawn out the crowd more to see how they might be interested in deploying pNFS.

One of my themes in the presentation, i.e., I'm not sure it would come out in just reading the slides, is that as developers, we are focused on delivering feature functionality right now and not on management. I'm not talking about data management, we have a great story there with spe, but in pNFS community management. I.e., how does an admin manage the set of machines.

The BUI (and scripting in the CLI) on the Unified Storage 7000 really drives home this point. Taking that approach and extending it such that the BUI on the MDS also configures the DSes would be a major bonus in our implementation.

Oh yeah, the slides are available at okcosug-pnfs-spe.pdf


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090701 Wednesday July 01, 2009
New BFU bits and closed-bins for closedv11

I just did a branch merge of the nfs41-gate with the onnv_117 tag of the onnv-gate.

And the closed tree had some changes.

I've added a tag of closedv11 to the nfs41-gate and you can download new versions of the closed-bins at http://www.opensolaris.org/os/project/nfsv41/downloads/


There are two ways you can go about installing pNFS on your machines:

  1. Get an install media for the base release (i.e., 117 in this case), install it, and then use the BFU (think precompiled kernel) and SUNWonbld supplied tarballs to install our pNFS image on top of the base. Details for doing that can be found at the bottom of http://wikis.sun.com/display/NFS/July+2008+Release+Notes
  2. Retrieve either a copy of our public Mercurial gate (ssh://anon@hg.opensolaris.org/hg/nfsv41/nfs41-gate) or the source tarball on the Downloads page. Follow the instructions at http://opensolaris.org/os/downloads/on/ on how to build ON. We've provided the source and the closed binaries, you supply the compile cycles.

Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090628 Sunday June 28, 2009
Debugging NFS4ERR_BADSESSION from a DS

We decided to skip branch merges to Nevada builds 113-116. We had issues with the hardware in the lab and vboxes. Plus we wanted a stable environment for the BakeAThon. I'm having a hard time merging with build 117. One of the issues was a panic on the client (which I could patch) due to getting NFS4ERR_BADSESSION from a DS. I was able to see the DS replying on the exchange_id request in a packet trace. There are only a handful of places where the server returns this error code, so I was pretty sure I could reuse a Dtrace script I had laying around:

:nfssrv:mds_lease_chk:return
/args[1] != 0/
{
        printf("rc1 = %d", args[1]);
}

Until I came along this code snippet:

                                *cs->statusp = resp->dsr_status =
                                    NFS4ERR_UNSAFE_COMPOUND;
                                goto final;
...
final:
        DTRACE_NFSV4_2(op__destroy__session__done,
            struct compound_state *, cs,
            DESTROY_SESSION4res *, resp);
}

Just great, no return value. How could I catch it?

Well, the answer is staring us right in the face, we can use op__destroy__session__done. I had to snoop around for examples in a colleague's home directory (which is kinda why I blog about this stuff, it is easier to find), but ended up with this:

:::op-bind-conn-to-session-done
{
        bresp = (BIND_CONN_TO_SESSION4res *)arg1;

        printf("rc1 = %d", bresp->bctsr_status);
}

:::op-destroy-session-done
{
        dresp = (DESTROY_SESSION4res *)arg1;

        printf("rc1 = %d", dresp->dsr_status);
}

Note that I wasn't too worried about being specific as to which module you found these calls in, I was betting on them being very unique.

Oh, and I was able to narrow down where the NFS4ERR_BADSESSION was coming from. And then I had to add debug statements to find out why. :-< I bet I might have been able to do it still with Dtrace. :->


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090622 Monday June 22, 2009
Trip report - BakeAThon at NetApp

So we had our first BakeAThon hosted by NetApp. I thought it went very well and it enabled the NetApp folks to see how much work goes into hosting the event.

I can't go into details that involves other vendors, because of a NDA, but I can say that we did a lot of testing on new features in both the client and the server. I'm not that familiar with the client side changes, I think we had the compound constructor changes by Bob Mastors and Karen Rochford was testing a rewrite of the layout handling code.

I was more excited about the changes on the server side:

  • Rob Thurlow had added Proxy I/O, which allows a nfsv4.1 client to talk to our pNFS client. I.e., you get the sessions code, but not the 'p' -- it isn't parallel. We had tested this internally, but hadn't had a chance to see how other clients fared.
  • Jim Wahlig had made some significant changes to how the server does layout recalls and also put in the first pass at persistent layouts.
  • Piyush Shivam had added a heartbeat between the DS and the MDS. This allowed the DS to detect that the MDS had rebooted and to re-register with it. If the DS reboots itself, it would automatically handle this condition.
  • Lisa Week had implemented the Control Protocol logic for DS_REMOVE. This feature allows a MDS to remove files from a DS.
  • Rick Mesta was testing general sessions code changes, including the sa_cachethis implementation.
  • Jeff Smith had some code cleanup in place to reorganize our kmod footprint. This was the first of two changes and was the most innocent.
  • I had cleaned up the DS_REPORTAVAIL logic and general race cases for when a DS was rebooting whilst under load from the client.
  • Plenty of bug fixes and other minor improvements.

I also brought a major rewrite of the layout handling code (allowing for multiple layouts, etc), device handling (allowing multiple datasets on the same DS) and the integration of the kspe (kernel simple policy engine), but I probably needed another week of development/testing on that code.

The other major development is that we are finally making the switch to using Virtual Boxes for our testing needs. We typically would want 3-4 machines per developer. We can refine this by having a pair of public communities (MDS plus 2 DSes). One as the stable system and the other as a sanity test rig. And then each client developer gets their own machine for a client and each server developer gets a community.

Well, with Virtual Boxes working on a wide range of host OSes, we can have the communities hosted on one or two beefier machines, perhaps sharing room with the build server, and each developer can bring up what they need on their laptops.

I'm going to be giving a presentation on July 7th over pNFS at the Oklahoma City OpenSolaris User Group - OKCOSUG. I'd like to give a live demo using virtual boxes, but I'm not making any promises...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

20090511 Monday May 11, 2009
Lock bug in DS teardown code

I've hit a nasty little bug which requires an orderly shutdown of the DS as the client is pounding it with traffic:

panic[cpu1]/thread=ffffff01d16d38e0: rw_destroy: lock still active, lp=ffffff01e60c6e08 wwwh=10 thread=ffffff01d16d38e0

ffffff00090a0b50 unix:rw_panic+6f ()
ffffff00090a0b70 unix:rw_destroy+33 ()
ffffff00090a0ba0 nfssrv:dserv_mds_instance_destroy+6d ()
ffffff00090a0bf0 genunix:kmem_cache_free_debug+29c ()
ffffff00090a0c50 genunix:kmem_cache_free+90 ()
ffffff00090a0c90 nfssrv:dserv_mds_instance_teardown+2b8 ()
ffffff00090a0cd0 unix:stubs_common_code+51 ()
ffffff00090a1eb0 nfs:nfssys+73 ()
ffffff00090a1f00 unix:brand_sys_syscall32+292 ()

Lock still active, I take that to mean that we are trying to destroy it as another thread has it held.

If we look at dserv_mds_instance_destroy, it only helps as far as getting us in the right source file. Strike that, as there is only one read/write lock, it also tells us which one. The issue is this snippet in dserv_instance_enter:

 220         if (inst == NULL)
 221                 return (ESRCH);
 222 
 223         rw_enter(&inst->dmi_inst_lock, lock_type);
 224         /*
 225          * dmi_teardown_in_progress is only set in one place,
 226          * dserv_mds_teardown_instance() and when doing so the dmi_inst_lock
 227          * is held as a WRITER, therefore, it is safe to check it without
 228          * holding the dmi_content_lock.
 229          */
 230         if (inst->dmi_teardown_in_progress == B_TRUE) {
 231                 rw_exit(&inst->dmi_inst_lock);
 232                 if (lock_type == RW_READER)
 233                         return (EIO);
 234                 else if (lock_type == RW_WRITER)
 235                         /*
 236                          * This will protect from receiving multiple teardown
 237                          * commands happening at once.
 238                          */
 239                         return (EBUSY);
 240         }

I thought the issue was that if we got past here, we had multiple references held to the lock by threads processing compounds. I.e., I've been dealing with refcounts and all problems look like refcounts will solve them. I had the refcount code halfway implemented and was looking to send a wakeup to try and get the writer going when I realized what the code was really trying to do.

If the instance had been removed from memory totally, then we would bail out on the first check at line 220. If it was in the process of tearing down, then the lock would be held by the WRITER, which would mean that all READERs had exited. I.e., the refcount of READERs had to be 0. No use tracking something that does not matter.

The problem had to be in that window between starting to tear down and removing the instance from the avl tree (look at the code in usr/src/uts/common/fs/nfs/dserv_mds.c for more context). And the most obvious thing is that a READER which comes along to check has to grab the lock to check. So, this code solves that:

 216         bool_t grab_lock = FALSE;
...
 223         if (inst == NULL)
 224                 return (ESRCH);
 225 
 226         /*
 227          * If dmi_teardown_in_progress is set, then we can't grab the
 228          * lock. I.e., we are in the midst of either tearing it
 229          * down or we have torn it down.
 230          */
 231 retry_with_lock:
 232         if (grab_lock) {
 233                 /*
 234                  * Now we have to grab the lock and make sure that it is not
 235                  * true!
 236                  */
 237                 rw_enter(&inst->dmi_inst_lock, lock_type);
 238         }
 239 
 240         /*
 241          * dmi_teardown_in_progress is only set in one place,
 242          * dserv_mds_teardown_instance() and when doing so the dmi_inst_lock
 243          * is held as a WRITER, therefore, it is safe to check it without
 244          * holding the dmi_content_lock.
 245          */
 246         if (inst->dmi_teardown_in_progress == B_TRUE) {
 247                 if (grab_lock)
 248                         rw_exit(&inst->dmi_inst_lock);
 249                 if (lock_type == RW_READER)
 250                         return (EIO);
 251                 else if (lock_type == RW_WRITER)
 252                         /*
 253                          * This will protect from receiving multiple teardown
 254                          * commands happening at once.
 255                          */
 256                         return (EBUSY);
 257         }
 258 
 259         if (!grab_lock) {
 260                 grab_lock = TRUE;
 261                 goto retry_with_lock;
 262         }

We will try to check inst->dmi_teardown_in_progress twice. The first time we will do it without the lock. The only way it can be 1 is if a tear down is in progress. We aren't going to keep the lock in that case, we are going to exit right away. But if it is 0 here, we have no way of knowing whether another thread just modified it. So in that case we have to grab the lock and check again.

I thought this would solve the problem, but I ended up with the same panic. The new issue can be found in rwlock(9F):

The rw_enter() function acquires the lock, and blocks if necessary. If enter_type is RW_READER, the caller blocks if there is a writer or a thread attempting to enter for writ- ing. If enter_type is RW_WRITER, the caller blocks if any thread holds the lock.

So say a WRITER comes along, it will wait until all of the READERS drain. At that point, we know the refcnt is 0. The problem is that if another READER comes along, it will now block until the WRITER is done. It will keep the lock active:

...
    dmi_inst_lock = {
        _opaque = [ 0x10 ]
    }
...
[1]> ffffff01e60c6dc8::print dserv_mds_instance_t dmi_inst_lock|::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
ffffff01e60c6e08        READERS=2  B000

What we have to do as a READER is try to grab the lock. If we have to block, then in this case only, we know tear-down has started!

[th199096@aus-build-x86 nfs]> grep dserv_instance_enter *c
dserv_mds.c:dserv_instance_enter(krw_t lock_type, boolean_t create_instance,
dserv_mds.c: * This function frees any of the locks taken by dserv_instance_enter
dserv_mds.c:    error = dserv_instance_enter(RW_WRITER, B_FALSE, &inst);
dserv_mds.c:    error = dserv_instance_enter(RW_READER, B_TRUE, &inst);
dserv_mds.c:    error = dserv_instance_enter(RW_READER, B_TRUE, &inst);
dserv_mds.c:    error = dserv_instance_enter(RW_READER, B_TRUE, &inst);
dserv_mds.c:    error = dserv_instance_enter(RW_READER, B_FALSE, &inst);
dserv_mds.c:    error = dserv_instance_enter(RW_READER, B_FALSE, &inst);
dserv_server.c:         error = dserv_instance_enter(RW_READER, B_FALSE, &inst);

So we could do this:

 238                 if (rw_tryenter(&inst->dmi_inst_lock, lock_type) == 0) {
 239                         if (lock_type == RW_READER)
 240                                 return (EIO); 
 241                         else if (lock_type == RW_WRITER)
 242                                 rw_enter(&inst->dmi_inst_lock, lock_type);
 243                 }               

But I really hate doing this as it makes an assumption about there only ever being one reason to grab as a WRITER and no way to programmatically enforce it. Would a comment suffice?

What we really want to do is sleep and when awoken, retry to grab the lock. We would also have to get the instance pointer fresh.

I think a quick comment and this change will accomplish all I want:

 238                 if (rw_tryenter(&inst->dmi_inst_lock, lock_type) == 0) {
 239                         if (lock_type == RW_READER)
 240                                 return (EAGAIN); 
 241                         else if (lock_type == RW_WRITER)
 242                                 rw_enter(&inst->dmi_inst_lock, lock_type);
 243                 }               

I.e., let the caller try again if it wants to!

So code, build, test!


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily