« July 2009 »
SunMonTueWedThuFriSat
   
2
3
4
5
6
7
9
10
11
12
13
14
15
18
19
20
21
22
23
 
       
Today
XML

Neat blogs

Navigation

Editing

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

20090727 Monday July 27, 2009
How spe interacts with file opens

The spe is a simple policy engine, which dictates the stripe size, the block size, and the set of devices used during file creation - these three items form the layout. The impact of that statement is subtle, but it sets some expectations. The spe code only fires on a file creation, so if a file already exists, then we read in the on disk layout (odl) to get the layout.

So, if files were created under a different set of policies, then there will be no problem accessing the files under the new set of policies.

Here is a file we created before we even enabled spe:

[root@pnfs-17-21 ~]> cp nfs4_vnops.c  /pnfs/pnfs-17-24/redo.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
        Layout obtained at: Sat Jul 25 17:51:55:18886 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [1]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

I've rebooted the client and mds many times, I've added spe policies, but I can check to see that we get the exact same layout. (Note, I've also changed the mount point to be the root of the mds, so expect the path to be different!)

[root@pnfs-4-02 ~]> touch /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c
Layout unacquired
[root@pnfs-4-02 ~]> cat /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c > /dev/null
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 1
Layout [0]:
        Layout obtained at: Mon Jul 27 11:38:40:673328 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [1]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

One thing to note right off the bat is that 'touch' does not cause the file layout to be loaded. That is simple, the 'touch' command is a meta-data operation and can be serviced on the mds. It does not need to open the file. I'd have to look at the snoop output to be sure, but we look like we are getting the same layout.

Now, if I do the same copy as before, we see:

[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redome.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 116
Layout [0]:
        Layout obtained at: Mon Jul 27 11:43:30:691639 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 3968
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK

We just happen to get a stripe count of 4, but notice that the stripe unit is much smaller. We can tell that the extension of '.c' caused the first policy to fire:

[root@pnfs-17-24 ~]> more /etc/policies.spe 
1000, 4, 4k, pool17, ext == c
2000, 8, 1k, pool17:pool4, path == /pnfs1
3000, 3, 8k, pool4:pool17, path == /pnfs2

Okay, what if we call it redome.h? That should cause the last policy to match. Note that the test of the path attribute is based on the path of the mds and not the client. Also, if we have a mount point under the root of the mds, we still get the complete path when we do the check.

[root@pnfs-4-02 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/pnfs2/pnfs/redome.h
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs2/pnfs/redome.h
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 64
Layout [0]:
        Layout obtained at: Mon Jul 27 11:46:20:549943 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 3, stripe unit: 8000
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

And it does! We can see it in both the stripe count and stripe unit.

And these files will have those same layouts as long as they live.

Well, I need to go figure out some more unit tests to break my code...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
More on getting the kspe code to generate an array of mds_sid

Okay, I've forced the ds_dataset_name to be present at creation. And I'm getting pretty hashes. I'm still not generating a mds_sid array.

When we store 'pnfs-4-01:pnfs1/d10':

Jul 27 10:59:45 pnfs-17-24 nfssrv: utf8_hash of |pnfs-4-01:pnfs1/d10|[20] is 55811466

And when we try to search for it:

Jul 27 11:04:51 pnfs-17-24 nfssrv: utf8_hash of |pnfs-4-01:pnfs1/ds10|[21] is 111623246

The lengths are different, which will cause a bit shift. And we see that one is 'd10' and the other is 'ds10'. Well, I'll change '/etc/npools.spe' as I don't have renaming a dataset name <-> mds_sid working yet.

[root@pnfs-4-01 ~]> zfs rename pnfs1/d10 pnfs1/ds10
cannot rename 'pnfs1/d10': dataset is busy

And there could be other problems.

But anyway, fix the dataset name, and we have spe, I repeat, we have spe!

Here we can see a file creation without any policy firing:

[root@pnfs-4-02 ~]> mount -o vers=4 pnfs-17-24:/ /pnfs/pnfs-17-24
[root@pnfs-4-02 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/pnfs1/nfs41/spek4.txt
[root@pnfs-4-02 ~]> nfsstat -l !$
nfsstat -l /pnfs/pnfs-17-24/pnfs1/nfs41/spek4.txt
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 4
Layout [0]:
        Layout obtained at: Mon Jul 27 11:04:51:725322 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 10, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
        Stripe [3]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
        Stripe [4]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
        Stripe [5]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
        Stripe [6]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [7]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [8]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [9]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK

And here we can see the same file copy resulting in not only a different number of stripes, but to a different dataset order!

[root@pnfs-4-02 ~]> umount /pnfs/pnfs-17-24
[root@pnfs-4-02 ~]> mount -o vers=4 pnfs-17-24:/ /pnfs/pnfs-17-24
[root@pnfs-4-02 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/pnfs1/nfs41/spek5.txt
[root@pnfs-4-02 ~]> nfsstat -l /pnfs/pnfs-17-24/pnfs1/nfs41/spek5.txt
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 140
Layout [0]:
        Layout obtained at: Mon Jul 27 11:27:06:306959 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 8, stripe unit: 960
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [4]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [5]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [6]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK
        Stripe [7]:
                tcp:pnfs-4-01.Central.Sun.COM:10.1.233.11:47009 OK

Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
First real test of kspe!

So, I started off a test run with the kspe. It allows me to decide which datasets are going to be used in a layout. (see my slides from Oklahoma City OpenSolaris User Group (OKCOSUG) presentation). I set up a simple set of npools and policies:

[root@pnfs-17-24 /etc]> more npools.spe 
pool17 pnfs-17-22:pnfs1/ds1 pnfs-17-23:pnfs1/ds3 pnfs-17-22:pnfs2/ds2 pnfs-17-23
:pnfs2/ds4 pnfs-17-23:pnfs1/ds7 pnfs-17-22:pnfs1/ds8
pool4 pnfs-4-01:pnfs1/ds4 pnfs-4-01:pnfs2/ds5 pnfs-4-01:pnfs2/d9 pnfs-4-01:pnfs1
/ds10
[root@pnfs-17-24 /etc]> more policies.spe 
1000, 4, 4k, pool17, ext == c
2000, 8, 1k, pool17:pool4, path == /pnfs1
3000, 3, 8k, pool4:pool17, path == /pnfs2

Note that a dataset identifier is a combination of host name and the zfs filesystem. I wrote the code and I still struggle with this is not a path name, it is a zfs name. I.e., it is not ':/' as a connector. 'pnfs2/ds2' is the zfs name and not '/pnfs2/ds2'!

And struggled a bit to load them, until I found my kspe implementation notes.

And right off the bat, I see my policies are not matching and I've got a core!

[root@pnfs-4-02 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/pnfs1/nfs41/spe_1234.32k.raw.txt

yields

[root@pnfs-17-24 ~]> rc = 0, eval = 0, id = 1000
rc = 0, eval = 1, id = 2000
WARNING: spe: 1000 8 

panic[cpu1]/thread=ffffff01d74bf4e0: BAD TRAP: type=e (#pf Page fault) rp=ffffff0008162d70 addr=0 occurred in module "nfssrv" due to a NULL pointer dereference
...
[1]> $c
kmdb_enter+0xb()
...
nfssrv`mds_layout_hash+0x19(ffffff0008162fe0)
...
[1]> ffffff0008162fe0::print layout_core_t
{
    lc_stripe_unit = 0x3e8
    lc_stripe_count = 0x8
    lc_mds_sids = 0
}

Which tells me that the kspe code did not generate a mds_sid array. Again, no problem, I just wrote that code last week and this is the first test of it!

So the bug was a stupid error:

                        return (mds_sids ? 0 : ENOENT);
versus
                        return (*mds_sids ? 0 : ENOENT);

Success was being returned with an empty array of mds_sids. There is still a bug, we should be finding matches, but at least it is not so nasty.

The bug is that these comparisons are failing:

Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool17| at 7 to global |pool17| at 7
Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool17| at 7 to global |pool4| at 6
Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool4| at 6 to global |pool17| at 7
Jul 27 01:37:45 pnfs-17-24 nfssrv: Comparing policy npool |pool4| at 6 to global |pool4| at 6
Jul 27 01:37:45 pnfs-17-24 nfssrv: spe_map_npools_to_mds_sids: No matching npools!

The first and last should match. Ah, they do (as shown by additional debug logic), which shows me barking up a wrong tree!

D'oh! I found it! If we look at this code:

spe_map_npools_to_mds_sids(kspe_state_t *kspe, spe_policy *sp,
...
        spe_npool       *sn;
        spe_npool       *np;

        /*
         * For each npool in the policy, find it in the
         * list of npools, and start assigning datasets.
         */
        for (sn = sp->sp_npools; sn != NULL; sn = sn->next) {
                for (np = kspe->ks_npools; np; np = np->next) {
                        cmp = utf8_compare(&np->sn_name, &sn->sn_name);
                        if (cmp == 0) {
                                /*
                                 * Now we fill in entries in the *mds_sids
                                 * array.
                                 */
                                for (ss = sn->sn_dses; ss; ss = ss->next) {

We see I was lazy in assuming that sn and np were the same thing. Note that they are the same type of object, but the sn points to the npools in the policy and np points to the npools in the global list. The point is that a npool can be in multiple policies. So rather than store the list of datasets in the policies (which is a nightmare for updating) or pointers to npools in the policies (which sounds good and I can't remember why not!) we store the datasets in the global list.

So instead of searching a list of datasets in the global list, we search an empty list in the policy. :->

And I'm at the next bug:

Jul 27 02:11:04 pnfs-17-24 nfssrv: mds_ds_path_to_mds_sid returned an error!

We're not finding a match as we search the ds_guid_info database. When I had a search issue a couple nights ago, it turned out to be the hash function. If we look at what we have:

        instp->ds_guid_info_dataset_name_idx =
            rfs4_index_create(instp->ds_guid_info_tab,
            "DS_guid-dataset-name-idx", mds_str_hash,
            ds_guid_info_dataset_name_compare, ds_guid_info_dataset_name_mkkey,
            FALSE);

We are using mds_str_hash on an utf8_string - which is a length and a string. I think that will be problematic. I'm not too happy with the hash functions here in general, but for now all I care about is that they are consistent.

They aren't consistent in this case and it is getting late...

As it boots:

Jul 27 03:33:41 pnfs-17-24 nfssrv: utf8_hash of empty str

As we search:

Jul 27 03:34:35 pnfs-17-24 nfssrv: utf8_hash of |pnfs-17-22:pnfs1/ds1|[21] is 111612431

We ought to see a non-NULL addition and we ought to see 10 messages, not 1.

Bzzt! That is an expectation and not code. The debug statements are correct. If we look at the entry create routine, which is called just before the hash function, we see:

static bool_t
ds_guid_info_create(rfs4_entry_t u_entry, void *arg)
{
...
        pgi->ds_dataset_name.utf8string_val = NULL;
        pgi->ds_dataset_name.utf8string_len = 0;

We don't have our hands on that info -- or do we?

We do - I can fix this - but not now. It will wait until tomorrow, because I've got to sleep...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily