« July 2009 »
SunMonTueWedThuFriSat
   
2
3
4
5
6
7
9
10
11
12
13
14
15
18
19
20
21
22
23
 
       
Today
XML

Neat blogs

Navigation

Editing

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

20090726 Sunday July 26, 2009
File contents not matching

So after a reboot, a diff of a file copied to the server is showing problems. There are several triage steps to take:

  1. See if this happens with the old server - this will isolate whether or not it is an existing bug.
  2. See if it happens across an unmount - this will isolate whether the reboot is doing something.
  3. See if it happens from another client - this will also isolate whether it is the reboot or not.
  4. Try it such that one write goes to each DS and not to each datatset - which will tell us if the writes are wrong, i.e., the FH decode is not working correctly.

I can show the problem thusly:

[root@pnfs-17-21 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw
[root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw | wc -l
       0

How do I know that the client is even getting the bits back from the server. If they are still in local buffers, this may not be happening. So, lets try from another client:

[root@pnfs-4-02 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw | wc -l
     132

So it isn't the reboot and it probably is not the ODL.

Okay, I just configured a stock community and repeated the same test and there was no loss of data. Clearly either my changes or changes I failed to do have caused the problem. I.e., either I caused a bug or I exposed an existing bug - one that the new feature set enabled. Either way, I have to fix it before I can integrate my changes.

I'm beginning to think it is in code which has yet to be implemented:

[thud@ultralord nfs]> grep dnk_sid *
dserv_server.c:	dskey.dnk_sid = &fhp->fh.v1.mds_sid;
[thud@ultralord nfs]> grep dnk_fid *
dserv_server.c:	CRC32(rc, key->dnk_fid->val, key->dnk_fid->len,
dserv_server.c:	NFS_AVL_COMPARE(a->dnk_fid->len, b->dnk_fid->len);
dserv_server.c:	rc = memcmp(a->dnk_fid->val, b->dnk_fid->val,
dserv_server.c:	    a->dnk_fid->len);
dserv_server.c:	data->dnd_fid = key->dnk_fid;
dserv_server.c:	dskey.dnk_fid = &fhp->fh.v1.mds_fid;
dserv_server.c:	key->dnk_fid = &key->dnk_real_fid;

I enabled the setting of 'dnk_sid', but no one uses it. I bet what is happening is that both 32k chunks of the files are being written on the same dataset...

With some debug printfs, we can see that the DSes are picking different object sets to save the data:

Jul 26 21:28:46 pnfs-17-22 nfssrv: Accessing pnfs2/ds2/EF25FBCF08FA1BD7 for D336969B13CB6C0E000000000000001E
Jul 26 21:28:46 pnfs-17-22 nfssrv: Accessing pnfs1/ds1/EF25FBCF08FA1BD7 for 86FE860F0E0B87D4000000000000001E

So I implemented the code for 'dnk_sid' and now I'm seeing:

[root@pnfs-4-02 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/slow | wc -l
       0

It also works now across a reboot.

The bad thing here is that I'm making it up on the fly, I don't understand the nnode code. I'm going to have to get a code review just on that chunk.

Actually, strike that. I know there was a bug, missing code to be exact. I know my fix works, and I understand the changes I made. I don't understand how those changes work in the bigger picture of the nnode code. Oh, I understand that when we tried to access the two different datasets on the same DS, we would sometimes get one instead of the other. I'm still going to talk with the author, I want to make sure my fixes are optimal.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
Testing On Disk Layouts

Last night, I did the following test of file creation:

[root@pnfs-17-21 ~]>  cp nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
        Layout obtained at: Sat Jul 25 02:55:17:818063 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [3]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la  /pnfs/pnfs-17-24/Pnfs4_vnops.c
-rw-r--r--   1 root     root      409540 Jul 25 02:55 /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c

So we have pNFS!

I then went to bed, knowing that the next set of testing to do was what we call on disk layouts or odl. When we create a file, we generate a layout based on all of the currently reporting datasets (ignore kspe, which I will soon enable and start testing). When we generate the create layout, we need to stuff it somewhere such that a client which opens the file either for appending or reading can get at the contents. So we store the layout on disk, hence odl.

The system we had in place made assumptions like all devices would be there, they would report in the same order, etc. Well, with using the mds_sids as keys, I've changed that.

To test things, I've rebooted the client and the servers. I want to make sure that there is nothing in memory. Soon, I'll add a new DS to make sure the number of reporting datasets does not have an impact. But for now, what happens?

[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c2
Binary files nfs4_vnops.c and /pnfs/pnfs-17-24/Pnfs4_vnops.c2 differ
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c2
Layout unacquired

I don't need snoop, I need to start examining the odl code. And to be honest, I was expecting problems here.

I don't think the layout was written to disk. For now, we plaop it down in '/var':

[root@pnfs-17-24 ~]> cd /var/nfs/v4_state/layouts
[root@pnfs-17-24 layouts]> ls -la
total 6
drwxr-xr-x   3 daemon   daemon       512 Jun 26 17:27 .
drwxr-xr-x   5 daemon   daemon       512 Jul 23 16:30 ..
drwxr-xr-x   2 daemon   daemon       512 Jun 26 17:27 2d90005
[root@pnfs-17-24 layouts]> cd 2d90005/
[root@pnfs-17-24 2d90005]> ls -la
total 4
drwxr-xr-x   2 daemon   daemon       512 Jun 26 17:27 .
drwxr-xr-x   3 daemon   daemon       512 Jun 26 17:27 ..

Found the problem!

#ifdef PERSIST_LO_ENABLED

And it wasn't enabled!

IF I enable it and try the experiment again:


[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/a13.c | wc -l
   21433
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/a13.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 14
Layout [0]:
        Layout obtained at: Sat Jul 25 17:49:59:853319 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [1]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [2]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [3]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

I should have gotten the layout from before the write. But it looks wrong. They are out of order, see:

[root@pnfs-17-21 ~]> cp nfs4_vnops.c  /pnfs/pnfs-17-24/redo.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
        Layout obtained at: Sat Jul 25 17:51:55:18886 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [1]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

Reboot and we see:

[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 14
Layout [0]:
        Layout obtained at: Sat Jul 25 17:55:33:205567 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 4, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [1]:
                tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
        Stripe [2]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
        Stripe [3]:
                tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK

So the good news is that odl works, the bad news is just not right.

We can actually see this in action. I wrote a simple program to write 32k of 'a', 32k of 'b', 32k of 'c', and 32k of 'd's to a file. I then copied that file over.

[root@pnfs-17-21 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw
[root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw
[root@pnfs-17-21 ~]> reboot
...
[root@pnfs-17-21 ~]> mount -o vers=4 pnfs-17-24:/pnfs2/pnfs /pnfs/pnfs-17-24
[root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw
Binary files 1234.32k.raw and /pnfs/pnfs-17-24/b1234.32k.raw differ

The first diff had to have gotten the same layout as before. Now when we reboot, we see a difference. And since the pattern is so simple, we can see what might be going on:

[root@pnfs-17-21 ~]> grep a /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
       0
[root@pnfs-17-21 ~]> grep b /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
      31
[root@pnfs-17-21 ~]> grep c /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
       0
[root@pnfs-17-21 ~]> grep d /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
      64

We need to snoop the filehandles from the create and then from the later read.

cre:  fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
odl:  fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
cre:  fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
odl:  fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
cre:  fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
odl:  fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
cre:  fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
odl:  fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000

And they match. The problem isn't in the filehandles we hand out.

I'm going to end it there for now...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily