So after a reboot, a diff of a file copied to the server is showing problems. There are several triage steps to take:
I can show the problem thusly:
[root@pnfs-17-21 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw
[root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw | wc -l
0
How do I know that the client is even getting the bits back from the server. If they are still in local buffers, this may not be happening. So, lets try from another client:
[root@pnfs-4-02 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw | wc -l
132
So it isn't the reboot and it probably is not the ODL.
Okay, I just configured a stock community and repeated the same test and there was no loss of data. Clearly either my changes or changes I failed to do have caused the problem. I.e., either I caused a bug or I exposed an existing bug - one that the new feature set enabled. Either way, I have to fix it before I can integrate my changes.
I'm beginning to think it is in code which has yet to be implemented:
[thud@ultralord nfs]> grep dnk_sid * dserv_server.c: dskey.dnk_sid = &fhp->fh.v1.mds_sid; [thud@ultralord nfs]> grep dnk_fid * dserv_server.c: CRC32(rc, key->dnk_fid->val, key->dnk_fid->len, dserv_server.c: NFS_AVL_COMPARE(a->dnk_fid->len, b->dnk_fid->len); dserv_server.c: rc = memcmp(a->dnk_fid->val, b->dnk_fid->val, dserv_server.c: a->dnk_fid->len); dserv_server.c: data->dnd_fid = key->dnk_fid; dserv_server.c: dskey.dnk_fid = &fhp->fh.v1.mds_fid; dserv_server.c: key->dnk_fid = &key->dnk_real_fid;
I enabled the setting of 'dnk_sid', but no one uses it. I bet what is happening is that both 32k chunks of the files are being written on the same dataset...
With some debug printfs, we can see that the DSes are picking different object sets to save the data:
Jul 26 21:28:46 pnfs-17-22 nfssrv: Accessing pnfs2/ds2/EF25FBCF08FA1BD7 for D336969B13CB6C0E000000000000001E Jul 26 21:28:46 pnfs-17-22 nfssrv: Accessing pnfs1/ds1/EF25FBCF08FA1BD7 for 86FE860F0E0B87D4000000000000001E
So I implemented the code for 'dnk_sid' and now I'm seeing:
[root@pnfs-4-02 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/slow | wc -l
0
It also works now across a reboot.
The bad thing here is that I'm making it up on the fly, I don't understand the nnode code. I'm going to have to get a code review just on that chunk.
Actually, strike that. I know there was a bug, missing code to be exact. I know my fix works, and I understand the changes I made. I don't understand how those changes work in the bigger picture of the nnode code. Oh, I understand that when we tried to access the two different datasets on the same DS, we would sometimes get one instead of the other. I'm still going to talk with the author, I want to make sure my fixes are optimal.
Last night, I did the following test of file creation:
[root@pnfs-17-21 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
Layout obtained at: Sat Jul 25 02:55:17:818063 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/Pnfs4_vnops.c
-rw-r--r-- 1 root root 409540 Jul 25 02:55 /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
So we have pNFS!
I then went to bed, knowing that the next set of testing to do was what we call on disk layouts or odl. When we create a file, we generate a layout based on all of the currently reporting datasets (ignore kspe, which I will soon enable and start testing). When we generate the create layout, we need to stuff it somewhere such that a client which opens the file either for appending or reading can get at the contents. So we store the layout on disk, hence odl.
The system we had in place made assumptions like all devices would be there, they would report in the same order, etc. Well, with using the mds_sids as keys, I've changed that.
To test things, I've rebooted the client and the servers. I want to make sure that there is nothing in memory. Soon, I'll add a new DS to make sure the number of reporting datasets does not have an impact. But for now, what happens?
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c2 Binary files nfs4_vnops.c and /pnfs/pnfs-17-24/Pnfs4_vnops.c2 differ [root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c2 Layout unacquired
I don't need snoop, I need to start examining the odl code. And to be honest, I was expecting problems here.
I don't think the layout was written to disk. For now, we plaop it down in '/var':
[root@pnfs-17-24 ~]> cd /var/nfs/v4_state/layouts [root@pnfs-17-24 layouts]> ls -la total 6 drwxr-xr-x 3 daemon daemon 512 Jun 26 17:27 . drwxr-xr-x 5 daemon daemon 512 Jul 23 16:30 .. drwxr-xr-x 2 daemon daemon 512 Jun 26 17:27 2d90005 [root@pnfs-17-24 layouts]> cd 2d90005/ [root@pnfs-17-24 2d90005]> ls -la total 4 drwxr-xr-x 2 daemon daemon 512 Jun 26 17:27 . drwxr-xr-x 3 daemon daemon 512 Jun 26 17:27 ..
Found the problem!
#ifdef PERSIST_LO_ENABLED
And it wasn't enabled!
IF I enable it and try the experiment again:
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/a13.c | wc -l
21433
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/a13.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 14
Layout [0]:
Layout obtained at: Sat Jul 25 17:49:59:853319 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
I should have gotten the layout from before the write. But it looks wrong. They are out of order, see:
[root@pnfs-17-21 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/redo.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
Layout obtained at: Sat Jul 25 17:51:55:18886 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Reboot and we see:
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 14
Layout [0]:
Layout obtained at: Sat Jul 25 17:55:33:205567 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
So the good news is that odl works, the bad news is just not right.
We can actually see this in action. I wrote a simple program to write 32k of 'a', 32k of 'b', 32k of 'c', and 32k of 'd's to a file. I then copied that file over.
[root@pnfs-17-21 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw [root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw [root@pnfs-17-21 ~]> reboot ... [root@pnfs-17-21 ~]> mount -o vers=4 pnfs-17-24:/pnfs2/pnfs /pnfs/pnfs-17-24 [root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw Binary files 1234.32k.raw and /pnfs/pnfs-17-24/b1234.32k.raw differ
The first diff had to have gotten the same layout as before. Now when we reboot, we see a difference. And since the pattern is so simple, we can see what might be going on:
[root@pnfs-17-21 ~]> grep a /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
0
[root@pnfs-17-21 ~]> grep b /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
31
[root@pnfs-17-21 ~]> grep c /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
0
[root@pnfs-17-21 ~]> grep d /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
64
We need to snoop the filehandles from the create and then from the later read.
cre: fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 cre: fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 cre: fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 cre: fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
And they match. The problem isn't in the filehandles we hand out.
I'm going to end it there for now...