« December 2009
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today
XML

Neat blogs

Navigation

Editing

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

20090726 Sunday July 26, 2009
File contents not matching

So after a reboot, a diff of a file copied to the server is showing problems. There are several triage steps to take:

  1. See if this happens with the old server - this will isolate whether or not it is an existing bug.
  2. See if it happens across an unmount - this will isolate whether the reboot is doing something.
  3. See if it happens from another client - this will also isolate whether it is the reboot or not.
  4. Try it such that one write goes to each DS and not to each datatset - which will tell us if the writes are wrong, i.e., the FH decode is not working correctly.

I can show the problem thusly:

[root@pnfs-17-21 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw
[root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw | wc -l
       0

How do I know that the client is even getting the bits back from the server. If they are still in local buffers, this may not be happening. So, lets try from another client:

[root@pnfs-4-02 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/dc_1234.32k.raw | wc -l
     132

So it isn't the reboot and it probably is not the ODL.

Okay, I just configured a stock community and repeated the same test and there was no loss of data. Clearly either my changes or changes I failed to do have caused the problem. I.e., either I caused a bug or I exposed an existing bug - one that the new feature set enabled. Either way, I have to fix it before I can integrate my changes.

I'm beginning to think it is in code which has yet to be implemented:

[thud@ultralord nfs]> grep dnk_sid *
dserv_server.c:	dskey.dnk_sid = &fhp->fh.v1.mds_sid;
[thud@ultralord nfs]> grep dnk_fid *
dserv_server.c:	CRC32(rc, key->dnk_fid->val, key->dnk_fid->len,
dserv_server.c:	NFS_AVL_COMPARE(a->dnk_fid->len, b->dnk_fid->len);
dserv_server.c:	rc = memcmp(a->dnk_fid->val, b->dnk_fid->val,
dserv_server.c:	    a->dnk_fid->len);
dserv_server.c:	data->dnd_fid = key->dnk_fid;
dserv_server.c:	dskey.dnk_fid = &fhp->fh.v1.mds_fid;
dserv_server.c:	key->dnk_fid = &key->dnk_real_fid;

I enabled the setting of 'dnk_sid', but no one uses it. I bet what is happening is that both 32k chunks of the files are being written on the same dataset...

With some debug printfs, we can see that the DSes are picking different object sets to save the data:

Jul 26 21:28:46 pnfs-17-22 nfssrv: Accessing pnfs2/ds2/EF25FBCF08FA1BD7 for D336969B13CB6C0E000000000000001E
Jul 26 21:28:46 pnfs-17-22 nfssrv: Accessing pnfs1/ds1/EF25FBCF08FA1BD7 for 86FE860F0E0B87D4000000000000001E

So I implemented the code for 'dnk_sid' and now I'm seeing:

[root@pnfs-4-02 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/slow | wc -l
       0

It also works now across a reboot.

The bad thing here is that I'm making it up on the fly, I don't understand the nnode code. I'm going to have to get a code review just on that chunk.

Actually, strike that. I know there was a bug, missing code to be exact. I know my fix works, and I understand the changes I made. I don't understand how those changes work in the bigger picture of the nnode code. Oh, I understand that when we tried to access the two different datasets on the same DS, we would sometimes get one instead of the other. I'm still going to talk with the author, I want to make sure my fixes are optimal.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Trackback URL: http://blogs.sun.com/tdh/entry/file_contents_not_matching
Comments:

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed