Last night, I did the following test of file creation:
[root@pnfs-17-21 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
Layout obtained at: Sat Jul 25 02:55:17:818063 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
[root@pnfs-17-21 ~]> ls -la /pnfs/pnfs-17-24/Pnfs4_vnops.c
-rw-r--r-- 1 root root 409540 Jul 25 02:55 /pnfs/pnfs-17-24/Pnfs4_vnops.c
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c
So we have pNFS!
I then went to bed, knowing that the next set of testing to do was what we call on disk layouts or odl. When we create a file, we generate a layout based on all of the currently reporting datasets (ignore kspe, which I will soon enable and start testing). When we generate the create layout, we need to stuff it somewhere such that a client which opens the file either for appending or reading can get at the contents. So we store the layout on disk, hence odl.
The system we had in place made assumptions like all devices would be there, they would report in the same order, etc. Well, with using the mds_sids as keys, I've changed that.
To test things, I've rebooted the client and the servers. I want to make sure that there is nothing in memory. Soon, I'll add a new DS to make sure the number of reporting datasets does not have an impact. But for now, what happens?
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/Pnfs4_vnops.c2 Binary files nfs4_vnops.c and /pnfs/pnfs-17-24/Pnfs4_vnops.c2 differ [root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/Pnfs4_vnops.c2 Layout unacquired
I don't need snoop, I need to start examining the odl code. And to be honest, I was expecting problems here.
I don't think the layout was written to disk. For now, we plaop it down in '/var':
[root@pnfs-17-24 ~]> cd /var/nfs/v4_state/layouts [root@pnfs-17-24 layouts]> ls -la total 6 drwxr-xr-x 3 daemon daemon 512 Jun 26 17:27 . drwxr-xr-x 5 daemon daemon 512 Jul 23 16:30 .. drwxr-xr-x 2 daemon daemon 512 Jun 26 17:27 2d90005 [root@pnfs-17-24 layouts]> cd 2d90005/ [root@pnfs-17-24 2d90005]> ls -la total 4 drwxr-xr-x 2 daemon daemon 512 Jun 26 17:27 . drwxr-xr-x 3 daemon daemon 512 Jun 26 17:27 ..
Found the problem!
#ifdef PERSIST_LO_ENABLED
And it wasn't enabled!
IF I enable it and try the experiment again:
[root@pnfs-17-21 ~]> diff nfs4_vnops.c /pnfs/pnfs-17-24/a13.c | wc -l
21433
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/a13.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 14
Layout [0]:
Layout obtained at: Sat Jul 25 17:49:59:853319 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [2]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
I should have gotten the layout from before the write. But it looks wrong. They are out of order, see:
[root@pnfs-17-21 ~]> cp nfs4_vnops.c /pnfs/pnfs-17-24/redo.c
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 13
Layout [0]:
Layout obtained at: Sat Jul 25 17:51:55:18886 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Reboot and we see:
[root@pnfs-17-21 ~]> nfsstat -l /pnfs/pnfs-17-24/redo.c
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 14
Layout [0]:
Layout obtained at: Sat Jul 25 17:55:33:205567 2009
status: UNKNOWN, iomode: LAYOUTIOMODE_RW
offset: 0, length: EOF
num stripes: 4, stripe unit: 32768
Stripe [0]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [1]:
tcp:pnfs-17-23.Central.Sun.COM:10.1.233.193:47009 OK
Stripe [2]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
Stripe [3]:
tcp:pnfs-17-22.Central.Sun.COM:10.1.233.192:47009 OK
So the good news is that odl works, the bad news is just not right.
We can actually see this in action. I wrote a simple program to write 32k of 'a', 32k of 'b', 32k of 'c', and 32k of 'd's to a file. I then copied that file over.
[root@pnfs-17-21 ~]> cp 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw [root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw [root@pnfs-17-21 ~]> reboot ... [root@pnfs-17-21 ~]> mount -o vers=4 pnfs-17-24:/pnfs2/pnfs /pnfs/pnfs-17-24 [root@pnfs-17-21 ~]> diff 1234.32k.raw /pnfs/pnfs-17-24/b1234.32k.raw Binary files 1234.32k.raw and /pnfs/pnfs-17-24/b1234.32k.raw differ
The first diff had to have gotten the same layout as before. Now when we reboot, we see a difference. And since the pattern is so simple, we can see what might be going on:
[root@pnfs-17-21 ~]> grep a /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
0
[root@pnfs-17-21 ~]> grep b /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
31
[root@pnfs-17-21 ~]> grep c /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
0
[root@pnfs-17-21 ~]> grep d /pnfs/pnfs-17-24/b1234.32k.raw | wc -l
64
We need to snoop the filehandles from the create and then from the later read.
cre: fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[0] = (72) 00000002000000010000000000000000000000002FE9040D00000010312CE43624C45DB0000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 cre: fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[1] = (72) 00000002000000010000000000000000000000002FE9040D0000001044CF9A4C47798276000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 cre: fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[2] = (72) 00000002000000010000000000000000000000002FE9040D00000010D336969B13CB6C0E000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 cre: fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000 odl: fh[3] = (72) 00000002000000010000000000000000000000002FE9040D0000001086FE860F0E0B87D4000000000000001E00000008EF25FBCF08FA1BD70000000A2B0000000000E40400000000
And they match. The problem isn't in the filehandles we hand out.
I'm going to end it there for now...
mmm .... trying to understand whats you are talking about here but am a little confused. However, i know little about pNFS. Please keep writing though. Im learning stuff from your posts!
- Alex
Posted by Alex on July 26, 2009 at 07:40 AM CDT #
So do I understand that correctly, in the striped layout, the entire file is automatically written to all four stripes, and when served, it will be served from all four stripes?
What I'm getting at is, if one of the four stripes were to become unavailable, the file or files would still be available, without delay or timeouts?
Posted by UX-admin on July 26, 2009 at 11:56 AM CDT #
Alex, please ask questions if you need to.
UX-admin, no, not with the first generation of pNFS servers. While it helps to think of pNFS like RAID, you won't see parity for some time.
BTW - you asked a really great question! I think people are going to see pNFS as RAID and assume that certain things are just going to be work.
The current protocol is designed for speed and not management. So, if we have 4 stripes, that means we can send 4 writes out at the same time. We are basically giving the client 3 more servers to juggle the load.
Things like parity, snapshots, hot spares, etc, are down the pipeline.
I talk to this in a recent slide deck:
http://blogs.sun.com/tdh/resource/pnfs/okcosug-pnfs-spe.pdf
Posted by Tom Haynes on July 26, 2009 at 09:59 PM CDT #
Thank You for taking the time to answer. Forwhat it's worth, that was my confusion with pNFS as well.
It will be great when pNFS will be capable of redundancy, because an NFS cluster will no longer be required, and that should simplify things an order of magnitude.
Posted by UX-admin on July 27, 2009 at 03:17 AM CDT #
Tom,
Can you point me to a good paper that describes pNFS ? Something with more detail than the aforementioned pdf. Or even a video presentation link anyone has done ?
-Alex
Posted by Alex on July 27, 2009 at 09:25 AM CDT #
Alex, it is dated, but you might try this:
http://www.opensolaris.org/os/project/nfsv41/pnfsdemos/basics
You might also look at http://www.citi.umich.edu/projects/asci/pnfs/linux/
The issue is either you get 2 ~400 page RFCs or you get high level overviews like the PDF I presented.
Enjoy,
Tom
Posted by Tom Haynes on July 27, 2009 at 12:06 PM CDT #
I have received several similar emails like this one.
Posted by links london on December 01, 2009 at 07:39 PM CST #