I've got two clients doing some simple file copies, file removes, and directory removes. After a while, I stop both test scripts and unmount the mds. We can see the relationship between files, layouts, and device tables (mpd):
[root@pnfs-17-24 ~]> ./rlays.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_Layout_entry_cache|::print struct rfs4_dbe
refcnt = 0x50
refcnt = 0x99
refcnt = 0x49
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt | wc -l
+ mdb -k
+ echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
303
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_mpd_entry_cache|::print struct rfs4_dbe
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
I just fixed a bug where a layout request invalidated the old layout. So we can see here that there are over 300 files with state which are still active, there are 3 corresponding layouts, each with a different usage, and 3 mpds, each with a hold from their respective layout. The 3 layouts correspond to the 3 policies in effect in the spe. There is no round robin scheduling going on, which might see more layouts in use.
I haven't fixed it yet, and this example doesn't show it, but each layout create is going to cause a corresponding mpd create. Which is okay for now, the only difference we could support would be a different stripe unit size.
And we can see that the files have been harvested, but the layouts have not:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt | wc -l
+ mdb -k
+ echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
0
[root@pnfs-17-24 ~]> ./rlays.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_Layout_entry_cache|::print struct rfs4_dbe
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_mpd_entry_cache|::print struct rfs4_dbe
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
The layouts are ripe for plucking, with a refcnt of 1, it means that they are in the table, but no one else references them. Let's see if we can quickly use them again:
[root@pnfs-17-24 ~]> ./rlays.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_Layout_entry_cache|::print struct rfs4_dbe
refcnt = 0x5
refcnt = 0xb
refcnt = 0x4
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_mpd_entry_cache|::print struct rfs4_dbe
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
Now I don't have proof here that these are the same entries in the db, but what I wanted to point out is that I never invalidate entries in either the layout or mpd tables. Instead, I rely on the fact that they can not go away until all external references are gone. Based on timing, I'd say we did reuse the entries, otherwise I would expect to see 6 mpds and not 3 of them. I.e., they do not get reused and if there were new layout entries, we'd expect to see 6 mpds.
I'm interested in that layout and mpd entries get harvested as they become stale. Because of the prior assumption that there was only ever 1 of each, it never was an issue.
If we recall our mappings, each mpd can belong to several layouts, each layout can belong to several files, etc. So when I discovered that mpds were sticking in memory with a refcnt of 2, I checked the layouts, saw that they had a refcnt of 2, and then checked the files:
[root@pnfs-17-24 ~]> more rlays.sh
#!/bin/sh -x
echo '::walk mds_Layout_entry_cache|::print struct rfs4_dbe' | mdb -k
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
refcnt = 0x2
Now, some of these would go away, but in the end, several would stick in memory. I actually found 2 nasty rfs4_file_t rele bugs and fixed them, but the issue stayed.
I thought my last fix did the trick, I turned my attention away for a bit as I worked on my MacBook Pro, and the file references went away. I rebooted and tried again. No such luck, that was when I got the above.
I unmounted the mds from my two clients and went away to study the code, trying to find out how an mds_op_close() did a file rele which would lead to the layout rele.
I was going to reboot, without any luck, and try the case where I created the files and did nothing else - the remove code is suspect to me still. Well, before I did that, I checked the file refcnts:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k+
echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
refcnt = 0x1
My guess is that somehow the clients are causing state to be clung to on the server. It is only on the unmount that a final rele takes place. It has been some time since I did that last check:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k + echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
[root@pnfs-17-24 ~]> ./rlays.sh
+ mdb -k
+ echo ::walk mds_Layout_entry_cache|::print struct rfs4_dbe
[root@pnfs-17-24 ~]> ./rmpd.sh
+ mdb -k
+ echo ::walk mds_mpd_entry_cache|::print struct rfs4_dbe
{
lock = [
{
_opaque = [ 0 ]
},
]
refcnt = 0x1
Whoops, I expected the mpds to be out of memory as well and after a bit of patience, they are. :->
[root@pnfs-17-24 ~]> ./rmpd.sh | grep refcnt + mdb -k + echo ::walk mds_mpd_entry_cache|::print struct rfs4_dbe
A quick test of creating 4 files sees:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
refcnt = 0x5
refcnt = 0x5
refcnt = 0x5
refcnt = 0x5
If we then unmount, the snoop is mainly the following:
[root@pnfs-4-02 ~]> grep RETURN\) xxx NFS: Op = 8 (DELEGRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 8 (DELEGRETURN) NFS: Op = 51 (LAYOUTRETURN) NFS: Op = 51 (LAYOUTRETURN)
And indeed, we see the refcnts get decremented by two for each file:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt
+ mdb -k
+ echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
refcnt = 0x3
refcnt = 0x3
refcnt = 0x3
refcnt = 0x3
And after a while:
[root@pnfs-17-24 ~]> ./rfps.sh | grep refcnt + mdb -k + echo ::walk mds_File_entry_cache|::print struct rfs4_dbe
I'm actually happy that this strange behavior occurred. I wouldn't have taken the time to painstakingly poured over every file rele and I wouldn't have thus found the two bugs. The first was in error handling in the remove case and the second was in file delegation recalls. The first was probably hit or miss to being found. The second was a timebomb.