Removes are failing and I don't think this is the classic issue we have had with this operation. I know that at the last BAT, the removes were succeeding enough to cause a panic. And we don't seem to be getting to that code even.
Snoop shows the client sending the remove, but the MDS is not sending anything to the DS. Barring the MDS sending to the wrong DS, I've put together a simple DTrace script to see what is going on:
[root@pnfs-17-24 ~]> more remove.d
#!/usr/sbin/dtrace -s
::do_ctl_mds_remove:return
{
printf("rc = %d", args[1]);
}
::ctl_mds_clnt_remove_file:return
{
printf("rc = %d", args[1]);
}
:::nfss-e-vop_fid_pseudo_failed
{
}
:::nfss-i-layout_is_null_cannot_remove
{
}
::mds_op_remove:entry
{
self->resop = (nfs_resop4 *)arg1;
self->resp = self->resop->nfs_resop4_u.opremove;
}
::mds_op_remove:return
{
printf("rc = %d", self->resp.status);
}
I don't like the hoops I have to go through to get the status on mds_op_remove, but I have to do that because that function is of type void. I'm going to change them to return a status. It doesn't cost much and makes simple DTrace scripting easier.
Anyway, here is our result:
[root@pnfs-17-24 ~]> ./remove.d dtrace: script './remove.d' matched 6 probes CPU ID FUNCTION:NAME 0 30503 do_ctl_mds_remove:nfss-e-vop_fid_pseudo_failed 0 62050 do_ctl_mds_remove:return rc = 28 0 62052 mds_op_remove:return rc = 0 ^C
What is that 28?
[thud@ultralord nfssrv]> grep 28 /usr/include/sys/errno.h #define ENOSPC 28 /* No space left on device */
Now I want to know what vop_fid_pseudo() is returning, but only when called in this context. I can use the thread properties like this:
::vop_fid_pseudo:entry
/self->live == 1/
{
self->vp = (vnode_t *)arg0;
self->fidp = (fid_t *)arg1;
}
::vop_fid_pseudo:return
/self->live == 1/
{
printf("fid len = %d\trc = %d",
self->fidp->un._fid.len, args[1]);
}
Only if self->live is set will I see output:
[root@pnfs-17-24 ~]> ./remove.d dtrace: script './remove.d' matched 8 probes CPU ID FUNCTION:NAME 0 62468 vop_fid_pseudo:return fid len = 10 rc = 28 0 30503 do_ctl_mds_remove:nfss-e-vop_fid_pseudo_failed 0 62050 do_ctl_mds_remove:return rc = 28 0 62052 mds_op_remove:return rc = 0 ^C
And now I need to know where that error is set:
49 vop_fid_pseudo(vnode_t *vp, fid_t *fidp)
50 {
51 struct vattr va;
52 int error;
53
54 error = VOP_FID(vp, fidp, NULL);
55
56 /*
57 * XXX nfs4_fid() does nothing and returns EREMOTE.
58 * XXX nfs3_fid()/nfs_fid() returns nfs filehandle as its fid
59 * which has a bigger length than local fid.
60 * NFS_FH4MAXDATA is the size of
61 * fhandle4_t.fh_xdata[NFS_FH4MAXDATA].
62 *
63 * Note: nfs[2,3,4]_fid() only gets called for diskless clients.
64 */
65 if (error == EREMOTE ||
66 (error == 0 && fidp->fid_len > NFS_FH4MAXDATA)) {
67
68 va.va_mask = AT_NODEID;
69 error = VOP_GETATTR(vp, &va, 0, CRED(), NULL);
70 if (error)
71 return (error);
72
73 fidp->fid_len = sizeof (va.va_nodeid);
74 bcopy(&va.va_nodeid, fidp->fid_data, fidp->fid_len);
75 return (0);
76 }
77
78 return (error);
79 }
And that means I need to know what VOP_GETATTR() is returning. I'll pull the same trick as before, but I'll need to be looking at, well, I have to figure that out now, don't I?
The trick is to let the vnode tell me:
int
fop_getattr(
vnode_t *vp,
...
err = (*(vp)->v_op->vop_getattr)(vp, vap, flags, cr, ct);
So I need to find the pointer to the function:
printf("vop_getattr = %p", self->vp->v_op->vop_getattr);
And then I need to find out what that is:
0 62467 vop_fid_pseudo:entry vop_getattr = fffffffff883c180 ... [root@pnfs-17-24 ~]> mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs mpt sd sockfs ip hook neti sctp arp usba stmf fctl nca lofs idm cpc random zfs nfs fcip logindmux ptm sppp ] > fffffffff883c180::dis zfs_getattr: pushq %rbp
Okay, we at least know which filesystem is complaining. I probably could have done this differently, but this way works. The only error in that function comes from zfs_zaccess():
0 53373 zfs_zaccess:return rc = 0 0 53125 zfs_getattr:return rc = 0 0 53125 zfs_getattr:return rc = 0
Which means I was reading the code incorrectly! Note that these are being called by the code, my check forces that to be true, but the error must be elsewhere. It must be the VOP_FID() which is causing the issue!
4323 static int
4324 zfs_fid(vnode_t *vp, fid_t *fidp, caller_context_t *ct)
4325 {
4326 znode_t *zp = VTOZ(vp);
4327 zfsvfs_t *zfsvfs = zp->z_zfsvfs;
4328 uint32_t gen;
4329 uint64_t object = zp->z_id;
4330 zfid_short_t *zfid;
4331 int size, i;
4332
4333 ZFS_ENTER(zfsvfs);
4334 ZFS_VERIFY_ZP(zp);
4335 gen = (uint32_t)zp->z_gen;
4336
4337 size = (zfsvfs->z_parent != zfsvfs) ? LONG_FID_LEN : SHORT_FID_LEN;
4338 if (fidp->fid_len < size) {
4339 fidp->fid_len = size;
4340 ZFS_EXIT(zfsvfs);
4341 return (ENOSPC);
4342 }
Some quick DTrace will tell me I'm correct:
::zfs_fid:entry
/self->live == 1/
{
zp = (znode_t *)(self->vp)->v_data;
zfsvfs = zp->z_zfsvfs;
lfid = sizeof (zfid_long_t) - sizeof (uint16_t);
sfid = sizeof (zfid_short_t) - sizeof (uint16_t);
size = (zfsvfs->z_parent != zfsvfs) ? lfid : sfid;
printf("fidlen %s size! (%d, %d, %d, %d)",
self->fidp->un._fid.len < size ? "<" : "not <",
self->fidp->un._fid.len, size, lfid, sfid);
}
...
0 53162 zfs_fid:entry fidlen < size! (0, 10, 20, 10)
0 53163 zfs_fid:return rc = 28
So is this new? Or something I introduced? I don't see how the code ever worked here. We pass a fid_t off the stack, i.e., uninitialized, into vop_fid_pseudo() and when we call zfs_fid(), that check should trigger every time. The comments on lines 57-61 lead me to think that vop_fid_pseudo() is not intended for types other than NFS. In any event, the other callers of it do:
makefh4(nfs_fh4 *fh, vnode_t *vp, struct exportinfo *exi)
...
fid_t fid;
...
bzero(&fid, sizeof (fid));
fid.fid_len = MAXFIDSZ;
And the file should have been removed!
0 67453 zfs_fid:return rc = 0 0 81693 vop_fid_pseudo:entry vop_getattr = fffffffff8088180 0 67452 zfs_fid:entry fidlen not < size! (64, 10, 20, 10) 0 67453 zfs_fid:return rc = 0 0 81694 vop_fid_pseudo:return fid len = 10 rc = 0
From the snoop we can both see the MDS communicating with the DS and that I need to go teach snoop about a remove!
CTL-MDS: ----- Sun CTL-MDS ----- CTL-MDS: CTL-MDS: Proc = 10 (Remove object(s) or entire fsid at the DS) CTL-MDS:
While I fixed a valid bug (and we are past the code that was panicking at the BAT, I doubt I fixed the problem we had been seeing:
Before a huge copy:
[root@pnfs-17-22 ~]> zfs list NAME USED AVAIL REFER MOUNTPOINT pnfs1 9.35G 6.28G 23K /pnfs1
After a huge copy:
[root@pnfs-17-22 ~]> zfs list NAME USED AVAIL REFER MOUNTPOINT pnfs1 12.5G 3.09G 23K /pnfs1
And after the delete:
[root@pnfs-17-22 ~]> zfs list NAME USED AVAIL REFER MOUNTPOINT pnfs1 12.5G 3.09G 23K /pnfs1
Note, the DS did report that it tried to delete the file:
[root@pnfs-17-22 ~]> ./remove.d dtrace: script './remove.d' matched 2 probes CPU ID FUNCTION:NAME 0 62434 ctl_mds_srv_remove:return status = 0 1 62434 ctl_mds_srv_remove:return status = 0 1 62434 ctl_mds_srv_remove:return status = 0 0 62434 ctl_mds_srv_remove:return status = 0 1 62434 ctl_mds_srv_remove:return status = 0 1 62434 ctl_mds_srv_remove:return status = 0 0 62434 ctl_mds_srv_remove:return status = 0
Amazingly enough, I had forgotten I was running that DTrace script. So the issue is that ctl_mds_srv_remove() reports success, but the file contents are still there.
I'm starting to wonder if I haven't hit my head on a known ZFS bug that is fixed in 118. The pNFS code is based on 117. I'll investigate.