So, my unit testing is all done and it is time to integrate. Bzzt! Wrong answer - we have to do regression testing. In our case, it is called pNFS/miniPIT and was mainly crafted by Helen Chao. And it was bombing on my new code.
First of all, the core was confusing:
[root@pnfs-minipit2-4 ~]> panic[cpu0]/thread=30008a61560: 000000047ad BAD TRAP: type=31 rp=2a100f596d0 addr=14 mmu_fsr=0 occurred in module "nfssrv" due to a NULL pointer dereference nfsd: trap type = 0x31 addr=0x14 pid=100528, pc=0x7afb3d64, sp=0x2a100f58f71, tstate=0x4414001604, context=0x29b g1-g7: 7be42af4, 2, 2000, 0, 0, 29b, 30008a61560
The NULL pointer wasn't anything to do with the function being called. Secondly, I focused on trying to reproduce it inside the test harness - it wasn't until I made a simple test case that I progressed.
Third, I focused on the server being 32 bit instead of the real difference - it was not setup as a MDS - therefore it had no DSes reporting to it. I'll show why that is important:
4463 error = nfs41_spe_allocate(&spe_va, claddr,
4464 vp->v_path, &lc, TRUE);
4465 if (error) {
4466 /*
4467 * XXX: Until we get the SMF code
4468 * in place, we handle all errors by
4469 * using the default layout of the
4470 * old prototype code
4471 *
4472 * At that point, we should return the
4473 * given error.
4474 *
4475 * XXX: Any way *plo is NULL here?
4476 */
4477 *plo = mds_gen_default_layout(cs->instp);
4478
4479 /*
4480 * Record the layout, don't get bent out of shape
4481 * if it fails, we'll try again at checkstate time.
4482 */
4483 (void) mds_put_layout(*plo, vp);
4484
4485 return (NFS4_OK);
4486 }
4487
4488 /*
4489 * XXX: Any way *plo is NULL here?
4490 */
4491 *plo = mds_add_layout(&lc);
The fourth, and major problem, is that I had valid comments and I ignored them. Yes, it was certainly valid for the *plo to be NULL after 4477 but probably not 4491. But they are both easy enough to code for!
The issue is that a NFSv4.1 server, i.e., no pNFS, also goes through this code path. We need to handle not having a layout:
4474 * At that point, we should return the
4475 * given error.
4476 */
4477 *plo = mds_gen_default_layout(cs->instp);
4478 if (*plo == NULL) {
4479 status = NFS4ERR_LAYOUTUNAVAILABLE;
4480 } else {
4481 /*
4482 * Record the layout, don't get
4483 * bent out of shape if it fails,
4484 * we'll try again at checkstate time.
4485 */
4486 (void) mds_put_layout(*plo, vp);
4487 }
4488
4489 return (status);
Which appears okay, except with just this change, files will not be created. We want to return NFS4ERR_LAYOUTUNAVAILABLE, mainly for DTrace probing, but in the caller, we want to do:
4892 } else {
4893 status = mds_createfile_get_layout(req, vp, cs, &ct, plo);
4894
4895 /*
4896 * Allow mds_createfile_get_layout() to be verbose
4897 * in what it presents as a status, but be aware
4898 * that it is permissible to not generate a
4899 * layout.
4900 */
4901 if (status == NFS4ERR_LAYOUTUNAVAILABLE) {
4902 status = NFS4_OK;
4903 }
4904 }