I'm in the process of finishing off some coding I started before the BAT last month. It started off as a simple observation, my DSes with two datasets were only using the first dataset. It was time to rewrite the device and layouts to use the mds_sids. That is an unique value which identifies the ZFS dataset on a zpool. It should be unique in allowing a mapping from dataset name: 'ds1:ppool/pnfs' for humans and the value used by computers.
Well, doing all that also means it is time for the kspe (kernel simple policy engine) to be integrated. I had all of the above coded and the prototype for the kspe, but had to go to BAT to test other stuff. Well, this week I picked things back up and finished the code off. I'm still not completely happy with it, especially in where the kspe lives, i.e., in the 'nfs' or 'nfssrv' module.
I got my MDS up and running, but I was hitting issues with the above. But that only caused the modules not to load. I was able to use scp to get new modules over. And when I fixed those issues, I ran into the next major one. I had added another index to the layout nfs4 database table and I'd only told the table it had one.
I got in a nasty reboot loop and because of fastboot I couldn't get to to the grub menu. I ended up doing a power cycle from the LOM and that caused the grub menu to come up. I could then boot into single user mode and do:
# echo "setprop boot-file 'kmdb'" >> /a/boot/solaris/bootenv.rc # reboot
That ended up fixing the fastboot issue. Now what I would do in a panic situation was get into single user mode and issue:
# cd / # rm kernel/misc/amd64/nfssrv kernel/misc/nfssrv # reboot
That would bring me up safely in multi-user mode and then I could scp the new kernel modules over.
But I forgot that the DSes use the same initialization routines as the MDSes. If I hadn't already made all of the boxes configured to drop into kmdb on a panic, I would have been ticked.
Hey, I have a DS up and running with one dataset (gotta start small) and a down MDS:
[root@pnfs-17-24 ~]> panic[cpu1]/thread=ffffff01d9ee70c0: assertion failed: e->refcnt > 1, file: ../../common/fs/nfs/nfs4_db.c, line: 133 ffffff00081aa980 genunix:assfail+7e () ffffff00081aa9b0 nfssrv:rfs4_dbe_rele+67 () ffffff00081aaa90 nfssrv:ds_reportavail+4d8 () ffffff00081aab40 nfssrv:nfs_ds_cp_dispatch+9e () ffffff00081aac30 rpcmod:svc_getreq+20d () ffffff00081aaca0 rpcmod:svc_run+197 () ffffff00081aacd0 rpcmod:svc_do_run+81 () ffffff00081abeb0 nfs:nfssys+a0e () ffffff00081abf00 unix:brand_sys_syscall32+292 ()
I'm off to solve it. See ya' later!