I've spent the last several weeks making both the MDS and DS survive whilst the DS reboots itself silly. On the MDS side, that was mainly in making sure we didn't orphan off ds_addrlist and ds_guid_info database entries. And on the DS, it was mainly making sure that we didn't try to operate on a client's NFSv4.1 requests before we started to end the NFS server.
The ds_owner has two linked lists: ds_addlist and ds_guid_info entries. The lists allow the quick traversal of the entries and to not store the owner id in the entries. At first, we might want to store the owner id, but then we also have to store a boot instance id as well. I.e., are you an ds_addrlist from before the reboot of the DS or after?
But this incestuous relationship causes the most convoluted usage of the rfs4_dbe code. It ain't pretty, and it ain't always easy to follow.
The way I did unit testing was to reboot a DS and see if the MDS stayed up. If it did, then I would run a shell script to dump some interesting structures via mdb. If it didn't, well, then I was already in kmdb and could start dumping the structures directly.
Once I got that working, I started running 10 back-to-back instances of the cthon test suite. And I would then reboot the DS whenever I was ready.
Eventually, I got to the point where the DS would randomly crash upon reboot. It looked like the server instance was being torn down at the same time it was being used to grab a lock. I added a variable to keep track of whether it was being torn down. But I couldn't reliably trigger the bug.
I created a simple script to drive NFS traffic, even if there was an application level error:
[root@pnfs-17-21 ~]> more swift.sh
#!/bin/sh
mount -o vers=4 pnfs-17-24:/pnfs2/pnfs /pnfs/pnfs-17-24/
while /usr/bin/true; do
dd if=/root/cleanup@downtime-zsend.bz2 of=/pnfs/pnfs-17-24/zero count=12
04 bs=2048
dd if=/pnfs/pnfs-17-24/zero of=/root/zero count=1204 bs=2048
sleep 1
done
I also needed a way to force the DS to reboot itself just after the dserv was started. I could have added some code to cause a panic, but I wanted a clean and orderly shutdown. The solution was to use smf(5) to create a service instance with a dependency on dserv and which called reboot directly. This worked like a charm.
If I hit a crash or had to refresh the nfssrv modules, I would drop down into maintenance mode and edit the '/a/var/svc/manifest/network/reboot.xml' file to change the 'reboot' to 'true'.
This unit test ended up testing both the MDS state management and the DS race case.
With a solid set of results from the unit testing, I just need to go validate my code with our standard test suite that all integrations need to pass. And that, that will be my task tomorrow...