ElHam is a filesystem testing tool designed to detect corruption, be multiprotocol, and stress a filesystem. It isn't designed to be a benchmark. I'm in the middle of debugging a real nasty NFSv4 bug, read that to mean we haven't a real clue as to what is going on or how to reproduce it, and I need to generate sufficient load on a test system.
So I went and got ElHam from SourceForge.net. I wrote it when I was at NetApp as a tool we could use internally to get multiprotocol lock testing, generate metadir traffic, and to hand out to customers for corruption testing. As such, we stuck a BSD license on it and hung it off of SourceForge.net.
It still needs work done on it - for example, I figured out that it wasn't detecting big endianess. I also have to make a pass through it and make sure that I capture all returns from function calls and check that they are valid. One of the things you need for corruption testing is early detection of problems.
Sometimes in trying to detect corruption, you can get a false positive because of client side caching. If your focus is strictly on the server, i.e., you are testing a filer, that is bad. So you might be tempted to turn off client side caching. It also appears to go faster, but again, ElHam is not designed to be a benchmark.
The other evil with turning off client side caching is that it effectively negates both locking in general and NFSv4 delegations. ElHam is designed to have multiple readers and writers, both local and remote, changing files in a directory tree. Client side caching issues are something it should have to live with.
Anyway, multiple instances (from different architectures and OSes) are possible because ElHam records what is supposed to be in every data block. So when another instance comes along, it is able to compute what should be in the data block and then it can see if the on-disk image is corrupt. I need to write a small application to inject corruption - this will help me get signatures to show people what ElHam has detected.
The current big issue is that ElHam is designed to push a filesystem to capacity and back off. I.e., reads and writes in the face of a full filesystem are interesting. To aid in that testing, it is best that the 'data', 'meta', and 'history' (see ElHam docs) directories each be on a different filesystem. Well yesterday I had all three on the same filesystem and it got full. So I'm trying to reproduce that and see what is happening.
A really neat way to do this is to use ZFS to create different filesystems and then set quotas to control how much space each filesystem is allowed:
# zfs create zoo/elham # zfs set sharenfs=on zoo/elham # zfs create zoo/elham/data # zfs create zoo/elham/meta # zfs create zoo/elham/history # zfs list zoo/elham/* NAME USED AVAIL REFER MOUNTPOINT zoo/elham/data 36.7K 654G 36.7K /zoo/elham/data zoo/elham/history 36.7K 654G 36.7K /zoo/elham/history zoo/elham/meta 36.7K 654G 36.7K /zoo/elham/meta # zfs set quota=2G zoo/elham/data # zfs set quota=20G zoo/elham/meta # zfs set quota=20G zoo/elham/history # zfs list zoo/elham/* NAME USED AVAIL REFER MOUNTPOINT zoo/elham/data 36.7K 2.00G 36.7K /zoo/elham/data zoo/elham/history 36.7K 20.0G 36.7K /zoo/elham/history zoo/elham/meta 36.7K 20.0G 36.7K /zoo/elham/meta
Note that I give the 'history' and 'meta' filesystems much more of a quota. I don't want to run out of space on them.
I'm going to kick off several instances of ElHam and see if I can fill this puppy up.