Cache and Volume Stripe sizing
Here was an interesting problem I ran across not too long ago. Had a customer with a lot of A5200's deployed since they get a high spindle count (22 SCSI drives in a 4u box). Old database philosophy puts the emphasis on spindles to improve throughput .. but, these days with an increase in storage cache, more intelligent controller algorithms to stage, destage, and stream data at optimal rates, and multiplexed HBAs to linearly improve throughput across fibre channel fabrics - the tuning issues generally go back to the types of mixed applications you have coming to bear on the storage.
Anyhow, since I've been dealing with storage arrays with built in raid controllers and cache for so long, I seem to have forgotten about what it was to take care of all this within the operating system and software based volume managers. In this particular case, the customer was using Solaris 8, Sun Volume Manager, and an A5200. Watching iostat for a given drive in the A5200, they were noticing rebuild and throughput times on the order of around 2.5MB/s (which is abnormally slow) and relatively high I/O wait times. They had striped volumes mirrored front to back on a given A5200 (ok - let's split the loop on the A5200) and were using a boatload of the old 72G cheetah 10KRPM seagate drives with mixed firmware. Ok, fcode aside (btw - seagate's update utilities are all win32 and linux based, sun repackages the fcode and update utility for the drives they OEM), what I failed to catch right away is the default stripe size for Sun Volume Manager which is 16KB and basic physics. On non-well aligned I/O, I'll probably be doing 1 write/revolution so:
((10,000 RPM)/60s)/drive = 166.67 IOPs/drive
166.67 IOPs/drive * 16KB/IOP = (2.6 MB/s)drive
which means that we can only transfer up to a maximum of 2.6MB/drive which matches pretty close to what was observed with iostat. Would 15K drives help? sure, but only by a factor of 1.5 - what we really need is better aligned I/O with higher block transfer rates (which is normally taken care of in well designed array controllers and cache.) In this case, increasing the stripe size should be the major improvement, and from previous experience ~384KB would be closer to optimal if we could deliver this in parallel (yielding on the order of 60MB/s per drive). Of course also tweaking maxphys, the [s]sd_max_xfer_size, and bufhwm can prove fruitful provided that there's enough memory to go around for the I/O subsystem. But this is one thing I love about Solaris - it's design ability to be able to handle large I/O transfer sizes (maxphys goes up to MAXINT.) Contrast this with linux and AIX which use a 4K io xfer size. Of course you'll need a filesystem to be able to handle the throughput as well .. but that's a much larger (but closely integrated) discussion ..
