Data Processing
Valdis's Weblog
Archives
« November 2009
MonTueWedThuFriSatSun
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
      
Today
Click me to subscribe
Search

Links
 

Today's Page Hits: 32

Locations of visitors to this page
« Exploiting Dtrace to... | Main | What is a mainframe... »
Thursday May 22, 2008
Tuning for filesystem performance, specifically QFS

The holy grail of storage performance, here goes (this question comes up every week).

To make I/O performance perfect, a block of data needs to be transferred, unhindered and unaltered with as few dissasemblies and assemblies as possible as it travels from the CPU to the physical disks. I have explained this many times and tuned this for over 20yrs and the basic rules do not change, strange thing. Neither do Moore's Law or Amdahls law, but they do get misquoted.

So if you application writes in 16K blocks make sure that all components in the I/O path for this application work in 16K units or larger. But not too much large as you will be wasting resources.

-- Exceprt from a discussion a couple of weeks ago, when an app was writing data in 128KB blocks and we were using a shared HPC SAN fielsystem called QFS, may be useful to someone ---

Suppliers (array manufacturers), industry etc mix up segment size and stripe width. This is what I do:

Understand you disk arrays and how they transfer data.

segment size is size of block write on a individual disk (your case 128KB)
stripe width is the amount that the array controller writes to a raid vol/grp/unit, this is number of disks x segment size (your case 128KB x 4 = 512KB). Person was using 4 disks.

Now the DAU (Disk Allocation Unit) that QFS uses to write a block of data for most best practices should match this to avoid write/read miss and what we want to do is for one QFS read or write you only have one "RAID group" read/write. But you can specify the DAU to be what ever you want, within reason.

Your application is writing data in blocks of 0.5MB, So yes your DAU should be 512KB.

So you can have 4 disks of 128KB seg size, or 8 disks or 64KB seg size etc. 8 disks will give more performance than 4, and if you have a 8D+1P RAID 5 group this just happens to fit nicely. NB 1 disk is for RAID parity so you need to add this to the 8 disks for data.

Remember no matter how good a disk arrays cache system is, with the sizes of databases etc that we have nowadays the cache can get overwhelmed very quickly if you do not have enough spindles or disks as we call them. In the end performance is determined by the number of IOPS (I/O's per second) of the backend disks. Try a database load, import/export of a table and watch you disk array performance deteriorate as the cache just cannot keep up.

Now IOPS, very approximate rule is that the faster the disk spins the higher will be it's performance. However, if you can get the average seek time and rotational latency from a disk manufacturers disk sheet then you can work out IOPS. IOPS can be calculated by using the following formula;

IOPS = 1000ms/(averag seek time + rotational latency)

Now QFS stripe options can also help here, but that is an even bigger story. QFS can do round robin writes and stripe accross many disk array RAID groups/sets.

The trick is that the DAU is (most of the time I am sure there will be exceptions) the same blocksize (currency) that the app uses. e.g app writes 950KB DAU should be 1024KB. Most apps behave in the normal powers of 2 KB type (8,16,32,64,128,256,512,1024,2048) thing so you should have a close match as DAU's can be the same size.

What we try to do most of the time is to configure the system so that all the "gates" from the app to the disk raid group are the same size. The "truck" i.e. the block of data fits all the way from the app to QFS filesystem to RAID group without having to do 2 or more writes/read for one requested block for the application. Nightmare scenario is that for an app writing one block the array does many writes. e.g. app writes 128KB and stripe width = 32KB, thus everytime the app does a single read or write the controller has to ask (read/write) 4 times. This is serious I/O performance overhead and what I can make lots of money fixing.

Make sure that your block is not disassembled or assembled in it's journey from the app to the disk. OK the PCI bridges and HBA's may do this but we cannot change that. PCI lanes is getting into deep heavy tech stuff.

So I normally work this way. Find the app blocksize, then make the DAU the same, then make the stripe width the same as the DAU, then decide how many disks we want to use to get IOPS and then divide the stripe width derived from the above calculation by the number of spindles in the physical arrays raid group, to get the segment size. Now the segment sizes are mainly fixed on the arrays that we use, from 8,16,32,64,126,256KB. So we sometimes do not get a round number, to match the app blksize, DAU etc. However, I always make this "magic gate number for the blocks/trucks" larger than the DAU so to avoid 2 physical reads/writes per each application write/read, which is the crux of all application and I/O tuning.

Storage heaven is where we have full stripe writes and reads. Which is implied by the application block size, DAU fitting the stripe width accordingly.

You can check this with various tools by using vdbench (storage perfromance saviour) to do 10 writes or reads of a specific blocksize and if the array does not do the same amount of I/O's (e.g. 10 writes/reads) then you are not hitting the G spot (array Group Size) spot. So if you do 10 writes and the array did 20 your seg size quite likely is half of your DAU or app blocksize. Remember filesystems do strange things to application writes and can mutilate them in more ways than we can dream up, so we have to know and understand filesystems. A good old Unix test is the "dd" command, if you have a array with a certain number of disks in a RAID group run dd to the actual raw Lun to see what it can do. Your filesystem layout which you use later if correctly tuned should get close to this number. If you get more then you are a candidate for a Nobel prize. If widely different then something between the app and the disk is messing thinks up. No chance of a Nobel prize, maybe a Darwin prize.

Think of a truck going down a road and all the tunnels and lanes are the exact size or bigger, thus the trucks journey is never hindered and the driver does not have to unload/dissassemble, load/assemble the truck (block of data) to get it through the tunnels, lanes, toll gates.

Now can the QFS community guys check this as I have been know to write faster than I think. But have have got close to max specified speeds on 6140 and 6540 using this technique. Plus some old heritage and legacy arrays.

Now Have I put the whole storage consultancy business out of a job. Not really, take this example. A woman calls a mechanic (call him Jerry) to fix her car as the engine does not work. Then Jerry takes a look at the engine and gets a hammer, he hits a specific part of the engine and the engine starts to work. Jerry says, that will cost you $500. The woman says, you must be joking, you just hit it with a $10 hammer. Well says Jerry, the bill is for $10 for the hammer and $490 for the knowledge where to hit the engine. You pay for knowledge not the muscle.

Posted at 07:05PM May 22, 2008 by Valdis Filks in Technical  |  Comments[2]

Comments:

Great blog post! One question: where can I download vdbench from ?

Thanks

-Alex

Posted by alex on May 23, 2008 at 12:44 PM CEST #

If you are a customer of partner you can get it from the SDLC, Sun Download center. I can see it internally but not sure how partners see it. As it is s/w and Sun's strategy is to open source all software I believe that you will soon see it as Open Source.

vdbench runs on JAVA and you can use it on most Oses, so you can do performance comparisons running the same load on Win, AIX, Solaris, HPUX and others.

Posted by Valdis on May 23, 2008 at 02:23 PM CEST #

Post a Comment:
  • HTML Syntax: NOT allowed