Paul van den Bogaard his weblog

Thursday Jun 19, 2008

Running PostgreSQL on ZFS file system. Some considerations

Up till now I compared ZFS to UFS when deploying the PostgreSQL database files. Only these database processes were running on the system, and only all the PostgreSQL related files were on the filesystem being compared.


The file system was created on either a stripe set of nine disks using Solaris Volume Manager or a zpool using these same nine disks. The last blog entry reported ZFS outperforming UFS.


I did some extra tests to see if performance (read database throughput) could be increased by fiddling with some elements that are known to influence the IO part when using ZFS.


The ZFS Intent Log is known to consume bandwith. The effect of this element can be seen by disabling it altogether. Of course this is definitely NOT advised for production environments. (See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ZIL for some extra information.)


Did some tests with the ZIL disabled. Using 5, 15, 50 and 85% for the completion_target of PostgreSQL. The following graph shows the database transaction throughput:




tpmNoZIL.png


Recalling the baseline I presented in an earlier blog using seven disk arrays with caches, that showed a throughput around 90K TPM this one comes very close. Only nine internal disks this time...


The previous throughput graph (... the saga continues) showed a rate of ~ 50K TPM: transactional integrity does not come for free.


Did some more tests where ZIL was enable, but new using a dedicated disk array (with 1GByte of write cache) for the intent log. The throughput result of this setup, the disabled-ZIL and the 'all on nine disks' tests are in the next graph (all have a completion_target setting of 15%):


tpmDifferentZIL.png


The ZIL element obviously takes time. Time during which the throughput is cleary throttled while a database checkpoint is in progress. Using a dedicated fast device pays of to increase throughput and make it more constant over time. The IO (write part) was measured for both the ZIL device as for the data on the "nine disk" ZFS file system . Here is the result:




zlogvsdataIO.png


The checkpoint at 1800 seconds resulted in a 77 MByte write/sec for the ZIL only. As already suggested by Mikael Gueck I would surely like to do some extra tests with SSD disks. Guess I need to hold my breath a while, since right now I "just' do not have them.


However right now I feel ZFS is already a realistic alternative for a PostgreSQL setup.


Will do some more tests where I'll play with some of the other features of ZFS to get a feeling for their effect on my PostgreSQL environment.

Wednesday Jun 18, 2008

UFS versus ZFS with PostgreSQL, the saga continues

The previous posts described some tests I did using a setup that was "tuned" to focus on CPU consumption. The idea was to minimize IO as much as possibly by caching all data needed as much as possible and using plenty of disk arrays with write caches to minimize write latency.


In my current tests I am interested in the IO components and how they influence throughput when using two different file systems. In order to create a significant IO component the amount of disks was limited to nine internal ones: no array; no write caches. In addition the load generator was changed to significantly increase the working set data. Indeed my 16GB of shared buffers are not enough to buffer this working set.


The initial configuration was "tuned" to once in 15 minutes perform a checkpoint. Here I will describe the results when and using a smaller checkpoint interval and influencing the time "available" to write out all data due to a checkpoint. The PostgreSQL parameters in effect here are:



checkpoint_timeout
checkpoint_completion_target

The checkpoint_timeout was set to do a checkpoint every 300 seconds. The checkpoint_completion_target used to be 50%. This influences PostgreSQL to write out all the dirty buffers in a period of time that is 50% of the timeout value. Indeed the old graphs show this nicely: first halve of each checkpoint period a lot of IO can be seen that is not present during the second halve of this period.


The idea behind the tests described here was to spread the IO over time. This was done by reducing the timeout from 15 to 5 minutes and increasing the completion_target to 85%. This turned out not to be optimal. Probably too many dirty blocks were re-dirtied again so a continuous writing of the same blocks was done over and over.


A smaller checkpoint_timeout seems fine for the time being. Increasing this parameter ensures that rewriting of blocks is minimized but the amount of blocks to be written increases resulting in a much more extreme IO workload once the checkpoint starts. With that same reduced checkpoint_timeout setting the completion_target was set to 15% to see what happens. The results of the 85% and 15% tests can be seen in the graphs below. First the throughput graph.




tpm.png


The nice thing here is that ZFS shows the best throughput (on average about 50K TPM.) And indeed for a 15% completion_target. Same target for UFS causes huge fluctuations in the throughput graph. Now lets look at the IOPS graph.




writeio.png


Although the pattern is clear lets focus on a smaller time interval to see what happens during one checkpoint interval




writeio2.png


The 15% setting seems too much for the UFS case: it needs to continue writing data for the whole period. ZFS on the other hand, puts out a burst of IO's in a 50 second period and from then on the IO settle to a constant amount of "background" IO for the rest of the checkpoint interval (most likely this has a high content of WAL related output.) 15% of 5 minutes is 45 seconds. Indeed ZFS seems to be able to cope with this setting.


The above shows the write IO's per second. How much data is actual written is shown in the following graph.




writeoutp2.png


ZFS is clearly pushing out much more data. During those initial 50 seconds up to 60MB/sec, after that a linear reduction of this amount can be seen. Although the IOPS are rather constant the amount is not. ZFS seems to reduce its write size. UFS pushes significantly less data out to the disks. However it uses so many IO calls that this reduces the overall TPM throughput.


Since ZFS is a logging file system the data written contains both the application (PostgreSQL) data and the logging data. I'll do some extra tests to see how this effects the TPM results. Need to "sacrifice" a 1GB cache disk array for this to emulate a fast SSD-like disk. Will make sure that this one is dedicated for ZFS its so called intent log.

Monday Jun 16, 2008

UFS vs ZFS: commandline editing

My previous post on UFS vs ZFS showed too much of a difference. I looked at all my tests and they all showed this same difference. Still I wanted to make sure...


So I redid the test and .... gone were does 85MB ZFS writes. Doing these tests I rely on command line editing and command line history. Try to use the same command line as before. So I walked through my history and did not spot anything. On the first pass that is.


Than I "just" spotted it. Once more the devil is in the details. The command line used was


#zfs set recordsize=8k pg


And that # character was not the prompt but a comment marker... Well this explains a lot. The record size that in fact was used was the deault setting, 128K. Each 8KB write was in fact a 128KByte write.


So I redid the ZFS based tests and here are the results for the 'all on one volume of nine disks.'


First the Writes/sec:


writeio.png


And the corresponding IO load:




writekb.png


Finally here is the TPM throughput graph for the 'all on nine' tests.




tpm2.png


Once more ZFS writes out more data. Looking at the TPM numbers the ZFS tests pushes a little less transactions through the system. Therefore the extra writing must be related to ZFS "overhead." The "good" thing is ZFS uses less but much larger IOs.


Although the maximum throughput is a little better for UFS, the ZFS behaviour is more constant. Overall the average looks very similar. No clear winner. But than it was not a contest, just a comparison. And surely I would like to understand those too noticable high-lows in throughput. Of course the writes are related to PostgreSQL its checkpoint configuration. This configuration was "tuned" for a certain set of tests were focus was on CPU consumption. In order to do this the checkpoints were moved to once per 900 seconds. Also PostgreSQL attempt to finish all checkpoint related IO in half of the checkpoint time. Another test is called for that causes a more constant and less extreme IO load.


And btw, the UFS tests were done without directio mounting: the ramp up is almost as fast as with ZFS.

Friday Jun 13, 2008

PostgreSQL on UFS versus ZFS

Introduction


Up till now I had the good fortune to do my PostgreSQL tests using a Sun Fire V890, 16 cores SPARC IV @1350MHz. Attached were seven disk arrays. Four of these with 256MB of write through cache; the other three with 1GB of write through cache. All spindels set up as a stripe.


Always used UFS, where WAL was located on a stripe of the four smaller arrays; all application data related tables on one array; all application indexes on one array; and all other PostgreSQL related data on the third array. Both file systems that held application specific data were mounted with directio option in place. As was the WAL holding file system.


Always used an internal toolkit to generate an OLTP-ish load with 128 concurrent users that were constantly sending their queries and DML (no think times), set up in such a way that there would be quite some IO (a rather large working set).


Next to these seven arrays I had nine internal disks free. This blog entry describes the findings when just using these internal disks in different ways. These internal disks were configure in different ways to hold the data mentioned above. Next to the (array based) baseline different configurations using both UFS and ZFS were tested. Below the findings are presented.


Results


The load generation tool reports its results through a TPM metric. The next graph shows this throughput for the baseline, a single stripe over the nine internal disks holding an UFS file system and a single pool (of these same internal nine disks) on which a ZFS file system was configure.


The /etc/system settings used for all these tests are:



set rlim_fd_max=260000
set rlim_fd_cur=260000
set maxphys=1048576
set ip:ip_squeue_fanout=1
set ip:ip_squeue_bind=0
set tune_t_fsflushr=1
set autoup=900
set ufs:ufs_WRITES=0
set segmap_percent=25

My system has 48GByte internal memory. PostgreSQL shared buffers take 16GByte. So I do not want the filesystems to use more than 25% of my memory for caching. Default UfS uses 12%. However ZFS is a little more greedy by default. This setting at least ensures the database processes would have an adequate amount of memory.


The ZFS file system was created with an 8KByte record size and compression turned off. No further settings were changed.





fs_tpm.png



Before each test a backup of the freshly created database was restored. The caching behavior of ZFS shows quite nicely. Both UFS tests need their time to get started (remember that directio mounted file systems do not cache, so in the beginning the data needs to be read from disks, instead from file system cache). The dips in throughput correspond nicely with checkpoint activity.

Checkpoints are definitely handled way better when using disk arrays with memory backed up cache. At about 2000 seconds the throughput seems become stable in the array case, while in both other cases the checkpoint load strongly effects the throughput.

Comparing the UFS and ZFS tests done using internal disks it seems that during throughput peek times they are quite comparable, although ZFS shows a little less throughput. The data above "just" presents the TPM. Lets look at the IO graphs. To make things "easier" I'll only show UFS vs ZFS using the internal disks. The above graphs come from a test where all database files are in the same file system. Therefore iostat cannot discriminate between WAL, data nor index based IOs.

writerate.png

Yes indeed, there is a difference. Read rate is not depicted here. ZFS shows no reads during the interval. Indeed the file system cache works perfectly. For UFS there is an initial rate of 13MByte/sec. This one reduces rapidly. At 1500 seconds there is 200KByte/sec left. After 2500 seconds (from the start) this has been further reduced to less than 40 KB/sec.

However the difference in write IO is significant. Using ZFS the periods of write IO is longer (from ~200 sec with UFS, to ~380 sec with ZFS) and the amount of write IO is up to five times as big (18MB/sec vs 80MB/sec). Now looking at the actual IOPS the following graph shows.



writeiorate.png



This one shows ZFS is handling the disk device much better. The average number of IO's is down by a factor of 3 when using ZFS.

Finally looking at CPU consumption it shows the ZFS test uses a little less of this resource.





Next step



Some extra tests were done where two pools are used. One for WAL and the other for the rest of the database files. This to check if the behavior could reside from the way the database uses its storage. The WAL pool has four disks, the other pool has five.

Since a picture tells us more than a thousand words, here are the results. First the WAL output.



walwrite.png



Although they seem to differ, it looks more or less equal when considering amounts. Still the UFS is "faster," meaning the frequency of those spikes (checkpoint) is higher. Once more, since everything is in the ZFS file system cache, the ZFS test starts at once, while the UFS one needs a warm up period. However the huge differences seen in the previous graphs do not show here. Now lets look at the filesystem holding the other PostgreSQL data.



allwrite.png

There it is! There is a striking difference between UFS and ZFS file systems when doing non wall related write IO with PostgreSQL. And by the way, the difference is bigger compared to the 'all on one file system' tests: approximately 7MB for UFS versus 85MB for ZFS. The ZFS numbers are obtained through zpool iostat command. I double checked the iostat numbers, and indeed the 85 MB number is correct. No hidden undocumented features that could "explain" this difference. Need to dig deeper.





Conclusion



Although the actual performance from a client perspective is just a little less when comparing ZFS with UFS, the amount of IO (KB/sec) is too different to be ignored. Although ZFS seems to optimize by large sequential writers the overall effect in an OLTP environment seems not to be optimal. I hope to be able to do some further tests that could help shed some light on this remarkable finding.

Calendar

Feeds

Search

Links

Navigation

Referrers