Paul van den Bogaard his weblog

Tuesday Jul 22, 2008

Waving Commit Delay

In the previous entries many graphs were shown. The once showing throughput by the amount of commands handled had this typical sinusoidal pattern. I now think I found the cause of it. Although I definitely do not understand it (yet).

Since, as usual, a picture tells us more than a thousand words here is one with that pattern. The vertical axes shows the cumulative time by all PostgreSQL processes in a certain state:


mail_commitdelay.png

Ok, what is shown here? The red graph depicts the spent while sleeping as a direct result of the commit_delay setting. Currently the average number of processes with active transactions when a flush is called is around 85. Therefore that part of the equation is true, so indeed it is the commit_delay setting that determines if the sleep is done: all other prerequisites are true (read the source if you do not understand :-) .)

Now when setting commit_delay to zero, the result becomes:


mail_nocommitdelay.png

The time for XLogFlush now fully determines the time spent in the routine that handle the RecordTransactionCommit. It is also approximately 60% of the time spend handling commands.

The downside of not disabling commit_delay is total throughput. A little bit of delay reduced the amount of (WAL related) IO significantly. This was already shown in an earlier blog entry. However the behavior is much more constant without the delay. would it be beneficial if the same constancy could be realized in the commit_delay case?

To answer that one, some extra investigations are called for.

Sunday Jul 20, 2008

Througput spikes explained

In previous posts a typical pattern in handling commands was seen. The handling versus reading commands cumulative time consumption showed a pattern as can be seen in the next graph:

mediumheap.png

I had a hunch that this could be related to the client (load generator) behaviour. The client program is written in Java. It runs 128 application threads that continuously communicate with the database. Lots of rows inserted and retrieved. The number of objects created should be considerable. Garbage collection to clean up the Java heap might be the driving force for the above mentioned pattern.

My standard configuration runs with a 2 Giga byte heap. I decided to increase this to 3.2GB, and also increase the Eden size to 1.5GB. For those a little familiar with Java, this is the space in which the objects are created. The test resulted in the following graph:


largeheap.png

Indeed it is the client who is responsible for these spikes. Using a larger heap means more objects can be created in the Java heap before a garbage collection is forced. Also a larger heap holds many more objects translating in longer GC times. Indeed the time spent reading commands has increased. Please note, garbage collections in Eden are "stop the world", meaning the client does not respond to the database process: during a GC the readcommand does take significant more time.

Wednesday Jul 16, 2008

Finding the throttle

In my previous blog entry I showed a behavioral pattern in PostgreSQL that needs attention. In this entry I will try to explain what I did to find the culprit.

The only statistics discussed were the ones in tcop/postgres.c These gave an impression of the global activity. During a checkpoint lots of time is spend handling commands. Although this time seems to be zero, it cummulates to figures as high as 1600 seconds. By looking at the other statistics during those checkpointing intervals some of these too show this accumulation hopefully too. This would make it easy to find the root cause. At 1218 seconds this pattern shows nicely. The next table shows the top time-consumers during that interval:

NAME COUNT TOTAL AVERAGE MAX
SLRUPhysicalWritePage::Part6 16 8368 523 8368
SLRUPhysicalWritePage::call 16 8370 523 8368
SLRUselectLRUPage::WritePage 1 8370 8370 8370
SLRUwritePage::call 16 8372 523 8370
postgresqlMain::readchar::B 9799 19100 1 4297
LWLock::Acquire::08::semaphoreTime 1711 21303 12 411
xlogflush::go::acq 1724 21349 12 411
SLRUselectLRUPage::Return3 768 344920 449 10574
SLRUselectLRUPage::Return1 2162 879268 406 10568
LWLock::Acquire::59::semaphoreTime 676 1059722 1567 8368
SimpleLRUWaitIO::bufferlock::Acquire 768 1060120 1380 8368
SLRUselectLRUPage::WaitIO 622 1061593 1706 8381
SimpleLRUWaitIO::call 747 1061932 1421 8381
SLRUselectLRUPage::Part3 623 1069964 1717 8381
SLRUselectLRUPage::call 2931 1224187 417 10574
SLRUreadPageReadOnly::SimpleReadPageCalls 1022 1224899 1198 10575
SLRUreadPage::call 2799 1224903 437 10575
SLRUreadPageReadOnly::call 1525 1225559 803 10576
postgresqlMain::handleCommand::E 9824 1267569 129 10577

Ok, what are these? These are the statistics. I try to use a #8220convention:#8221 A name consists of one or more tags separated by a colon pair. First part is the name of the function in which the statistic is used. Next part should indicate the part in the function that it means to time, and if present the final part tells more about the details being measured. Most statistics measure wall clock time, however it can be that another metric is used (number of elements in a list for example). Rows in the table are sorted by total time. The postgresqlMain::handleCommand::E is the statistic used to count time needed to handle an Execute command. During the last second it was called 9824 times, and the total time it took to do this was a little more than 1267 seconds. On average it took 129 msecs. However there once was a maximum handling time of a little over 10 seconds. (if time is the metric, the resolution is msec.) The functions in my copy of backend/access/transam/slru.c are now heavily instrumented. Statistic names starting with SLRU belong there. It looks like most of the commandHandling time comes from SLRUreadPageReadOnly: SLRUreadPageReadOnly::call depicts the total time spent in this function. However from the count value it is obvious that not all handleCommand visit this function. 1022 of these #8230ReadOnly function calls (SLRUreadPageReadOnly::SimpleReadPageCalls) call the SimpleLruReadPage function. Almost all of those 1267 seconds are spend in there. This function has 6 statistics build in. None of the others popup besides that one for the time consumption by the SLRUselectLRUPage::call. Since it was not obvious to me at that time what was going on in that function. Therefore I split this routine in three parts and measured time spend in each of them. Only part3 is visible in this Top-N table. Looking at the code I see this part contains an if-else construct in which either SimpleLruWaitIO or SimpleLruWritePage are called. The first one takes 1069 seconds for 747 calls, the second one takes only 8 seconds. However that is for just one call! This is a single write that takes forever. Why? And is this related to those 1069 seconds spent in the WaitIO part. Names used do point in that direction.The SimpleLruWaitIO routing too has 6 statistics of his own. One of them is: SimpleLRUWaitIO::bufferlock::Acquire.
The bufferlocks are assigned dynamically. This then, could be an explanation for the accumulation of time in LWLock 59. As you might recall in this one all statistics for all dynamic locks in the system are stored. The numbers do match. And indeed the maximum waiting time seen for LWLock::59:semaphoreTime is very similar to that WriteIO. It looks like all my client related processes are waiting for this single write that takes so long, since 128 * 8 seconds translates to 1024 seconds total. Since this maximum time is for one single occurrence this also means that single block (did not actually look to the block size, so it is an assumption really) is used by all processes. If this is the case than there is something out there behaving as a classic scalability issue.


Time consuming write

Now why is that single write so terribly slow? Can I get more information out of the statistics I have? Before the SlruPhyscialWritePage routine is called one of the buffer locks is acquired in EXCLUSIVE mode. After this routine returns and some extra housekeeping is done this lock is released again. The PhyscialWritePage routine, which of course also has a fair amount of statistics, is the one where time is consumed. Meaning that that lock being held in EXCLUSIVE mode is our scalability issue! The PhyscialWritePage was also #8220magic#8221 to me, so simple separation by adding #8220PartX#8221s here and there learns me that Part6 is the time consumer. Since I know by traversing the code, that the last parameter passed to this routine has a value of null it must be the pg_fsync(fd) in this routine that causes the long waiting. This raises the next question. What is being synced? It must be data since we are in a checkpoint.


DTrace to the rescue
The little script I used for this basically starts on a fdsync system call. Before this happens it knows the file descriptors and file names that belong to each other. On each fdsync that takes longer than 2 seconds the filename that belongs to the file descriptor is printed.The script also counts all writes against the file descriptors. Basically the assumption is that all these write are out there waiting to be sync-ed. During the test three different filenames were shown as follows:
fsync 24637: Took 8 secs to write 122588 entries to "pg_tblspc/16384/16386/16387.1"
fsync 24637: Took 8 secs to write 128169 entries to "pg_tblspc/16384/16386/16387"fsync 24698: Took 7 secs to write 1 entries to "pg_clog/000C"
Here 24637 is the process id of the bgwriter. It is flushing single gigabyte portions at each flush. To all files of interest. (the ones above are application specific tablespaces/files). However writing a single block to clog also takes a very long time. Mmm that one is easy to resolve. Just wondering if it would make a difference. Moved pg_clog (including its content) to yet another disk. Created a softlink, and redid the test.


The CLOG solution
The next two graphs tell the story. These are the same graph as shown in the previous blog entry. The first one present the overall picture, the second one zooms in for a little more detail.
handlingcommandsclog.png handlingcommandsclogzoom.png

This one looks far better. However that pattern of "increased readcommand time / decreased handle time" is interesting too. And of course the questions:
  • CLOG is flushed by the fdsync. However I do not think this originates from the code pathes inspected so far. Just happened to be a side effect of the DTrace script and relatively ease to remove. So where did it come from?
  • Is it just the huge load of IOs on a ZFS file system that causes these waits, or can this be seen on UFS too for example?
  • Still there is a huge burst of writes (1GB) to each tablespace. Would it be worthwhile to sync in between checkpoints to spread the IO load over time? A single disk can sustain only so many IOs/sec. Spreading should indeed be advantaguous.

This screams for further inspection.


Checkpoint throttled throughput

In previous entries the checkpoint effect on throughput was mentioned but not explained. While looking at the IO behaviour comparing UFS to ZFS when running a PostgreSQL database instance, I see huge bursts of IO that also seem to throttle throughput in a significant way.

The original tests used a load generation tool that reported its measured throughput each 100 seconds. Since we all know that averaging removes the interesting information I redid a test with a 5 second reporting interval. Indeed this showed the effect of throughput much better: 0 TPM while checkpointing. Complete stand still of the application. Wouw, this needs further attention.

Instrumentation effect

Initially I used DTrace probes to instrument the code. However I found that for high frequency probes (firing over 1000 times a second on many processes) the instrumentation effect was not neglectable. Magne Mahre wrote a little library we started using to instrument "a light weight way". In my situation LWLock 25 was used to handle some synchronization issues. Every process counted locally and during commit time the global pool of counters was synced. This works rather well. However it is so unfortunate that dbwriter and logwriter commit at a very low rate. I had to call a "Copy local to global" routine to ensure the information gathered in these processes did also show in the final reports.

Unfortunately the final reports did show a new number one contender: lock 25. This method too did not turn out to be the light weight instrumentation method looked for. Although I must admit that in my attempt to understand I enter quite an amount of counters in the machine (currently I have over 550 new counters). Lots of these turn out not to increment.

I wrote a quick library that uses the atomic_xxx routines provided in the libc librarty that comes with Solaris. The now complete mutex and lock free way of instrumenting seems to provide me with the information I was looking for. Although I am still grasping to understand it completely. This library creates statistic objects in a new segment of shared memory. The advantage seems to be that any other process can attach to it, to display a real time view of the counters in there. I currently sample with a frequency of 1 Hz. And this adequately shows me the information.

A single statistic holds a running total, a counter of "times updated", the minimum value so far, and the maximum value so far used to update that statistic. Maybe a future version will also provide a histogram facility so a better understanding of the dynamics of certain statistic could be possible. Right now it is not implemented.

Setup

I currently using one of the earlier PostgreSQL 8.4 sources. All is build 64bit in an optimized way with Sun Studio 12 compilers and tools. Most tests run for an hour, after which I start mining the data. Current focus is on activities, or lack there of, during checkpoints. Checkpoints that are parameterized to happen every 300 seconds. And should finish in 15% of the time. All PostgreSQL related files reside on two ZFS pools. A WAL pool holds 4 internal disks. The other pool consists of 5 internal disks and holds all the other files. Each pool has its own memory backed up cache array for the ZFS Intent Log (see earlier post on UFS and ZFS).

Since the checkpoints are "tuned" to happen as fast as possible the concentration of IO seems to be high. However the IO subsystem can cope with this.

The next graph shows the throughput. A clear pattern is visible. I had a hunch that it might be related to garbage collection in the client. Not seeing throughput can also mean less activity on the load generator. Since this is a Java program GC needed to investigated. But ok, using jstat -gc it was very obvious this application behaves as it should be. No prolonged periods of GC-takes-over-the-world here. Once more the database was "prime suspect."

All tests are done with 128 clients.m

typicaltpm.png

Mapping this to the IO happening is clearly shows in the following graph. Please note these TPM numbers are for 100 second interval and are clearly shown after that interval. Also the actual point "Time=zero seconds" is different on the load generating machine and the system under test.


combinedinfo.png

And the IO graphs do not include the traffic to the ZIL holding devices. This is database related IO only. Seems obvious that the sudden burst in IO is related to the next dip in throughput. A dip that is about 20-25%.

From this graph my idea resulted to do some finer grained measurements. This to see if during those high IO peaks all throughput suffered. Redid the test, but now with a 5 second interval for reporting TPM..

Same information as the previous graph, but now a 5 second interval for TPM. Same overall pattern as the one above but now focussing on a smaller, arbitrary selected, interval.


combinedfastsampling.png

Indeed the checkpoint seems to be the driving force for this sudden, steep, not long lasting, surge in throughput. However mapping the results from two applications running on different machines is a little tricky. As the saying goes: 'Use the source, Luke'

Using the source

In order to be able to exclude the data from the client I implemented some counters in tcop/postgres.c In the routine PostgresMain there is a "for ( ; ; )" loop in which data is read from the client, and acted upon. Which these I am now able to tell how much time (WAL clock) has elapsed handling data in different contexts (execute, parse, bind, sync, define) and how much the database process is waiting for the client. And of course it tells me the number of times this has been done. The counters are cumulating over time. However perl is quite happy to translate this into deltas per second.

Besides these counters I enthusiastically overloaded many source modules with too many statistics. In storage/lmgr/lwlock.c I placed multiple per non dynamic lock, and one where I cumulate all the information for the dynamic locks. In my source this one is numbered 59. The next table is a snippet where the tcop/postgres.c placed stats are shown together with those from the lwlock.c source module (only the most time consuming lwlocks during the time a checkpoint is happening. BTW the tests are now reduced to 1800 or so seconds since these patterns seem not to be affected by warm up phases while mimicking real life..

Here is the graph. I put the ceiling on 400M to make things a little more visible. The actual peaks are around 1600M. micro seconds. The third graph depicts the total time spend reading and handling commands.


cummtimeelapsed.png

Zooming in a little show a bit more detail:

cummtimeelapsedzoom.png

Those short interruptions directly followed by a peak of the same magnitude are artifacts of measurements: if a timer stops in the beginning of the next interval, there is an extra cumulation in that next interval instead of the current one. However the peak at 910 and 1210 seconds are way beyond these artifacts. Total time is quit nice: 128 seconds. Recall we have 128 processes all measured each single second! The extra peaks/dips in reading/handling every 22 seconds or so is not yet explained. This one might be due to short garbage collection interruptions. It is indeed more time waiting for a command to arrive. Need to look into that one too some day soon...

Looking at the blue 'Totals' graph show a couple of seconds of "nothingness" Not reading, but also not executing! Followed by an extreme peak of handling time. BTW if a statistic timer starts in interval A, and ends in interval A+5 seconds, this adds zero's to A, A+1, ..., A+4, and 5 to interval A+5. There is indeed a repeating pattern of total zero throughput ( I am sure, I read the numbers.)

Calendar

Feeds

Search

Links

Navigation

Referrers