Paul van den Bogaard his weblog
Light Weight Lock Contention
During earlier tests it was discovered that PostgreSQL it's ProcArrayLock was the major source of contention when executing test with over 500 concurrent users. To put this into context some additional probes were added in the lwlock.c source and in postgres.c. In the latter the main routine(PostgresMain), that reads the user commands and executes them, determines the next thing that needs to be done like bind, parse, define and execute. By using these events it is possible to look at the LWLock contention in context of the current requests.
Four major statistics are measured with respect to these leight weight locks: the time to acquire the lock ; the time to wait for the lock before it is available; the time it takes to release it; and finally the time the lock was being held. The acquire time includes the wait time.
All times are reported in milliseconds and accummulated.
The acquire time --after correction for implicit wait time-- and release time are rather constant and small. The wait and held times are the interesting ones.
First, lets look at the numbers for a 64 user run and a 1280 user run. All lockid's uptill but not including 25 are predefined locks. All numbers starting at 25 are dynamically assigned. Locks with ID 25 or higher are cummulated and presented as lock with ID 25.
Here is some data about the consumption pattern of LWLocks. While doing a 3600 minute run there were 10 cycles of DTrace activity to collect this data. A single cycle consists of 1 minute of instrumentation followed by 4 minutes of no instrumentation. This will be refered to as a "partial (instrumentation) run."
All data from those 10 single minutes intervals are combined. For those locks that were acquired for more than 100000 times the data is shown. Data is sorted on the wait time column.
First column depicts the lockID, the second is a code of two letters. The first letter is either S (shared mode acquire) or E (exclusive mode acquire). The second letter is the context: P(arsing), B(inding), E(xecuting), S(ync). For a complete list check the source ;^)
After these there are four groups. Each group consists of three columns: number of times, cummulative wait time in milliseconds and average wait time in milliseconds (cummulative divided by number). The groups are "Time Held", "Acquire Time (including wait time)", "Wait Time", and "Acquire Time without Wait Time." Tables are sorted for the column "WaitingTime."
64 users
Holding Acquiring Waiting Acquire without Waiting
ID Count Time Avg Count Time Avg Count Time Avg Count Time Avg
5 SE 694364 16195 0.023 694364 8186 0.012 27 0 0.000 694364 8186 0.012
5 SP 433670 10883 0.025 433670 5967 0.014 7 0 0.000 433670 5967 0.014
5 SS 561073 12809 0.023 561073 6537 0.012 17 0 0.000 561073 6537 0.012
5 SB 855804 20103 0.023 855804 10074 0.012 39 1 0.026 855804 10073 0.012
25 SE 17651780 419107 0.024 17651787 211544 0.012 186 6 0.032 17651787 211538 0.012
4 SP 105640 3299 0.031 105640 1503 0.014 308 10 0.032 105640 1493 0.014
11 SE 133554 3405 0.025 133554 1731 0.013 283 13 0.046 133554 1718 0.013
11 EE 161634 3945 0.024 161634 2049 0.013 761 27 0.035 161634 2022 0.013
4 SB 380304 12601 0.033 380304 5056 0.013 714 34 0.048 380304 5022 0.013
4 SE 180781 5213 0.029 180781 2371 0.013 967 38 0.039 180781 2333 0.013
4 EE 161423 3527 0.022 161423 2069 0.013 3455 223 0.065 161423 1846 0.011
3 EE 161445 3797 0.024 161445 2442 0.015 8374 440 0.053 161445 2002 0.012
25 EP 224292 6479 0.029 224292 3993 0.018 9959 682 0.068 224292 3311 0.015
25 EE 1813561 87351 0.048 1813562 24029 0.013 15611 2341 0.150 1813562 21688 0.012
7 EE 635664 16594 0.026 635664 11731 0.018 50265 3138 0.062 635664 8593 0.014
25 EB 852120 37442 0.044 852120 35556 0.042 95684 9798 0.102 852120 25758 0.030
8 EE 161221 108131 0.671 161222 183921 1.141 63073 181181 2.873 161222 2740 0.017
1280 users
5 SS 1095699 29526 0.027 1095700 16210 0.015 239 483 2.021 1095700 15727 0.014
5 SE 1357406 38023 0.028 1357407 22026 0.016 349 647 1.854 1357407 21379 0.016
5 SP 851657 27240 0.032 851657 16652 0.020 325 649 1.997 851657 16003 0.019
5 SB 1678278 49172 0.029 1678279 29385 0.018 684 1682 2.459 1678279 27703 0.017
3 EE 315330 8923 0.028 315330 13394 0.042 14738 8524 0.578 315330 4870 0.015
11 SE 250017 7282 0.029 250017 19220 0.077 8783 15257 1.737 250017 3963 0.016
25 SE 34187469 988775 0.029 34187488 490078 0.014 398 19655 49.384 34187488 470423 0.014
11 EE 322850 11541 0.036 322850 43695 0.135 25076 36419 1.452 322850 7276 0.023
25 EP 445981 16483 0.037 445982 74782 0.168 23477 65533 2.791 445982 9249 0.021
25 EE 3460287 1258692 0.364 3460295 960810 0.278 176807 902399 5.104 3460295 58411 0.017
25 EB 1663721 74921 0.045 1663723 1308005 0.786 183328 1260363 6.875 1663723 47642 0.029
7 EE 1138770 39522 0.035 1138771 1798318 1.579 169920 1772398 10.431 1138771 25920 0.023
4 SP 206469 38776 0.188 206470 5182624 25.101 8794 5181440 589.202 206470 1184 0.006
8 EE 230775 54803 0.237 230777 8466229 36.686 210054 8454877 40.251 230777 11352 0.049
4 SE 384603 66350 0.173 384608 9997294 25.993 16936 9993060 590.048 384608 4234 0.011
4 SB 742002 138343 0.186 742004 14315047 19.292 24277 14311237 589.498 742004 3810 0.005
4 EE 305800 12427 0.041 305800 196006021 640.961 341076 200376943 587.485 305800 -4370922 -14.293
First some remarks. The negative numbers for 1280 users the ProcArrayLock (ID: 4) in exclusive mode in execute context (4 EE), are probably due to the way things are handled within DTrace with respect to timing and buffer management. I will try to find out what this can be. It seems to only happen (also for some other runs with higher number of users count) when the number of waits are bigger than the number of acquire requests. This occurs when too many processes are "fighting" for this lock. It basically means "the lock was released" but someone else was succesfull acquiring so I need to try again."
Looking at the actual time it takes to acquire a lock (last group) the time it takes is nicely constant when comparing a system under high load to a lightly loaded system. These numbers also map nicely into a single time slot of the Solaris scheduler (I always use the FX class when so many processes are running.) This means that fiddling with time slices to gain throughput will not show much effect. Also the time during which a lock is being held seems quite nice too. Other inspection with DTrace showed that (for locks with ID < 25) there were barely any system calls made. The locks are used for work that is highly dependend on CPU speed. Once more playing with time slices will not bring significant improvement.
When looking at the number of times a lock was held there is definitely no lineair relation comparing the 64 with the 1280 user test. Counts for 25 SE and 25 EE are about twice as high for the 1280 user case. Furher inspection shows that this is also true for all other locks.
From the above it looks ok to only look at the wait count and time characteristics while doing further comparisions when changing the load. This would at least make reading the data a little easier ;^)
And before I forget, here is the DTrace script used to get this data:
self int handleCommand;
self int readyForQuery;
self char commandType;
self int commandTypeInitialized;
BEGIN { started = timestamp; }
/* First the extra probes in postgres.c are used to place context. Since, when DTrace is started it can be
there is not context! For this the charecter '-' is used
The commands are only timed after their start has been seen
*/
postgresql*:postgres::ready_for_query_start
{
self->readyForQuery = timestamp;
}
postgresql*:postgres::ready_for_query_end
/self->readyForQuery /
{
@count["readyForQuery", '-'] = count();
@elapsed["readyForQuery", '-'] = sum(timestamp - self->readyForQuery);
self->readyForQuery = 0;
}
postgresql*:postgres::ignore_till_sync
{
@count["ignore_till_sync", '-'] = count();
}
postgresql*:postgres::handle_command_start
{
self->commandTypeInitialized = 1;
}
postgresql*:postgres::handle_command_start
/arg0 == 0/
{
self->handleCommand = 0;
@count["zero",'-'] = count();
}
postgresql*:postgres::handle_command_start
/arg0 != 0/
{
self->handleCommand = timestamp;
self->commandType = (char)arg0;
}
postgresql*:postgres::handle_command_end
/self->handleCommand/
{
@count["handleCommand", arg0] = count();
@elapsed["handleCommand", arg0] = sum(timestamp - self->handleCommand);
self->handleCommand = 0;
self->commandType = '-';
}
/* Below the LWLock related probes are used to gather information
All LWLocks are divided: up till ID 25 (not inclusive) are
locks with a well known ID. Locks starting with ID 25 and up
are dynamic. Values are combined into grand totals for 25 and
up.
*/
postgresql*:postgres:LWLockAcquire:lwlock-acquire
{
self->ts[arg0,arg1]=timestamp;
self->mode[arg0] = arg1;
}
postgresql*:postgres:LWLockRelease:lwlock-release
/self->commandTypeInitialized && arg0 < 25 && self->ts[arg0,self->mode[arg0] ] /
{
this->mode = self->mode[arg0];
@lockheldCount[arg0,this->mode,self->commandType] = count();
@lockheldTime[arg0,this->mode,self->commandType] = sum (timestamp - self->ts[arg0,this->mode]);
self->ts[arg0,this->mode] = 0;
}
postgresql*:postgres:LWLockRelease:lwlock-release
/self->commandTypeInitialized && arg0 >= 25 && self->ts[arg0,self->mode[arg0]] /
{
this->mode = self->mode[arg0];
@lockheldCount[25,this->mode,self->commandType] = count();
@lockheldTime[25,this->mode,self->commandType] = sum (timestamp - self->ts[arg0,this->mode]);
self->ts[arg0,this->mode] = 0;
}
postgresql*:postgres:LWLockAcquire:lwlock-startwait
/arg0 < 25/
{
self->tsw[arg0,arg1] = timestamp;
@queueLengthStartwaitMax[ arg0,arg1 ] = max(arg2);
@queueLengthStartwaitAvg[ arg0,arg1 ] = avg(arg2);
@queueLengt2StartwaitMax[ arg0,arg1 ] = max(arg3);
@queueLengt2StartwaitAvg[ arg0,arg1 ] = avg(arg3);
}
postgresql*:postgres:LWLockAcquire:lwlock-startwait
/arg0 >= 25/
{
self->tsw[arg0,arg1] = timestamp;
@queueLengthStartwaitMax[25,arg1 ] = max(arg2);
@queueLengthStartwaitAvg[25,arg1 ] = avg(arg2);
@queueLengt2StartwaitMax[25,arg1 ] = max(arg3);
@queueLengt2StartwaitAvg[25,arg1 ] = avg(arg3);
}
postgresql*:postgres:LWLockAcquire:lwlock-endwait
/self->commandTypeInitialized && self->tsw[arg0,arg1] && (arg0 < 25)/
{
@lockwaitCount[arg0,arg1,self->commandType] = count();
@lockwaitTime[arg0,arg1,self->commandType] = sum(timestamp - self->tsw[arg0,arg1]);
@cycleCountMax[arg0, arg1] = max(arg2);
@cycleCountAvg[arg0, arg1] = avg(arg2);
self->tsw[arg0,arg1] = 0;
}
postgresql*:postgres:LWLockAcquire:lwlock-endwait
/self->commandTypeInitialized && self->tsw[arg0,arg1] && (arg0 >= 25)/
{
@lockwaitCount[25,arg1,self->commandType] = count();
@lockwaitTime[25,arg1,self->commandType] = sum(timestamp - self->tsw[arg0,arg1]);
self->tsw[arg0,arg1] = 0;
@cycleCountMax[25, arg1] = max(arg2);
@cycleCountAvg[25, arg1] = avg(arg2);
}
postgresql*:postgres:LWLockAcquire:lwlock-acquire-enter
{
self->acqtime[arg0,arg1] = timestamp;
}
postgresql*:postgres:LWLockAcquire:lwlock-acquire-return
/self->commandTypeInitialized && arg0 < 25 && self->acqtime[arg0,arg1]/
{
@lockAcquireCount[ arg0,arg1,self->commandType ] = count();
@lockAcquireTime[arg0,arg1,self->commandType ] = sum( timestamp - self->acqtime[arg0,arg1]);
self->acqtime[arg0,arg1] = 0;
}
postgresql*:postgres:LWLockAcquire:lwlock-acquire-return
/self->commandTypeInitialized && arg0 >= 25 && self->acqtime[arg0,arg1]/
{
@lockAcquireCount[ 25,arg1,self->commandType ] = count();
@lockAcquireTime[25,arg1,self->commandType ] = sum( timestamp - self->acqtime[arg0,arg1]);
self->acqtime[arg0,arg1] = 0;
}
postgresql*:postgres:LWLockRelease:lwlock-release-enter
/arg0 < 25/
{
self->acqtime[arg0,self->mode[arg0] ] = timestamp;
}
postgresql*:postgres:LWLockRelease:lwlock-release-enter
/arg0 >= 25/
{
self->acqtime[arg0,self->mode[arg0] ] = timestamp;
}
postgresql*:postgres:LWLockRelease:lwlock-release-return
/self->commandTypeInitialized && arg0 < 25 && self->acqtime[arg0, self->mode[arg0] ]/
{
this->mode = self->mode[arg0];
@lockReleaseCount[ arg0, this->mode,self->commandType ] = count();
@lockReleaseTime[arg0, this->mode,self->commandType ] = sum( timestamp - self->acqtime[arg0, this->mode]);
self->acqtime[arg0, this->mode] = 0;
self->mode[arg0] = 0;
}
postgresql*:postgres:LWLockRelease:lwlock-release-return
/self->commandTypeInitialized && arg0 >= 25 && self->acqtime[arg0, self->mode[arg0]]/
{
this->mode = self->mode[arg0];
@lockReleaseCount[25, this->mode,self->commandType] = count();
@lockReleaseTime[25, this->mode,self->commandType ] = sum( timestamp - self->acqtime[arg0, this->mode]);
self->acqtime[arg0, this->mode] = 0;
self->mode[arg0] = 0;
}
:::tick-60s { exit(0); }
:::END {
printf("WALL: %d millisec\n", ((timestamp - started)/1000000) );
printa("CH %2d %1d %2c %@15d\n",@lockheldCount);
normalize(@lockheldTime,1000000); printa("TH %2d %1d %2c %@15d\n",@lockheldTime);
printa("CA %2d %1d %2c %@15d\n", @lockAcquireCount);
normalize(@lockAcquireTime, 1000000); printa("TA %2d %1d %2c %@15d\n", @lockAcquireTime);
printa("CR %2d %1d %2c %@15d\n", @lockReleaseCount);
normalize(@lockReleaseTime, 1000000); printa("TR %2d %1d %2c %@15d\n", @lockReleaseTime);
printa("CW %2d %1d %2c %@15d\n",@lockwaitCount);
normalize(@lockwaitTime,1000000); printa("TW %2d %1d %2c %@15d\n",@lockwaitTime);
printa("QM %2d %1d %@10d\n", @queueLengthStartwaitMax);
printa("QA %2d %1d %@10d\n", @queueLengthStartwaitAvg);
printa("QM2 %2d %1d %@10d\n", @queueLengt2StartwaitMax);
printa("QA2 %2d %1d %@10d\n", @queueLengt2StartwaitAvg);
printa("CM %2d %1d %@10d\n", @cycleCountMax);
printa("CV %2d %1d %@10d\n", @cycleCountAvg);
normalize(@elapsed, 1000000);
printa("C %10s %2c %@10d\n", @count);
printa("T %10s %2c %@10d\n", @elapsed);
}
Posted at 09:02AM Mar 10, 2008 by paulvandenbogaard in PostgreSQL | Comments[0]