Monday April 27, 2009
pstime - a mash-up of ps(1) and ptime(1)
I have done some testing in the past where I needed to know the amount of CPU consumed by a process more accurately than I can get from the standard set of operating system utilities.
Recently I hit the same issue - I wanted to collect CPU consumption of mysqld.
To capture process CPU utilization over an interval on Solaris, about the best I can get is the output from a plain "prstat" command, which might look like:
mashie ) prstat -c -p `pgrep mysqld` 5 2 Please wait... PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 7141 mysql 278M 208M cpu0 39 0 0:38:13 40% mysqld/45 Total: 1 processes, 45 lwps, load averages: 0.63, 0.33, 0.18 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 7141 mysql 278M 208M cpu1 32 0 0:38:18 41% mysqld/45 Total: 1 processes, 45 lwps, load averages: 0.68, 0.34, 0.18
I am after data from the second sample only (still not sure exactly how prstat gets data for the fist sample, which comes out almost instantaneously), so you can guess I will need some sed/perl that is a litte more complicated than I would prefer.
pstime reads PROCFS (i.e.. the virtualized file-system mounted on /proc) and captures CPU utilization figures for processes. It will report the %USR and %SYS either for a specific list of processes, or every process running on the system (i.e., running at both sample points). The start sample time is recorded in high resolution at the time a process' data is captured, and then again after N seconds, where N is the first parameter supplied to pstime.
The default output of pstime is expressed as either a percentage of whole system CPU, or CPU seconds, with four significant digits. Solaris itself records the original figures in nanosecond resolution, although we do not expect today's hardware to be that accurate.
Here is an example:
mashie ) pstime 10 `pgrep sysbench\|mysqld` UID PID %USR %SYS COMMAND mysql 7141 44.17 3.391 /u/dist/mysql60-debug/bin/mysqld --defaults-file=/et mysql 19870 2.517 2.490 sysbench --test=oltp --oltp-read-only=on --max-time= mysql 19869 0.000 0.000 /bin/sh -p ./run-sysbench
Posted at 01:25PM Apr 27, 2009 by timc in Performance | Comments[3]
Tuesday April 21, 2009
Expanding Google's InnoDB Synchronization Improvements to Solaris
There is much excitement today at the launch of MySQL 5.4, so I will relate my story about a project I contributed to this new version.
When we started looking at performance improvements for MySQL, we were interested in "low hanging fruit", or fixes and changes that could reap measurable benefits for users in the short term.
An obvious candidate at that time was the now well-known Google SMP patch. I had seen Mark Callaghan present on this at the MySQL User Conference in 2008, and was interested to investigate.
I was pretty new to InnoDB at that time, and was soon to discover that InnoDB was possibly experiencing poor scalability around its mutexes and read-write locks because InnoDB had a private implementation of adaptive mutexes and read-write locks, and this was probably not the best implementation on all or even most platforms MySQL is available on.
Now InnoDB's "private" mutexes and rw-locks were a good way to get spin-locks on all platforms, which may be a win in many cases, but as the Google team had demonstrated, it could be improved on. Indeed, I knew that adaptive spin-locks are available on Solaris, and they offer an extra advantage - if the holder of a lock is found to be off CPU, we don't bother spinning, but instead put the thread wanting the lock straight to sleep.
So, I decided to undertake a couple of performance studies of InnoDB's locking, being:
The second step turned out to be quite complicated. I could not even change all of InnoDB's RW-locks to POSIX ones, as the InnoDB sychronization objects offer functionality not available via POSIX. It also meant we would be diverging more significantly from the InnoDB in 5.1, so this option - although looking promising - was shelved.
This left the Google SMP patch. It also looked promising. It was a less dramatic change, and offered scaling benefits in all the testing I did.
There was one last snag though - the mutex and RW-lock improvments in the Google SMP patch would only be applied if you were building on x86/x64 with GCC 4.1 or later, as they relied on GCC's atomic built-ins.
You can consider that we have a two-dimensional matrix of platforms that MySQL supports, being a compiler, then an Operating System. To make a feature portable across this matrix, you need to find a portable API, write code that is portable, or write code that uses a choice of different portable API's depending on what is available.
Now we definitely wanted to get a similar benefit for InnoDB on SPARC, and not necessarily just with GCC. In any case, GCC did not offer all of the built-in atomics for SPARC at the time. Happily, there are atomic functions available in Solaris that fit the job fine. MySQL 5.4 uses the functions if you build on Solaris without a version of GCC that supports built-in atomics.
Just so you understand though, here is (a simplified version of) what happens when you build MySQL 5.4 on your chosen platform with your chosen compiler:
use GCC built-in atomics
use atomic functions
use traditional InnoDB synchronization objects, based on pthread_mutex*.
As Neel points out in his blog, it was an exercise we learnt something from, even if we did develop functionality that will not be used. The important thing is we know we have improved the performance of MySQL, by extending the Google SMP improvements to all Solaris users, regardless of chosen compiler.
Posted at 09:15AM Apr 21, 2009 by timc in MySQL |
Thursday April 09, 2009
Testing the New Pool-of-Threads Scheduler in MySQL 6.0, Part 2
In my last blog, I introduced my investigation of the "Pool-of-Threads" scheduler in MySQL 6.0. Read on to see where I went next.
I now want to take a different approach to comparing the two schedulers. It is one thing to compare how the schedulers work "flat out" - with a transaction request rate that is limited only by the maximum throughput of the system under test. I would like to instead look at how the two schedulers compare when I drive mysqld at a consistent transaction rate, then vary only the number of connections over which the transaction requests are arriving. I will aim to come up with a transaction rate that sees CPU utilization somewhere in the 40-60% range.
This is more like how real businesses use MySQL every day, as opposed to the type of benchmarking that computer companies usually engage in. This will also allow me to look at how the schedulers run at much higher connection counts - which is where the pool-of-threads scheduler is supposed to shine.
Now, I will let you all know that I first conducted my experiments with mysqld and the load generator (sysbench) on the same system. I was again not sure this would be be the best methodology, primarily because I would end up having one operating system instance scheduling in some cases a very large number of sysbench threads along with the mysqld threads.
It turned out the results from this mode threw up some issues (like not being able to get my desired throughput with 2048 connections in pool-of-threads mode), so I repeated my experiments - the second set of results have the load generation coming from two remote systems, each with a dedicated 1 Gbit ethernet link to the DB server.
The CPU utilization I have captured was just the %USR plus %SYS for the mysqld process. This makes the two sets of metrics comparable.
Here are my results. First for experiments where sysbench ran on the same host as mysqld:
Then for experiments where sysbench ran on two remote hosts, each with a dedicated Gigabit Ethernet link to the database server:
As you can see, the pool-of-threads model does incur an overhead, both in terms of CPU consumption and response time, at low connections counts. As hoped though, the advantage swings in pool-of-threads' favour. This is particularly noticeable in the case where our clients are remote. It is arguable that an architecture involving many hundreds or thousands of client connections is more likely to have those clients located remote from the DB server.
Now, the first issue I have is that while pool-of-threads starts to win on response time, the response time is still increasing in a similar fashion to thread-per-connection's response time (note - the scale is logarithmic). This is not what I expected, so we have a scalability problem in there somewhere.
The second issue is where I have to confess - I only got one "lucky" run where my target transaction rate was achieved for pool-of-threads at 2048 connections. For many other runs, the target rate could not be achieved, as these raw numbers show:
| connections | tps | mysqld %usr | mysqld %sys | mysqld %cpu | avg-resp | 95%-resp |
|---|---|---|---|---|---|---|
| 2048 | 962.22 | 25.23 | 14.93 | 40.16 | 1943.78 | 2368.78 |
| 2048 | 1197.00 | 30.59 | 11.20 | 41.79 | 317.98 | 435.19 |
| 2048 | 836.50 | 21.98 | 11.09 | 33.07 | 2259.36 | 2287.03 |
| 2048 | 963.00 | 26.49 | 12.07 | 38.56 | 1333.67 | 1128.93 |
| 2048 | 992.25 | 25.81 | 15.08 | 40.89 | 1851.17 | 2280.50 |
| 2048 | 915.71 | 24.16 | 15.05 | 39.21 | 2220.45 | 2342.06 |
| 2048 | 919.54 | 24.25 | 15.05 | 39.30 | 2210.95 | 2331.45 |
| 2048 | 917.09 | 24.15 | 15.05 | 39.20 | 2217.86 | 2321.40 |
| 2048 | 875.09 | 23.20 | 13.29 | 36.49 | 2188.69 | 2344.91 |
| 2048 | 1180.62 | 31.35 | 14.57 | 45.92 | 1439.96 | 1772.86 |
| 2048 | 1185.80 | 30.74 | 14.24 | 44.98 | 1185.71 | 1814.24 |
| 2048 | 1146.90 | 30.34 | 15.23 | 45.57 | 1602.85 | 1842.14 |
| 2048 | 1141.47 | 30.20 | 15.22 | 45.42 | 1612.34 | 1873.95 |
| 2048 | 1158.74 | 30.47 | 12.99 | 43.46 | 999.76 | 1870.35 |
| 2048 | 1177.59 | 30.67 | 14.97 | 45.64 | 1403.22 | 1838.84 |
This indicates we have some sort of bottleneck right at or around the 2048 thread point. This is not what we want with pool-of-threads, so I will continue my investigation.
Posted at 10:37AM Apr 09, 2009 by timc in MySQL | Comments[3]
Wednesday April 08, 2009
Testing the New Pool-of-Threads Scheduler in MySQL 6.0
I have recently been investigating a bew feature of MySQL 6.0 - the "Pool-of-Threads" scheduler. This feature is a fairly significant change to the way MySQL completes tasks given to it by database clients.
To begin with, be advised that the MySQL database is implemented as a single multi-threaded process. The conventional threading model is that there are a number of "internal" threads doing administrative work (including accepting connections from clients wanting to connect to the database), then one thread for each database connection. That thread is responsible for all communication with that database client connection, and performs the bulk of database operations on behalf of the client.
This architecture exists in other RDBMS implementations. Another common implementation is a collection of processes all cooperating via a region of shared memory, usually with semaphores or other synchronization objects located in that shared memory.
The creation and management of threads can be said to be cheap, in a relative sense - it is usually significantly cheaper to create or destroy a thread rather than a process. However these overheads do not come for free. Also, the operations involved in scheduling a thread as opposed to a process are not significantly different. A single operating system instance scheduling several thousand threads on and off the CPUs is not much less work than one scheduling several thousand processes doing the same work.
The theory behind the Pool-of-Threads scheduler is to provide an operating mode which supports a large number of clients that will be maintaining their connections to the database, but will not be sending a constant stream of requests to the database. To support this, the database will maintain a (relatively) small pool of worker threads that take a single request from a client, complete the request, return the results, then return to the pool and wait for another request, which can come from any client. The database's internal threads still exist and operate in the same manner.
In theory, this should mean less work for the operating system to schedule threads that want CPU. On the other hand, it should mean some more overhead for the database, as each worker thread needs to restore the context of a database connection prior to working on each client request.
A smaller pool of threads should also consume less memory, as each thread requires a minimum amount of memory for a thread stack, before we add what is needed to store things like a connection context, or working space to process a request.
You can read more about the different threading models in the MySQL 6.0 Reference Manual.
Mark Callaghan of Google has recently had a look at whether this theory holds true. He has published his results under "No new global mutexes! (and how to make the thread/connection pool work)". Mark has identified (via this bug he logged) that the overhead for using Pool-of-Threads seems quite large - up to 63 percent.
So, my first task is see if I get the same results. I will note here that I am using Solaris, whereas Mark was no doubt using a Linux distro. We probably have different hardware as well (although both are Intel x86).
Here is what I found when running sysbench read-only (with the sysbench clients on the same host). The "conventional" scheduler inside MySQL is known as the "Thread-per-Connection" scheduler, by the way.

This is in contrast to Mark's results - I am only seeing a loss in throughput of up to 30%.
These results do show there is a definite reduction in maximum throughput if you use the pool-of-threads scheduler.
I believe it is worth looking at the bigger picture however. To do this, I am going to add in two more test cases:
What I want to see is what sort of impact the pool-of-threads scheduler has for a workload that I expect is still the more common one - where our database server is on a dedicated host, accessed via a network.


As you can see, the impact on throughput is far less significant when the client and server are separated by a network. This is because we have introduced network latency as a component of each transaction and increased the amount of work the server and client need to do - they now need to perform ethernet driver, IP and TCP tasks.
This reduces the relative overhead - in CPU consumed and latency - introduced by pool-of-threads.
This is a reminder that if you are conducting performance tests on a system prior to implementing or modifying your architecture, you would do well to choose a test architecture and workload that is as close as possible to that you are intending to deploy. The same is true if you are are trying to extrapolate performance testing someone else has done to your own architecture.
On the other hand, if you are a developer or performance engineer conducting testing in order to test a specific feature or code change, a micro-benchmark or simplified test is more likely to be what you need. Indeed, Mark's use of the "blackhole" storage engine is a good idea to eliminate that processing from each transaction.
In this scenario, if you fail to make the portion of the software you have modified a significant part of the work being done, you run the risk of seeing performance results that are not significantly different, which may lead you to assume your change has negligible impact.
In my next posting, I will compare the two schedulers using a different perspective.
Posted at 10:26AM Apr 08, 2009 by timc in MySQL | Comments[2]
Monday April 06, 2009
New Feature for Sysbench - Generate Transactions at a Steady Rate
Perhaps I am becoming a regular patcher of sysbench...
I have developed a new feature for sysbench - the ability to generate transactions at a steady rate determined by the user.
This mode is enabled using the following two new options:My need for these options is simple - I want to generate a steady load for my MySQL database. It is one thing to measure the maximum achievable throughput as you change your database configuration, hardware, or num-threads. I am also interested in how the system (or just mysqld's) utilization changes, at the same transaction rate, when I change other variables.
An upcoming post will demonstrate a use of sysbench in this mode.
For the moment my new feature can be added to sysbench 0.4.12 (and probably many earlier versions) via this patch. These changes are tested on Solaris, but I did choose only APIs that are documented as also available on Linux. I have also posted my patch on sourceforge as a sysbench feature enhancement request.
Posted at 03:42PM Apr 06, 2009 by timc in Performance | Comments[2]