Performance Analysis Richard Smith's Ruminations

Wednesday Apr 22, 2009

Professor Julius Sumner Miller on his popular TV show became known for his stock phrase, "Why is it so?"  Quite a few of the emails on the aliases I subscribe to end up asking that question, albeit about computer system performance rather than Physics.

A couple of weeks ago I was asked to investigate a possible MySQL scalability issue with the just-announced 5.4 release, as a result of feedback from a customer who had tried it out and compared it with 5.1.32. Their workload simulated the concurrent creation of customers in a database. This was done via a stored procedure that generated random data for the set of records that represented a customer in their database. The stored procedure was invoked by N threads concurrently, and they were interested in throughput as a function of N on a 64-thread Sun SPARC Enterprise T5220 Server.

In the results that I saw, which was based on the peak 1 second from a run lasting roughly 1 hour, the customer got slightly better throughput from 5.1.32 than 5.4 Beta when N=800. Now inserting records is close to a cpu-bound activity, but not perfectly so. Depending on mysqld options used, there is a non-trivial amount of serialization involved. When I ran their workload in the range of 1 to 32 threads, it soon stopped scaling. Within this range though, in every case, I got better throughput with 5.4 than 5.1.33 (the version I used for comparison).

On the whole, it didn't seem sensible to focus on the case N=800, which goes way beyond the number of virtual cpus available [64 threads across 8 cores]. Overcommiting a cpu runs the risk of threads getting preempted while holding a lock, or thrashing a shared resource such as a cache or common data structure. Solaris does have an API for providing hints to the scheduler about critical periods when preemption would be a bad idea [details can be found in the schedctl_init(3SCHED) manpage], but MySQL currently doesn't use it.

Having established, at least to my satisfaction, that 5.4 was scaling better than 5.1.33 for N in the range 1 to 32, I was still keen to identify what the bottlenecks were. That's a useful first step when trying to improve scalability! Returning to the customer workload, it also came with a my.cnf file containing a set of configuration variables which I used as the basis for my testing. The config file didn't specify innodb_thread_concurrency, which had its default changed with 5.4. This wasn't so important for an insert-intensive workload, but clearly could be a bottleneck for a read-intensive workload. Several parameters are known to have significant performance implications (such as log-bin, innodb_support_xa, innodb_flush_log_at_trx_commit). The choices one might make for a replication environment, where each slave has a single thread applying updates, are quite different to those required for an update-intensive central database. In what follows, I changed the values of some parameters from what the customer had specified, to improve scalability.

My real motivation though in writing this blog entry is to discuss the use of a technique that I have found very useful when investigating scalability, such as the situation described above, which involves the combination of Sun Studio Performance Analyzer and Solaris' microstate accounting.

A natural question to ask about a single thread is where does it spend its time? The Solaris kernel conveniently records for each thread the amount of time it spends in a set of microstates, as documented in the header file /usr/include/sys/msacct.h.

/* LWP microstates */
#define LMS_USER	0	/* running in user mode */
#define LMS_SYSTEM	1	/* running in sys call or page fault */
#define LMS_TRAP	2	/* running in other trap */
#define LMS_TFAULT	3	/* asleep in user text page fault */
#define LMS_DFAULT	4	/* asleep in user data page fault */
#define LMS_KFAULT	5	/* asleep in kernel page fault */
#define LMS_USER_LOCK	6	/* asleep waiting for user-mode lock */
#define LMS_SLEEP	7	/* asleep for any other reason */
#define LMS_WAIT_CPU	8	/* waiting for CPU (latency) */
#define LMS_STOPPED	9	/* stopped (/proc, jobcontrol, lwp_suspend) */

In addition to user cpu time, which Linux users will be familiar with it, it provides separate per-thread buckets for time spent waiting to be dispatched, time spent asleep waiting for a lock or on a condition variable, system call execution time, and several others. This means that all of a thread's time can be accounted for.

Via the /proc pseudo-filesystem, snapshots of resource usage can be taken, which is what underpins a particularly useful command, prstat -mL, for looking at which microstates threads are spending their time in. pstack(1) can be used to print a hex+symbolic stack trace for  a process or set of LWPs within a process to get snapshots of which functions threads are spending their time. Neither of these though give a complete picture, as you really want to combine the two types of information, and ultimately map the resources back to the source code. This is why I make extensive use of Performance Analyzer.

Using Performance Analyzer involves a few steps. Firstly you get much better information if the executable has been compiled with debugging information, for example -g. When investigating the customer scalability issue mentioned above, I downloaded the official 5.1.33 binary, but also compiled my own build from source, with debugging information to assist analysis.

Then a profiling experiment needs to be run, to collect profiling data. When profiling mysqld, I generally start it via the collect(1) command and use the -y USR2 option to  start it with profiling suspended but able to be toggled on and off via USR2 signal. This provides an opportunity to run something to warm up caches if desired before starting to profile.

collect -y USR2 $MYSQL_HOME/libexec/mysqld ...

In my workload generation script I toggle profiling on and off at the start and end:

mpid=`pgrep -x mysqld`
kill -USR2 $mpid
workload
kill -USR2 $mpid

This should result in an experiment directory being created containing all the profiling data. The final step is to analyze the profile data. For this, either the analyzer(1) GUI or er_print(1) command can be used. I'm often working remotely from benchmark machines and tend to use er_print, but do occasionally fire up analyzer via vnc. The GUI makes various aspects of the analysis less tedious and I would encourage its use for people less experienced with the tools.

mysqld is a multithreaded program that creates one thread for each user connection, as well as having a number of background threads. Several of the background threads spend most of their time asleep on condition variables, waiting for work. I'm mostly interested in where the foreground threads that actually process SQL statements spend their time, so I filter out the background threads (they generally have the lowest thread ids):

$ er_print test.9.er
(er_print) threads
Objects sorted by metric: Exclusive User CPU Time

Excl.      Name  
User CPU         
   sec.     
228.860    
 29.000    Thread 27
 28.970    Thread 28
 28.610    Thread 30
 28.370    Thread 29
 28.160    Thread 33
 28.090    Thread 31
 27.910    Thread 34
 27.689    Thread 32
  1.081    Thread 22
  0.801    Thread 21
...

(er_print) thread_sel 27-34
Exp Sel   Total 
=== ===== ===== 
  1 27-34    34 

Serial bottlenecks tend to be associated with synchronization primitives like mutexes and condition variables. If a thread makes a number of attempts to get a resource [spin] but fails each time, eventually it gets put to sleep [wait]. The time spent asleep is recorded as User Lock time. The list of metrics that er_print displays can be expanded from the default of User CPU time via its metrics command:

(er_print) metrics e.user:i.user:e.lock:i.lock
Current metrics: e.user:i.user:e.lock:i.lock:name
Current Sort Metric: Exclusive User CPU Time ( e.user )

A summary of resources by function can be obtained, here sorted by descending Exclusive User Lock time, which includes a couple of pseudo-functions. <Total> aggregates resources from all functions, which the [potentially long] list of real functions below should add up to. Exclusive resources are those used by a function itself, whereas Inclusive resources are those used by a function and all the children beneath it in the call stack.

(er_print) sort e.lock
Current Sort Metric: Exclusive User Lock Time ( e.lock )
(er_print) limit 30
(er_print) functions
Functions sorted by metric: Exclusive User Lock Time

Excl.     Incl.     Excl.      Incl.       Name  
User CPU  User CPU  User Lock  User Lock         
   sec.      sec.      sec.       sec.      
226.799   226.799   533.123    533.123     <Total>
  2.001     2.001   533.123    533.123     __lwp_park
  4.713     4.713     0.         0.        <Unknown>
...

Extremely few functions are ever likely to have Exclusive User Lock time, because that is handled at a very low level underneath the implementation of synchronization primitives provided by libc.so.1. The application functions associated with synchronization primitive bottlenecks are higher up the stack. They can be found by following the chain of function calls up the stack via csingle command. Experience shows that these are on or around the callers of functions like pthread_mutex_lock and pthread_cond_wait. In the extract below, the function of interest is highlighted with an asterisk [*], surrounded by the immediate parents and children of that function in the call tree.

(er_print) csingle pthread_cond_wait
Callers and callees sorted by metric: Attributed User CPU Time

Attr.     Excl.     Incl.     Attr.      Excl.      Incl.       Name  
User CPU  User CPU  User CPU  User Lock  User Lock  User Lock         
 sec.      sec.      sec.        sec.     sec.         sec.      
3.663     0.190     4.173     533.123    0.         533.123     os_event_wait_low
0.190     0.190     3.663       0.       0.         533.123    *pthread_cond_wait
3.472     0.160     3.472     533.123    0.         533.123     cond_wait

(er_print) csingle os_event_wait_low
Callers and callees sorted by metric: Attributed User CPU Time

Attr.     Excl.     Incl.     Attr.      Excl.      Incl.       Name  
User CPU  User CPU  User CPU  User Lock  User Lock  User Lock         
 sec.      sec.       sec.       sec.     sec.         sec.      
2.552     0.420     30.191    249.314    0.         249.545     log_write_up_to
1.561     0.500     37.897    273.201    0.         527.889     trx_commit_off_kernel
0.060     0.         0.060     10.607    0.          10.607     sync_array_wait_event
0.190     0.190      4.173      0.       0.         533.123    *os_event_wait_low
3.663     0.190      3.663    533.123    0.         533.123     pthread_cond_wait
0.170     2.172      3.763      0.       0.           0.        mutex_unlock
0.110     2.952      3.252      0.       0.           0.        mutex_lock_impl
0.040     0.280      0.280      0.       0.           0.        mutex_lock

In this case an application function log_write_up_to makes use of an Innodb utility function os_event_wait_low to wait on a condition variable,  which appears to be a bottleneck. A source listing of log_write_up_to identifies the statement responsible, highlighted by a double hash [##].

Source file: .../storage/innobase/log/log0log.c
Object file: test.9.er/archives/mysqld_4f2PsJ6tDrc
Load Object: test.9.er/archives/mysqld_4f2PsJ6tDrc

   Excl.     Incl.     Excl.      Incl.
   User CPU  User CPU  User Lock  User Lock
    sec.      sec.      sec.         sec.
...
## 0.020     2.572     0.         249.314         1370.   os_event_wait(log_sys->no_flush_event);

With a little practice, following these steps is actually a fairly quick process--it takes much longer to write it all up! And to be fair, the job is only part-done. To improve scalability, it would be necessary to identify where in the code the corresponding signal is issued, determine what takes time between the wait and corresponding signal, and see if that critical path can be shortened. In this particular case, its related to the batching of log entries and flushing them to the Innodb log file. Using the fastest possible storage for logs is a partial answer.

One issue to be aware of when profiling is that under some circumstances, Performance Analyzer may not be able to map back from instruction addresses to source lines. Some gcc compilers I've tried for example appear not to generate usable PC->linenumber mappings in .debug_line sections. For those with sunsolve accounts, CRs 6726134, 6776962, and 6831881 address some of these issues, although its not clear to me whether the problems are in the generation of the mappings, or their interpretation.

Innodb makes extensive use of the preprocessor and macros, which when combined with the inlining of functions, can make the mapping back from instructions to source code quite tricky. On occasion, particularly when dealing with an unfamiliar part of the source tree, I've found it useful to create a less-optimized version, with inlining suppressed, when trying to understand the control flow.

Although not illustrated here, for everything I did with the command-line tool er_print, there is an equivalent in analyzer GUI. Being able to point and click in the GUI makes moving up and down the callstack less tedious. It also has a graphical timeline view that assists the human eye in detecting interesting or unusual patterns in callstacks over time. More information about er_print and analyzer can be found in Sun Studio 12: Performance Analyzer manual.

Finally, perhaps as a teaser, here's an example timeline from a different mysqld profile--one that shows an immediate and obvious pattern. It shows a bunch of threads, but by no means all, periodically waiting 400ms for commands from remote clients. This is for a workload where once a command finishes, the next one should immediately be sent to mysqld. This clearly calls for further investigation.

Mysql_pauses_timeline.png

Wednesday Apr 15, 2009

Ever more powerful hardware can stress software and lead to bottlenecks not previously seen. I was working recently on Olio scalability where the challenge was to see how many active users our configuration could support while meeting specified response time targets. In our setup, an active user on average did one operation every 5 seconds, and an operation required on average 7 web transactions. Supporting 10000 users across two X2270 web servers therefore involved a transaction rate of about 7200 transactions per second per server. The web tier contained PHP application code that in turn accessed memcached, NFS-mounted filestore, and backend replicated MySQL database.

Needless to say, the initial configuration failed to support 10000 users. The history of the exercise then became one of iterating through a process of identifying bottlenecks, eliminating or working around them, and moving on to discover the next bottleneck. There will always be at least one bottleneck of course, but artificial bottlenecks in software that prevent the underlying hardware from being fully utilized are especially to be avoided.

One of the things we did in the course of a run is use Solaris' lockstat(1M) command to collect information about hot locks in the kernel. Solaris uses several different kinds of lock: locks where it is acceptable to be put to sleep, locks where you are only permitted to spin if an attempt to get the lock fails, Read-Write locks that can allow multiple concurrent readers but only one writer at  a time, etc. In our case, contention for Read-Write locks dominated. Here's an example:

R/W writer blocked by writer: 604708 events in 518.141 seconds (1167 events/sec)

Count  indv cuml rcnt     nsec Lock                      Caller
-------------------------------------------------------------------------------
604575 100% 100% 0.00  1376373 0xffffff0e9e1893c8        ufs_rwlock+0xf6
    54   0% 100% 0.00    26305 0xffffff0d38a73630        tmp_remove+0x70
    51   0% 100% 0.00    32393 0xffffff0d38a73630        tmp_create+0x126
    15   0% 100% 0.00    31474 0xffffff0d38a73630        tmp_rename+0x1b4

The output shows a major bottleneck on a particular ufs rwlock. But what file is that lock associated with? Via mdb(1), the Lock address can be used to find the filename. The Caller gives a strong clue that the lock is ufs-related, and mdb's ::whatis dcmd can be used to identify it:

> 0xffffff0e9e1893c8::whatis
ffffff0e9e1893c8 is ffffff0e9e1892f0+d8, allocated from ufs_inode_cache

Again we have a strong clue that the lock is associated with an inode. Proceeding on the assumption that we're dealing with the i_rwlock field of an inode, a sanity check can be performed to verify that the offset of i_rwlock in inode_t is 0xd8, which matches the offset in the ::whatis output. [I admit, I did also look at kernel source to get these fieldnames.]

> ::offsetof inode_t i_rwlock
offsetof (inode_t, i_rwlock) = 0xd8

The next step is to get the vnode corresponding to the inode, and from that get the path of the file it represents. The inode's base address was infact printed by the ::whatis dcmd, but it could also be computed by subtracting i_rwlock offset from the lock address. Either works. From there, just print the value of i_vnode, then use the ::vnode2path dcmd.

> 0xffffff0e9e1893c8-0xd8::print inode_t i_vnode
i_vnode = 0xffffff0df3646040
> ffffff0e9e1892f0::print inode_t i_vnode
i_vnode = 0xffffff0df3646040
> 0xffffff0df3646040::vnode2path
/var/apache2/2.2/logs/access_log

Aha! Apache's access_log was a bottleneck. This probably should not be surprising given the rate at which transactions  were being logged from 5000+ processes on a single server. Although by this point we had reached our target of 10000 users, we did test what happens when the log was put on /tmp, which was to improve response times further. It does however point to the need for a new, more scalable, logging implementation for Apache. At a minimum, high-activity logs need to go on the fastest available storage.

Speaking of /tmp, the original lock output shows some rwlock activity associated with /tmp as well (as implied by the Callers). And yes, even before we got to the ufs lock issue there was a bigger bottleneck on /tmp, one that was stopping us reaching our target. While the mechanics differ in detail (different structure, fieldname, offset), mdb can be used to identify the files responsible for /tmp contention too:

> 0xffffff0d38a73630::whatis
ffffff0d38a73630 is ffffff0d38a73570+c0, allocated from kmem_alloc_224
> ::offsetof struct tmpnode tn_rwlock
offsetof (struct tmpnode, tn_rwlock) = 0xc0
> ffffff0d38a73570::print struct tmpnode tn_vnode
tn_vnode = 0xffffff0d38c9e480
> 0xffffff0d38c9e480::vnode2path
/tmp

The PHP application code was saving session state for each session in a file in /tmp, creating a bottleneck on the rwlock associated with the /tmp directory itself. The workaround for this was to edit php.ini and tell PHP to distribute its session state files across multiple subdirectories. This is done via session.save_path:

< session.save_path = "/tmp"
---
> session.save_path = "1;/tmp/http_sessions"

With the /tmp change in place and other tuning of caching layer and MySQL, Apache scaled to the level we needed to support 10000 active Olio users across two X2270 servers. Access_log is important for Service Level Management amongst other things, and while our response times were being met with it on JBOD, rwlock contention indicates a more scalable solution will be required in the future. /tmp can provide some relief albeit at the risk of being on volatile storage.


           

Thursday Nov 06, 2008

A colleague recently asked for some assistance with analyzing and tuning the performance of Olio on a 1.4GHz T5220 CMT server. The main symptoms were that its cpu consumption varied considerably over the course of a run, and that response time targets weren't being met, even at only moderate load.

Olio is a web2.0 toolkit designed to help evāžaluate the suitability, functionality and performance of various web technologies. The deployment I was looking at involved several components:

  • Faban driver on a separate server, generating web requests on average every 5 second per simulated user
  • Apache 2.2.9 from CoolStack 1.3.1
  • PHP 5.2.6 also from CoolStack
  • MySQL, initially 5.1.25 from CoolStack, later switched to 5.1.30 that I compiled myself

Apache/PHP and MySQL were both running on the same T5220, and a processor set had been created for MySQL.

Since Apache was involved, one of the things that I did is edit httpd.conf to create a CustomLog format for its access log.

    LogFormat "%h %l %u %t \"%r\" %>s %b %{pid}P %X %D" rscommon
    CustomLog logs/access_log rscommon

This added process ID (pid), KeepAlive status, and response time (in microseconds) to each log record. Running an experiment for a very small number of users then showed which web queries were most common and which were inherently much slower than the rest. Temporarily for this purpose I set certain KeepAlive parameters to make it easier to follow the sequence of requests from a particular user, because they'd be serviced by the same process (when using mpm_worker and configuring sufficent processes).

KeepAlive on
KeepAliveTimeout 60
MaxKeepAliveRequests 0

For 10 users over 360 seconds, Faban reported 700 Operations, or 1.944 ops/sec. However the access log showed 5579 web requests, so clearly one or more Operations involved many web requests. The access_log confirmed that there were several inherently slow request types which were also quite common.

 210 GET /events.php?socialEventID=...
 286 GET /taggedEvents.php?tag=...
 286 POST /taggedEvents.php
 316 GET /index.php
3396 GET /fileService.php?cache=false&file=...

I probably wouldn't have chosen a quasi-open queuing system workload myself for performance tuning: a closed queuing system with zero think time makes it easier to identify bottlenecks. Still, having a list of candidate web queries meant that I could quickly create a simpler workload that exercised just Apache or Apache+PHP. For that I used the Apache utility ab, and created two simple pages, one static and one dynamic via phpinfo().

The motivation for this is to verify what throughput can be achieved for relatively lightweight requests. As will be seen below, there is a trap that involves a nasty interaction between tcp_time_wait_interval, KeepAlive, and the characteristics of this workload.

If KeepAlive is not used then each request creates a new connection, which gets closed at the end of the request. TCP requires a delay, which on Solaris is specified by tcp_time_wait_interval, before the port number can be reused by another connection. There are only 64k different port numbers, so once they've all been used, then a new connection has to wait for a port to become available due to expiry of tcp_time_wait_interval for a previous connection. The default on Solaris is 60000ms, which implies a sustainable connection rate of about 1000/sec. In a LAN environment, MSL (Maximum Segment Lifetime) should be less than 1 second, in which case the default can be safely reduced. This is necessary for very high connection rates.

# ndd /dev/tcp tcp_time_wait_interval
60000
# ndd -set /dev/tcp tcp_time_wait_interval 2000

Alternatively if KeepAlive is used, then each httpd process potentially can sit around idle for KeepAliveTimeout seconds before becoming available to service requests from another user. With a large user population, this means having a lot of apache processes, each with their own connection to mysqld at the backend. Careful attention need to be paid to whether there is sufficient real memory for so many httpd processes. Experience also suggests large numbers of active mysql processes can reduce efficiency.

For 1000 users, each combination failed to deliver the required performance (throughput and/or response time targets).

 KeepAlive KeepAliveTimeout
 MaxClients  Result
 on 60
 2048  fail: response times > 5 sec
 on  1  400  fail: 90th percentile > target
 off
 400  fail: unstable response times

The instability with KeepAlive off is particularly interesting as for majority of the run response times and throughput were looking good, when suddenly things deteroriated badly. This is reflected in the graphs generated by Faban of cpu utilisation and response time.

cpu_11h.png

resptm_11h.png

Before pursuing this though, I decided to investigate more deeply the inherently slow transactions. One way of doing this is to select a transaction type and run it repeatedly in a single thread. Using a single thread makes it more practical to instrument things with DTrace's pid provider, as only a single apache process need be started up. Mysqld itself is a single process, but multithreaded. Each apache process establishes its own connection to myqld. I wrote a DTrace script commandx.d to capture SQL text, response time and cpu time from mysql, which I found very useful. [Caveat: it needs to be run using the same data model as the process its observing, either -32 or -64.] DTrace scripts tend to grow organically, as you hack, modify, adapt, enhance, and expand them according to need. This one needs further work to capture stmt ids returned from PREPARE commands.

There are a few other techniques I have found useful for MySQL. One is to capture microstate accounting and other resource information from /proc pseudo-filesystem at start and end of an experiment and output the difference via a small utility I wrote, pidusage. Mysqld maintains a set of global variables about its internal activity that can be queried. To compute the differences between two samples I use the nawk script mysqldiff. Putting it all together the resulting microbenchmark script looks something like this:

mpid=`pgrep -x mysqld`
mysql -B -e 'show global status;' >sample1
pidusage -s -p$mpid
ab -c 1 -n 100 "http://localhost:8080/events.php?socialEventID=..."
pidusage -g -p$mpid
mysql -B -e 'show global status;' >sample2
mysqldiff sample1 sample2

For DTracing mysqld:

commandx.d -64 -p $mpid -o commandx.log 

And for starting a standalone single apache process:

httpd -k start -X

Combining all this generates a wealth of information, but I'll quote selectively here. 100 queries took 15.908s, and according to ab the mean was 159.077ms, so that's consistent. From microstate accounting, mysqld used 13.084s cpu time so the workload is cpu-bound (the database fits in memory). Statistics from mysqld give a clue about the cpu time: 100 web queries resulted in 500 select statements, which in total read 439800 rows.

Com_admin_commands                              100
Com_begin                                       100
Com_commit                                      100
Com_select                                      501
Com_update                                      100
Handler_read_first                              100
Handler_read_key                             418800
Handler_read_next                             22600
Handler_read_rnd_next                        417170
Innodb_buffer_pool_read_requests             934231
Innodb_rows_read                             439800
Questions                                       803

And this is what commandx captured at mysqld level for a typical events.php query (formatted to make the SQL text more readable). Note that all times are in microseconds (with Time referring to elapsed time since the script started logging).

      Time  Tid   RespTM    CpuTM Type          Len SQL
  10070946 6487       51       36 14:PING         0
  10071118 6487   123550   123500 03:QUERY      194 select tag, refcount as count
                                                    from   SOCIALEVENTTAG
                                                    where  socialeventtagid in
                                                         ( select socialeventtagid
                                                           from   SOCIALEVENTTAG_SOCIALEVENT
                                                           where  socialeventid = '2086')
                                                    and    refcount > 0
                                                    order  by tag ASC
  10195907 6487     1005      974 03:QUERY       84 select count(username) as count from PERSON_SOCIALEVENT where socialeventid = '2086'
  10197179 6487      747      706 03:QUERY      275 select title,description,submitterUserName,imagethumburl,literatureurl,telephone,timezone,
                                                           eventtimestamp,street1,street2,city,state,zip,country,latitude,longitude,summary
                                                    from   SOCIALEVENT as se, ADDRESS as a
                                                    where  se.socialeventid = '2086' and se.ADDRESS_a...
  10199070 6487      746      724 03:QUERY       68 select username from PERSON_SOCIALEVENT where socialeventid = '2086'
  10209447 6487       97       73 03:QUERY       17 start transaction
  10209761 6487      594      568 03:QUERY      110 update COMMENTS_RATING set comments='', ratings=''
                                                    where  username='' and socialeventid='2086' and commentid=''
  10210475 6487     1356     1329 03:QUERY      113 select commentid,username,comments,ratings,created_at,updated_at
                                                    from   COMMENTS_RATING where socialeventid = '2086'
  10221075 6487       81       65 03:QUERY        6 commit 

So a single query dominates. Prior to MySQL 6.0.x, there is a known issue around subqueries not being processed optimally. In this case there is a workaround, which is to use a join instead of a subquery:

select  tag, refcount as count
from    SOCIALEVENTTAG a, SOCIALEVENTTAG_SOCIALEVENT b
where   a.socialeventtagid = b.socialeventtagid
and     b.socialeventid = '2086'
and     a.refcount > 0
order   by tag ASC

I chose to use super-smack to test the two versions in isolation, before making changes to the PHP source code. Super-smack provides a convenient facility for plugging keys selected randomly from a dictionary into SQL commands. This makes for a more realistic workload. Since I had already turned off mysqld's Query Cache, this wasn't strictly necessary, but I chose to do it anyway. [Aside: I made a few changes to the super-smack source that I downloaded to clean up output formatting, ensured it used an MT-hot malloc, and replaced calls to random() with a thread-safe random number generator. This last change required passing an additional parameter for per-thread state around the callstack.]

$ super-smack -D. soc002.smack 1 100
Query Barrel Report for client smacker
connect: max=0ms  min=-1ms avg= 42ms from 1 clients 
Query_type   num_queries    max_time    min_time     q_per_s
soc002               100         124         124        8.05

$ super-smack -D. soc002a.smack 1 100
Query Barrel Report for client smacker
connect: max=0ms  min=-1ms avg= 41ms from 1 clients 
Query_type   num_queries    max_time    min_time     q_per_s
soc002a              100           0           0      681.73

The next problem query came from index.php, and took about 99ms.

select count(*) as count From SOCIALEVENT where eventtimestamp >= CURRENT_TIMESTAMP

The problem here is that almost all 18000 records in SOCIALEVENT have an eventtimestamp in the future. This part of the application looks a bit dodgy since it appears to be involved in pagination, in which links are provided for the first 10 pages. My suggestion here is to rely on count(*) of all records being a special case, and subtract the few records older than CURRENT_TIMESTAMP. Its still expensive relative to most queries though.

select count(*) as count From SOCIALEVENT
select count(*) as count From SOCIALEVENT where eventtimestamp < CURRENT_TIMESTAMP

$ super-smack -D. cnt001.smack 1 100 Query Barrel Report for client smacker connect: max=0ms min=-1ms avg= 47ms from 1 clients Query_type num_queries max_time min_time q_per_s cnt001 100 100 100 9.95 $ super-smack -D. cnt002.smack 1 100 Query Barrel Report for client smacker connect: max=0ms min=-1ms avg= 47ms from 1 clients Query_type num_queries max_time min_time q_per_s cnt002a 100 39 39 24.03 cnt002b 100 0 0 24.03

The third change is a bit subtle, but here's the original query:

select socialeventid,title,summary,imagethumburl,createdtimestamp,eventdate,submitterusername
from   SOCIALEVENT
where  eventtimestamp >= CURRENT_TIMESTAMP
order  by eventdate asc
limit  0,10

The issue here is that the order clause sorts by eventdate, not eventtimestamp. For this application, eventdate can actually be derived from eventtimestamp (via DATE function), and eventtimestamp has an index that facilitates getting records in the right order. It makes more sense to order by eventtimestamp.

Ideally I would have rerun the original experiment at this point, but my changes had been applied to a version of the application just prior to transaction start/commit being inserted. So temporarily I just used the modified older version, but note that it doesn't have the explicit transaction logic. This should be a relatively minor thing, as by default each SQL statement is a transaction when using InnoDB engine in MySQL. Before simulating 500+ users, I ran a smaller test to verify from the apache access_log that several of the common requests were now inherently faster. Making the mysql component more efficient also shifts the relative cpu consumption between mysql and apache. Rather than manually adjusting the number of cpus in a processor set, I deleted them and let the OS work it out. Another decision was to run with KeepAlive off, as performance was less stable with it enabled. [Maybe in the future it would be worth hacking the apache source to allow subsecond KeepAliveTimeout ?]

Cranking up the number of users, its now passing all performance criteria at 1800 users. There are a few signs of it approaching instability with the run queue occasionally jumping. At its busiest the system overall is approaching 90% busy, so if throughput is to be increased much further, further attacks will need to be made on the components consuming the majority of the cpu time. Some of the requests are still quite slow (~1 second) due to image processing but I'll leave that to another blog entry.

UIDriver metric: 344.169 ops/sec
Driver start: Thu Nov 06 03:35:52 PST 2008
Driver end: Thu Nov 06 03:43:03 PST 2008
Total number of operations: 123901
Pass/Fail: PASSED

Response Times (seconds)

Type Avg Max 90th% Reqd. 90th% Pass/Fail
HomePage 0.317 4.511 0.500 1.000 PASSED
Login 0.292 2.429 0.400 1.000 PASSED
TagSearch 0.222 5.393 0.500 2.000 PASSED
EventDetail 0.108 4.104 0.200 2.000 PASSED
PersonDetail 0.169 3.640 0.380 2.000 PASSED
AddPerson 1.390 6.814 2.040 3.000 PASSED
AddEvent 1.468 7.117 2.140 4.000 PASSED