Performance Analysis Richard Smith's Ruminations

Wednesday Apr 22, 2009

Professor Julius Sumner Miller on his popular TV show became known for his stock phrase, "Why is it so?"  Quite a few of the emails on the aliases I subscribe to end up asking that question, albeit about computer system performance rather than Physics.

A couple of weeks ago I was asked to investigate a possible MySQL scalability issue with the just-announced 5.4 release, as a result of feedback from a customer who had tried it out and compared it with 5.1.32. Their workload simulated the concurrent creation of customers in a database. This was done via a stored procedure that generated random data for the set of records that represented a customer in their database. The stored procedure was invoked by N threads concurrently, and they were interested in throughput as a function of N on a 64-thread Sun SPARC Enterprise T5220 Server.

In the results that I saw, which was based on the peak 1 second from a run lasting roughly 1 hour, the customer got slightly better throughput from 5.1.32 than 5.4 Beta when N=800. Now inserting records is close to a cpu-bound activity, but not perfectly so. Depending on mysqld options used, there is a non-trivial amount of serialization involved. When I ran their workload in the range of 1 to 32 threads, it soon stopped scaling. Within this range though, in every case, I got better throughput with 5.4 than 5.1.33 (the version I used for comparison).

On the whole, it didn't seem sensible to focus on the case N=800, which goes way beyond the number of virtual cpus available [64 threads across 8 cores]. Overcommiting a cpu runs the risk of threads getting preempted while holding a lock, or thrashing a shared resource such as a cache or common data structure. Solaris does have an API for providing hints to the scheduler about critical periods when preemption would be a bad idea [details can be found in the schedctl_init(3SCHED) manpage], but MySQL currently doesn't use it.

Having established, at least to my satisfaction, that 5.4 was scaling better than 5.1.33 for N in the range 1 to 32, I was still keen to identify what the bottlenecks were. That's a useful first step when trying to improve scalability! Returning to the customer workload, it also came with a my.cnf file containing a set of configuration variables which I used as the basis for my testing. The config file didn't specify innodb_thread_concurrency, which had its default changed with 5.4. This wasn't so important for an insert-intensive workload, but clearly could be a bottleneck for a read-intensive workload. Several parameters are known to have significant performance implications (such as log-bin, innodb_support_xa, innodb_flush_log_at_trx_commit). The choices one might make for a replication environment, where each slave has a single thread applying updates, are quite different to those required for an update-intensive central database. In what follows, I changed the values of some parameters from what the customer had specified, to improve scalability.

My real motivation though in writing this blog entry is to discuss the use of a technique that I have found very useful when investigating scalability, such as the situation described above, which involves the combination of Sun Studio Performance Analyzer and Solaris' microstate accounting.

A natural question to ask about a single thread is where does it spend its time? The Solaris kernel conveniently records for each thread the amount of time it spends in a set of microstates, as documented in the header file /usr/include/sys/msacct.h.

/* LWP microstates */
#define LMS_USER	0	/* running in user mode */
#define LMS_SYSTEM	1	/* running in sys call or page fault */
#define LMS_TRAP	2	/* running in other trap */
#define LMS_TFAULT	3	/* asleep in user text page fault */
#define LMS_DFAULT	4	/* asleep in user data page fault */
#define LMS_KFAULT	5	/* asleep in kernel page fault */
#define LMS_USER_LOCK	6	/* asleep waiting for user-mode lock */
#define LMS_SLEEP	7	/* asleep for any other reason */
#define LMS_WAIT_CPU	8	/* waiting for CPU (latency) */
#define LMS_STOPPED	9	/* stopped (/proc, jobcontrol, lwp_suspend) */

In addition to user cpu time, which Linux users will be familiar with it, it provides separate per-thread buckets for time spent waiting to be dispatched, time spent asleep waiting for a lock or on a condition variable, system call execution time, and several others. This means that all of a thread's time can be accounted for.

Via the /proc pseudo-filesystem, snapshots of resource usage can be taken, which is what underpins a particularly useful command, prstat -mL, for looking at which microstates threads are spending their time in. pstack(1) can be used to print a hex+symbolic stack trace for  a process or set of LWPs within a process to get snapshots of which functions threads are spending their time. Neither of these though give a complete picture, as you really want to combine the two types of information, and ultimately map the resources back to the source code. This is why I make extensive use of Performance Analyzer.

Using Performance Analyzer involves a few steps. Firstly you get much better information if the executable has been compiled with debugging information, for example -g. When investigating the customer scalability issue mentioned above, I downloaded the official 5.1.33 binary, but also compiled my own build from source, with debugging information to assist analysis.

Then a profiling experiment needs to be run, to collect profiling data. When profiling mysqld, I generally start it via the collect(1) command and use the -y USR2 option to  start it with profiling suspended but able to be toggled on and off via USR2 signal. This provides an opportunity to run something to warm up caches if desired before starting to profile.

collect -y USR2 $MYSQL_HOME/libexec/mysqld ...

In my workload generation script I toggle profiling on and off at the start and end:

mpid=`pgrep -x mysqld`
kill -USR2 $mpid
workload
kill -USR2 $mpid

This should result in an experiment directory being created containing all the profiling data. The final step is to analyze the profile data. For this, either the analyzer(1) GUI or er_print(1) command can be used. I'm often working remotely from benchmark machines and tend to use er_print, but do occasionally fire up analyzer via vnc. The GUI makes various aspects of the analysis less tedious and I would encourage its use for people less experienced with the tools.

mysqld is a multithreaded program that creates one thread for each user connection, as well as having a number of background threads. Several of the background threads spend most of their time asleep on condition variables, waiting for work. I'm mostly interested in where the foreground threads that actually process SQL statements spend their time, so I filter out the background threads (they generally have the lowest thread ids):

$ er_print test.9.er
(er_print) threads
Objects sorted by metric: Exclusive User CPU Time

Excl.      Name  
User CPU         
   sec.     
228.860    
 29.000    Thread 27
 28.970    Thread 28
 28.610    Thread 30
 28.370    Thread 29
 28.160    Thread 33
 28.090    Thread 31
 27.910    Thread 34
 27.689    Thread 32
  1.081    Thread 22
  0.801    Thread 21
...

(er_print) thread_sel 27-34
Exp Sel   Total 
=== ===== ===== 
  1 27-34    34 

Serial bottlenecks tend to be associated with synchronization primitives like mutexes and condition variables. If a thread makes a number of attempts to get a resource [spin] but fails each time, eventually it gets put to sleep [wait]. The time spent asleep is recorded as User Lock time. The list of metrics that er_print displays can be expanded from the default of User CPU time via its metrics command:

(er_print) metrics e.user:i.user:e.lock:i.lock
Current metrics: e.user:i.user:e.lock:i.lock:name
Current Sort Metric: Exclusive User CPU Time ( e.user )

A summary of resources by function can be obtained, here sorted by descending Exclusive User Lock time, which includes a couple of pseudo-functions. <Total> aggregates resources from all functions, which the [potentially long] list of real functions below should add up to. Exclusive resources are those used by a function itself, whereas Inclusive resources are those used by a function and all the children beneath it in the call stack.

(er_print) sort e.lock
Current Sort Metric: Exclusive User Lock Time ( e.lock )
(er_print) limit 30
(er_print) functions
Functions sorted by metric: Exclusive User Lock Time

Excl.     Incl.     Excl.      Incl.       Name  
User CPU  User CPU  User Lock  User Lock         
   sec.      sec.      sec.       sec.      
226.799   226.799   533.123    533.123     <Total>
  2.001     2.001   533.123    533.123     __lwp_park
  4.713     4.713     0.         0.        <Unknown>
...

Extremely few functions are ever likely to have Exclusive User Lock time, because that is handled at a very low level underneath the implementation of synchronization primitives provided by libc.so.1. The application functions associated with synchronization primitive bottlenecks are higher up the stack. They can be found by following the chain of function calls up the stack via csingle command. Experience shows that these are on or around the callers of functions like pthread_mutex_lock and pthread_cond_wait. In the extract below, the function of interest is highlighted with an asterisk [*], surrounded by the immediate parents and children of that function in the call tree.

(er_print) csingle pthread_cond_wait
Callers and callees sorted by metric: Attributed User CPU Time

Attr.     Excl.     Incl.     Attr.      Excl.      Incl.       Name  
User CPU  User CPU  User CPU  User Lock  User Lock  User Lock         
 sec.      sec.      sec.        sec.     sec.         sec.      
3.663     0.190     4.173     533.123    0.         533.123     os_event_wait_low
0.190     0.190     3.663       0.       0.         533.123    *pthread_cond_wait
3.472     0.160     3.472     533.123    0.         533.123     cond_wait

(er_print) csingle os_event_wait_low
Callers and callees sorted by metric: Attributed User CPU Time

Attr.     Excl.     Incl.     Attr.      Excl.      Incl.       Name  
User CPU  User CPU  User CPU  User Lock  User Lock  User Lock         
 sec.      sec.       sec.       sec.     sec.         sec.      
2.552     0.420     30.191    249.314    0.         249.545     log_write_up_to
1.561     0.500     37.897    273.201    0.         527.889     trx_commit_off_kernel
0.060     0.         0.060     10.607    0.          10.607     sync_array_wait_event
0.190     0.190      4.173      0.       0.         533.123    *os_event_wait_low
3.663     0.190      3.663    533.123    0.         533.123     pthread_cond_wait
0.170     2.172      3.763      0.       0.           0.        mutex_unlock
0.110     2.952      3.252      0.       0.           0.        mutex_lock_impl
0.040     0.280      0.280      0.       0.           0.        mutex_lock

In this case an application function log_write_up_to makes use of an Innodb utility function os_event_wait_low to wait on a condition variable,  which appears to be a bottleneck. A source listing of log_write_up_to identifies the statement responsible, highlighted by a double hash [##].

Source file: .../storage/innobase/log/log0log.c
Object file: test.9.er/archives/mysqld_4f2PsJ6tDrc
Load Object: test.9.er/archives/mysqld_4f2PsJ6tDrc

   Excl.     Incl.     Excl.      Incl.
   User CPU  User CPU  User Lock  User Lock
    sec.      sec.      sec.         sec.
...
## 0.020     2.572     0.         249.314         1370.   os_event_wait(log_sys->no_flush_event);

With a little practice, following these steps is actually a fairly quick process--it takes much longer to write it all up! And to be fair, the job is only part-done. To improve scalability, it would be necessary to identify where in the code the corresponding signal is issued, determine what takes time between the wait and corresponding signal, and see if that critical path can be shortened. In this particular case, its related to the batching of log entries and flushing them to the Innodb log file. Using the fastest possible storage for logs is a partial answer.

One issue to be aware of when profiling is that under some circumstances, Performance Analyzer may not be able to map back from instruction addresses to source lines. Some gcc compilers I've tried for example appear not to generate usable PC->linenumber mappings in .debug_line sections. For those with sunsolve accounts, CRs 6726134, 6776962, and 6831881 address some of these issues, although its not clear to me whether the problems are in the generation of the mappings, or their interpretation.

Innodb makes extensive use of the preprocessor and macros, which when combined with the inlining of functions, can make the mapping back from instructions to source code quite tricky. On occasion, particularly when dealing with an unfamiliar part of the source tree, I've found it useful to create a less-optimized version, with inlining suppressed, when trying to understand the control flow.

Although not illustrated here, for everything I did with the command-line tool er_print, there is an equivalent in analyzer GUI. Being able to point and click in the GUI makes moving up and down the callstack less tedious. It also has a graphical timeline view that assists the human eye in detecting interesting or unusual patterns in callstacks over time. More information about er_print and analyzer can be found in Sun Studio 12: Performance Analyzer manual.

Finally, perhaps as a teaser, here's an example timeline from a different mysqld profile--one that shows an immediate and obvious pattern. It shows a bunch of threads, but by no means all, periodically waiting 400ms for commands from remote clients. This is for a workload where once a command finishes, the next one should immediately be sent to mysqld. This clearly calls for further investigation.

Mysql_pauses_timeline.png

Wednesday Apr 15, 2009

Ever more powerful hardware can stress software and lead to bottlenecks not previously seen. I was working recently on Olio scalability where the challenge was to see how many active users our configuration could support while meeting specified response time targets. In our setup, an active user on average did one operation every 5 seconds, and an operation required on average 7 web transactions. Supporting 10000 users across two X2270 web servers therefore involved a transaction rate of about 7200 transactions per second per server. The web tier contained PHP application code that in turn accessed memcached, NFS-mounted filestore, and backend replicated MySQL database.

Needless to say, the initial configuration failed to support 10000 users. The history of the exercise then became one of iterating through a process of identifying bottlenecks, eliminating or working around them, and moving on to discover the next bottleneck. There will always be at least one bottleneck of course, but artificial bottlenecks in software that prevent the underlying hardware from being fully utilized are especially to be avoided.

One of the things we did in the course of a run is use Solaris' lockstat(1M) command to collect information about hot locks in the kernel. Solaris uses several different kinds of lock: locks where it is acceptable to be put to sleep, locks where you are only permitted to spin if an attempt to get the lock fails, Read-Write locks that can allow multiple concurrent readers but only one writer at  a time, etc. In our case, contention for Read-Write locks dominated. Here's an example:

R/W writer blocked by writer: 604708 events in 518.141 seconds (1167 events/sec)

Count  indv cuml rcnt     nsec Lock                      Caller
-------------------------------------------------------------------------------
604575 100% 100% 0.00  1376373 0xffffff0e9e1893c8        ufs_rwlock+0xf6
    54   0% 100% 0.00    26305 0xffffff0d38a73630        tmp_remove+0x70
    51   0% 100% 0.00    32393 0xffffff0d38a73630        tmp_create+0x126
    15   0% 100% 0.00    31474 0xffffff0d38a73630        tmp_rename+0x1b4

The output shows a major bottleneck on a particular ufs rwlock. But what file is that lock associated with? Via mdb(1), the Lock address can be used to find the filename. The Caller gives a strong clue that the lock is ufs-related, and mdb's ::whatis dcmd can be used to identify it:

> 0xffffff0e9e1893c8::whatis
ffffff0e9e1893c8 is ffffff0e9e1892f0+d8, allocated from ufs_inode_cache

Again we have a strong clue that the lock is associated with an inode. Proceeding on the assumption that we're dealing with the i_rwlock field of an inode, a sanity check can be performed to verify that the offset of i_rwlock in inode_t is 0xd8, which matches the offset in the ::whatis output. [I admit, I did also look at kernel source to get these fieldnames.]

> ::offsetof inode_t i_rwlock
offsetof (inode_t, i_rwlock) = 0xd8

The next step is to get the vnode corresponding to the inode, and from that get the path of the file it represents. The inode's base address was infact printed by the ::whatis dcmd, but it could also be computed by subtracting i_rwlock offset from the lock address. Either works. From there, just print the value of i_vnode, then use the ::vnode2path dcmd.

> 0xffffff0e9e1893c8-0xd8::print inode_t i_vnode
i_vnode = 0xffffff0df3646040
> ffffff0e9e1892f0::print inode_t i_vnode
i_vnode = 0xffffff0df3646040
> 0xffffff0df3646040::vnode2path
/var/apache2/2.2/logs/access_log

Aha! Apache's access_log was a bottleneck. This probably should not be surprising given the rate at which transactions  were being logged from 5000+ processes on a single server. Although by this point we had reached our target of 10000 users, we did test what happens when the log was put on /tmp, which was to improve response times further. It does however point to the need for a new, more scalable, logging implementation for Apache. At a minimum, high-activity logs need to go on the fastest available storage.

Speaking of /tmp, the original lock output shows some rwlock activity associated with /tmp as well (as implied by the Callers). And yes, even before we got to the ufs lock issue there was a bigger bottleneck on /tmp, one that was stopping us reaching our target. While the mechanics differ in detail (different structure, fieldname, offset), mdb can be used to identify the files responsible for /tmp contention too:

> 0xffffff0d38a73630::whatis
ffffff0d38a73630 is ffffff0d38a73570+c0, allocated from kmem_alloc_224
> ::offsetof struct tmpnode tn_rwlock
offsetof (struct tmpnode, tn_rwlock) = 0xc0
> ffffff0d38a73570::print struct tmpnode tn_vnode
tn_vnode = 0xffffff0d38c9e480
> 0xffffff0d38c9e480::vnode2path
/tmp

The PHP application code was saving session state for each session in a file in /tmp, creating a bottleneck on the rwlock associated with the /tmp directory itself. The workaround for this was to edit php.ini and tell PHP to distribute its session state files across multiple subdirectories. This is done via session.save_path:

< session.save_path = "/tmp"
---
> session.save_path = "1;/tmp/http_sessions"

With the /tmp change in place and other tuning of caching layer and MySQL, Apache scaled to the level we needed to support 10000 active Olio users across two X2270 servers. Access_log is important for Service Level Management amongst other things, and while our response times were being met with it on JBOD, rwlock contention indicates a more scalable solution will be required in the future. /tmp can provide some relief albeit at the risk of being on volatile storage.