Professor Julius Sumner Miller on his popular TV show became known for his stock phrase, "Why is it so?" Quite a few of the emails on the aliases I subscribe to end up asking that question, albeit about computer system performance rather than Physics.
A couple of weeks ago I was asked to investigate a possible MySQL scalability issue with the just-announced 5.4 release, as a result of feedback from a customer who had tried it out and compared it with 5.1.32. Their workload simulated the concurrent creation of customers in a database. This was done via a stored procedure that generated random data for the set of records that represented a customer in their database. The stored procedure was invoked by N threads concurrently, and they were interested in throughput as a function of N on a 64-thread Sun SPARC Enterprise T5220 Server.
In the results that I saw, which was based on the peak 1 second from a run lasting roughly 1 hour, the customer got slightly better throughput from 5.1.32 than 5.4 Beta when N=800. Now inserting records is close to a cpu-bound activity, but not perfectly so. Depending on mysqld options used, there is a non-trivial amount of serialization involved. When I ran their workload in the range of 1 to 32 threads, it soon stopped scaling. Within this range though, in every case, I got better throughput with 5.4 than 5.1.33 (the version I used for comparison).
On the whole, it didn't seem sensible to focus on the case N=800, which goes way beyond the number of virtual cpus available [64 threads across 8 cores]. Overcommiting a cpu runs the risk of threads getting preempted while holding a lock, or thrashing a shared resource such as a cache or common data structure. Solaris does have an API for providing hints to the scheduler about critical periods when preemption would be a bad idea [details can be found in the schedctl_init(3SCHED) manpage], but MySQL currently doesn't use it.
Having established, at least to my satisfaction, that 5.4 was scaling better than 5.1.33 for N in the range 1 to 32, I was still keen to identify what the bottlenecks were. That's a useful first step when trying to improve scalability! Returning to the customer workload, it also came with a my.cnf file containing a set of configuration variables which I used as the basis for my testing. The config file didn't specify innodb_thread_concurrency, which had its default changed with 5.4. This wasn't so important for an insert-intensive workload, but clearly could be a bottleneck for a read-intensive workload. Several parameters are known to have significant performance implications (such as log-bin, innodb_support_xa, innodb_flush_log_at_trx_commit). The choices one might make for a replication environment, where each slave has a single thread applying updates, are quite different to those required for an update-intensive central database. In what follows, I changed the values of some parameters from what the customer had specified, to improve scalability.
My real motivation though in writing this blog entry is to discuss the use of a technique that I have found very useful when investigating scalability, such as the situation described above, which involves the combination of Sun Studio Performance Analyzer and Solaris' microstate accounting.
A natural question to ask about a single thread is where does it spend its time? The Solaris kernel conveniently records for each thread the amount of time it spends in a set of microstates, as documented in the header file /usr/include/sys/msacct.h.
/* LWP microstates */ #define LMS_USER 0 /* running in user mode */ #define LMS_SYSTEM 1 /* running in sys call or page fault */ #define LMS_TRAP 2 /* running in other trap */ #define LMS_TFAULT 3 /* asleep in user text page fault */ #define LMS_DFAULT 4 /* asleep in user data page fault */ #define LMS_KFAULT 5 /* asleep in kernel page fault */ #define LMS_USER_LOCK 6 /* asleep waiting for user-mode lock */ #define LMS_SLEEP 7 /* asleep for any other reason */ #define LMS_WAIT_CPU 8 /* waiting for CPU (latency) */ #define LMS_STOPPED 9 /* stopped (/proc, jobcontrol, lwp_suspend) */
In addition to user cpu time, which Linux users will be familiar with it, it provides separate per-thread buckets for time spent waiting to be dispatched, time spent asleep waiting for a lock or on a condition variable, system call execution time, and several others. This means that all of a thread's time can be accounted for.
Via the /proc pseudo-filesystem, snapshots of resource usage can be taken, which is what underpins a particularly useful command, prstat -mL, for looking at which microstates threads are spending their time in. pstack(1) can be used to print a hex+symbolic stack trace for a process or set of LWPs within a process to get snapshots of which functions threads are spending their time. Neither of these though give a complete picture, as you really want to combine the two types of information, and ultimately map the resources back to the source code. This is why I make extensive use of Performance Analyzer.
Using Performance Analyzer involves a few steps. Firstly you get much better information if the executable has been compiled with debugging information, for example -g. When investigating the customer scalability issue mentioned above, I downloaded the official 5.1.33 binary, but also compiled my own build from source, with debugging information to assist analysis.
Then a profiling experiment needs to be run, to collect profiling data. When profiling mysqld, I generally start it via the collect(1) command and use the -y USR2 option to start it with profiling suspended but able to be toggled on and off via USR2 signal. This provides an opportunity to run something to warm up caches if desired before starting to profile.
collect -y USR2 $MYSQL_HOME/libexec/mysqld ...
In my workload generation script I toggle profiling on and off at the start and end:
mpid=`pgrep -x mysqld` kill -USR2 $mpid workload kill -USR2 $mpid
This should result in an experiment directory being created containing all the profiling data. The final step is to analyze the profile data. For this, either the analyzer(1) GUI or er_print(1) command can be used. I'm often working remotely from benchmark machines and tend to use er_print, but do occasionally fire up analyzer via vnc. The GUI makes various aspects of the analysis less tedious and I would encourage its use for people less experienced with the tools.
mysqld is a multithreaded program that creates one thread for each user connection, as well as having a number of background threads. Several of the background threads spend most of their time asleep on condition variables, waiting for work. I'm mostly interested in where the foreground threads that actually process SQL statements spend their time, so I filter out the background threads (they generally have the lowest thread ids):
$ er_print test.9.er (er_print) threads Objects sorted by metric: Exclusive User CPU Time Excl. Name User CPU sec. 228.86029.000 Thread 27 28.970 Thread 28 28.610 Thread 30 28.370 Thread 29 28.160 Thread 33 28.090 Thread 31 27.910 Thread 34 27.689 Thread 32 1.081 Thread 22 0.801 Thread 21 ... (er_print) thread_sel 27-34 Exp Sel Total === ===== ===== 1 27-34 34
Serial bottlenecks tend to be associated with synchronization primitives like mutexes and condition variables. If a thread makes a number of attempts to get a resource [spin] but fails each time, eventually it gets put to sleep [wait]. The time spent asleep is recorded as User Lock time. The list of metrics that er_print displays can be expanded from the default of User CPU time via its metrics command:
(er_print) metrics e.user:i.user:e.lock:i.lock Current metrics: e.user:i.user:e.lock:i.lock:name Current Sort Metric: Exclusive User CPU Time ( e.user )
A summary of resources by function can be obtained, here sorted by descending Exclusive User Lock time, which includes a couple of pseudo-functions. <Total> aggregates resources from all functions, which the [potentially long] list of real functions below should add up to. Exclusive resources are those used by a function itself, whereas Inclusive resources are those used by a function and all the children beneath it in the call stack.
(er_print) sort e.lock Current Sort Metric: Exclusive User Lock Time ( e.lock ) (er_print) limit 30 (er_print) functions Functions sorted by metric: Exclusive User Lock Time Excl. Incl. Excl. Incl. Name User CPU User CPU User Lock User Lock sec. sec. sec. sec. 226.799 226.799 533.123 533.123 <Total>2.001 2.001 533.123 533.123 __lwp_park 4.713 4.713 0. 0. <Unknown> ...
Extremely few functions are ever likely to have Exclusive User Lock time, because that is handled at a very low level underneath the implementation of synchronization primitives provided by libc.so.1. The application functions associated with synchronization primitive bottlenecks are higher up the stack. They can be found by following the chain of function calls up the stack via csingle command. Experience shows that these are on or around the callers of functions like pthread_mutex_lock and pthread_cond_wait. In the extract below, the function of interest is highlighted with an asterisk [*], surrounded by the immediate parents and children of that function in the call tree.
(er_print) csingle pthread_cond_wait Callers and callees sorted by metric: Attributed User CPU Time Attr. Excl. Incl. Attr. Excl. Incl. Name User CPU User CPU User CPU User Lock User Lock User Lock sec. sec. sec. sec. sec. sec. 3.663 0.190 4.173 533.123 0. 533.123 os_event_wait_low 0.190 0.190 3.663 0. 0. 533.123 *pthread_cond_wait 3.472 0.160 3.472 533.123 0. 533.123 cond_wait (er_print) csingle os_event_wait_low Callers and callees sorted by metric: Attributed User CPU Time Attr. Excl. Incl. Attr. Excl. Incl. Name User CPU User CPU User CPU User Lock User Lock User Lock sec. sec. sec. sec. sec. sec. 2.552 0.420 30.191 249.314 0. 249.545 log_write_up_to 1.561 0.500 37.897 273.201 0. 527.889 trx_commit_off_kernel 0.060 0. 0.060 10.607 0. 10.607 sync_array_wait_event 0.190 0.190 4.173 0. 0. 533.123 *os_event_wait_low 3.663 0.190 3.663 533.123 0. 533.123 pthread_cond_wait 0.170 2.172 3.763 0. 0. 0. mutex_unlock 0.110 2.952 3.252 0. 0. 0. mutex_lock_impl 0.040 0.280 0.280 0. 0. 0. mutex_lock
In this case an application function log_write_up_to makes use of an Innodb utility function os_event_wait_low to wait on a condition variable, which appears to be a bottleneck. A source listing of log_write_up_to identifies the statement responsible, highlighted by a double hash [##].
Source file: .../storage/innobase/log/log0log.c
Object file: test.9.er/archives/mysqld_4f2PsJ6tDrc
Load Object: test.9.er/archives/mysqld_4f2PsJ6tDrc
Excl. Incl. Excl. Incl.
User CPU User CPU User Lock User Lock
sec. sec. sec. sec.
...
## 0.020 2.572 0. 249.314 1370. os_event_wait(log_sys->no_flush_event);
With a little practice, following these steps is actually a fairly quick process--it takes much longer to write it all up! And to be fair, the job is only part-done. To improve scalability, it would be necessary to identify where in the code the corresponding signal is issued, determine what takes time between the wait and corresponding signal, and see if that critical path can be shortened. In this particular case, its related to the batching of log entries and flushing them to the Innodb log file. Using the fastest possible storage for logs is a partial answer.
One issue to be aware of when profiling is that under some circumstances, Performance Analyzer may not be able to map back from instruction addresses to source lines. Some gcc compilers I've tried for example appear not to generate usable PC->linenumber mappings in .debug_line sections. For those with sunsolve accounts, CRs 6726134, 6776962, and 6831881 address some of these issues, although its not clear to me whether the problems are in the generation of the mappings, or their interpretation.
Innodb makes extensive use of the preprocessor and macros, which when combined with the inlining of functions, can make the mapping back from instructions to source code quite tricky. On occasion, particularly when dealing with an unfamiliar part of the source tree, I've found it useful to create a less-optimized version, with inlining suppressed, when trying to understand the control flow.
Although not illustrated here, for everything I did with the command-line tool er_print, there is an equivalent in analyzer GUI. Being able to point and click in the GUI makes moving up and down the callstack less tedious. It also has a graphical timeline view that assists the human eye in detecting interesting or unusual patterns in callstacks over time. More information about er_print and analyzer can be found in Sun Studio 12: Performance Analyzer manual.
Finally, perhaps as a teaser, here's an example timeline from a different mysqld profile--one that shows an immediate and obvious pattern. It shows a bunch of threads, but by no means all, periodically waiting 400ms for commands from remote clients. This is for a workload where once a command finishes, the next one should immediately be sent to mysqld. This clearly calls for further investigation.


