Tuesday Sep 23, 2008
Tuesday Sep 23, 2008
One example is Bug#32149 Long semaphore wait for adaptive hash latch. The bug synopsis could very well have been Why are global locks in a storage engine such a BAD idea. There is lots of useful information about this and the workaround MySQL server uses to work with storage engines that do.. Interesting!
Do you know of any other interesting bugs that document MySQL internals?
Tuesday Sep 16, 2008
Percona recently released a patch which includes performance fixes developed by Yasufumi Kinoshita from NTT Comware. This helps diskbound applications quite significantly. Details at Bug #29413 Maximum performance of OLTP benchmark is not so scalable on multi-cpu. It looks like the bulk of performance improvements come from breaking up the lock guarding the buffer pool structures, and improvements in the IO code path. The "buf_pool->mutex" also gets quite hot when concurrency is not limited (via innodb_thread_concurrency) and you have multiple threads executing on multi-core systems.
Google recently released an updated patch for improving Innodb scalability. My experience with this patch shows a significant scalability improvement for many workloads. Great work Mark, Ben and rest of the gang!
This is excellent news! This clearly demonstrates that we (as a community) are making great progress in making MySQL scale. This also underlines the fact that the choice of the storage engine will influence your scalability experience quite a bit. This also sets the bar higher for Falcon and Maria
(A good thing in my opinion)
Thursday Aug 21, 2008
Along the way we're planning to blog about how we
identified and eliminated performance bottlenecks and provide you with
all the gory details about the process and technique. This will
hopefully entice you all to contribute to fixing performance issues
We can reach our goal of a highly scalable MySQL much faster!. If you
any other ideas about how we can scale the MySQL scalability effort
(pun intended), or you would like to bust some locks, please feel free to
let us know.
Fixing locks is sometimes very tricky as sometimes eliminating a really hot lock can only give a modest improvement in performance, or sometimes, no improvements at all. For example, sometimes fixing a hot lock might only move the bottleneck to the next lock. We have to address all the scalability issues before we actually see a dramatic improvement. A recent example illustrates this point.
If you have been following MySQL scalability issues, you must be well aware of the LOCK_plugin lock. LOCK_plugin is used to protect the list of plugins used by MySQL. These plugins can be storage engine plugins, udf plugins, full text parser plugins, etc. LOCK_plugin got extremely hot for several benchmarks. Analysis of the contention showed that LOCK_plugin was being acquired and released at the end of every query execute as part of sending results to the clients. A closer look at the usage reveals that this lock is held while providing a hint to Innodb to release any shared locks and release concurrency tickets. The technique used by the MySQL server was to iterate over the plugin list and call ha_release_temporary_latches() for all storage engine plugins. The method used to do this (plugin_for_each_with_mask() grabs LOCK_plugin while iterating over the plugins list. For more details, checkout Bug#34409 LOCK_plugin contention via ha_release_temporary_latches/plugin_foreach
Sergei Golubchik recently fixed it in the most elegant way possible -- do not acquire a lock over the plugins list while releasing temporary latches. Looking at Sergei's fix and searching through the source, we see ha_release_temporary_latches() is called by MySQL server at multiple places without holding the LOCK_plugin lock! We are unnecessarily holding LOCK_plugin! Since holding LOCK_plugin is the side effect of iterating the plugins list, do not use the plugins list to iterate over storage engines; just iterate over the storage engines that are used in this transaction. What I really liked about the fix is that Sergei identified that there is no need for a lock, and got rid of it, rather than trying to break up the lock, or other things!. Kudos to Sergei for fixing this in such an efficient manner. If you are using 5.1.28 or 6.0.7-alpha, you already have this fix.
With this bugfix, we see around a 5% improvement in sysbench read-only tests. The reason we see only a 5% improvement and not more is that eliminating this lock made other locks more contented, and they are now the bottleneck. We are one step closer to a more scalable MySQL. Back to lock hunting for now.
Wednesday Jul 23, 2008
While comparing sysbench runs using two different versions of MySQL 5.1, I noticed a big difference in the system utilization. One version had much more idle time than the other. This difference was much more apparent with high thread counts on machines with lots of CPU. A look at the system calls showed a large number of lwp_park system calls. That is, threads were being put to sleep.
bash # $ dtrace -qn 'syscall:::entry{@c[probefunc]=count()}' -n tick-5s'{trunc(@c, 10);exit(0)}'
p_online 1535
ioctl 2255
fcntl 6134
priocntlsys 6462
write 6492
read 12775
yield 19065
gtime 313927
pollsys 321074
lwp_park 951749
Investigating why threads are being parked showed a very interesting
stacktrace
libc.so.1`__lwp_park+0x10
libc.so.1`cond_wait_queue+0x28
libc.so.1`cond_wait+0x10
libc.so.1`pthread_cond_wait+0x8
mysqld`os_event_wait_low+0x3c
mysqld`sync_array_wait_event+0x1c0
mysqld`mutex_spin_wait+0x1c4
mysqld`mutex_enter_func+0x1c
mysqld`rw_lock_x_lock_func+0x3b4
mysqld`mtr_x_lock_func+0x10
mysqld`fsp_get_available_space_in_free_extents+0x30
mysqld`int ha_innobase::info(unsigned)+0x164
mysqld`bool make_join_statistics(JOIN*,TABLE.. +0x180
mysqld`int JOIN::optimize()+0x670
mysqld`bool mysql_select(THD*,Item***,TABLE_.. +0x224
mysqld`bool handle_select(THD*,st_lex*,selec.. +0xc4
mysqld`bool execute_sqlcom_select(THD*,TABLE.. +0x1c8
mysqld`int mysql_execute_command(THD*)+0x380
mysqld`bool Prepared_statement::execute(Stri.. +0x29c
mysqld`bool Prepared_statement::execute_loop.. +0xbc
Looking at make_join_statistics() we see that it calls
ha_innobase::info() with HA_STATUS_VARIABLE |
HA_STATUS_NO_LOCK. A quick search for the definition of
HA_STATUS_NO_LOCK shows
/* assuming the table keeps shared actual copy of the 'info' and local, possibly outdated copy, the following flag means that it should not try to get the actual data (locking the shared structure) slightly outdated version will suffice */ #define HA_STATUS_NO_LOCK 2So the mysql server is requesting ha_innobase::info() to not hold a lock, and it is being ignored by ha_innobase::info()!.
I compared against MySQL 5.0 and saw that this particular lock was not held when ha_innobase::info() was called. Searching through the commit logs I found that this was introduced by Bug#32440.
Quickly hacking the code to revert to the old behaviour gave a big boost in performance. Hence I filed Bug #38185 ha_innobase::info can hold locks even when called with HA_STATUS_NO_LOCK. Luckily its a very small change and a fix is already in progress.
Moral of the story? A simple contended lock can bring down your performance by quite a lot.
Monday Jul 07, 2008
I can think of 2 easy ways to answer this question. If you know of any additional methods please feel free to let me know.
If you know of any other methods please let me know!
Wednesday Jun 25, 2008
Filesort, as the name implies, is used to sort records when there is an ORDER BY clause in the query. The reason it has the prefix "file" is because temporary files may be used to store intermediate results. filesort() uses a per thread sort buffer of size sort_buffer_size to sort the table. filesort() is implemented in sql/filesort.cc.
filesort may not know how many records need to be sorted. It asks the storage engine for an estimate of the number of records in the table via estimate_rows_upper_bound(). If the number of records that fit in the sort buffer is less than the estimate, filesort will use temporary files to store intermediate results. The flow is as follows (or atleast the important steps to understand the bug mentioned above)
Initializing the sort buffer can be very expensive when the estimate for the number of rows is off. For example, in my benchmark, the table has 210 Million rows. Innodb returns 210 Million for estimate_rows_upper_bound(). The actual number of rows that fulfill the WHERE clause of my query is 3!. For a default sort buffer size of 2MB, the number of rows that fit in the sort buffer is 70,000. So filesort() unnecessarily initializes space for 70,000 rows, and does the sort. You might think initializing space for 70,000 rows is not a big deal, but when it constitutes 10% of the query execution time, it IS a big deal.
So how can we improve this?
Tuesday May 06, 2008
Often we hear customers complain about not getting good results with their application even though the results with micro-benchmarks were quite good. This is mainly because well known micro-benchmarks do very specific things like measuring bulk throughput or latency and do not really model the customer's application. Using micro-benchmarks does help standardize performance results, but the customer cannot really draw assumptions, based on these standard benchmarks, about how their application will perform. The safest alternative is to run their application to determine the performance. This maybe very time consuming and may require a lot of setup. uperf allows the customer to specify a description(profile) of what their application does and stress the system using this description. Using a profile has some additional benefits like you could modify the profile slightly to determine the effects of using a different protocol, etc..
Customers rarely want to share their application or it is just too time consuming to get their application and measure their performance in our labs, We are often left with a description of what the application is doing. uperf can take [a modified version of] this description and run whatever the real application does. In addition to running the customer's application, uperf can provide you with additional features like
So please feel free to check out uperf and contribute code, ideas, suggestions etc..
Wednesday Apr 02, 2008
Mysql uses mysql_execute_command() to execute queries. Looking at the callstack you can see very clearly that mysql_execute_command() calls open_and_lock_tables which then tries to open tables via open_table(). The code path gets interesting here. As you can see in the image above, (and in the svg callstack), there is tight for loop. Looking closely (moving your mouse over the area in the svg file), you will see that the code is basically traversing a hash list! Not the best thing to do with a hashtable. Looking at the code, it is originating at line number 2781 in sql/sql_base.cc. Can it be improved? Most probably. Will it help if we improve it? Definitely for some cases, maybe not for others.
I found a few more interesting places in the code, but I will save them for my future blogs
Tuesday Feb 26, 2008
Sysbench is a popular open source benchmark used to measure performance of various parts of the operating systems and (one) application (database). Since we are talking about MySQL, I will concentrate on the database part of the test.
The oltp test in the sysbench benchmark creates a single table and runs a set of queries against it. Each row in the table is around 250 bytes and by default it creates 1000 rows. For our experiment we used 10 million rows. Allan has blogged about the details regarding the experiments; I will present an alternate view to those experiments.
The trick to getting good numbers with Sysbench and MySQL is very simple
MySQL uses the innodb_buffer_size variable to determines the size of the cache it uses. If you have sufficient memory (more than 2.5GB in our case), you could cache the whole table. An alternative strategy (especially if you are using 32bit MySQL) is to choose a smaller value for the innodb_buffer_cache and let the table be cached in the filesystem buffer cache; but I suspect it is more efficient to cache the data at the innodb level. This should eliminate or considerably reduce all the reads.
To eliminate writes, you can either choose to do a read-only test, or use a cache-enabled disks for the writes. The cache can either be NVRAM, or the write cache on the disk. Note: Using cache that is not battery backed is very risky. Do it at your own risk.
If you are still seeing delays due to IO(which in turn translates to idle time on the system), you can try adding more threads to soak up the available CPU. You have to be very careful to find the right balance.
To recap, We got the best numbers using
Thats it! Here is the my.cnf that gives me the best numbers.
[mysqld] datadir=/mysqldata innodb_data_home_dir = /mysqldata innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend innodb_log_group_home_dir = /mysqldata innodb_buffer_pool_size = 4096M innodb_additional_mem_pool_size = 20M innodb_log_file_size = 400M innodb_log_buffer_size = 64M innodb_thread_concurrency = 0
Friday Jan 18, 2008
Monday Jan 07, 2008


Thursday Nov 01, 2007
I wrote a simple ruby script to post process the allfrom.d output and generate a SVG image. The advantage of using SVG is that you can use javascript to provide added functionality. For example, you can hover your mouse over any block to see the name of the function and its elapsed time. Similarly, you could add support for Zoom and Pan
Unfortunately, I am having problems with blogs.sun.com serving svg files with the right mime type. So I have included a png image below. You can save the svg files somewhere on your computer and view them using Firefox.
A sample for the connect(3socket) is shown below. The input file used to generate it is available here. The width of each box indicates how long the function took. Y axis indicates call depth.

If you are interested, you can also check out
Thursday Oct 18, 2007
Rendering /tmp/neel/jruby-1.0.1/lib/ruby/gems/1.8/gems/actionpack-1.13.5/lib/ action_controller/templates/rescues/layout.rhtml (500 Internal Error) no marshal_dump is defined for class Java::JavaObject /tmp/neel/jruby-1.0.1/lib/ruby/1.8/pstore.rb:349:in `dump' /tmp/neel/jruby-1.0.1/lib/ruby/1.8/pstore.rb:327:in `transaction' /tmp/neel/jruby-1.0.1/lib/ruby/1.8/cgi/session/pstore.rb:81:in `update' /tmp/neel/jruby-1.0.1/lib/ruby/1.8/cgi/session/pstore.rb:88:in `close' /tmp/neel/jruby-1.0.1/lib/ruby/1.8/cgi/session.rb:324:in `close'The problem is that Rails is trying to store the Java object in the session store. The default session store is the filesystem, and it fails since it cannot marshal a Java Object. Setting the session store to use memory instead of the filesystem solves the problem.
In config/environment.rb set
config.action_controller.session_store = :memory_store
Sunday Oct 14, 2007
Tuesday Sep 18, 2007
I wrote a simple perl script to print out the ARC statistics. Sample output is shown below. You can either print the set of default fields, or specify what fields you want printed (see arcstat -v for details)
Download arcstat.pl
In later blogs, I will describe how to make sense of these numbers. Sample output below
Cached random read
thumper1[bash]$ ~/arcstat.pl
Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
18:15:21 985M 100M 10 47M 14 53M 8 1M 3 11G 11G
18:15:22 137K 0 0 0 0 0 0 0 0 11G 11G
18:15:23 138K 0 0 0 0 0 0 0 0 11G 11G
18:15:24 138K 0 0 0 0 0 0 0 0 11G 11G
18:15:25 138K 0 0 0 0 0 0 0 0 11G 11G
18:15:26 138K 0 0 0 0 0 0 0 0 11G 11G
18:15:27 139K 0 0 0 0 0 0 0 0 11G 11G
18:15:28 140K 0 0 0 0 0 0 0 0 11G 11G
18:15:29 139K 0 0 0 0 0 0 0 0 11G 11G
18:15:30 140K 0 0 0 0 0 0 0 0 11G 11G
18:15:31 139K 0 0 0 0 0 0 0 0 11G 11G
18:15:32 33K 0 0 0 0 0 0 0 0 11G 11G
^C
Uncached sequential write
Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
18:17:48 0 0 0 0 0 0 0 0 0 11G 11G
18:17:49 0 0 0 0 0 0 0 0 0 11G 11G
18:17:50 664 372 56 7 2 365 100 5 2 11G 11G
18:17:51 1K 534 36 5 0 529 100 5 0 11G 11G
18:17:52 2K 774 33 6 0 768 100 6 0 10G 10G
18:17:53 2K 645 31 5 0 640 100 5 0 10G 10G
18:17:54 1K 645 35 5 0 640 100 5 0 10G 10G
18:17:55 2K 645 31 5 0 640 100 5 0 10G 10G
18:17:56 1K 646 35 6 0 640 100 6 0 10G 10G
18:17:57 2K 645 31 5 0 640 100 5 0 10G 10G
18:17:58 2K 774 33 6 0 768 100 6 0 10G 10G