Wednesday Sep 16, 2009
Wednesday Sep 16, 2009
I will continue blogging on my personal site at http://realneel.blogspot.com. Hope to see you there.
Wednesday Jun 10, 2009
MySQL clients uses COM_* commands to communicate with the MySQL server. MySQL show status breaks these commands into many categories and displays how many times each of these commands were executed. It, however, does not display the time taken to processes those commands as well as how many times the command execution resulted in an error. Enter Dtrace
I wrote a small DTrace script to figure out what commands(COM_*) are being executed on the server and summarize them. If you have used truss -c on Solaris before, you must liked its concise summary; I have tried to present the output in a similar fashion.
For Sysbench read-write (10 queries per transaction) test with 1 thread, each executing 1 transaction, you see
# ./cmdtruss
Sampling... Hit Ctrl-C to end.
^C
Command seconds calls errors
Query 0.042 2
Quit 0.000 2
CloseStmt 0.000 10
Prepare 0.000 11 1
Execute 0.001 20
------- ----- ------
total: 0.043 45 1
As you can see from above, each query is COM_STMT_PREPARE, then for each query sysbench executes COM_STMT_EXECUTE 20 times. and then each query is closed via COM_STMT_CLOSE. Each thread connects and disconnects from the server two times. There is one COM_STMT_PREPARE that results in an error. You will also notice that COM_CONNECT is not being captured. This is because the MYSQL_COMMAND_START probe does not capture it.
Here is an example where I am deliberately executing a query that results in an error. # while true; do mysql -e "show foobar"; done . cmdtruss shows
# ~/cmdtruss
Sampling... Hit Ctrl-C to end.
^C
Command seconds calls errors
Quit 0.000 80
Query 0.000 160 80
------- ----- ------
total: 0 240 80
As you can see only half the queries result in an error. Every invocation of mysql program results in select @@version_comment limit 1 being executed (and this succeeds). The other query show foobar results in an error.
This script uses the static probes defined in MySQL 5.4beta, so you cannot use it with earlier versions. Please feel free to use it and give me feedback. I think this is a quick and easy way to figure out what's happening in the MySQL server.
You can download the script here
Thursday May 28, 2009
#./inniostat -h
Usage: inniostat [-h] [-d] [-p pid] [interval]
-h : Print this message
-p : MySQL PID
-d : Dump dtrace script being used
# ./inniostat
__physical__ ___Innodb___ ____read____ ______write______
r/s w/s r/s w/s data pre log dblbuf dflush Time
24 121 24 50 24 0 50 0 0 16:00:57
26 130 26 51 26 0 51 0 0 16:00:58
18 134 18 54 18 0 54 0 0 16:00:59
25 129 25 51 25 0 51 0 0 16:01:00
29 116 46 47 17 29 47 0 0 16:01:01
10 140 10 132 10 0 52 0 80 16:01:02
29 129 35 53 14 21 53 0 0 16:01:03
| Col | Description |
|---|---|
| r/s | Physical (Actual) writes per sec |
| w/s | Physical (Actual) writes per sec |
| r/s | Reads issued by Innodb per sec |
| w/s | Writes issued by Innodb per sec |
| data | Regular Reads issued by Innodb to the datafiles per sec |
| pre | Prefetch Reads issued by Innodb to the datafiles per sec |
| log | Log writes issued by Innodb per sec |
| dblbuf | Double buffer writes per sec |
| dflush | Writes due to flushing of Innodb buffers to disk. |
For the above example you can see that writes to the Innodb double buffer (dblbuf) is zero. This is because I had turned it off explicitly. You will also notice that there are more physical writes than what Innodb is requesting. This can be due to a variety of factors (other engines are doing writes, other applications are doing writes, recordsize mismatch between filesystem and mysql, filesystem logging, etc.). You will also notice a few prefetch requests are being issued.
Consider the sample output below (contributed by Dimitri)
__physical__ ___Innodb___ ____read____ ______write______
r/s w/s r/s w/s data pre log dblbuf dflush Time
0 1681 0 5259 0 0 2780 0 2479 14:58:13
0 224 0 6111 0 0 6111 0 0 14:58:18
0 761 0 4300 0 0 3368 0 932 14:58:23
0 503 0 4232 0 0 3546 0 686 14:58:28
0 904 0 4024 0 0 2923 0 1101 14:58:33
0 1231 0 4046 0 0 2470 0 1575 14:58:38
0 502 0 4192 0 0 3640 0 552 14:58:43
0 928 0 4669 0 0 3331 0 1338 14:58:48
You can quickly notice that this is a cached workload (r/s is zero).
There is a significant difference
between log writes issued by Innodb and physical writes. This is because this particular setup had
innodb_flush_log_at_trx_commit set to 2 (i.e flush log writes every second). If want to flush the log after every commit, you need a disk[s] that can handle around 3000 writes per second.
You will also notice a lot of Innodb buffers being flushed (dflush). This maybe because modified buffers in the Innodb buffer pool are being flushed, and/or due to caching in the filesystem instead of in the InnoDB buffer pool.
Some (but not all) of these statistics are available via "show status" command. Be sure to understand what they are measuring before using them. Few of them can be little puzzling. For ex, Innodb_buffer_pool_read_ahead_seq variable counts the number of times a sequential read ahead was triggered. It does not tell you, how many pages were fetched as a result of the read ahead. Similarly, Innodb_log_write_requests tells you how many times a write happened to the log, but its is not really useful as this is incremented when someone writes to the in-memory log block, it does not tell you when the log was written to disk.
Innodb issues reads to either the data files or log files (during recovery). This script does not differentiate between them. Writes are either to the data files, log files, or double write buffer. If you view the source of the script (or use the -d option), you will see that I am interposing on the fil_io function and interpreting its arguments to differentiate IO types. I am using the io:::start probe to figure out the physical IO. Note that this script tracks Innodb IO requests. The actual IO may be carried out by a Innodb background thread some time later. This script also makes assumptions regarding several Innodb constants and hence may not be accurate for future versions.
You can download it here. Since it uses DTrace, it only works on Solaris, MacOS and FreeBSD. Please feel free to use it and let me know of any feedback/comments.
Tuesday May 26, 2009
One of the cool things about talking about MySQL performance with
ZFS is that there is not much tuning to be done
Tuning with ZFS
is considered
evil,
but a necessity at times. In this blog I will describe some of the
tunings that you can apply to get better performance with ZFS as
well as point out performance bugs which when fixed will nullify the
need for some of these tunings.
For the impatient, here is the summary. See below for the reasoning behind these recommendations and some gotchas.
Lets look at all of them in detail.
| What | Match ZFS recordsize with Innodb page size (16KB for Datafiles, and 128KB for Innodb log files). |
| How | zfs set recordsize=16k tank/db |
| Why |
The biggest boost in performance can be obtained by matching the ZFS record size with the size of the IO. Since a Innodb Page is 16KB in size, most read IO is of size 16KB (except for some prefetch IO which can get coalesced). The default recordsize for ZFS is 128KB. The mismatch between the read size and the ZFS recordsize can result in severely inflated IO. If you issue a 16KB read and the data is not already there in the ARC, you have to read 128KB of data to get it. ZFS cannot do a small read because the checksum is calculated for the whole block and you have to read it all to verify data integrity. The other reason to match the IO size and the ZFS recordsize is the read-modify-write penalty. With a ZFS recordsize of 128KB, When Innodb modifies a page, if the zfs record is not already in memory, it needs to be read in from the disk and modified before writing to disk. This increases the IO latency significantly. Luckily matching the ZFS recordsize with the IO size removes all the problems mentioned above. For Innodb log file, the writes are usually sequential and varying in size. By using ZFS recordsize of 128KB you amortize the cost of read-modify-write. |
| Note |
You need to set the recordsize before creating the database files. If you have already created the files, you need to copy the files to get the new recordsize. You can use the stat(2) command to check the recordsize (look for IO Block:) |
| What | If you have a write heavy workload, use a seperate intent log (slog). |
| How | zpool add log c4t0d0 c4t1d0 |
| Why |
Write latency is extremely critical for many MySQL workloads. Typically, a query will read some data, do some calculations, update some data and then commit the transaction. To commit, the Innodb log has to be updated. Many transactions can be committing at the same time. It is very important that this "wait" for commit be fast. Luckily in ZFS, synchronous writes can be accelerated up by using the Seperate Intent Log. In our tests with Sysbench read-write, we have seen around 10-20% improvement with the slog. |
| Note |
|
| What | L2ARC (or Level 2 ARC) |
| How | zpool add cache c4t0d0 |
| Why |
If your database does not fit in memory, every time you miss the database cache, you have to read a block from disk. This cost is quite high with regular disks. You can minimize the database cache miss latency by using a (or multiple) SSDs as a level-2 cache or L2ARC. Depending on your database working set size, memory and L2ARC size you may see several orders of magnitude improvement in performance. |
| Note |
| What | When it is safe, turn off ZFS cache flush |
| How |
The ZFS Evil tuning guide has more information about setting this tunable. Refer to it for the best way to achieve this. |
| Why |
ZFS is designed to work reliably with disks with caches. Everytime it needs data to be stored persistantly on disk, it issues a cache flush command to the disk. Disks with a battery backed caches need not do anything (i.e the cache flush command is a nop). Many storage devices interpret this correctly and do the right thing when they receive a cache flush command. However, there are still a few storage systems which do not interpret the cache flush command correctly. For such storage systems, preventing ZFS from sending the cache flush command results in a big reduction in IO latency. In our tests with Sysbench read-write test we saw a 30% improvement in performance. |
| Note |
|
| What | Prefer to cache within MySQL/Innodb over the ARC. |
| How | Via my.cnf and by limiting the ARC size |
| Why |
You have multiple levels of caching when you are using MySQL/Innodb with ZFS. Innodb has its own buffer pool and ZFS has the ARC. Both of them make independent decisions on what to cache and what to flush. It is possible for both of them to cache the same data. By caching inside Innodb, you get a much shorter (and faster) code path to the data. Moreover, when the Innodb buffer cache is full, a miss in the Innodb buffer cache can lead to flushing of a dirty buffer, even if the data was cached in the ARC. This leads to unnecessary writes. Even though the ARC dynamically shrinks and expands relative to memory pressure, it is more efficient to just limit it.In our tests, we have found that it is better (7-200%) to cache inside Innodb rather than ZFS. |
| Note |
The ARC can be tuned to cache everything, just metadata or nothing on a per filesystem basis. See below for tuning advise about this. |
| What | Disable ZFS Prefetch. |
| How | In /etc/system: set zfs:zfs_prefetch_disable = 1 |
| Why |
Most filesystems implement some kind of prefetch. ZFS prefetch detects linear (increasing and decreasing), strided, multiblock strided IO streams and issues prefetch IO when it will help performance. These prefetch IO have a lower priority than regular reads and are generally very beneficial. ZFS also has a lower level prefetch (commonly called vdev prefetch) to help with spatial locality of data. In Innodb, rows are stored in order of primary index. Innodb issues two kinds of prefetch requests; one is triggered while accessing sequential pages and other is triggered via random access in an extent. While issuing prefetch IO, Innodb assumes that file is laid out in the order of the primary key. This is not true for ZFS. We are yet to investigate the impact of Innodb prefetch. It is well known that OLTP workloads access data in a random order and hence do not benefit from prefetch. Thus we recommend that you turn off ZFS prefetch. |
| Note |
|
| What | Disable Innodb Double write buffer. |
| How | skip-innodb_doublewrite in my.cnf |
| Why |
Innodb uses a double write buffer for safely updating pages in a tablespace. Innodb first writes the changes to the double write buffer before updating the data page. This is to prevent partial writes. Since ZFS does not allow partial writes, you can safely turn off the double write buffer. In our tests with Sysbench read-write, we say a 5% improvement in performance. |
| Note |
Wednesday Apr 22, 2009
Like I said during the talk, I will blog about ZFS comparisons with EXT3 when I have a chance to do the tests. Thanks for attending!
Tuesday Apr 21, 2009
Allan managed to get
slightly higher Sysbench Read-Only numbers than mine using the latest MySQL 5.4. Interestingly Solaris does better than Linux. Probably a bug since many of the optimizations in MySQL 5.4 are OS independent. But then a lot can happen in 12 months
Today Sun announces MySQL 5.4. This is a great day for customers as
they can use systems with many cores much more efficiently. Its a great day for
the MySQL community and the MySQL performance team because we made it happen.
MySQL 5.4 includes a lot of community contributed fixes as well as many
fixes from our team. Mikael
and Allan are blogging about all the cool
new features and the great scalability of MySQL 5.4. I thought I will take
this opportunity to blog about some of the things we tried, and rejected.
Sometimes there are a lot of things to be learnt from things that
do not work
Early on during our performance investigation, we were trying to see if we can reduce some of the contention in Innodb locks. If you are not familiar with Innodb locks, I suggest you read Tim Cook's excellent presentation to MySQL University on this very topic.
In a nutshell, Innodb has 4 kinds of lock modes (shared, exclusive, intention shared, and intention exclusive). Since POSIX synchronization methods support maximum of 2 modes (reader or writer), Innodb implements its own set of locking primitives using a condition variable and a regular mutex. Innodb also implements its own spin followed by a block. When it gets a mutex in an not contended case, it is very fast. However, when spinning for a lock fails, it gets interesting.
A failed spin for a lock means it has to block for the lock to be available. Currently Innodb uses the wait array interface to keep track of who is waiting on what mutex. Unfortunately there is a global mutex protecting the wait array. This global mutex (sync_primary_wait_array) has shown to be hot in some high thread count experiments.
The following callstack illustrates this perfectly.
nsec ---- Time Distribution --- count Stack
16384 | | 8 mysqld`os_mutex_enter+0x4
32768 | | 11 mysqld`sync_array_free_cell+0x28
65536 | | 7 mysqld`mutex_spin_wait+0x194
131072 | | 9 mysqld`trx_undo_assign_undo+0x30
262144 |@ | 34 mysqld`trx_undo_report_row_operation+0x168
524288 |@@@@ | 122 mysqld`btr_cur_update_in_place+0x160
1048576 |@@@@@ | 157 mysqld`btr_cur_optimistic_update+0xbc
2097152 |@@@@@@ | 186 mysqld`row_upd_clust_rec+0x50
4194304 |@@ | 86 mysqld`row_upd_clust_step+0x5f0
As you can see, mutex_spin fails, it has to wait, it looks for a free cell in the sync_array to block. Before it can find a free cell, it has to lock the sync_array and search for a free cell. It blocks (using OS primitives, not Innodb locking) on the sync_primary_wait_array.
I implemented a quick prototype to measure the improvement in performance before asking the Innodb folks to take a closer look. My performance tests showed very minimal improvement in performance.
An important thing to note is that this contention happened when it failed to get a mutex. A thread fails to get a mutex if another thread has already acquired the mutex. So it does not really matter how long (within reasonable limits), the thread that failed to acquire the mutex took to sleep on the lock.
Richard Smith had a great idea of totally bypassing the sync_array interface. He prototyped it and got some good gains, but I will let him talk about it.
This just illustrates how hard performance work can be at times. Sometimes the
number of ideas rejected is more than what got accepted
Tuesday Apr 14, 2009
Today Sun announced multiple servers based on Intel's Nehalem Processor. I had early access to a Sun Fire X4270 server (2 socket) for a couple of weeks. I used this opportunity to test some of latest MySQL performance and scalability enhancements. For someone unfamiliar with this system, this is a 2 socket 2U server with support for a max of 144GB of memory. With hyperthreading turned on, the operating system sees 16 CPUs.
Before I share the results of my findings, lets get clear on the terminology. Socket refers to physical sockets on the motherboard. CPU refers to the number of processors seen by the operating system. Core refers to the physical processing unit. A Nehalem socket has 4 cores. Thread refers to the hyperthreading threads. One Nehalem core has 2 threads. Using this terminology, the Sun Fire X4270 has 16 CPUs (2 sockets, 4 cores per socket, 2 threads per core).
I used the ever popular Sysbench benchmark. I used an internal version based off version of MySQL 5.1 running on OpenSolaris. Since the goal of this experiment was showcase MySQL (and Innodb) scalability, (and the X4270 system), I used a cached workload. You should be able to see similar speedups for regular applications, provided there are no IO bottlenecks and no known MySQL scalability issues are being exercised. The X4270 supports 16x2.5" disk drives (SATA, SAS or SSD) so IO should not be a problem for most workloads. I used the tunings mentioned in my earlier blog Maximizing Sysbench OLTP performance for MySQL.
Nehalem incorporates Hyperthreading technology. Hyperthreading allows a core to run an additional software thread. along with the original thread. Since there is very little dedicated chip resources for the second thread, you cannot expect to see 2x boost in performance.
There two ways you can disable
hyperthreading on Solaris.
| Experiment | Sockets | CPUs seen by Solaris |
Read only TPS |
Read Write TPS |
|---|---|---|---|---|
| Hyperthreading ON | 2 | 16 | 6310 | 4652 |
| Hyptherthreading OFF | 2 | 8 | 4648 | 3584 |
| 35% | 29.7% | |||
There are two ways to study system
scaling.
To study system scaling across sockets, we typically fully populate each core/socket before moving on to the next core/socket. For example with the X4270, we use
| 1 core using both threads (1 CPUs) |
| 2 cores using both threads in each core (4 CPUs) |
| 4 cores using all threads (1 full socket) (8 CPUs) |
| 2 full sockets(16 CPUs) |
By fully allocating CPUs per core one socket at a time, we are basically showing what would happen if you only had the number of CPUs shown. This approach shows the best scalability and is also the most realistic approach.
| Sockets | CPUs seen by Solaris |
Read Only TPS |
ReadWrite TPS |
|---|---|---|---|
| 1 Socket | 8 | 3364 | 2616 |
| 2 Sockets | 16 | 6310 | 4652 |
| 1.87x | 1.77x | ||
As you can see from above, going from 1 socket to 2 socket, we see a
87% improvement in ReadOnly test and 77% improvement in
Read-Write performance.
![]() |
|
Wednesday Apr 01, 2009
There has a been lot of buzz regarding SSDs lately. SSDs change the dynamics of the IO subsystem. You are no longer limited by rotational latency and vibration effects. For a performance engineer this has many implications. Since performance engineers care mostly about performance, the first thought that comes to mind is "Are we going to see a big impact in benchmarks?".
The answer is really easy for IO bound benchmarks. How about CPU bound benchmarks? Many database benchmarks are CPU limited. Does a faster disk really change anything?
SSD's have a huge random IO capability. During a recent experiment with a SSD, I got around 12,000 random IO operations per second! I have seen SSDs where you can get more. If you have ever worked with rotating disks, this is HUGE. However, performance engineers have rarely been limited by IOPS. We can always use multiple disks to get the same IOPS requirements. Since we are talking benchmarks, and many benchmarks do not have a $/txn metric, we dont worry about cost
More importantly, many of the benchmarks (TPC-C, TPC-E, etc) are all CPU limited. So having a faster disk does not really change too many things.
Performance engineers have long known that it is easy to compensate for the IO latency by using multiple threads of execution. You can always hide IO latency by using multiple threads of execution (provided you have a scalable workload). This is a very common technique and used in many benchmarks.
SSDs have a huge advantage in terms of cost, power and density. If you factor $/txn or Watt/txn, then SSDs clearly have a big advantage; and no wonder customers are really pumped up about SSDs. SSDs are really good for random read IO. For sequential IO, you are most likely going to be limited by your bus bandwidth and a few regular disks should be sufficient to max out the pipe. For writes, using a controller that has a write cache should provide the same latency for most database workloads. It is the random reads where the SSD's true value can be seen.
For CPU bound workloads, the benefit of SSDs may be realized in the fact that you need fewer threads of execution to max out your CPUs. Using fewer threads of execution means there will less contention for resources, and it is possible for some workloads to benefit by reduced contention. This improvement is application specific. Scalable workloads will see less benefit than non-scalable workloads.
For mysql, this reduced contention would probably benefit a lot. It is probably the same effect as reducing innodb_thread_concurrency but with greater throughput
For IO bound applications or non-scalable applications (for ex, applications that hold a lock while doing IO), the decreased latency with SSDs will cause a huge improvement in their performance. SSDs kinda level the playing field. Most database vendors have spent years optimizing the IO codepath. Things like multiblock read, or IO sorting, etc are moot with SSDs. Copy-on-write is the same as 'inplace' modification of database blocks.
Probably the most important benefit of using SSDs with databases is that it reduces the penalty for missing the database buffer cache. A common question customers usually face is how much is the performance improvement will I get if I buy more memory? The answer is indirectly related to the penalty for missing the cache. A database buffer cache miss has to be fulfilled by an IO. For a regular disk this is atleast 6ms. For a SSD, this is less than 1ms!. It is straightforward to calculate avg time for IO for different hit rates. Depending on how many misses you have per transaction, the benefits of using SSD can be huge. I will blog with some actual numbers in a later blog.
So in conclusion, most customers will benefit a lot by using SSDs. Performance engineers are not so lucky
Friday Feb 13, 2009
One of the most important features of MySQL is the support for pluggable storage engines. Since users use MySQL in different ways, one storage engine may not fit everyone 's needs. There are lots of engines for a user/application to choose from. Applications (and users) access MySQL using a uniform interface irrespective of what kind of storage engine is being used. As with any kind of software layering, it is possible to lose optimization opportunities as we cross layer boundaries. In this blog I will discuss one such lost opportunity.
Internally MySQL uses the PSAPI (public storage engine api) to communicate with the storage engines. The MySQL server uses a fixed row format. Storage engines are free to use whatever row format they choose (for ex, InnoDB uses a different row format). The advantage of using a different row format is that specialized storage engines could store rows optimally. For example, a storage engine for DSS could store data in column based format. The disadvantage of having a different row format is that you need to convert data to and from the MySQL row format every time data traverses the storage engine api. This copying does not come free.
Let me illustrate this by an example. I have a table with 1000 records. I execute a query (select id from sbtest1 where length(c)>0;) that does a full table scan and count how many times a function is executed. Note this is a count, and does not include how much time the function actually took. Looking at function counts, we see
130 rec_copy_prefix_to_buf 161 mtr_memo_slot_release 1000 Item_field::val_str(String*) 1000 ha_innobase::unlock_row() 1000 ha_innobase::general_fetch(unsigned char*, unsigned, unsigned) 1000 Field_string::val_str(String*, String*) 1000 Item_func_gt::val_int() 1000 Arg_comparator::compare_int_signed() 1000 Item_func_length::val_int() 1000 lock_clust_rec_cons_read_sees 1000 my_lengthsp_8bit 1000 row_sel_store_mysql_rec 1001 rr_sequential(st_read_record*) 1001 evaluate_join_record(JOIN*, st_join_table*, int) 1001 ha_innobase::rnd_next(unsigned char*) 1001 Item_int::val_int() 1001 rec_get_offsets_func 1001 row_search_for_mysql 1004 handler::ha_statistic_increment(unsigned long system_status_var::*) const 1010 thd_ha_data 2152 memcpy
As expected, we see quite a few functions being called 1000 times. It is interesting to see that memcpy is being called two times the row count. Looking closer at what calls memcpy, we see
8 net_write_buff(st_net*, const unsigned char*, unsigned long) 9 alloc_root 125 mtr_memo_slot_release 130 rec_copy_prefix_to_buf 871 row_search_for_mysql 1000 row_sel_store_mysql_rec
row_sel_store_mysql_rec is used to convert a row from the InnoDB format to the MySQL format. It uses memcpy to copy every field that is required (i.e part of the select or condition). For integers, conversion is done using a for loop. For everything else (except blobs), memcpy is used (once per field). If I change my query to two columns in the where clause, I expect to see twice the number of calls to memcpy. Innodb also does an additional memcpy for each query (called from row_search_for_mysql). I need to take a closer look at why its done.
So why am I picking on memcpy? During a recent investigation of CPU consumers for a sysbench read-only test, memcpy was the top consumer of CPU. The cost of memcpy can get very high if there is a lot of data being transferred between MySQL and the storage engine. You can always reduce this penalty by minimizing the data transfer between the MySQL server and storage engine by reducing full table scans, or pushing the query condition down to the storage engine etc..
Remember, software layering does not come free, but sometimes it might be worth it!
Monday Feb 09, 2009
UFS has been the main filesystem on Solaris until the arrival of ZFS in 2004. UFS has been around since the SunOS 4.x days and is still quite widely used on Solaris. Although OpenSolaris uses ZFS by default, many database users have mostly used UFS.
UFS is a buffered filesystem because by default it uses free memory to cache data. While buffering is useful for a large number of applications, database users have generally stayed away from it by using DirectIO. If you are using UFS with MySQL, you have the option of using UFS buffered or UFS DirectIO. In this blog I will try to describe some of the reasoning behind why UFS DirectIO should be used with MySQL.
UFS buffers filesystem data using 8K pages (4K on x86) in the buffer cache. Innodb stores records in 16K size pages and issues reads of size 16k. This corresponds to two UFS page reads of 8K. The current read-ahead logic in UFS is triggered when it sees a sequential read of 2 UFS pages. The read ahead then issues large (1MB) reads. Since MySQL (like other databases) does random IO, this large read is unnecessary. In many cases , this large read will overwhelm the storage system leading to large service times, and ultimately bad performance.
To get around this issue, we can use DirectIO. Directio is an overloaded term that basically means
One of the main benefits of using UFS DirectIO is the avoidance of the single writer lock. POSIX dictates that writes are atomic i.e multiple writers to a same block need to be queued1. When using DirectIO this constraint is relaxed and it is up to the databases to ensure atomicity. Additionally, the copying of data directly to/from the applications memory saves CPU cycles, making transactions more efficient. However, using DirectIO means also that storage engines like MyISAM, that relies on the filesystem buffer cache, will be negatively affected. Storage engines like Innodb, will not be able to benefit from write coalescing.
For writes, Innodb provides an option of using regular writes followed by a fsync() at transaction commit time (the default), or using O_SYNC writes. Many customer use "regular writes followed by fsync()" because there is a possibility of better performance because of write coalescing. InnoDB also provides an unsafe option where, it does not flush the writes at the end of the transaction, but rather flushes it every second. The benefit of write coalescing is much higher here, but it comes at the cost of data integrity -- a crash can leave your database in an inconsistent state. For customers relying on this feature, using DirectIO will negate the effect of the write coalescing.
Luckily Innodb provides an option (innodb_flush_method=O_DIRECT) where directio is only enabled for the datafiles. Log files still use the default flush method. Using this option gives you the best of both worlds -- Use DirectIO, but still benefit from write coalescing for the logs. You still lose write coalescing for the datafiles, but since those writes happen asynchronously (the query does not wait for the datafile write to complete), it is less critical for performance. If you want to use DirectIO for the logs also, you can mount the filesystem with the forcedirectio option.
In our tests with Sysbench, we have observed that the penalty introduced by UFS read ahead is much more than the benefit of write coalescing.
In case you are wondering about ZFS, it does not suffer from the main limitations of UFS, and as a result it has less need for DirectIO (which is not yet supported on ZFS). ZFS does not suffer from the main reasons behind why DirectIO is so beneficial with UFS as itdoes not have the single writer lock problem. ZFS uses range locks to ensure multiple writes are able to update different parts of the file simultaneously. It also has a highly scalable cache (called the ARC). I will blog about ZFS and MySQL some other time.
Wednesday Nov 05, 2008
shell>dtrace -qn io:::start'/args[0]->b_flags & B_READ/{@[tid]=count()}tick-5s{exit(0)}'
709 39
711 42
710 43
712 51
As you can see above, all 4 threads are issuing reads.
Since innodb_thread_concurrency limits how many threads are executing inside Innodb, it acts as a throttle on the number of reads that can be issued in parallel.
Optionally you can use dtrace to look at the callstacks for reads and verify that the reads originate while reading pages of the Innodb datafiles.
So why are reads triggering fdsync() ?
In case the Innodb buffer gets full, it needs to evict data from the buffer pool to make space for the new record that is being read. The eviction function (buf_flush_free_margin()) also flushes possible buffered writes from the double write memory buffer. This flush is done using fdsync(). You can try increasing the buffer pool size incase the fdsync() bothers you. It is also possible to turn off the Innodb double write buffer, but it is generally not recommended (except for ZFS) as you might lose data in case of a power outage.
So in conclusion,
Monday Oct 13, 2008
pfexec pkg install ss-dev
export CC=/opt/SunStudioExpress/bin/cc export CXX=/opt/SunStudioExpress/bin/CC export CFLAGS="-xO3" #Your fav compiler flags go here export CXXFLAGS="-xO3" ./configure --with-plugins=innobase,myisam --with-mysqld-libs=-lmtmalloc gmake
Yes, it is that simple. Of course you could just download the binaries and use them too; they work just fine.
Tuesday Sep 23, 2008
One example is Bug#32149 Long semaphore wait for adaptive hash latch. The bug synopsis could very well have been Why are global locks in a storage engine such a BAD idea. There is lots of useful information about this and the workaround MySQL server uses to work with storage engines that do.. Interesting!
Do you know of any other interesting bugs that document MySQL internals?
Tuesday Sep 16, 2008
Percona recently released a patch which includes performance fixes developed by Yasufumi Kinoshita from NTT Comware. This helps diskbound applications quite significantly. Details at Bug #29413 Maximum performance of OLTP benchmark is not so scalable on multi-cpu. It looks like the bulk of performance improvements come from breaking up the lock guarding the buffer pool structures, and improvements in the IO code path. The "buf_pool->mutex" also gets quite hot when concurrency is not limited (via innodb_thread_concurrency) and you have multiple threads executing on multi-core systems.
Google recently released an updated patch for improving Innodb scalability. My experience with this patch shows a significant scalability improvement for many workloads. Great work Mark, Ben and rest of the gang!
This is excellent news! This clearly demonstrates that we (as a community) are making great progress in making MySQL scale. This also underlines the fact that the choice of the storage engine will influence your scalability experience quite a bit. This also sets the bar higher for Falcon and Maria
(A good thing in my opinion)