Wednesday Apr 22, 2009
Wednesday Apr 22, 2009
Like I said during the talk, I will blog about ZFS comparisons with EXT3 when I have a chance to do the tests. Thanks for attending!
Tuesday Apr 21, 2009
Allan managed to get
slightly higher Sysbench Read-Only numbers than mine using the latest MySQL 5.4. Interestingly Solaris does better than Linux. Probably a bug since many of the optimizations in MySQL 5.4 are OS independent. But then a lot can happen in 12 months
Today Sun announces MySQL 5.4. This is a great day for customers as
they can use systems with many cores much more efficiently. Its a great day for
the MySQL community and the MySQL performance team because we made it happen.
MySQL 5.4 includes a lot of community contributed fixes as well as many
fixes from our team. Mikael
and Allan are blogging about all the cool
new features and the great scalability of MySQL 5.4. I thought I will take
this opportunity to blog about some of the things we tried, and rejected.
Sometimes there are a lot of things to be learnt from things that
do not work
Early on during our performance investigation, we were trying to see if we can reduce some of the contention in Innodb locks. If you are not familiar with Innodb locks, I suggest you read Tim Cook's excellent presentation to MySQL University on this very topic.
In a nutshell, Innodb has 4 kinds of lock modes (shared, exclusive, intention shared, and intention exclusive). Since POSIX synchronization methods support maximum of 2 modes (reader or writer), Innodb implements its own set of locking primitives using a condition variable and a regular mutex. Innodb also implements its own spin followed by a block. When it gets a mutex in an not contended case, it is very fast. However, when spinning for a lock fails, it gets interesting.
A failed spin for a lock means it has to block for the lock to be available. Currently Innodb uses the wait array interface to keep track of who is waiting on what mutex. Unfortunately there is a global mutex protecting the wait array. This global mutex (sync_primary_wait_array) has shown to be hot in some high thread count experiments.
The following callstack illustrates this perfectly.
nsec ---- Time Distribution --- count Stack
16384 | | 8 mysqld`os_mutex_enter+0x4
32768 | | 11 mysqld`sync_array_free_cell+0x28
65536 | | 7 mysqld`mutex_spin_wait+0x194
131072 | | 9 mysqld`trx_undo_assign_undo+0x30
262144 |@ | 34 mysqld`trx_undo_report_row_operation+0x168
524288 |@@@@ | 122 mysqld`btr_cur_update_in_place+0x160
1048576 |@@@@@ | 157 mysqld`btr_cur_optimistic_update+0xbc
2097152 |@@@@@@ | 186 mysqld`row_upd_clust_rec+0x50
4194304 |@@ | 86 mysqld`row_upd_clust_step+0x5f0
As you can see, mutex_spin fails, it has to wait, it looks for a free cell in the sync_array to block. Before it can find a free cell, it has to lock the sync_array and search for a free cell. It blocks (using OS primitives, not Innodb locking) on the sync_primary_wait_array.
I implemented a quick prototype to measure the improvement in performance before asking the Innodb folks to take a closer look. My performance tests showed very minimal improvement in performance.
An important thing to note is that this contention happened when it failed to get a mutex. A thread fails to get a mutex if another thread has already acquired the mutex. So it does not really matter how long (within reasonable limits), the thread that failed to acquire the mutex took to sleep on the lock.
Richard Smith had a great idea of totally bypassing the sync_array interface. He prototyped it and got some good gains, but I will let him talk about it.
This just illustrates how hard performance work can be at times. Sometimes the
number of ideas rejected is more than what got accepted
Tuesday Apr 14, 2009
Today Sun announced multiple servers based on Intel's Nehalem Processor. I had early access to a Sun Fire X4270 server (2 socket) for a couple of weeks. I used this opportunity to test some of latest MySQL performance and scalability enhancements. For someone unfamiliar with this system, this is a 2 socket 2U server with support for a max of 144GB of memory. With hyperthreading turned on, the operating system sees 16 CPUs.
Before I share the results of my findings, lets get clear on the terminology. Socket refers to physical sockets on the motherboard. CPU refers to the number of processors seen by the operating system. Core refers to the physical processing unit. A Nehalem socket has 4 cores. Thread refers to the hyperthreading threads. One Nehalem core has 2 threads. Using this terminology, the Sun Fire X4270 has 16 CPUs (2 sockets, 4 cores per socket, 2 threads per core).
I used the ever popular Sysbench benchmark. I used an internal version based off version of MySQL 5.1 running on OpenSolaris. Since the goal of this experiment was showcase MySQL (and Innodb) scalability, (and the X4270 system), I used a cached workload. You should be able to see similar speedups for regular applications, provided there are no IO bottlenecks and no known MySQL scalability issues are being exercised. The X4270 supports 16x2.5" disk drives (SATA, SAS or SSD) so IO should not be a problem for most workloads. I used the tunings mentioned in my earlier blog Maximizing Sysbench OLTP performance for MySQL.
Nehalem incorporates Hyperthreading technology. Hyperthreading allows a core to run an additional software thread. along with the original thread. Since there is very little dedicated chip resources for the second thread, you cannot expect to see 2x boost in performance.
There two ways you can disable
hyperthreading on Solaris.
| Experiment | Sockets | CPUs seen by Solaris |
Read only TPS |
Read Write TPS |
|---|---|---|---|---|
| Hyperthreading ON | 2 | 16 | 6310 | 4652 |
| Hyptherthreading OFF | 2 | 8 | 4648 | 3584 |
| 35% | 29.7% | |||
There are two ways to study system
scaling.
To study system scaling across sockets, we typically fully populate each core/socket before moving on to the next core/socket. For example with the X4270, we use
| 1 core using both threads (1 CPUs) |
| 2 cores using both threads in each core (4 CPUs) |
| 4 cores using all threads (1 full socket) (8 CPUs) |
| 2 full sockets(16 CPUs) |
By fully allocating CPUs per core one socket at a time, we are basically showing what would happen if you only had the number of CPUs shown. This approach shows the best scalability and is also the most realistic approach.
| Sockets | CPUs seen by Solaris |
Read Only TPS |
ReadWrite TPS |
|---|---|---|---|
| 1 Socket | 8 | 3364 | 2616 |
| 2 Sockets | 16 | 6310 | 4652 |
| 1.87x | 1.77x | ||
As you can see from above, going from 1 socket to 2 socket, we see a
87% improvement in ReadOnly test and 77% improvement in
Read-Write performance.
![]() |
|
Wednesday Apr 01, 2009
There has a been lot of buzz regarding SSDs lately. SSDs change the dynamics of the IO subsystem. You are no longer limited by rotational latency and vibration effects. For a performance engineer this has many implications. Since performance engineers care mostly about performance, the first thought that comes to mind is "Are we going to see a big impact in benchmarks?".
The answer is really easy for IO bound benchmarks. How about CPU bound benchmarks? Many database benchmarks are CPU limited. Does a faster disk really change anything?
SSD's have a huge random IO capability. During a recent experiment with a SSD, I got around 12,000 random IO operations per second! I have seen SSDs where you can get more. If you have ever worked with rotating disks, this is HUGE. However, performance engineers have rarely been limited by IOPS. We can always use multiple disks to get the same IOPS requirements. Since we are talking benchmarks, and many benchmarks do not have a $/txn metric, we dont worry about cost
More importantly, many of the benchmarks (TPC-C, TPC-E, etc) are all CPU limited. So having a faster disk does not really change too many things.
Performance engineers have long known that it is easy to compensate for the IO latency by using multiple threads of execution. You can always hide IO latency by using multiple threads of execution (provided you have a scalable workload). This is a very common technique and used in many benchmarks.
SSDs have a huge advantage in terms of cost, power and density. If you factor $/txn or Watt/txn, then SSDs clearly have a big advantage; and no wonder customers are really pumped up about SSDs. SSDs are really good for random read IO. For sequential IO, you are most likely going to be limited by your bus bandwidth and a few regular disks should be sufficient to max out the pipe. For writes, using a controller that has a write cache should provide the same latency for most database workloads. It is the random reads where the SSD's true value can be seen.
For CPU bound workloads, the benefit of SSDs may be realized in the fact that you need fewer threads of execution to max out your CPUs. Using fewer threads of execution means there will less contention for resources, and it is possible for some workloads to benefit by reduced contention. This improvement is application specific. Scalable workloads will see less benefit than non-scalable workloads.
For mysql, this reduced contention would probably benefit a lot. It is probably the same effect as reducing innodb_thread_concurrency but with greater throughput
For IO bound applications or non-scalable applications (for ex, applications that hold a lock while doing IO), the decreased latency with SSDs will cause a huge improvement in their performance. SSDs kinda level the playing field. Most database vendors have spent years optimizing the IO codepath. Things like multiblock read, or IO sorting, etc are moot with SSDs. Copy-on-write is the same as 'inplace' modification of database blocks.
Probably the most important benefit of using SSDs with databases is that it reduces the penalty for missing the database buffer cache. A common question customers usually face is how much is the performance improvement will I get if I buy more memory? The answer is indirectly related to the penalty for missing the cache. A database buffer cache miss has to be fulfilled by an IO. For a regular disk this is atleast 6ms. For a SSD, this is less than 1ms!. It is straightforward to calculate avg time for IO for different hit rates. Depending on how many misses you have per transaction, the benefits of using SSD can be huge. I will blog with some actual numbers in a later blog.
So in conclusion, most customers will benefit a lot by using SSDs. Performance engineers are not so lucky