Neelakanth Nadgir's blog
« Previous month (Mar 2009) | Main | Next month (May 2009) »
Wednesday Apr 22, 2009
Optimizing MySQL Performance with ZFS - Slides available
Today Allan and I presented "Optimizing MySQL Performance with ZFS" talk at MySQL Conference 2009. You can get the slides here (I have uploaded it to the conference website also)

Like I said during the talk, I will blog about ZFS comparisons with EXT3 when I have a chance to do the tests. Thanks for attending!

Posted at 09:35PM Apr 22, 2009 by Neelakanth Nadgir in MySQL  | 

Tuesday Apr 21, 2009
MySQL 5.4 on 2 Socket Nehalem system (Sun Fire X4270)
Now that MySQL 5.4 (internally code named performance version or summit) is officially released, I can tell you that I used MySQL 5.4 alpha for my Nehalem scaling studies in my earlier blog - MySQL Scalability on Nehalem systems (Sun Fire X4270). I am waiting to get hold of a 4 socket Nehalem system to see we scale; but that will have to wait for the MySQL conference to get over.

Allan managed to get slightly higher Sysbench Read-Only numbers than mine using the latest MySQL 5.4. Interestingly Solaris does better than Linux. Probably a bug since many of the optimizations in MySQL 5.4 are OS independent. But then a lot can happen in 12 months :-)

Posted at 04:57PM Apr 21, 2009 by Neelakanth Nadgir in MySQL  | 

Reducing Innodb mutex contention

Today Sun announces MySQL 5.4. This is a great day for customers as they can use systems with many cores much more efficiently. Its a great day for the MySQL community and the MySQL performance team because we made it happen. MySQL 5.4 includes a lot of community contributed fixes as well as many fixes from our team. Mikael and Allan are blogging about all the cool new features and the great scalability of MySQL 5.4. I thought I will take this opportunity to blog about some of the things we tried, and rejected. Sometimes there are a lot of things to be learnt from things that do not work :-)

Early on during our performance investigation, we were trying to see if we can reduce some of the contention in Innodb locks. If you are not familiar with Innodb locks, I suggest you read Tim Cook's excellent presentation to MySQL University on this very topic.

Background

In a nutshell, Innodb has 4 kinds of lock modes (shared, exclusive, intention shared, and intention exclusive). Since POSIX synchronization methods support maximum of 2 modes (reader or writer), Innodb implements its own set of locking primitives using a condition variable and a regular mutex. Innodb also implements its own spin followed by a block. When it gets a mutex in an not contended case, it is very fast. However, when spinning for a lock fails, it gets interesting.

The problem

A failed spin for a lock means it has to block for the lock to be available. Currently Innodb uses the wait array interface to keep track of who is waiting on what mutex. Unfortunately there is a global mutex protecting the wait array. This global mutex (sync_primary_wait_array) has shown to be hot in some high thread count experiments.

The following callstack illustrates this perfectly.
    nsec ---- Time Distribution --- count Stack
   16384 |                        |     8 mysqld`os_mutex_enter+0x4
   32768 |                        |    11 mysqld`sync_array_free_cell+0x28
   65536 |                        |     7 mysqld`mutex_spin_wait+0x194
  131072 |                        |     9 mysqld`trx_undo_assign_undo+0x30
  262144 |@                       |    34 mysqld`trx_undo_report_row_operation+0x168
  524288 |@@@@                    |   122 mysqld`btr_cur_update_in_place+0x160
 1048576 |@@@@@                   |   157 mysqld`btr_cur_optimistic_update+0xbc
 2097152 |@@@@@@                  |   186 mysqld`row_upd_clust_rec+0x50
 4194304 |@@                      |    86 mysqld`row_upd_clust_step+0x5f0

As you can see, mutex_spin fails, it has to wait, it looks for a free cell in the sync_array to block. Before it can find a free cell, it has to lock the sync_array and search for a free cell. It blocks (using OS primitives, not Innodb locking) on the sync_primary_wait_array.

Proposed Solution

Locking the whole sync_array to pick a free cell does not sound too scalable. I thought of following simple ways to fix this
  1. Use a mutex per cell instead of a global mutex.
  2. Use atomic ops to mark cells free/busy instead of grabbing the global mutex and checking.
  3. The search for a free cell always starts from 0. This is suboptimal as busy cells will tend to accumulate at the beginning. I propose starting at the previously found free cell and circling back after hitting the end.

I implemented a quick prototype to measure the improvement in performance before asking the Innodb folks to take a closer look. My performance tests showed very minimal improvement in performance.

Why did it not give a big boost in performance

An important thing to note is that this contention happened when it failed to get a mutex. A thread fails to get a mutex if another thread has already acquired the mutex. So it does not really matter how long (within reasonable limits), the thread that failed to acquire the mutex took to sleep on the lock.

Richard Smith had a great idea of totally bypassing the sync_array interface. He prototyped it and got some good gains, but I will let him talk about it.

Did we add this to 5.4?

No, since it did not improve performance, it is not there in 5.4. I doubt we will put it in unless we see a decent improvement in performance, or a drop in CPU utilization as a result of fixing it. There are other implementation issues (there is some deadlock detection code that depends on entire sync_array being locked, and others) that make it risky to fix without good justification.

This just illustrates how hard performance work can be at times. Sometimes the number of ideas rejected is more than what got accepted :-)

Posted at 09:13AM Apr 21, 2009 by Neelakanth Nadgir in MySQL  | 

Tuesday Apr 14, 2009
MySQL Scalability on Nehalem systems

MySQL Scalability on Nehalem systems (Sun Fire X4270)

Today Sun announced multiple servers based on Intel's Nehalem Processor. I had early access to a Sun Fire X4270 server (2 socket) for a couple of weeks. I used this opportunity to test some of latest MySQL performance and scalability enhancements. For someone unfamiliar with this system, this is a 2 socket 2U server with support for a max of 144GB of memory. With hyperthreading turned on, the operating system sees 16 CPUs.

Before I share the results of my findings, lets get clear on the terminology. Socket refers to physical sockets on the motherboard. CPU refers to the number of processors seen by the operating system. Core refers to the physical processing unit. A Nehalem socket has 4 cores. Thread refers to the hyperthreading threads. One Nehalem core has 2 threads. Using this terminology, the Sun Fire X4270 has 16 CPUs (2 sockets, 4 cores per socket, 2 threads per core).

I used the ever popular Sysbench benchmark. I used an internal version based off version of MySQL 5.1 running on OpenSolaris. Since the goal of this experiment was showcase MySQL (and Innodb) scalability, (and the X4270 system), I used a cached workload. You should be able to see similar speedups for regular applications, provided there are no IO bottlenecks and no known MySQL scalability issues are being exercised. The X4270 supports 16x2.5" disk drives (SATA, SAS or SSD) so IO should not be a problem for most workloads. I used the tunings mentioned in my earlier blog Maximizing Sysbench OLTP performance for MySQL.

Nehalem Hyperthreading

Nehalem incorporates Hyperthreading technology. Hyperthreading allows a core to run an additional software thread. along with the original thread. Since there is very little dedicated chip resources for the second thread, you cannot expect to see 2x boost in performance.

There two ways you can disable hyperthreading on Solaris.

  1. BIOS setting. During bootup, enter setup and disable Hyperthreading.
  2. Turn off 2nd thread of each core. In Solaris you can use the psrinfo -pv command to identify with CPUs correspond to the same cores, and then use the psradm -f command to turn them off.
Our tests indicated that both the methods are pretty similar in performance. Numbers below are TPS or transactions-per-transaction. Higher number indicates better performance.

Experiment Sockets CPUs seen
by Solaris
Read only
TPS
Read Write
TPS
Hyperthreading ON 2 16 6310 4652
Hyptherthreading OFF 2 8 4648 3584
  35% 29.7%

As you can see from the above tests, Hyperthreading gives you between 30-35% boost in performance.

System Scaling

There are two ways to study system scaling.

  1. One method is to study performance increase as we add more sockets to a system. This is useful for customers who want to evaluate the performance benefits of upgrading from a 1 socket system to a 2 socket system.
  2. The other way is see how a system behaves as we increase load. Scaling this way also showcases how the operating system schedules threads onto cores. In the case of Nehalem, the second thread of the core is not really a full blown core, so you will not see twice the horsepower of a single thread.
Lets look at both of them in detail.

Scaling from one Nehalem socket to two Nehalem sockets

To study system scaling across sockets, we typically fully populate each core/socket before moving on to the next core/socket. For example with the X4270, we use
1 core using both threads (1 CPUs)
2 cores using both threads in each core (4 CPUs)
4 cores using all threads (1 full socket) (8 CPUs)
2 full sockets(16 CPUs)

By fully allocating CPUs per core one socket at a time, we are basically showing what would happen if you only had the number of CPUs shown. This approach shows the best scalability and is also the most realistic approach.

Sockets CPUs seen
by Solaris
Read Only
TPS
ReadWrite
TPS
1 Socket 8 3364 2616
2 Sockets 16 6310 4652
  1.87x 1.77x

As you can see from above, going from 1 socket to 2 socket, we see a 87% improvement in ReadOnly test and 77% improvement in Read-Write performance.

Scaling with increasing load

For this experiment, we enable hyperthreading and increase the load on the MySQL server. For Sysbench, increasing load means that we increase the number of connections to the database. Since MySQL uses a thread per connection, Solaris must be able to spread these threads out in an optimal manner. i.e if we have 8 threads, Solaris should ideally use one thread per core. This means that for 8 threads, we should see performance close to what we achieved with Hyperthreading off.
Sysbench scaling with increasing load
Threads Read only
TPS
Read Write
TPS
1 569.06
452.12
2
1412.23
1066.06
4
2636.04
2080.23
8
4139.46
3268.45
16
6310.89
4652.63
32
6152.06
4286.59
48
6015.61
3944.35

As you can see, we got 4139 transactions per second at 8 threads for the read-only test and 3268 transactions per second for the read-write test. This is around 90% of what we get when we have hyperthreading disabled. Solaris does a great job of scheduling threads in an optimal way.

Conclusion

As you can see from the above benchmarks, many MySQL workloads will see very good scaling from one socket to two sockets. You can get 30-35% boost in performance by using Hyperthreading. When hyperthreading is enabled, Solaris does a great job of scheduling threads in an optimal way.
Posted at 08:19AM Apr 14, 2009 by Neelakanth Nadgir in MySQL  | 

Wednesday Apr 01, 2009
SSDs for Performance Engineers
Why should a performance engineer care about SSDs

There has a been lot of buzz regarding SSDs lately. SSDs change the dynamics of the IO subsystem. You are no longer limited by rotational latency and vibration effects. For a performance engineer this has many implications. Since performance engineers care mostly about performance, the first thought that comes to mind is "Are we going to see a big impact in benchmarks?".

The answer is really easy for IO bound benchmarks. How about CPU bound benchmarks? Many database benchmarks are CPU limited. Does a faster disk really change anything?

So what does an SSD really give you?

Faster IOPS

SSD's have a huge random IO capability. During a recent experiment with a SSD, I got around 12,000 random IO operations per second! I have seen SSDs where you can get more. If you have ever worked with rotating disks, this is HUGE. However, performance engineers have rarely been limited by IOPS. We can always use multiple disks to get the same IOPS requirements. Since we are talking benchmarks, and many benchmarks do not have a $/txn metric, we dont worry about cost :-) More importantly, many of the benchmarks (TPC-C, TPC-E, etc) are all CPU limited. So having a faster disk does not really change too many things.

Faster Latency

Performance engineers have long known that it is easy to compensate for the IO latency by using multiple threads of execution. You can always hide IO latency by using multiple threads of execution (provided you have a scalable workload). This is a very common technique and used in many benchmarks.

True value of a SSD

SSDs have a huge advantage in terms of cost, power and density. If you factor $/txn or Watt/txn, then SSDs clearly have a big advantage; and no wonder customers are really pumped up about SSDs. SSDs are really good for random read IO. For sequential IO, you are most likely going to be limited by your bus bandwidth and a few regular disks should be sufficient to max out the pipe. For writes, using a controller that has a write cache should provide the same latency for most database workloads. It is the random reads where the SSD's true value can be seen.

So where does SSDs really help?

For CPU bound workloads, the benefit of SSDs may be realized in the fact that you need fewer threads of execution to max out your CPUs. Using fewer threads of execution means there will less contention for resources, and it is possible for some workloads to benefit by reduced contention. This improvement is application specific. Scalable workloads will see less benefit than non-scalable workloads.

For mysql, this reduced contention would probably benefit a lot. It is probably the same effect as reducing innodb_thread_concurrency but with greater throughput

For IO bound applications or non-scalable applications (for ex, applications that hold a lock while doing IO), the decreased latency with SSDs will cause a huge improvement in their performance. SSDs kinda level the playing field. Most database vendors have spent years optimizing the IO codepath. Things like multiblock read, or IO sorting, etc are moot with SSDs. Copy-on-write is the same as 'inplace' modification of database blocks.

SSDs as a cheaper RAM

Probably the most important benefit of using SSDs with databases is that it reduces the penalty for missing the database buffer cache. A common question customers usually face is how much is the performance improvement will I get if I buy more memory? The answer is indirectly related to the penalty for missing the cache. A database buffer cache miss has to be fulfilled by an IO. For a regular disk this is atleast 6ms. For a SSD, this is less than 1ms!. It is straightforward to calculate avg time for IO for different hit rates. Depending on how many misses you have per transaction, the benefits of using SSD can be huge. I will blog with some actual numbers in a later blog.

So in conclusion, most customers will benefit a lot by using SSDs. Performance engineers are not so lucky :-)

Posted at 08:06PM Apr 01, 2009 by Neelakanth Nadgir in MySQL  |