Monday July 06, 2009
Scale beyond 8 cores?
If you try to benchmark a memcached server, you will see that it scales relatively good up to 4 threads, and if you go beyond 8 threads the throughput will start to drop. People have been talking about the scalability problems of the memcached server on the mailing lists, IRC and on the hackathons with various solutions to the problems. If you look at the implementation of the memcached server, you will see that it use only a limited set of locks:
| Lock | Purpose |
|---|---|
slabs_lock | The internal memory management of memcached use one single lock to protect access to the internal memory allocator. People tend to want to split this mutex into a separate mutex per slab class, but I don't think that this will give you any measurable performance wins. Why? well the _only_ time we will try to access this lock is during an operation that will modify an entry in the cache (add, set, replace, append, prepend, incr, decr). If this lock really is a performance bottleneck, your GET load is lower than your SET load and that is not the typical use pattern of memcached (and I would like to optimize for the common case...) |
stats_lock | In 1.2.x all modifications of statistical information was protected by this single lock. plockstat revealed that we had mutex contention on this mutex, so in 1.4 most of the statistics are collected per thread, and aggregated when you call the stats command. |
conn_lock | Access to the list of connection structures is guarded by this lock. If this lock comes up when you try to benchmark your memcached server, you would be way better off by reusing your connections to the memcached server. |
cache_lock | All access to the internal hash table is protected by this lock, and this is the lock I'm going to talk a bit more about! |
If you look at the implementation of the memcached server, it stores all of the items in one large hash map. Every time we need to insert, delete or search the hash table, we will try to get exclusive access to the entire hash table. And this is pretty much all that memcached does ;-) (You send a request, it looks up the item, and sends it back to you).
So how can we easily fix this problem without rewriting everyting? Well we could partition up the hash into multiple partitions and only lock down a subset of the hash instead of the complete hash. To avoid extra locking to update the LRU (if it spans two partitions) list I decided to let each partition have it's own LRU list. This sounded like a pretty easy fix to implement, so I just grabbed my laptop and implemented it while watching C.S.I on TV with my girlfriend one night :-)
So how did it scale? Zoran Radovic wrote the following blog entry: Scaling Memcached: 500,000+ Operations/Second with a Single-Socket UltraSPARC T2. If you are interested in the code, you can get it from http://github.com/trondn/memcached/tree/partition.
Posted at 01:10PM Jul 06, 2009 by trond in Memcached | Comments[5]
| « November 2009 | ||||||
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 20 | 21 | |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | |||||
| Today | ||||||
Partitioning cache_lock is a big win. But I think there's a lot more than can be done.
Locks should be per data instead of per operation. There's no reason for using the same lock for items and hash table.
If you do that separation, data structures that are mostly read (like the hash table) can be protected by a rwlock instead of a mutex.
Have you taken a look at the patch I sent to the list?
If you are interested I have also another version with some more improvements like using lockless atomic operation (an thus no mutex and no contention) for refcounts and some stats.
Posted by Jaime on July 07, 2009 at 06:24 PM CEST #
did you thing in a trivial multi read and one write pattern acces to hash table ?
Bye
Posted by pau on July 07, 2009 at 06:38 PM CEST #
"did you thing in a trivial multi read and one write pattern acces to hash table ?"
That's just what I said: using a rwlock.
Posted by Jaime on July 07, 2009 at 06:42 PM CEST #
Jaime: My main goal was to improve the scalability without changing too much, and this patch removes the contention on the cache locks (at least it isn't the hot mutex anymore). I haven't looked closely to your patch, but we ran it with the same setup as the patritioned hash and it turned out that it had a slightly lower throughput. I don't have the numbers available right now but we could look more into that later on. I would like to get some numbers on where the hot spot is before I'm picking the next thing to change :-)
Posted by Trond Norbye on July 07, 2009 at 06:51 PM CEST #
I didnt read ;)
Posted by pau on July 07, 2009 at 06:51 PM CEST #