A software-based distributed caching system such as memcached is an important piece of today's largest Internet sites that support millions of concurrent users and deliver user-friendly response times. The distributed nature of memcached design transforms 1000s of servers into one large caching pool with gigabytes of memory per node. This blog entry explores single-instance memcached scalability for a few usage patterns.
Table below shows out-of-the-box (no custom OS rewrites or networking tuning required) performance with 10G networking hardware and one single-socket UltraSPARC T2-based server with 8 cores and 8 threads per core (64 threads on a chip). All runs are done with a single memcached instance and 40 worker threads so that about 3 cores (24 threads) are used for the critical networking stack that is also heavily parallelized. 40+24 threads is a nice balance for this particular scenario. memcached-1.3.1 with scalability patches described below, and libevent-1.4.9 (used by memcached) are both compiled for a 64-bit environment to enable large memory support of each caching node.
|
Average Object Size |
Operations/Second |
Bandwidth |
|
100 bytes |
530,000 |
1.2 Gb/s |
|
2048 bytes |
370,000 |
6.9 Gb/s |
|
4096 bytes |
255,000 |
9.2 Gb/s |
An average object size of 4 KBytes easily saturates one 10G link and we get close to the theoretical peak bandwidth of 10Gb/s. This data clearly shows that memcached planning work is very important to find a system balance that is a perfect fit for each application space. For some applications 1G interface could be sufficient, but for more general usage patterns with even larger objects, nothing else but 10G or multiple 10G links will provide enough network bandwidth. It is also worth mentioning that response times for all three test cases are in the sub-millisecond range during heavy loads.
Scalability Improvements: If we take a closer look at memcached-1.2.6 when 16 or 32 threads are used, lock-contention performance tools such as lockstat (part of all recent versions of Solaris) report that there are two very contended locks that limit our scalability. One such lock is used for internal memcached bookkeeping and statistics (stats_lock), and the other one is a global data lock called cache_lock that becomes a bottleneck when a lot of threads are competing for the globally shared memory within each node. Trond Norbye was able to make a scalability patch that removes contention from memcached and makes it very scalable. Please subscribe to Trond's weblog or memcached mailing lists if you are interested in implementation details.
Test Setup: The setup consists of two dual-socket UltraSPARC T2 Plus clients that are directly connected to the server with one 10G link each. The memcached server is using 64GB physical memory and is running at 1.4 GHz. Back-to-back networking is configured in a way that we can use two dedicated 10G links and only one memcached instance that snoops all the traffic from both links. In a more realistic setup, this would be done with a dedicated 10G switch, which is one of our top priorities and a work in progress.
Micro-benchmark Details: The memcached benchmark is extracted from the Apache Olio project. Before we start measurements at steady-state, the benchmark infrastructure initializes an empty memcached instance with a desired mix of very small (4-100 bytes), small (1024-2048 bytes), and medium objects (5120-20480 bytes). If the desired average object size is 2048 bytes, then the algorithm will create a mixture of 76%, 9.5%, and 14.5% respectively. We also avoid evictions by starting memcached with a larger allocated memory size than the desired total cache size (e.g., 48GB). All experiments here are using 90% GET operations and 10% SET operations, which is a fairly common usage pattern.
Next Steps: As a rule of thumb, optimization work never ends. For performance and scalability fans, this is good news because there are still several areas that will benefit from additional software improvements that are currently work in progress (both OpenSolaris and memcached code bases). Make sure to subscribe to this blog if interested in following this work. We will start looking at dual-socket UltraSPARC T2 Plus servers that can support 128 threads in a single box with even more networking, and that can achieve near-linear architectural scalability for a range of large and memory intensive workloads.
Acknowledgments: This work would not be possible without memcached scalability rewrites by Trond, networking and driver work by the OpenSolaris (Crossbow) community, and a Faban-based memcached micro-benchmark development and support.