Tuesday May 19, 2009

A software-based distributed caching system such as memcached is an important piece of today's largest Internet sites that support millions of concurrent users and deliver user-friendly response times. The distributed nature of memcached design transforms 1000s of servers into one large caching pool with gigabytes of memory per node. This blog entry explores single-instance memcached scalability for a few usage patterns.

Table below shows out-of-the-box (no custom OS rewrites or networking tuning required) performance with 10G networking hardware and one single-socket UltraSPARC T2-based server with 8 cores and 8 threads per core (64 threads on a chip). All runs are done with a single memcached instance and 40 worker threads so that about 3 cores (24 threads) are used for the critical networking stack that is also heavily parallelized. 40+24 threads is a nice balance for this particular scenario. memcached-1.3.1 with scalability patches described below, and libevent-1.4.9 (used by memcached) are both compiled for a 64-bit environment to enable large memory support of each caching node.

Average Object Size

Operations/Second

Bandwidth

100 bytes

530,000

1.2 Gb/s

2048 bytes

370,000

6.9 Gb/s

4096 bytes

255,000

9.2 Gb/s

An average object size of 4 KBytes easily saturates one 10G link and we get close to the theoretical peak bandwidth of 10Gb/s. This data clearly shows that memcached planning work is very important to find a system balance that is a perfect fit for each application space. For some applications 1G interface could be sufficient, but for more general usage patterns with even larger objects, nothing else but 10G or multiple 10G links will provide enough network bandwidth. It is also worth mentioning that response times for all three test cases are in the sub-millisecond range during heavy loads.

Scalability Improvements: If we take a closer look at memcached-1.2.6 when 16 or 32 threads are used, lock-contention performance tools such as lockstat (part of all recent versions of Solaris) report that there are two very contended locks that limit our scalability. One such lock is used for internal memcached bookkeeping and statistics (stats_lock), and the other one is a global data lock called cache_lock that becomes a bottleneck when a lot of threads are competing for the globally shared memory within each node. Trond Norbye was able to make a scalability patch that removes contention from memcached and makes it very scalable. Please subscribe to Trond's weblog or memcached mailing lists if you are interested in implementation details.

Test Setup: The setup consists of two dual-socket UltraSPARC T2 Plus clients that are directly connected to the server with one 10G link each. The memcached server is using 64GB physical memory and is running at 1.4 GHz. Back-to-back networking is configured in a way that we can use two dedicated 10G links and only one memcached instance that snoops all the traffic from both links. In a more realistic setup, this would be done with a dedicated 10G switch, which is one of our top priorities and a work in progress.

Micro-benchmark Details: The memcached benchmark is extracted from the Apache Olio project. Before we start measurements at steady-state, the benchmark infrastructure initializes an empty memcached instance with a desired mix of very small (4-100 bytes), small (1024-2048 bytes), and medium objects (5120-20480 bytes). If the desired average object size is 2048 bytes, then the algorithm will create a mixture of 76%, 9.5%, and 14.5% respectively. We also avoid evictions by starting memcached with a larger allocated memory size than the desired total cache size (e.g., 48GB). All experiments here are using 90% GET operations and 10% SET operations, which is a fairly common usage pattern.

Next Steps: As a rule of thumb, optimization work never ends. For performance and scalability fans, this is good news because there are still several areas that will benefit from additional software improvements that are currently work in progress (both OpenSolaris and memcached code bases). Make sure to subscribe to this blog if interested in following this work. We will start looking at dual-socket UltraSPARC T2 Plus servers that can support 128 threads in a single box with even more networking, and that can achieve near-linear architectural scalability for a range of large and memory intensive workloads.

Acknowledgments: This work would not be possible without memcached scalability rewrites by Trond, networking and driver work by the OpenSolaris (Crossbow) community, and a Faban-based memcached micro-benchmark development and support.

Friday May 15, 2009

Response times of any web application are very critical for the end-user experience. Steve Souders takes a detailed look at several large Web sites and concludes that 80-90% of the end-user response time is spent on the frontend, i.e., program code that is running inside your Web browser.

Traditional parallelization techniques and caching are without a doubt very effective in the design of scalable Web servers, databases, operating systems and other mission-critical software and hardware components. Assume that all these components are perfectly parallel and optimized, Amdhal's law still suggests that response time improvements will be very modest, or barely measurable.

Thursday May 14, 2009

One interesting and useful paper on real-world concurrency by Bryan Cantrill and Jeff Bonwick.

Abstract: In this look at how concurrency affects practitioners in the real world, Cantrill and Bonwick argue that much of the anxiety over concurrency is unwarranted. Most developers who build typical MVC systems can leverage parallelism by combining pieces of already concurrent software such as database and operating systems (i.e., concurrency through architecture), rather than by writing multithreaded code themselves. And for those who actually must deal with threads and locks, the authors include a helpful list of best practices to help minimize the pain.

This blog copyright 2009 by Zoran Radovic