OpenSolaris Beats Linux on memcached Sun Fire X2270
memcached is the de-facto distributed caching server used to scale many web2.0 sites today. With the requirement to support a very large number of users as sites grow, memcached aids scalability by effectively cutting down on MySQL traffic and improving response times.
- memcached is a very light-weight server but is known not to scale beyond 4-6 threads. Some scalability improvements have gone into the 1.3 release (still in beta).
- As customers move to the newer, more powerful Intel Nehalem based systems, it is important that they have the ability to use these systems efficiently using appropriate software and hardware components.
Performance Landscape
memcached performance results: ops/sec (bigger is better)
| System | C/C/T | Processors | Memory | Operating System | Performance Ops/Sec |
|
|---|---|---|---|---|---|---|
| GHz | Type | |||||
| Sun Fire X2270 | 2/8/16 | 2.93 | Intel X5570 QC | 48GB | OpenSolaris 2009.06 | 352K |
| Sun Fire X2270 | 2/8/16 | 2.93 | Intel X5570 QC | 48GB | RedHat Enterprise Linux 5 (kernel 2.6.29) | 281K |
C/C/T: Chips, Cores, Threads
Results and Configuration Summary
Sun's results used the following hardware and software components.
Hardware:
-
Sun Fire X2270
2 x Intel X5570 QC 2.93 GHz
48GB of memory
10GbE Intel Oplin card
Software:
-
OpenSolaris 2009.06
Linux RedHat 5 (on kernel 2.6.29)
Benchmark Description
memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. The memcached benchmark was based on Apache Olio - a web2.0 workload.
The benchmark initially populates the server cache with objects of different sizes to simulate the types of data that real sites typically store in memcached :
- small objects (4-100 bytes) to represent locks and query results
- medium objects (1-2 KBytes) to represent thumbnails, database rows, resultsets
- large objects (5-20 KBytes) to represent whole or partially generated pages
The benchmark then runs a mixture of operations (90% gets, 10% sets) and measures the throughput and response times when the system reaches steady-state. The workload is implemented using Faban, an open-source benchmark development framework. It not only speeds benchmark development, but the Faban harness is a great way to queue, monitor and archive runs for analysis.
Key Points and Best Practices
OpenSolaris Tuning
The following /etc/system settings were used to set the number of MSIX:
- set ddi_msix_alloc_limit=4
- set pcplusmp:apic_intr_policy=1
For the ixgbe interface, 4 transmit and 4 receive rings gave the best performance :
- tx_queue_number=4, rx_queue_number=4
The crossbow threads were bound:
dladm set-linkprop -p cpus=12,13,14,15 ixgbe0
Linux Tuning
Linux was more complicated to tune, the following Linux tunables were changed to try and get the best performance:
- net.ipv4.tcp_timestamps = 0
- net.core.wmem_default = 67108864
- net.core.wmem_max = 67108864
- net.core.optmem_max = 67108864
- net.ipv4.tcp_dsack = 0
- net.ipv4.tcp_sack = 0
- net.ipv4.tcp_window_scaling = 0
- net.core.netdev_max_backlog = 300000
- net.ipv4.tcp_max_syn_backlog = 200000
Here are the ixgbe specific settings that were used (2 transmit, 2 receive rings):
- RSS=2,2 InterruptThrottleRate=1600,1600
Linux Issues
The 10GbE Intel Oplin card on Linux resulted in the following driver and kernel re-builds.
- With the default ixgbe driver from the RedHat distribution (version 1.3.30-k2 on kernel 2.6.18)), the interface simply hung during the benchmark test.
- This led to downloading the driver from the Intel site (1.3.56.11-2-NAPI) and re-compiling it. This version does work and we got a maximum throughput of 232K operations/sec on the same linux kernel (2.6.18). However, this version of the kernel does not have support for multiple TX rings.
- The kernel version 2.6.29 includes support for multiple TX rings but still doesn't have the ixgbe driver which is 1.3.56.11-2-NAPI. So we downloaded, built and installed these versions of the kernel and driver. This worked well giving a maximum throughput of 281K with some tuning.
See Also
- Shanti's Blog: http://blogs.sun.com/shanti/entry/opensolaris_beats_linux_on_memcached
- memcached website
- Apache Olio web2.0 workload
- Faban benchmark framework
Disclosure Statement
Sun Fire X2270 server with OpenSolaris 352K ops/sec. Sun Fire X2270 server with RedHat Linux 281K ops/sec. For memcached information, visit http://www.danga.com/memcached. Results as of June 8, 2009.
