/var/adm/blog

Monday Dec 22, 2008

Memcached Java Client Performance on OpenSolaris

So I have spent some time in the last month taking a look at how the two main Java-based Memcached clients (Whalin and Spy) perform when run on OpenSolaris. The point of this exercise was to generate some meaningful and useful data that we could use to understand the behavior of these two clients when compared with each other. 


I should point out that no attempt was made to optimize either client nor the environment in which the clients were run. The idea was to simulate, as much as possible, an 'out-of-the-box' experience. So clearly, the data below should not be taken as the optimal performance for each of the clients --as I am sure that there are means to squeeze more performance out of them. I also want to state that the results of this comparison should not be considered a repudiation or endorsement of a particular client. 


So, with all these caveats out of the way, let's get to it ....


THE EXECUTION ENVIRONMENT


The environment used to measure the performance consisted of Four Dual-Core 2.2 GHZ Sun Fire x2200 M2 with Four GB of memory running OpenSolaris build 90 (snv_90). All of the x2200s were on the same sub-net and each had GigE NICs. Three of the x2200s each hosted a Memcached server instance. The server instance was running version 1.2.5 and was allocated 1 GB of memory.  The fourth x2200 hosted the client under test. 



THE BENCHMARK


I used the Faban benchmarking framework to run these benchmarks. The really nice thing about this framework is that it allows you to focus on creating the workload logic, while it takes care of data collection and process/service management.


The workload pattern consisted of the following steps

  1. Start the 1.2.5 Memcached Server on the 3 server hosts.
  2. Preload 1,000,000 objects into the cache. The size of the objects in the cache varied from 768 bytes to 1,280 bytes with an average size of 1,024. The entries were distributed equally so that each server node was holding approximately 367 MB of cache data.
  3. Start the client load. The client load came from a single Java VM (1.6.0_6 in 64bit mode) running on a single host. The number of client threads varied from 1 - 50.  
  4. Load was applied for 330 seconds, with a ratio of 90% get operations to 10% set operations. Gets were performed in a bulk fashion, with each get asking for 100 cache entries.
  5. Shutdown the servers and collect the data. 

It should be noted that the Spy Client provides the user with the ability to perform non-blocking I/O while Whalin does not. In order to provide a more "apples-to-apples" comparison, the Spy Client was used in a blocking manner. The Spy client was also configured to use the WhalinTranscoder. 


The following data was collected while executing this benchmark

  • Operations-per-second (Ops/Sec): Total number of operations completed divided by number of seconds during which the benchmark ran.
  • meanSetLatency (setLatency): Arithmetic mean for how long, in milliseconds, individual set operations took to complete.
  • meanBulkGetLatency (getLatency): Arithmetic mean for how long, in milliseconds,  individual getBulk operations took to complete.
  • CPU utilization (cpu % busy):  Percentage of time the CPU was busy as measured by the vmstat utility.
  • NIC Saturation: Network Card Saturation as captured by the nicstat utility

 

THE RESULTS

Whalin 2.0.1 Spy 2.2
 Threads  Ops/Sec  setLatency  getLatency  cpu %   busy  NIC Saturation  Ops/Sec  setLatency  getLatency  cpu % busy  NIC Saturation
 1  272  0.5  4 29 20 581  0.4  2 34  45
 10  548 5 19 92 45 764 7 13 48  60
 20  544 7 40 92 45 759 13 28 48  60
 30  538 8 61 92 45 763 18 42 49  60
 40  534 9 83 92 45 702 25 61 45  58
 50  541 11 102 92 45 746 28 72 48  60

FINDINGS/OBSERVATIONS/NEXT STEPS

The data above indicates that the Spy client can achieve a higher throughput while using less CPU than Whalin and that Spy achieves a greater saturation of the NIC than Whalin.  With this in mind, it is worth again noting that no attempt was made to optimize either client and that it is entirely conceivable that each client could be configured to achieve greater performance. 

Both clients appear to reach a performance/resource utilization plateau between 1 and 10 threads. Further benchmarking (not included in the data above) has shown the plateau occurs between 1 and 5 threads. 

The lower setLatency averages achieved by the Whalin client indicate that in a more 'set-intensive' environment, Whalin may achieve higher performance than Spy.

The fact that the Whalin client has such a high CPU utilization rate is fertile ground for investigation. Is this something that can be addressed via a change in the runtime environment/configuration?

The benchmark used very small cache entries that were below both clients' compression threshold. What will the results look like when the cache entries exceed the compression threshold?

The benchmarking was performed using the ASCII protocol, how will things change if we run against a 1.3.X server and enable the binary protocol?


Comments:

Post a Comment:
  • HTML Syntax: NOT allowed

Calendar

Feeds

Search

Links

Navigation

Referrers