/var/adm/blog
@deprecated
So I am "deprecating" this blog, which, I guess, is my not so clever way of saying I won't be updating this blog anymore. Instead, I will be setting up a new blog at http://www.ricoconbico.com/blog in the very near future and all of my future updates will be made there ... adios :-)
Posted at 10:54PM Nov 04, 2009 by Eric Lambert in Drizzle | Comments[0]
The Gearman Replicator For Drizzle (Episode IV: A New Hope)
THE GEARMAN REPLICATOR FOR DRIZZLE
EPISODE IV: A NEW HOPE
Recently, Jay Pipes has been blogging about the new Drizzle Replication system. Jay has done a great job of describing how the system works and why it behaves the way it does. While Jay has been doing a bang-up job putting together the internals of the Drizzle replication system, I’ve been working on the “Gearman Replicator for Drizzle” which, as the name suggests, is a Gearman based Drizzle replicator that dove tails very nicely with the work that Jay has been doing and is not only a proof-of-concept for the replication system and the principals it is based on but also exemplifies how Drizzle can leverage the tools and environment around it, in this case Gearman.
Describing the entire “Gearman Replicator for Drizzle” would probably make for a lengthy blog post that I suspect many people would not see all the way through, so I’ve decided to follow in the footsteps of Mr George Lucas and make this a trilogy (and yes, I am aware that the Star Wars saga eventually had six episodes, don't tempt me, I may go that far to):
- The first episode (A New Hope), which you are reading now, will provide an overview of the system, the components that make it up, and how those components interact with each other.
- The second episode (The Applier Strikes Back) will focus mainly on the behavior in the master database and go over the role Gearman Replication Applier Drizzle plugin plays.
- The third episode (Return of the Gearman Job Result) will look at the “slave side” behavior and focus on describing how the Java based Gearman Replicator receives and applies transactions to the slave database.
WHAT IS GEARMAN
As may be obvious by now, the Gearman framework plays a central role in the “Gearman Replicator for Drizzle”. For those of you that don’t know what Gearman is, Gearman is essentially a distributed and scalable job scheduling framework that allows any number of clients to submit jobs to any number of workers, where a worker is some process that is capable of executing one or more particular requests or jobs. The architecture of Gearman looks something like the following:
THE GEARMAN REPLICATOR OVERVIEW
From 10,000 feet (or 3,048 meters for the rest of the world) The Gearman Replicator For Drizzle system looks a little like this.
As you can see, the system consists of four major pieces:
- The Drizzle “Master” Database: This is the database that is being replicated. As transactions are applied to the “Master" they are placed into the replication stream and eventually applied to the “Slave” database.
- The Gearman Job Server: The Gearman Job Server(s) acts as traffic cop or match maker by accepting job requests from Gearman Clients (in this case the Gearman Replication Applier), matching that request with an appropriate worker (in this case, a worker that has the Applier Function) and then submitting the request to the worker. In our scenario, the job request is a wrapper around a Transaction that has been placed into the replication stream by the Master database. Note, in the diagram above there are two job servers, this number was used for demonstration purposes. The number of job servers used by the system can be determined by the users and can range from one to many.
- The Gearman Worker: The Gearman Worker, as its' name suggests, is a process that executes requests on behalf of a Gearman Client. The worker registers which functions it can execute with Gearman Job Server(s) then executes any Gearman Job Requests passed on to it by the Gearman Job Server. In the case of the Gearman Replicator for Drizzle, the worker registers an applier function that can take a Drizzle Transaction and apply it to a target database.
- The Drizzle “Slave” Database: This database is the target database in our replication workflow. Transactions that are applied to the “Master” database will be applied to the “Slave” database as well. For the time being, the Slave is also a Drizzle database, but in theory with very little change it should be possible to use non-Drizzle databases (MySQL, PostgreSQL, etc.) as a “Slave” database.
While the diagram above does provide a good overview of the pieces of the system and how data and messages flow among them, it does leave open a few questions including :
1) What types of messages are being passed around the system?
2) What does it mean that he Gearman Replication Applier and Applier Functions are encased in an external entity (the Master Database and Gearman Worker Respectively)?
MESSAGES IN GEARMAN REPLICATOR FOR DRIZZLE
In regards to the type of messages that are passed between the pieces of the system, the diagram does provide a hint about what is going on. The arrows between the entities represent messages going to and from those entities. You may have noticed that some of the arrows are light blue while others are red, and one is even grey. The color coding here has meaning.
Those arrows that are light blue represent a Gearman Job Request being sent. A full description of a Gearman Job Request can be found in the Gearman documentation, but suffice to say that the request contains some meta-data about the job (a handle, name of the function to be executed, etc) and a payload to be used by the Gearman Function. In this case, the payload consists of a Google Protobuffer message that contains the Transaction to be replicated into the slave database (for more details of the Transaction message, see Jay’s blog). As it turns out, encapsulating the Transaction in Google Protobuffer pays big dividends here. One of the nice advantages of Gearman is the Workers and Functions can be written in a variety of languages, (C, Java, Python, Perl, etc ..). Since Google Protobuffers provides binding for several different languages, having the Transaction as a Protobuffer means clients need not create their own way of parsing the Transaction message. Instead, that functionality is provided by the Google Protobuffers library. Allowing the Gearman Workers/Functions to use the Google Protobuffers library to deconstruct the Transaction message results in cleaner and less buggy worker/functions and also makes their implementation more robust against version changes in the Transaction message.
The arrows that are red represent a Gearman Job Result message. Again, the exact structure of a Gearman Job Result message can be found in the Gearman documentation, but it essentially consists of a job status as well as data, and any error messages generated by the function as it executed the job.
The arrow that is grey represents a data transformation message used to modify the Slave database to be ‘in sync’ with the Master database. The exact format of that message depends on the implementation of the Applier Function but in the case of the Gearman Replicator for Drizzle, this is a JDBC call containing the statements that make up the transaction being replicated.
THE GEARMAN REPLICATION APPLIER AND APPLIER FUNCTION
One of the main components of the system is the Gearman Replication Applier. A detailed description of the Gearman Replication Applier will follow in a future blog, but it is worth noting that in the diagram above, this entity is completely contained within the Master database. This was done intentionally as the Gearman Replication Applier is an applier Drizzle plugin and as such actually runs within the Drizzle process. The job of the Gearman Replication Applier is to “listen” to the replication stream of the Master Database and to wrap each Transaction into a Gearman Job Request and pass it on to the Gearman Job Server.
Like the Gearman Replication Applier, the Applier Function also exists within the scope of another entity. It resides within a Gearman Worker process. While the Gearman Worker acts as a conduit between the Gearman Job Server and the Applier Function, it is the Applier Function that performs all the real work. It is the responsibility of the Applier Function to validate the Transaction it has received, apply the Transaction to the slave database (if you look carefully in the diagram, you’ll note that the grey arrow which represents the application of the transaction to the slave database originates from the Applier Function) and then generate the appropriate Gearman Job Result.
CONCLUSION
So far we have gone over the major pieces of the Gearman Replicator For Drizzle system and described how these pieces communicate with each other. In upcoming blogs I will go into more detail to describe the individual pieces and even show some code. For the time being if you want to see some code, look in the following places
- The Gearman Replication Applier will eventually make its way into the Drizzle trunk, but for the time being it can be found at https://code.launchpad.net/~elambert/drizzle/gearman_replication_applier.
- The GearmanRepilicator project is Java implementation of the Applier Function piece. It can be found https://code.launchpad.net/~elambert/drizzle/GearmanReplicator. For the time being, this branch is under the drizzle umbrella, I suspect at some point we will move this into a standalone project.
Also, if you have any comments or questions about this, feel free to make a comment below or drop me a line (eric.d.lambert@gmail.com).
Posted at 05:33PM Oct 30, 2009 by Eric Lambert in Drizzle | Comments[0]
gearman-java 0.03 released
gearman-java 0.03 was released today. You can grab the bits at https://launchpad.net/gearman-java/trunk/0.03 .
The release essentially consists of the following:
- Cleanup findbugs, pmd, and checkstyle warnings as well as misc. build improvements.
- Changed signature of the addServerMethod in GearmanClient and GearmanWorker (as well as their implementations) to return a boolean to indicate success or failure of attempt. Attmepts to add a server that can not be contacted will now return false as opposed to throwing a runtime exception.
- ClientImpl driveRequestTil
lState now drivesIO on all sessions that are selected for IO instead of driving IO for only the session to which the request belongs. - Allow gearman functions to control the name that will be used to register the function with the server when using the default function factory (factory was ignoring the name of the function and always registering function with its class name).
- Fixed bug #417004 (ReverseClient example shows improper use of client).
- Fixed bug #417214 (Worker performs slow on linux client). Connection between worker and job server did not have TCP_NODELAY set, causing performance problems on linux. Changed connection settings to mirror settings in libgearmand.
- Fixed bug #417208 (AbstractGearma
nFunction does not correctly handle failing or misbehaving functions). Fix resulted in changing the signature for GearmanFunction interface. The GearmanFunction interface now extends Callable< GearmanJobResul t>, clients of this interface will need to be changed to reflect this. - Fixed bug #418927. We can now send a receive payloads larger than the default buffer sizes
Posted at 11:21AM Sep 16, 2009 by Eric Lambert in Gearman | Comments[0]
Dell Dvd Store for Drizzle
A while back, the good folks at Dell created an e-commerce benchmarking application called the "DVD Store" that emulates, of all things, a web based DVD store. The test application contains a backend database component, a web application layer, as well as series of driver program to drive load against the application. The DS2, as it is known, is built in such a way that the database backend can implemented with a variety of databases, including Oracle, SQLServer, and MySQL. Last week, I spent some time porting the application to work with Drizzle.
You can find the fruits of my labor at http://launchpad.net/ds2drizzle/trunk/0.01/+download/ds2-drizzle-0.01.tar.gz. This port not only includes a Drizzle based backend --the backend currently uses the default INNODB storage engine-- but also supports the web based DS2 driver which can be run against both JSP or PHP based implementations of the DVD Store Web Application. Not included in this port is the ASP based Web Application nor has the direct (non-web-based) driver been ported to work with Drizzle.
The database schema used by this port looks very similar to the MySQL DS2 schema with one major exception. The original MySQL schema contained FULLTEXT indices. Since the INNODB storage engine does not support FULLTEXT indices, these indices have been removed from the schema. The affect of this is that the queries which relied on the FULLTEXT indices needed to be reworked. For the time being, these queries have been changed from MATCH type queries that took advantage of the FULLTEXT index to LIKE queries. This is obviously not an optimal solution and should be considered a hack (which I solely chose out of expedience). A better solution would have drizzle work in conjunction with the fulltext search engine such as lucene or sphynx.
Posted at 08:53PM Sep 15, 2009 by Eric Lambert in Drizzle | Comments[0]
Drizzle in the Snow (how to build Drizzle on OS X 10.6 , aka Snow Leopard)
So these days I do most of my development on my Mac Book Pro and for the most part it works just fine. In fact, things have been so smooth that I've lulled myself into the false sense of complacency that things will "just work". That is until this morning when I pulled down a fresh version of the drizzle trunk and tried to build it. Not more that a few seconds after kicking of the build I noticed the cursor blinking at me below with an error indicating that my build had failed. At this point it dawns on me that I had installed Snow Leopard (OS X 10.6) on the machine over the weekend and most likely this was the culprit.
As it turned out, there were some issues building Drizzle on OS X 10.6, but nothing to difficult to overcome.
ISSUE #1: FDATASYNC
This is issue manifests itself with the following build failure:
libtool: compile: /usr/bin/g++-4.2 -DHAVE_CONFIG_H -I. -I. -isystem ./gnulib -isystem ./gnulib -ggdb3 -I/Users/elambert/dev/drizzle/include -D_THREAD_SAFE -pipe -O3 -Werror -pedantic -Wall -Wextra -Wundef -Wshadow -fdiagnostics-show-option -fvisibility=hidden -Wformat -fno-strict-aliasing -Wno-strict-aliasing -Woverloaded-virtual -Wnon-virtual-dtor -Wctor-dtor-privacy -Wno-long-long -Wno-redundant-decls -std=gnu++98 -MT mysys/my_sync.lo -MD -MP -MF mysys/.deps/my_sync.Tpo -c mysys/my_sync.cc -fno-common -DPIC -o mysys/.libs/my_sync.o
mysys/my_sync.cc: In function ‘int my_sync(File, myf)’:
mysys/my_sync.cc:59: error: ‘fdatasync’ was not declared in this scope
make[2]: *** [mysys/my_sync.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2
The problem here is that 'configure' is being fooled into thinking that the fdatasync() system call is available on the system when in reality it should be using fsync instead. Unfortunately, the fix for this problem requires changes to the build system. Fortunately, those changes should already by in drizzle trunk by the time you read this. So if you are seeing this error, do a fresh pull from the trunk. If, for some reason, the changes have not made it to the trunk yet or pulling from the trunk is not option for you, just apply the diff listed at the bottom of this blog.
ISSUE #2: READLINE 'INCOMPATIBILITY'?
This issue manifest itself with the following build failure:
g++ -DHAVE_CONFIG_H -I. -I. -isystem ./gnulib -isystem ./gnulib -ggdb3 -I/Users/elambert/dev/drizzle/include -D_THREAD_SAFE -pipe -O3 -Werror -pedantic -Wall -Wextra -Wundef -Wshadow -fdiagnostics-show-option -fvisibility=hidden -Wformat -fno-strict-aliasing -Wno-strict-aliasing -Woverloaded-virtual -Wnon-virtual-dtor -Wctor-dtor-privacy -Wno-long-long -Wno-redundant-decls -std=gnu++98 -MT client/drizzle.o -MD -MP -MF $depbase.Tpo -c -o client/drizzle.o client/drizzle.cc &&\
mv -f $depbase.Tpo $depbase.Po
client/drizzle.cc:109: error: conflicting declaration ‘typedef int (rl_compentry_func_t)(const char*, int)’
/usr/include/readline/readline.h:44: error: ‘rl_compentry_func_t’ has a previous declaration as ‘typedef char* (rl_compentry_func_t)(const char*, int)’
client/drizzle.cc: In function ‘void initialize_readline(char*)’:
client/drizzle.cc:2348: error: invalid conversion from ‘char* (*)(const char*, int)’ to ‘int (*)(const char*, int)’
make[2]: *** [client/drizzle.o] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2
The problem here appears to be due to the fact that the readline implementation that ships with SnowLeopard (which I believe is just a wrapper under the editline lib) is 'incompatible' with Drizzle. The readline.h file in /usr/lib/readline defines some functions expected by Drizzle but not all of them, making it difficult to use the header file. To work around this you will need to build your own version of readline and compile Drizzle against it. The steps to do so are listed below:
1 - Download the readline 6.0 source tarball (other versions of readline may work, but I've only tested 6.0)
$ wget ftp://ftp.cwru.edu/pub/bash/readline-6.0.tar.gz
2 - Unarchive the tar ball
$ gtar xfz readline-6.0.tar.gz
3 - configure the readline build scripts. Note, I decided not to install the new version of readline in the default location but instead have it be installed in a specified location of my choosing. My reason for not installing the new version in the default location was that i did not want to upset any dependencies other parts of my system may have on the current version.
$ cd readline-6.0
$ ./configure --prefix=/Users/elambert/dev/readline-6.0
4 - Build and install readline. This step is fairly straight forward
$ make all && make install
5 - Configure Drizzle. If you are reading this blog, you should be fairly familiar with configuring Drizzle by now,but of note here is the fact that I specify the location to find readline with the --with-lib-prefix option. If you installed readline into the default location, you do not need to include this flag.
$ cd <DRIZZLE_HOME>
$ ./config/autorun.sh
$./configure --with-lib-prefix=/Users/elambert/dev/readline-6.0 \
--with-libdrizzle-prefix=/Users/elambert/dev/drizzle \
--prefix=/Users/elambert/dev/drizzle
6 - Build Drizzle
$ make -j2
7 - Test it
$ cd test
$ ./test-run
And thats it .... see not so bad was it.
DIFF FOR FDATASYNC FIX
=== modified file 'configure.ac'
--- configure.ac 2009-08-20 16:14:47 +0000
+++ configure.ac 2009-09-02 02:44:43 +0000
@@ -451,11 +451,27 @@
AC_MSG_ERROR("Drizzle requires fcntl.")
fi
+AC_CACHE_CHECK([working fdatasync],[ac_cv_func_fdatasync],[
+ AC_LANG_PUSH(C++)
+ AC_RUN_IFELSE([AC_LANG_PROGRAM([[
+#include <unistd.h>
+ ]],[[
+fdatasync(4);
+ ]])],
+ [ac_cv_func_fdatasync=yes],
+ [ac_cv_func_fdatasync=no])
+ AC_LANG_POP()
+])
+AS_IF([test "x${ac_cv_func_fdatasync}" = "xyes"],
+ [AC_DEFINE([HAVE_FDATASYNC],[1],[If the system has a working fdatasync])])
+
+
+
AC_CONFIG_LIBOBJ_DIR([gnulib])
AC_CHECK_FUNCS( \
cuserid fchmod \
- fdatasync fpresetsticky fpsetmask fsync \
+ fpresetsticky fpsetmask fsync \
getpassphrase getpwnam \
getpwuid getrlimit getrusage index initgroups isnan \
localtime_r log log2 gethrtime gmtime_r \
=== modified file 'm4/pandora_ensure_gcc_version.m4'
--- m4/pandora_ensure_gcc_version.m4 2009-07-08 07:09:13 +0000
+++ m4/pandora_ensure_gcc_version.m4 2009-09-01 21:21:13 +0000
@@ -8,12 +8,15 @@
AC_DEFUN([PANDORA_MAC_GCC42],
[AS_IF([test "$GCC" = "yes"],[
AS_IF([test "$host_vendor" = "apple" -a "x${ac_cv_env_CC_set}" = "x"],[
- AS_IF([test -f /usr/bin/gcc-4.2],
+ host_os_version=`echo ${host_os} | perl -ple 's/^\D+//g;s,\..*,,'`
+ AS_IF([test "$host_os_version" -lt 10],[
+ AS_IF([test -f /usr/bin/gcc-4.2],
[
CPP="/usr/bin/gcc-4.2 -E"
CC=/usr/bin/gcc-4.2
CXX=/usr/bin/g++-4.2
])
Posted at 09:08PM Sep 01, 2009 by Eric Lambert in Drizzle | Comments[1]
gearman-java 0.02 released
Today I am happy to announce the 0.02 release of the gearman-java library which contains the following changes:
- Significantly improved worker and client performance (running gearman blob_slap benchmark with 0.02 java worker sees ~100X increase over 0.01).
- Fixed bug #400466. (Client leaks memory when submitting attached jobs).
- Added build support for pmd, findbugs, checkstyle, and code coverage (emma).
The applicable binaries and source files can be retrieved from the gearman-java launch-pad portal.
Posted at 05:22PM Aug 06, 2009 by Eric Lambert in Gearman | Comments[0]
gearman-java 0.01 released
Today we released the 0.01 version of the gearman-java project ... woo hoo!
== WHAT IS THE GEARMAN-JAVA PROJECT ==
The purpose of the gearman-java project is to provide a pure Java based Gearman implementation that can be easily used by Java applications to define and submit gearman jobs as well as allow for Java based workers to interact with a Gearman System. The Gearman Java project will provide a 'pure java' implementation for the Gearman Client and Gearman Worker but does not include an implementation
of the Gearman Job Server.
While the gearman-java library provides Java implementations of the Gearman Worker and Client abstractions, there is no requirement that a Java Client use a Java Worker (or vice versa). The Workers and Clients make no assumption about their counter-parts and it is legitimate for a Java client to execute a C based Worker or for a perl client to execute a Java worker. One of the advantages of the Gearman framework is that the implementation details are irrelevant.
As you can see from the 0.01 version number, this is the first release of the project, and as such, you can expect that there will be bugs and I would expect that the API will most likely go through some revisions, and there certainly will be a lot of improvements coming in the near future, so you probably should not hook it up immediately to your production system.
But we would really like to encourage people to play with it and to give us feedback on how this implementation does or does not meet your needs. If you like it, great, let us know what works for you. If you don't like it, well, believe it or not, that's great too, as long as you can tell us what you don't like about it. We don't get offended easily and will use your feedback to make the system better. If you find a bug, let us know, we'll fix it. Bonus points if you can provide us with a test case. Double bonus points if you can provide us with a test case and a fix.
== WHERE CAN I GET IT ==
You can download the jar/source/javadocs from the URL below:
https://launchpad.net/gearman-java/trunk/0.01
== REQUIREMENTS ==
- JDK5 or higher.
- A Gearman Job Server. The gearman-java Library has been tested using the 0.8 C based gearmand Gearman Job Server, but in theory should work with any server that understands the Gearman Protocol.
- If you plan on building the project, as opposed to just using the pre-built jar, you will need ant on your path. The build has been tested with ant 1.7 but probably works with earlier versions.
- To execute the regression tests in the source tree, Junit 4.6 will need to be available on your system.
== KNOWN ISSUES==
- Performance. Performance is significantly slower than the C library. We will be addressing this soon.
- Memory leak in GearmanClientImpl. When submitting an attached job, if you do not call selectUpdateJobsEvents() method on a semi-regular basis, the client will leak memory. To work around this, call selectUpdateJobEvents() after a job is reported as done.
- Client and Worker will not attempt to re-establish connection to server if the connection drops.
== EXAMPLES ==
There are some examples that ship with the source tar ball, located in src/org/gearman/examples, which provide some real world examples on how to use the gearman-java library. Below are some quick snippets for those of you too lazy to go look.
Gearman Worker Implementation
The easiest way to create a worker is to create a class that extends org.gearman.worker.AbstractGearmanFunction and to implement the
executeFunction(). Below is a class that simply reverses a String.
|
Gearman Worker Registration
The easiest way to instantiate a worker and have it registered with a Gearman Job Server is to use the example WorkerRunner, also found in org.gearman.example
The example below will load the reverseFunction for execution and then register it with a GearmanJobServer running on the localhost using the default port. The example assumes that the class org.gearman.examples.ReverseFunction exists and is on the class path.
java -cp <PATH_TO_GEARMAN_JAVA_LIBRARY_JAR>:${CLASSPATH} org.gearman.example.WorkerRunner org.gearman.examples.ReverseFunction
|
Gearman Job Submission
The snippet below demonstrates how to create a job, submit it for execution, and to then retrieve its results.
|
== HOW DO I CONTRIBUTE ==
The top 10 ways you can contribute are listed below in RDL (Reverse David Letterman) order.
1) Use it! This is an early release and we are itchin' at the bit to get some feedback on what works and what doesn't.
2) File a bug. We can't fix it if we don't know about it.
3) Submit a test case. A product is only as good as its' test suite. Want a bug fixed? We'll get to it that much quicker if we don't have to figure out how to reproduce it.
4) Improve a test case. You see a bad test case or find one that does not adequately cover a feature, well, make it better.
5) Submit a feature request. You have an idea for how we can do things better, tell us.
6) Implement a feature. Take a look at the project blueprints, find something that looks fun to you, sign up for it and do it.
7) Help us document. If you see some documentation that is unclear,wrong, or missing, we hereby grant you the power to fix it.
8) Write a worker.
9) Blog about the gearman-java library. Blogging ... its' your ticket to stardom.
10) Become a member. Subscribe to the Gearman Discussion list and actively participate.
== FEEDBACK ==
If you have any feedback that you want to give, you can use any of the following
means to reach us.
- Mailing List: Gearman group on Google Groups
- Mailing List: gearman-java-discuss on Launchpad
- IRC Channel: #gearman on irc.freenode.net
== CONTRIBUTORS ==
The following individuals contributed to this release:
Posted at 04:31PM Jul 16, 2009 by Eric Lambert in Gearman | Comments[0]
Two wrongs do not make a right ... but two 'set's can make a 'delete'
I've spent some time the past couple days looking through the memcached server code and I noticed something interesting in how we process set commands when the server is configured not to evict items when low on memory(-M). For a split second, the behavior I was seeing in the code seemed wrong to me, but after giving it a bit of thought I realized that the behavior made perfect sense. Without getting into the details upfront, let me show you an example.
Lets start a memcached server. The behavior in question happens when the server is not able to fulfill a set request due to memory constraints, so to make it easier to achieve this condition, lets limit our cache size to 1MB.
smacky:memcached elambert$ ./memcached -m 1 -M |
Now that we have our server up and running, I want to add some data. On my system, I have file called 'afile', its 524,288 byte file that consists of nothing but the letter A. Lets store the contents of this file and associate it with the key 'key1'.
smacky:~ elambert$ ls -ld afile
-rw-r--r-- 1 elambert staff 524288 Mar 25 14:58 afile
smacky:~ elambert$ val=`cat ./afile`; printf "set key1 0 0 524288\r\n$val\r\n" | nc 127.0.0.1 11211
STORED
smacky:~ elambert$
Now, some time passes, and I want to insert the content of file 'bfile' (a 524,288 byte file filled with the letter B) into the cache, again under the key 'key1' (essentially replacing the old entry for 'key1').
smacky:~ elambert$ ls -ld bfile
-rw-r--r-- 1 elambert staff 524288 Mar 25 14:59 bfile
smacky:~ elambert$ val=`cat ./bfile`; printf "set key1 0 0 524288\r\n$val\r\n" | nc 127.0.0.1 11211
SERVER_ERROR out of memory storing object
smacky:~ elambert$
Well, it looks like we don't have enough room in the cache and since we told the server not to evict entries we get an out-of-memory error. So what does this mean for our 'key1' entry. Well, lets retrieve it and find out.
smacky:~ elambert$ printf "get key1\r\n" | nc 127.0.0.1 11211 |
Wow, thats interesting. The cache says it does not have an entry for 'key1', yet we were able to store the 'afile' contents under the key 'key1' and we told the server not to evict an items. Shouldn't the original entry still be in the cache? Well, actually, no. The server did exactly what it should have, which is delete the entry when the second set operation failed.
For some of you this may be obvious, but I suspect others are finding this behavior a little counter-intuitive and are wondering why a failed set would result in delete, especially when we tell the server not to evict items.
But to understand this behavior, we need to realize that memcached is ... well ... a cache. It is not designed to be the sole source of truth in your system. Memcached is just an 'optimization' that aides you in scaling your system and that it should always operate in tandem with a reliable store (might I suggest MySQL for this purpose :-) ). It is that store which acts as 'truth' for your system. And in a well designed system, updates must first be applied to this backend store before being applied to the cache. So even though the second set operation fails, the mere fact that an attempt was made to modify the value of the entry tells the server that the entry for 'key1' is most likely no longer in sync with the our backend --otherwise, why else would we attempt to change the value. Armed with this knowledge, the worst thing the server could do is leave the entry for 'key1' as is. If it were to leave the original entry, than any proceeding get request would get a stale value. A better strategy is to delete the item. By deleting the entry, clients that request the key will incur a cache-miss and simply go to the backend for the correct value.
Another point of confusion is the presence of the -M (don't evict) flag. Some people may interpret this flag to instruct the server to never evict an item. If you read the description of the flag, you'll see that is not the case. This flag only has meaning when the cache has exhausted its memory. If the server is out of space and we have specified the -M flag, the server will not evict an entry to make room, but it is still free to evict entries for many other reasons, including the example discussed above.
Posted at 04:32PM Mar 25, 2009 by Eric Lambert in Memcached | Comments[0]
basic UDP support in libmemcached
Yesterday I submitted a patch to libmemcached, the popular Memcached client library. The patch provides basic, yet limited, support for using the User Data Protocol (UDP) when communicating with a Memcached server. For those of you interested, you can get the source and documentation at http://hg.tangent.org/libmemcached .
In this post, I will describe how this patch behaves and describe some of the rational for how it was implemented and how it should be used.
UDP SUPPORT IN MEMCACHED SERVER
The Memcached Protocol Specification (http://code.sixapart.com/svn/memcached/trunk/server/doc/protocol.txt), does allow for UDP communication with the Memcached server. The Memcached UDP specification is a super-set of the TCP specification, it follows the same format as the Memcached server TCP protocol with the following additions:
- each request must be entirely contained within a single UDP data-gram.
- each data-gram request must include an 8 byte frame header (see specification for details). Components of this header provide basic session management as well as a means to ensure that the client is not violating the Memcached server UDP protocol, for example by sending multi-data-gram requests.
- When responding to a request, the server is not limited to a single data-gram. The response may be a multi-datagram request.
UDP SUPPORT IN LIBMEMCACHED
Limited Operations
Each of the constraints placed upon UDP communications with the Memcached server (see above), combined with the stateless/connectionless nature of the UDP protocol, ended up having an impact on how UDP support was implemented in libmemcached.
The most significant impact was the decision to limit the type of operations which could be executed when running in UDP mode. There exists a set of operations in libmemcached that issue a request of the server and then require a response from the server, for example memcached_get(). The fact that the nature of these types of operations require a response from the server posses a difficulty since UDP gives us no assurance that either the request or the response will be delivered. That the server may respond to a request with a multi-datagram message further complicates the matter since UDP does not ensure proper ordering of data-grams that have been received. Since ordering and delivery assurance is the purview of TCP and not wanting to rewrite TCP with in libmemcached, we decided to simply not support these types of operations in UDP mode. As such the following functions are not available when using UDP communication:
memcached_version(), memcached_stat(), memcached_get(), memcached_get_by_key(), memcached_mget(), mem-cached_mget_by_key(), memcached_fetch(), memcached_fetch_result(), memcached_value_fetch(). All other operations are supported.
Fire-and-Forget
As is hinted to in the section above, the UDP implementation does not attempt to handle responses from the server, the rationale being that since we can not be assured to receive the response, we should assume that we wont receive it. Because of this assumption, all supported operations are executed in a "fire-and-forget" mode, by which it means that once the operation has been executed by the client, no attempt is made to ensure that the operation was received or executed by the server. Since no attempt will be made to handle the server's response, when executing an operation in UDP mode the 'noreply' option is sent for all operations which support noreply.
Limited Size
In UDP mode, the current implementation limits the size of cache entries that can be sent to the server to 1KB. Technically, the limit of user supplied data is above 1300 bytes, the exact limit depending on whether the binary or ASCII protocol is being used and if the ASCII protocol is being used then the limit depends on which command is being executed (set, replace, cas, etc.). For simplicity's sake the user should consider the limit to be 1KB, although the client will allow operations above 1KB as long as the entire message --include UDP header and command overhead, does not exceed 1400 bytes. This limit is a consequence of the fact that the server does not support multi-datagram requests, which implies that no cache entry may be larger than a data-gram. Currently, the server defines the maximum data-gram size as 1400 bytes and to be consistent with the server, our implementation abides by this limit (it should be noted that the server currently only uses this limit when crafting UDP responses --which we ignore-- so it is conceivable that we could raise the client-side size limit to a larger value).
No Mixing
The primary data-structure in the libmemcached library is memcached_st, which acts a handle for the client. The typical use pattern is to create an 'instance' of this structure (by calling memcached_create() ) and then adding server 'instances' to this structure for each Mecached server we wish to communicate with (via memcached_add_servers()). Our UDP implementation does not allow for UDP and TCP servers to be added to the same memcached_st client instance. When we consider the fact that UDP server instances will not be able to support all operations (such as get, see above) we can see how mixing UDP and TCP servers in the same client would be problematic. As an example, let us assume we have added ServerA and ServerB to the same client instance. If ServerA were using UDP while serverB used TCP, than any get operation who's key hashed to ServerA would fail, since UDP get is not supported, while those that hash to Server B would pass (assuming the key exists in the cache). This inconsistent behavior is undesirable and as such the mixing of UDP and non-UDP servers in the same client 'instance' is not allowed. Should an application need to use both UDP and TCP to communicate with its Memcached servers, this can be accomplished by using two separate memcached_st client 'instances', one for each transport protocol (see examples below).
EXAMPLES
The example below shows how to set up a libmemcached client which communicates with the server via UDP
memcached_st *memc;
//creates handle for client instance
memc= memcached_create(NULL);
//turns on udp behavior for the memc client handle
memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_USE_UDP, 1);
//add the server 127.0.0.1:11211 to the client
memcached_server_add_udp(memc, "127.0.0.1", 11211);
//store some data
char *key= "foo";
char *value= "when we sanitize";
memcached_return rc= memcached_set(memc, key, strlen(key),value, strlen(value),(time_t)0, (uint32_t)0);
This example shows how to set up two clients, one using UDP and one using TCP
memcached_st *udp_client;
memcached_st *tcp_client;
//creates udp handle for client instance
udp_client= memcached_create(NULL);
//creates tcp handle for client instance
tcp_client= memcached_create(NULL);
//turns on udp behavior for the udp client handle
memcached_behavior_set(udp_client, MEMCACHED_BEHAVIOR_USE_UDP, 1);
//add the server 127.0.0.1:11211 to the clients
memcached_server_add_udp(udp_client, "127.0.0.1", 11211);
memcached_server_add(tcp_client, "127.0.0.1", 11211);
//store some data
char *key= "foo";
char *value= "when we sanitize";
memcached_return rc= memcached_set(udp_client, key, strlen(key),value, strlen(value),(time_t)0, (uint32_t)0);
//get some data; note we use the tcp client
size_t vlen;
uint32_t flags;
char * value= memcached_get(tcp_client, key, strlen(key),&vlen,&flags,&rc);
FUTURE
We recognize this is a fairly basic implementation of UDP support in libmemcached; the idea was to start small and use feedback from this implementation to determine what a more fully formed implementation should look like or if one is even needed at all. So if you are interested, please download the bits, kick the tires and tell us what ya think.
Posted at 09:14PM Mar 10, 2009 by Eric Lambert in Memcached | Comments[0]
Memcached Java Client Performance on OpenSolaris: Part II
Happy New Year! I hope 2009 will be a great year for all of you!
As for me, I am starting out 2009 the same way I finished 2008 ... looking at the performance of the Spy and Whalin Memcached clients on OpenSolaris. Today I re-ran nearly the same scenario that I discussed in my Dec 22nd blog entry (you'll have to read the blog for all the details) but with one major difference, I increased the size of the cache entries being placed into the cache. In my earlier tests, the entries I was storing into the cache were pretty small --varying from 768 bytes to 1,280 bytes. This time, I increased the size of the cache entries to 30,800 bytes. While a 30KB file is not exactly large, it is large enough to ensure that both clients will compress the data before sending it to the Memcache server.
The data for this new scenario is captured in the table below. The workload used in this scenario consisted of 90% get operations and 10% set operations. All get operations were 'bulkGet' operations that each requested 100 entries from the cache.
| Whalin 2.0.1 | Spy 2.2 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Threads | Ops/Sec | setLatency | getLatency | cpu % busy | NIC Saturation | Ops/Sec | setLatency | getLatency | cpu % busy | NIC Saturation |
| 1 | 43 | 1.3 | 25.6 | 26 | 0.4% | 45 | 1.6 | 24.0 | 26 | 0.4% |
| 10 | 91 | 6.1 | 121.2 | 99 | 0.9% | 115 | 9.5 | 95.3 | 98 | 1.2% |
| 20 | 91 | 12.3 | 242.9 | 98 | 0.9% | 114 | 18.9 | 192.8 | 98 | 1.2% |
| 30 | 89 | 26.3 | 371.9 | 99 | 0.9% | 101 | 44.4 | 321.2 | 90 | 0.9% |
| 40 | 87 | 37.8 | 511.9 | 98 | 0.8% | 14 | 140.7 | 1913.4 | 14 | 0.1% |
| 50 | 86 | 56.2 | 645.0 | 98 | 0.8% | 9 | 307.0 | 4235.3 | 9 | 0.5% |
As we can see, in this scenario, the performance of the Whalin client is much closer to that of Spy than in the non-compression/small-data scenario. When using small cache entries, the Whalin client performed at approximately 75% of the rate of the Spy client, where as in this scenario, we see that the clients are fairly equally matched in terms of ops/second as well as resource utilization and that somewhere between 30 and 40 threads the Whalin client exceeds the performance of the Spy Client (although, since both clients appear to plateau around 20 threads, not sure there is much gained by running over 30 threads).
Posted at 10:05PM Jan 06, 2009 by Eric Lambert in Memcached | Comments[0]
Memcached Java Client Performance on OpenSolaris
So I have spent some time in the last month taking a look at how the two main Java-based Memcached clients (Whalin and Spy) perform when run on OpenSolaris. The point of this exercise was to generate some meaningful and useful data that we could use to understand the behavior of these two clients when compared with each other.
I should point out that no attempt was made to optimize either client nor the environment in which the clients were run. The idea was to simulate, as much as possible, an 'out-of-the-box' experience. So clearly, the data below should not be taken as the optimal performance for each of the clients --as I am sure that there are means to squeeze more performance out of them. I also want to state that the results of this comparison should not be considered a repudiation or endorsement of a particular client.
So, with all these caveats out of the way, let's get to it ....
THE EXECUTION ENVIRONMENT
The environment used to measure the performance consisted of Four Dual-Core 2.2 GHZ Sun Fire x2200 M2 with Four GB of memory running OpenSolaris build 90 (snv_90). All of the x2200s were on the same sub-net and each had GigE NICs. Three of the x2200s each hosted a Memcached server instance. The server instance was running version 1.2.5 and was allocated 1 GB of memory. The fourth x2200 hosted the client under test.
THE BENCHMARK
I used the Faban benchmarking framework to run these benchmarks. The really nice thing about this framework is that it allows you to focus on creating the workload logic, while it takes care of data collection and process/service management.
The workload pattern consisted of the following steps
- Start the 1.2.5 Memcached Server on the 3 server hosts.
- Preload 1,000,000 objects into the cache. The size of the objects in the cache varied from 768 bytes to 1,280 bytes with an average size of 1,024. The entries were distributed equally so that each server node was holding approximately 367 MB of cache data.
- Start the client load. The client load came from a single Java VM (1.6.0_6 in 64bit mode) running on a single host. The number of client threads varied from 1 - 50.
- Load was applied for 330 seconds, with a ratio of 90% get operations to 10% set operations. Gets were performed in a bulk fashion, with each get asking for 100 cache entries.
- Shutdown the servers and collect the data.
It should be noted that the Spy Client provides the user with the ability to perform non-blocking I/O while Whalin does not. In order to provide a more "apples-to-apples" comparison, the Spy Client was used in a blocking manner. The Spy client was also configured to use the WhalinTranscoder.
The following data was collected while executing this benchmark
- Operations-per-second (Ops/Sec): Total number of operations completed divided by number of seconds during which the benchmark ran.
- meanSetLatency (setLatency): Arithmetic mean for how long, in milliseconds, individual set operations took to complete.
- meanBulkGetLatency (getLatency): Arithmetic mean for how long, in milliseconds, individual getBulk operations took to complete.
- CPU utilization (cpu % busy): Percentage of time the CPU was busy as measured by the vmstat utility.
- NIC Saturation: Network Card Saturation as captured by the nicstat utility
THE RESULTS
| Whalin 2.0.1 | Spy 2.2 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Threads | Ops/Sec | setLatency | getLatency | cpu % busy | NIC Saturation | Ops/Sec | setLatency | getLatency | cpu % busy | NIC Saturation |
| 1 | 272 | 0.5 | 4 | 29 | 20 | 581 | 0.4 | 2 | 34 | 45 |
| 10 | 548 | 5 | 19 | 92 | 45 | 764 | 7 | 13 | 48 | 60 |
| 20 | 544 | 7 | 40 | 92 | 45 | 759 | 13 | 28 | 48 | 60 |
| 30 | 538 | 8 | 61 | 92 | 45 | 763 | 18 | 42 | 49 | 60 |
| 40 | 534 | 9 | 83 | 92 | 45 | 702 | 25 | 61 | 45 | 58 |
| 50 | 541 | 11 | 102 | 92 | 45 | 746 | 28 | 72 | 48 | 60 |
FINDINGS/OBSERVATIONS/NEXT STEPS
The data above indicates that the Spy client can achieve a higher throughput while using less CPU than Whalin and that Spy achieves a greater saturation of the NIC than Whalin. With this in mind, it is worth again noting that no attempt was made to optimize either client and that it is entirely conceivable that each client could be configured to achieve greater performance.
Both clients appear to reach a performance/resource utilization plateau between 1 and 10 threads. Further benchmarking (not included in the data above) has shown the plateau occurs between 1 and 5 threads.
The lower setLatency averages achieved by the Whalin client indicate that in a more 'set-intensive' environment, Whalin may achieve higher performance than Spy.
The fact that the Whalin client has such a high CPU utilization rate is fertile ground for investigation. Is this something that can be addressed via a change in the runtime environment/configuration?
The benchmark used very small cache entries that were below both clients' compression threshold. What will the results look like when the cache entries exceed the compression threshold?
The benchmarking was performed using the ASCII protocol, how will things change if we run against a 1.3.X server and enable the binary protocol?
Posted at 11:16PM Dec 22, 2008 by Eric Lambert in Memcached | Comments[0]
Hello World
Hi There!
Seems appropriate that I should entitle my first blog entry as "Hello World". Sure ... it's not the most creative of titles, but oh well. Hopefully you'll forgive me for my lack of originality.
So, now that I've sucked you in, let me introduce myself. My name is Eric and I am a software engineer here at Sun Microsystems. I live in the San Francisco Bay Area and have been here most of my adult life. Other places I've lived include Minneapolis, San Luis Obispo, and Zweibrucken (Germany).
For the past eight years I've been working on-and-off at Sun (more 'on' than 'off', but who is counting) during which I've been involved in a variety of roles and projects, from a tools developer for the Java Conformance and Compatibility team, to a developer on the ST5800 (also known as Honeycomb to those of us who know and love her), to my current position as an engineer in the Database Group. My new role has me focusing a lot on caching solutions, specifically Memcached.
One of the first things I'll be tackling is a performance evaluation of the available Java based Memcached clients --I will primarily be looking at the Whalin and Spy clients but if someone knows of another Java based client I should consider, speak up. As I make progress on this, I'll be sure to post my findings to this blog.
Well that's all for now .... nice talking to you.
Posted at 12:34PM Nov 17, 2008 by Eric Lambert in Sun | Comments[0]