Tuesday October 20, 2009
Testing libmemcached on EC2
Someone pinged me yesterday about a problem he was seeing when he tried to run the test suite on Jaunty Ubuntu. The tests failed almost immediately in the following assertion:
value= memcached_behavior_get(memc, MEMCACHED_BEHAVIOR_SOCKET_SEND_SIZE); assert(value > 0);
I guess I'm an old-school developer, because I want to use a debugger to hunt down bugs. For some reason the default setting in the shell was to disallow creation of corefiles, so I had to execute the following command to allow the corefiles to be written:
$ ulimit -c unlimited
Now that I was able to generate coredumps I wanted to create a "debug build" of libmemcached, because the optimizer may remove local variables etc. If I'm not able to reproduce the bug with a debug build, well then we have to debug the optimized binary. Why make life harder than it already is ;-) To create a debug build, simply invoke:
$ ./configure --with-debug
This didn't work however :-( Disabling the optimization (-O3) caused the compiler to spit out some new warnings, and we treat warnings as errors in libmemcached. It turns out that the gcc version installed on the machine was the old gcc 4.3.3, and not one of the more recent 4.4 series. I've been struggling with different problems with gcc lately (mostly that it generate bogus warnings on C99 struct initializers), so I cannot say I was too happy about "yet another compiler problem". The code it complained about was:
unlikely (ptr->flags & MEM_USE_UDP)
With the following warning:
error: conversion to ‘long int’ from ‘uint32_t’ may change the sign of the result [-Wsign-conversion]
MEM_USE_UDP is an enum, and that's an integer according to C99 (see section 6.4.4.3), and flags is defined as an uint32_t. So yes, we are doing a bitwise and on an unsigned and a signed 32 bit word. But we are only testing if the value is 0 or not, so the sign doesn't matter at all!!! Just for the fun of it I decided to replace unlikely with a normal if (you might have had fun with the broken ntohX-macros on Linux generating warnings all of the time, so I guessed this could be a similar problem), and guess what: The warning is gone :-) So I went ahead and replaced all occurrences of unlikely with if... Not the thing you would like to do at 1:30AM :(
With the debug build available I could return to the original problem. I had been looking at the code, and my guess was that it was failing in getsockopt in the following snippet:
int sock_size;
socklen_t sock_length= sizeof(int);
/* REFACTOR */
/* We just try the first host, and if it is down we return zero */
if ((memcached_connect(&ptr->hosts[0])) != MEMCACHED_SUCCESS)
return 0;
if (getsockopt(ptr->hosts[0].fd, SOL_SOCKET,
SO_SNDBUF, &sock_size, &sock_length))
return 0; /* Zero means error */
return (uint64_t) sock_size;
I enabled a breakpoint on the line containing return 0; (so that i could look at errno) and ran the program, but guess what: It didn't fail! So it had to be a problem with memcached_connect. It turned out that this is a race-condition in the test suite, because the test program just starts up the memcached servers and start using them immediately. The memcached servers isn't done initializing themselves (and binding to the specified port) yet, so test fails to bind to the servers.
There are a number of small bugsI am going to fix in the test framework as a result of this:
I guess it's no secret that I really prefer software development using Solaris (and all of the great tools there), so if you are planning to do development on your EC2 image I would suggest that you start off with an OpenSolaris image instead (Check out http://blogs.sun.com/ec2/). That will give you easy access to a lot of great tools I cannot live without (dtrace, dbx, cc etc), and using the right tool for the task saves a lot of time!!! As an extra bonus you can use the DTrace probes I added to memcached to collect more information on what your memcached server is doing. Matt Ingenthron took this a step further in a demo by using the output from DTrace as an input feed to a browser.. I don't remember the link, but you should be able to Google it :-)
Posted at 10:21AM Oct 20, 2009 by trond in Memcached | Comments[1]