Trond Norbye's Weblog

« Previous month (Jan 2009) | Main | Next month (Mar 2009) »

http://blogs.sun.com/trond/date/20090328 Saturday March 28, 2009

How well is your hash table working for you?

If you look in the internals of memcached you will find a large hash table where all of the items are stored in (we hash to a bucket before we do a linary search in a linked list to try to locate the item). Memcached is supposed to grow the hash table automatically to avoid having too long lists in each bucket (that would kill the performance), but wouldn't it be cool to know how well the hash table works for you?? Well that's really easy to find out with dtrace!

So let's look at the user-supplied-bugs in libmemcached as an example. I already have memcached running on my server, so I just open up a new terminal and start the following dtrace one-liner:

trond@opensolaris> pfexec dtrace -n ':::assoc-find { @a = quantize(arg2);}'
dtrace: description ':::assoc-find ' matched 1 probe

In another terminal I just type the following commands:

trond@opensolaris> export MEMCACHED_SERVERS=localhost:11211
trond@opensolaris> ./testapp user
servers localhost:11211
        localhost : 11211


user

Testing user_supplied_bug1                                       0.993 [ ok ]
Testing user_supplied_bug2                                       0.009 [ ok ]
Testing user_supplied_bug3                                       0.041 [ ok ]
Testing user_supplied_bug4                                       0.000 [ ok ]
Testing user_supplied_bug5                                       0.192 [ ok ]
Testing user_supplied_bug6                                       0.053 [ ok ]
Testing user_supplied_bug7                                       0.031 [ ok ]
Testing user_supplied_bug8                                       0.000 [ ok ]
Testing user_supplied_bug9                                       0.004 [ ok ]
Testing user_supplied_bug10                                      2.572 [ ok ]
Testing user_supplied_bug11                                      2.555 [ ok ]
Testing user_supplied_bug12                                      0.000 [ ok ]
Testing user_supplied_bug13                                      0.008 [ ok ]
Testing user_supplied_bug14                                      2.607 [ ok ]
Testing user_supplied_bug15                                      0.000 [ ok ]
Testing user_supplied_bug16                                      0.008 [ ok ]
Testing user_supplied_bug18                                      0.003 [ ok ]
Testing user_supplied_bug19                                      0.000 [ ok ]
Testing user_supplied_bug20                                      0.010 [ ok ]

All tests completed successfully

So let's terminate dtrace and look at the results:

^C


           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 245648   
               1 |                                         589      
               2 |                                         35       
               4 |                                         0

So what does this tell us? In almost all of our requests the interesting item is the first item in the bucket :-)

http://blogs.sun.com/trond/date/20090327 Friday March 27, 2009

Debugging memcached UDF for MySQL

I got an email yesterday about a user experiencing problems when using memcached UDF for MySQL, so today I spent some time trying to recreate the problem. It turned out that the bug was caused by using uninitialized memory, so I guessed it could be a good blog documenting how I found it...

The first thing I did was to compile and install libmemcached and the udf:

trond@opensolaris> ./config/bootstrap
trond@opensolaris> ./configure --prefix=/opt/memcached --enable-dependency-tracking --enable-debug --without-memcached
trond@opensolaris> gmake all install
trond@opensolaris> ./config/bootstrap
trond@opensolaris> ./configure --prefix=/opt/memcached --with-mysql=/usr/mysql/bin/mysql_config --with-libmemcached=/opt/memcached CFLAGS=-g
trond@opensolaris> gmake all install

The next thing to do is to instruct MySQL to look in /opt/memcached/lib for dynamic libraries by adding the following line in my.cfg (trond@opensolaris> pfexec vi /etc/mysql/my.cfg):

plugin_dir=/opt/memcached/lib

I wanted to use libumem.so to do memory checking in MySQL/libmemcached/UDFs, and the easiest way to do this is to just replace the mysql binary with the following script:

trond@opensolaris> cd /usr/mysql/bin
trond@opensolaris> pfexec mv mysqld mysqld.bin
trond@opensolaris> cat /tmp/mysqld
#! /bin/ksh
export UMEM_DEBUG=default
export UMEM_LOGGING=transaction
export LD_PRELOAD=libumem.so
exec /usr/mysql/bin/mysqld.bin --skip-stack-trace "$@"
trond@opensolaris> mv /tmp/mysqld .
trond@opensolaris> chmod +x mysqld

So let's start MySQL and verify that it use libumem and locate the current directory (so we now where to look for the corefiles).

trond@opensolaris> svcadm enable mysql
trond@opensolaris> pfexec pldd `pgrep -x mysqld.bin`
8790:   /usr/mysql/bin/mysqld.bin --skip-stack-trace --user=mysql --datadir=/v
/lib/libumem.so.1
/usr/lib/libmtmalloc.so.1
/usr/lib/libCrun.so.1
/lib/librt.so.1
/lib/libz.so.1
/lib/libdl.so.1
/lib/libpthread.so.1
/lib/libthread.so.1
/lib/libgen.so.1
/lib/libsocket.so.1
/lib/libnsl.so.1
/lib/libm.so.2
/usr/lib/libCstd.so.1
/usr/lib/libc/libc_hwcap1.so.1
trond@opensolaris> pfexec pwdx `pgrep -x mysqld.bin`
8790:   /var/mysql/5.0/data

So let's install the UDFs:

trond@opensolaris> /usr/mysql/bin/mysql -u root < install_functions.sql
trond@opensolaris> /usr/mysql/bin/mysql -u root
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.0.67 Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select memc_servers_set('localhost:11211');
mysql> select memc_get("foobar");
ERROR 2006 (HY000): MySQL server has gone away

So the server went away? I guessed it dumped core, so let's look for a corefile:

trond@opensolaris> pfexec su -
root@razor:/var/mysql/data# ls -l 
total 81452
-rw------- 1 mysql mysql 62261434 2009-03-27 20:44 core.mysqld.bin.8790
-rw-rw---- 1 mysql mysql 10485760 2009-03-27 20:38 ibdata1
-rw-rw---- 1 mysql mysql  5242880 2009-03-27 20:44 ib_logfile0
-rw-rw---- 1 mysql mysql  5242880 2009-02-16 10:58 ib_logfile1
drwx------ 2 mysql mysql       53 2009-02-16 10:58 mysql
-rw-rw---- 1 mysql mysql        5 2009-03-27 20:44 razor.pid
drwx------ 2 mysql mysql      125 2009-02-16 11:24 test

So let's start debugging the corefile to try to figure out what happened:

root@razor:/var/mysql/data# dbx - core.mysqld.bin.8790 
Corefile specified executable: "/usr/mysql/5.0/bin/mysqld.bin"
For information about new features see `help changes'
Reading mysqld.bin
core file header read successfully
Reading ld.so.1
Reading libumem.so.1
Reading libmtmalloc.so.1
Reading libCrun.so.1
Reading librt.so.1
Reading libz.so.1
Reading libdl.so.1
Reading libpthread.so.1
Reading libthread.so.1
Reading libgen.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading libm.so.2
Reading libCstd.so.1
Reading libc.so.1
Reading libmemcached_functions_mysql.so.0.0.0
Reading libmemcached.so.2.0.0
Reading libscf.so.1
Reading libuutil.so.1
Reading libmd.so.1
Reading libmp.so.2
t@13 (l@13) terminated by signal SEGV (no mapping at the fault address)
Current function is memcached_quit_server
   14     if (ptr->fd != -1)

It seems that the pointer is invalid, so let's take a look at it:

(dbx) print ptr
ptr = 0xbaddcafe

If you look at the documentation for libumem, it will write the pattern 0xbaddcafe when you allocate memory, and 0xdeadbeef when you free memory. To me this sounds like we are using uninitialized memory. Let's take a look at the callstack:

(dbx) where
current thread: t@13
=>[1] memcached_quit_server(ptr = 0xbaddcafe, io_death = '\0'), line 14 in "memcached_quit.c"
  [2] memcached_quit(ptr = 0x9e2d450), line 65 in "memcached_quit.c"
  [3] memcached_free(ptr = 0x9e2d450), line 41 in "memcached.c"
  [4] memc_get_deinit(initid = 0x8963edc), line 82 in "get.c"
  [5] Item_udf_func::cleanup(0x8963e40, 0x1, 0x871f2b9, 0x822958e), at 0x81bbfa4 
  [6] THD::cleanup_after_query(0x9e34008, 0x9e34008, 0x8963d18, 0x19), at 0x82295f3 
  [7] dispatch_command(0x3, 0x9e34008, 0x9e41589, 0x1a), at 0x825ab32 
  [8] handle_one_connection(0x9e34008, 0xce76f000, 0xce4fefec, 0xce6dca5e), at 0x8256f96 
  [9] _thrp_setup(0xce804a00), at 0xce6dca96 
  [10] _lwp_start(0xcdc8eaff, 0xce6d540a, 0x40, 0x64, 0x40, 0xce804a00), at 0xce6dcd20 

I think we should start to look at frame 4 (frame 1, 2, 3 is inside libmemcached, but frame 4 is the first "external" call to libmemcached):

(dbx) frame 4
Current function is memc_get_deinit
   82     memcached_free(&container->memc);
(dbx) print container
container = 0x9e2d448
(dbx) examine container/10
0x09e2d448:      0xbaddcafe 0xbaddcafe 0xbaddcafe 0xbaddcafe
0x09e2d458:      0xbaddcafe 0xbaddcafe 0xbaddcafe 0xbaddcafe
0x09e2d468:      0xbaddcafe 0xbaddcafe

It looks like the complete container-structure isn't initialized at all. Luckily we can find out where it was allocated:

root@razor:/var/mysql/data# mdb core.mysqld.bin.8790 
Loading modules: [ libumem.so.1 libuutil.so.1 ld.so.1 ]
> $G
C++ symbol demangling enabled
> ::umalog

T-0.000000000  addr=9e2d440  umem_alloc_896
         libumem.so.1`umem_cache_alloc_debug+0x144
         libumem.so.1`umem_cache_alloc+0x19a
         libumem.so.1`umem_alloc+0xcd
         libumem.so.1`malloc+0x2a
         libmemcached_functions_mysql.so.0.0.0`memc_get_init+0x73
         bool udf_handler::fix_fields+0x66c
         bool Item_udf_func::fix_fields+0x2a
         bool setup_fields+0xe8
         int JOIN::prepare+0x1e0
         bool mysql_select+0x33e
         bool handle_select+0xf7
         int mysql_execute_command+0x4ac3
         bool dispatch_command+0x2b9b
         handle_one_connection+0x516
         libc_hwcap1.so.1`_thrp_setup+0x7e
[ ... cut ... ]

Luckily for us this was the most recent memory allocation (you might have to search a loooong list to find the allocation you looked for), and we can see that the allocation came from memc_get_init. It was pretty easy to spot the allocation, since there is only one in the function:

  [... cut ...]
  container= (memc_function_st *)malloc(sizeof(memc_function_st));
  rc= memc_get_servers(&container->memc);
  memcached_result_create(&container->memc, &container->results);
  initid->ptr= (char *)container;
  return 0;
}

Here we spot bug number 1.. we don't check the return value from memc_get_server, and if you read on you will see that it is what happens. So let's look at the function memc_get_servers:

int memc_get_servers(memcached_st *clone)
{
  int retval;
  memcached_st *test;

  pthread_mutex_lock(&memc_servers_mutex);
  test= memcached_clone(clone, master_memc);
  pthread_mutex_unlock(&memc_servers_mutex);
  retval= test ? 1 : 0;

  return retval ;
}

This seems correct enough, but let's look at the memcached_clone function?

memcached_st *memcached_clone(memcached_st *clone, memcached_st *source)
{
  memcached_return rc= MEMCACHED_SUCCESS;
  memcached_st *new_clone;

  if (source == NULL)
    return memcached_create(clone);

  if (clone && clone->is_allocated)
  {
    return NULL;
  }
[ ... cut ... ]

Now here is something interesting. We check the member is_allocated in the clone if we should abort the cloning process. Remember that we pass in a memory chunk we allocated with malloc, so we don't know the value of this (==> undefined behavior). We do return the "error" to the caller, but the caller doesn't check the return code resulting in the fact that we have a structure of uninitialized memory (and all use of it will be "undefined").

The manual section for memcached_clone isn't clear on the above fact, so I am going to update the documentation. It is not difficult to fix the source, just replace the call to (memc_function_st *)malloc(sizeof(memc_function_st)) with calloc(1, sizeof(memc_function_st)) (no need for the cast there).. I'll be pushing a patch to the author of the UDFs .

http://blogs.sun.com/trond/date/20090325 Wednesday March 25, 2009

Drizzle Developer Day 2009

The Drizzle Developer Day 2009 is scheduled the day after the MySQL users conference. I am really looking forward to talk to you, so please sign up today :-)

Bazaar plugin for Hudson

I have been using Hudson to build various software projects I am working on for some time and I really like it, so one of the first things I did when I started compiling Drizzle on OpenSolaris was to configure a new build target on my server.

Google pointed me to a Bazaar plugin, but unfortunately it didn't work well in the "master-slave" configuration. I tried to look at the source in the plugin to try to make it work, but it soon found out that it was a lot faster to just clone the Mercurial plugin and adapt it to Bazaar.

I pushed the plugin earlier today, so you may browse the source code and compile it yourself if you like (or wait for it to be listed in the available plugins for Hudson).

We are using this plugin on Drizzle to monitor the builds on one server while compiling Drizzle on other machines.

http://blogs.sun.com/trond/date/20090305 Thursday March 05, 2009

Adding debugging functions to your dbx session

Have you ever been sitting in a debugging session thinking: "arg, why don't I have a function doing xyz"? I know I have! Luckily for us we don't have to terminate the debugging session, build a new version of the application with the function available and try to recreate the debugging session. There is an easy way to extend our debugging session with new functions :-)

Talking is one thing, but developers don't believe anything before they see the code. So let's go ahead and create an example.

trond@opensolaris> nl -ba main.c
     1	#include <stdio.h>
     2	
     3	struct item {
     4	   /* the interesting data */
     5	   struct item* next;
     6	};
     7	
     8	int main(int argc, char** argv) {
     9	   struct item items[10];
    10	   for (int ii = 0; ii < 10; ++ii) {
    11	      items[ii].next = items + ii + 1;
    12	   }
    13	   /* terminate the list */
    14	   items[9].next = NULL;
    15	
    16	   /* let's create a loop */
    17	   items[9].next = &items[7];
    18	
    19	   return 0;
    20	}
trond@opensolaris> cc -o testprogram -g main.c -ldl

So let's start a debugging session, and set a breakpoint at line 17 in main.c:

trond@opensolaris> dbx testprogram
Reading testprogram
Reading ld.so.1
Reading libdl.so.1
Reading libc.so.1
(dbx) stop at 17
(2) stop at "main.c":17
(dbx) run
Running: testprogram 
(process id 17090)
stopped in main at line 17 in file "main.c"
   17      items[9].next = &items[7];
(dbx)

So how do we verify that we don't have a loop in this list??? The first thing we need to do is to create a small C-function to do loop detection and compile it into a shared object:

trond@opensolaris> nl -ba looptest.c
     1	#include <stdio.h>
     2	
     3	struct item {
     4	   /* the interesting data */
     5	   struct item* next;
     6	};
     7	
     8	int looptest(struct item* root) {
     9	   struct item* lookahead = root;
    10	
    11	   while (root != NULL) {
    12	      if (lookahead != NULL && lookahead->next != NULL) {
    13	         lookahead = lookahead->next->next;
    14	      } else {
    15	         lookahead = NULL;
    16	      }
    17	      if (root == lookahead) {
    18	         /* loop detected */
    19	         return 1;
    20	      } else {
    21	         root = root->next;
    22	      }
    23	   }
    24	
    25	   /* no loop */
    26	   return 0;
    27	}
trond@opensolaris> cc -o looptest.so -G -KPIC -g looptest.c

The trick is that we can use the call command in dbx to call a function from within the process we are debugging, and the function we want to call is dlopen. Why? dlopen will load the functions in the shared object into the address space of the process so that we can call them. Let's jump back to the debugging session:

(dbx) call dlopen("./looptest.so", 0x102)
Reading looptest.so
stopped in main at line 17 in file "main.c"
   17      items[9].next = &items[7];

So what is 0x102? Well that is result of RTLD_NOW | RTLD_GLOBAL (check the dlopen manual page for more info). Now we can call the looptest function from our debugging session:

(dbx) print looptest(items)
looptest(items) = 0

So let's continue the debugging and execute the next line that creates a loop in the list:

(dbx) next
stopped in main at line 19 in file "main.c"
   19      return 0;
(dbx) print looptest(items)
looptest(items) = 1

One small caveat is that you have to link your application with -ldl for this to work....

http://blogs.sun.com/trond/date/20090304 Wednesday March 04, 2009

Use Wireshark to look at your Memcached traffic!

A couple of days ago I got an IM from Stig Bjørlykke (a co-worker from when I worked at Thales Norway) asking about the traffic sent on port 11211. He was looking over the port numbers assigned by IANA and spotted my name there (I applied for port 11211 for memcached traffic some time ago :-).

I pointed Stig to the wiki, and I promised him that I should send him a little test program so that he could look at the traffic.. It took me some days before I sent the email, and a couple of days ago I got a new IM from Stig: "Hey, do you want to test the memcached dissector?".

Unfortunately for me their build farm doesn't build OpenSolaris packages, so I had to test it on my Mac book. I don't use the mac for development, and instead of spending time creating a setup there, I just ran my test program on my OpenSolaris boxes and captured the network traffic with snooop:

trond@opensolaris> pfexec snoop -o /tmp/snoop -d iprb0

I must admit that I was excited when I started Wireshark and loaded the capture file, and I think the result looks great:

If you like to try it yourself, you should download the latest build from http://www.wireshark.org/download/automated/ (or check out the source and build it yourself).

http://blogs.sun.com/trond/date/20090301 Sunday March 01, 2009

Socket connection timeout

I have been struggling to get a patch sent to the libmemcached mailing list to work as expected on my system. The first thing I normally do when I receive a new patch is to apply it and run the test suite, and if that doesn't work I normally notifies the author of the patch without digging too deep into the patch. This time I didn't get a test failure caused by one of the assert() statements in the patch, but the test program terminated unexpectedly. In order to give the author some more information on where it failed, I had to at least figure out how the program terminated (calling exit()?? a signal??? etc). truss is a good tool for that:

trond@opensolaris> truss -o /tmp/truss -E ./tests/testapp consistent_not
[... cut ...]
trond@opensolaris> tail -10 /tmp/truss
 0.0000 so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, "", SOV_DEFAULT) = 9
 0.0000 setsockopt(9, SOL_SOCKET, SO_LINGER, 0xFFBFF570, 8, SOV_DEFAULT) = 0
 0.0000 fcntl(9, F_GETFL)                               = 2
 0.0000 fcntl(9, F_SETFL, FWRITE|FNONBLOCK)             = 0
 0.0001 connect(9, 0x0004D3F0, 16, SOV_DEFAULT)         Err#150 EINPROGRESS
 0.0001 pollsys(0xFFBFF5E8, 1, 0xFFBFF578, 0x00000000)  = 1
 0.0000 fcntl(9, F_GETFL)                               = 130
 0.0000 fcntl(9, F_SETFL, FWRITE)                       = 0
 0.0000 write(9, " g e t   t H m Z u B 7 T".., 107)     Err#32 EPIPE
 0.0000     Received signal #13, SIGPIPE [default]

Hmm.. So we see that the program terminates when it receives the SIGPIPE signal, and we don't have a signal handler for this. If we look in the man page for write, you will see that write will generate SIGPIPE if the stream isn't connected to the peer.

It took me some time to figure out the bug here (I created a scaled down example and posted to a mailing list, and a friendly soul out there pointed me to the error!!). If you look closely in the truss output above, I only call connect once. The libmemcached took it for granted that if poll returned with a POLLOUT event, the connection was established. The correct thing would however be to call connect once more and check the return value (and possibly errno). The connect man page could have spent a paragraph describing this behavior ;-)

To end this blog, I'll create a new version of the connect call taking a timeout value as a parameter:

int my_connect(int s, const struct sockaddr *name, int namelen, int timeout)
{
   /* We may need to toggle on non-blocking mode */
   int flags = fcntl(s, F_GETFL, 0);
   if (flags == -1) {
      flags = 0;
   } else if (!(flags & O_NONBLOCK)) {
      (void)fcntl(s, F_SETFL, flags | O_NONBLOCK);
   }
   
   int ret;
   while (connect(s, name, namelen) < 0) {
      if (errno == EISCONN) {
         ret = 0;
         break;
      } else if (errno == EINPROGRESS || errno == EALREADY) {
         struct pollfd fds[1] = {[0].fd = s,
                                 [0].events = POLLOUT};         
         if (poll(fds, 1, timeout) == 0) {
            /* poll timed out, so let's set that in errno ;-) */
            errno = ETIMEDOUT;
            ret = -1;
            break;
         } else if (fds[0].revents & POLLERR) {
            ret = -1;
            break;
         }
      } else if (errno != EINTR) {
         ret = 1;
         break;
      }
   }
   
   /* Restore the old setting on the socket if we updated it above */
   int error = errno;
   if (!(flags & O_NONBLOCK)) {
      (void)fcntl(s, F_SETFL, flags);
   }
   errno = error;
   return ret;
}


Valid HTML! Valid CSS!

This is a personal weblog, I do not speak for my employer.