Trond Norbye's Weblog

« Previous month (Apr 2009) | Main | Next month (Jun 2009) »

http://blogs.sun.com/trond/date/20090626 Friday June 26, 2009

My disks crashed!

Yesterday I found out the hard way that you are not 100% safe even if you have a mirror.. Both of my disks started to fail, so I wasn't able to boot the computer anymore. I am pretty sure that someone with more admin experience would be able to get the machine back up and running (I must admit that I haven't paid much attention to how grub etc works, so my knowledge about system startup is a bit outdated ;-)). Because I feel I have a big hole of knowledge here, I booted a live cd and imported the ZFS mirror. Luckily for me I had no problems on the filesystems where I had all of my work, so I was able to copy all of it out to another disk (I don't have more than 4 SATA connectors in my machine, so I had to disconnect the SSD cache and plug in an extra disk there). With all my work safe on another hard drive safe on my desk I decided to do the easy thing: just buy two new disks and reinstalled OpenSolaris on those. My original plan was to get two 650GB disks, but when I arrived at the store they didn't have them anymore. Luckily for me he gave me two 1TB disks for the same price, so now I've got plenty of space for ZFS snapshots :-)

Personally I just love ZFS. During the installation of OpenSolaris, I just opened a terminal and typed in:

jack@opensolaris> pfexec zpool attach -f rpool c7d0s0 c9d0s0

And my mirror was up and running :-) I have been upgrading the old machine for a long time, so I have a lot of old "zombies" laying around. Instead of restoring all of them, I decided to just restore my data (/home), and recreate all of the configuration.

Hopefully I'll be done with everything tonight :-) I'm just so glad that I didn't loose a single bit of my data :-)

http://blogs.sun.com/trond/date/20090625 Thursday June 25, 2009

Replicate your keys to multiple memcached servers.

If you look at a how the (community version) of memcached works, all servers are completely isolated from each other. They don't know (or care) about the existence of other servers, and all advanced logic is implemented by the clients. This removes a lot of complexity from the server, resulting in a small clean source base with few bugs. You will also find this simple design in the client-server protocol, reducing what you can try to implement in the server.

If you scan the mailing lists you will find that requests for replication seems to pop up with a regular interval, so I decided to give it a shot. Personally I am not too interested in a full replicated scenario (where you have all of your keys stored on multiple machines), because I think you would be wasting too much space. I think a mixed mode is more interesting, where you store only a few of the items on multiple servers; and this is what I implemented.

If you look at the design for the replication from a 1000ft, it is dead simple. When we store a key on the server, we will also store it on the n'th next servers. If we encounter a problem when we try to send the GET request to the server we try fetch the replica instead. We will however not try to fetch the replica if:

  • The server crash before sending the response back. This will result in a cache miss (because we don't have any state withing libmemcached to recreate GET-request so that we can send it to the next replica server.)
  • The server doesn't have the item. A cache miss will be returned to the caller immediately, because trying the replica servers would cause long delays for real cache misses.

If you want to try it out you need to grab at least revision 539, but you should be aware of some design choices / limitations:

  • It is only supported with the binary protocol, so you cannot use a memcached server from the 1.2 series (you need the 1.4 branch).

    Why? Well the replication code use the "noreply" mode to store the replicas, and the "noreply" mode in the ASCII protocol is just one big hack ;-)

  • SET is the only command that will store multiple replicas.

    The replication code does not implement any kind of transactions / consistency, so I wanted to expose this fact to the user. Allowing ADD or REPLACE could confuse the users and introduce strange bugs in their application. INCR and DECR raise the same inconsistency problems. If you have an atomic counter (at least if it doesn't get evicted from the cache) you don't want it to behave strangely because of race conditions updating the replicas.

  • The CAS identified is generated on the server, so the master item and all replicas will have different CAS identifiers. If you enable replication you can't use CAS
  • We don't detect (and handle) network partition
  • If you run several memcached instances on the same server, you don't want to list them next to each other in the server list. The replication works in such a way that it will hash the key to locate the server the object belongs on, and it will store it on the n next servers in the list. If you list memcached instances on the same server next to each other, you might end up having the master and all of the replicas on the same server.
  • If you use consistent hashing you can grow your pool without blowing the complete cache
  • The replication works on a per memcached_st instance, so the API stays the same (and adds no extra costs if you don't use it

Well, I guess a lot of you don't like reading text that don't end each statement with a semicolon, so I should probably add some code. First you should locate the code where you create your memcached_st handle. You probably have something like (I removed the error checking to keep the example small, but you don't want to do that in your code!!!!):

   memcached_st *memc = memcached_create(NULL);
   memcached_server_st *servers = memcached_servers_parse(server_list);
   memcached_server_push(memc, servers);
   memcached_server_list_free(servers);

The first thing we need to do is to enable the binary protocol:

   memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 1);

As I mentioned above, I don't think you really want to replicate all of your keys, so let's create a new memcached_st instance and enable replication there (num_replicas contains the number of replicas I want):

   memcached_st repl = memcached_clone(NULL, memc);
   memcached_behavior_set(repl, MEMCACHED_BEHAVIOR_NUMBER_OF_REPLICAS, num_replicas);

And that's all you need to do! If you want to store a key with multiple replicas, you would go ahead and store it using the repl instance. For "normal" items, you would use the memc instance:

  /* Store a key with replicas: */
  memcached_set(repl, "replicated", 10, "foo", 3, 0, 0);
  /* Try to get the item (or the replicas if we have problems talking to the master) */
  void* value = memcached_get(repl, "replicated", 10, &vlen, &flags, &rc);
  /* Store a without replicas */
  memcached_set(memc, "single", 6, "foo", 3, 0, 0);
  /* Try to get the item */
  void* value = memcached_get(memc, "single", 6, &vlen, &flags, &rc);
  /* We can also get the master of a replicated item: */
  void* value = memcached_get(memc, "replicated", 10, &vlen, &flags, &rc);

http://blogs.sun.com/trond/date/20090607 Sunday June 07, 2009

Compiling Drizzle on OpenSolaris 2009.06

I thought it would be appropriate with a new and updated blog post on how to compile Drizzle with the release of OpenSolaris 2009.06. To make the blog more copy'n'paste friendly I have removed the prompt from all of the command's I am displaying :-)

The first thing we need to do is to install a complier, and all of the common tools used to build opensource projects. Drizzle also require libevent and gperf, and there exists precompiled packages for them. So let's go ahead and install the software with the following command:

   pfexec pkg install ss-dev SUNWlibevent SUNWgnu-gperf

I like to put the software I compile in separate ZFS filesystems, so let's go ahead and create:

  • /opt/dscm - To hold the scm systems
  • /opt/drizzle - This is where we want our Drizzle installation
  • /opt/gearman - This is where we want our Gearman installation

"Why not just put everything in /usr/local?" you may ask. Well, I don't like that because then I have a hard time figuring what files to remove when I want to uninstall a package. "This must turn into a long and complex path?" would probably be your next question. The answer is no. Just create the appropriate symbolic links and you are good to go :-)

So let's go ahead and create the ZFS filesystems:

for f in dscm drizzle gearman google
do
   pfexec zfs create -o mountpoint=/opt/$f rpool/$f
   pfexec chown `/usr/bin/id -u`:`/usr/bin/id -g` /opt/$f
done
  

Drizzle, Gearman and libmemcached all use Bazaar for development, and there isn't a package available for OpenSolaris so we need to install this ourself. The Bazaar team is really active and using the "release early, release often" model, and I want a easy way to keep up with the versions. Instead of having zombie files / versions laying around, I ended up with a model where I install each version into its own directory, and I have a symbolic link to the version I want to use. Because we install in a "nonstandard" location, we need to create a startup-script so that Python can find the modules. So let's go ahead and install Bazaar (1.15 is the latest stable version right now) :

wget --no-check-certificate http://launchpad.net/bzr/1.15/1.15final/+download/bzr-1.15.tar.gz
gtar xfz bzr-1.15.tar.gz
cd bzr-1.15
python setup.py install --prefix=/opt/dscm/bazaar-1.15
mkdir /opt/dscm/bin
cat > /opt/dscm/bin/bzr <<EOF
#! /bin/ksh
export PYTHONPATH=/opt/dscm/bazaar/lib/python2.4/site-packages
exec /opt/dscm/bazaar/bin/bzr "\$@"
EOF
chmod a+x /opt/dscm/bin/bzr
ln -s bazaar-1.15 /opt/dscm/bazaar
cd ..
rm -rf bzr-1.15.tar.gz bzr-1.15

The next time you want to upgrade Bazaar, all you need to do is to move the symbolic link /opt/dscm/bazaar to point to the new version. You can now either put /opt/dscm/bin into your path, or you can create something like /opt/local/bin and create a symbolic link to /opt/dscm/bin/bzr from there (and then put /opt/local/bin in your path. To avoid path problems, I'll keep on referring to bzr with absolute path throughout the example.

For some reason OpenSolaris doesn't contain a prebuilt 64-bit version of GNU readline, so that we need to compile that ourself (It is scheduled for an upcoming build AFAIK). To keep the example simple, I'll just install the readline library into /opt/drizzle. So just execute the following commands to download, build and install:

wget http://ftp.gnu.org/gnu/readline/readline-6.0.tar.gz
gtar xfz readline-6.0.tar.gz
cd readline-6.0
./configure --disable-static --prefix=/opt/drizzle 
gmake all install
gmake clean
./configure --disable-static --prefix=/opt/drizzle --libdir=/opt/drizzle/lib/`isainfo -k` CFLAGS="-m64"
gmake all install
ln -s `isainfo -k` /opt/drizzle/lib/64
ln -s . /opt/drizzle/lib/32
cd ..
rm -rf readline-6.0.tar.gz readline-6.0

Stop! why do you build it two times?" If you look at the options there I compile one version with "-m64", and that option will create a 64bit binary. Most people would probably not care for the 32bit binary, but I like to build both versions when I build a library (so that I don't have problems later on if I want to build a 32 (or 64 bit) binary using the library. The reason for the two symbolic links I create at the end is explained in chapter 32-bit and 64-bit Libraries.

Drizzle use Google Protocol buffers in the communication protocol, so let's go ahead and compile them. I don't use the latest version, because there is a compilation error in that version (and I haven't had the time to look at that yet):

wget http://protobuf.googlecode.com/files/protobuf-2.0.3.tar.gz
gtar xfz protobuf-2.0.3.tar.gz
cd protobuf-2.0.3
./configure --disable-static --with-zlib --prefix=/opt/google CPPFLAGS="-fast -m32" LDFLAGS="-fast" \
            --bindir=/opt/google/bin/i86
gmake all install
gmake clean
./configure --disable-static --with-zlib --prefix=/opt/google CPPFLAGS="-fast -m64" LDFLAGS="-fast -m64" \
            --libdir=/opt/google/lib/`isainfo -k` --bindir=/opt/google/bin/`isainfo -k`
gmake all install
cd ..
ln -s `isainfo -k` /opt/google/lib/64
ln -s . /opt/google/lib/32
cp /usr/lib/isaexec /opt/google/bin/protoc
rm -rf protobuf-2.0.3.tar.gz protobuf-2.0.3

With all the dependencies installed, we can go ahead and grab the source for libmemcached, libdrizzle, Gearman and Drizzle:

for f in libdrizzle gearmand libmemcached drizzle 
do
   /opt/dscm/bin/bzr branch lp:$f
done

So let's go ahead and start building them. libdrizzle is first up:

cd libdrizzle
./config/autorun.sh
./configure --disable-static --prefix=/opt/drizzle CFLAGS="-fast -m32" LDFAGS="-fast"
gmake all install
./configure --disable-static --prefix=/opt/drizzle --libdir=/opt/drizzle/lib/`isainfo -k` CFLAGS="-fast -m64" LDFAGS="-fast"
gmake clean
gmake all install
cd ..

The next one on the list is libmemcached:

cd libmemcached
./config/bootstrap
PATH=$PATH:/usr/perl5/bin ./configure --disable-static --prefix=/opt/drizzle CFLAGS="-fast -m32" LDFAGS="-fast" \
    --without-memcached --bindir=/opt/drizzle/bin/i86
gmake all install
PATH=$PATH:/usr/perl5/bin ./configure --enable-64bit --disable-static --prefix=/opt/drizzle \
    --libdir=/opt/drizzle/lib/`isainfo -k` CFLAGS="-fast" LDFAGS="-fast" --without-memcached --bindir=/opt/drizzle/bin/`isainfo -k`
gmake clean
gmake all install
for f in memcat memrm memcp memerror memflush memslap memstat
do
cp /usr/lib/isaexec /opt/drizzle/bin/$f
done
cd ..

There is a problem with the configure script for Gearman, so it is not able to create a 32 bit binary on a machine capable of running in 64 bit mode, so from now on we will only create 64 bit binaries (I will work on a patch for this):

cd gearmand
./config/bootstrap
./configure --prefix=/opt/gearman --disable-static --sbindir=/opt/gearman/sbin/`isainfo -k` --libdir=/opt/gearman/lib/`isainfo -k` \
            --bindir=/opt/gearman/bin/`isainfo -k` CFLAGS="-fast -I/opt/drizzle/include -m64" \
            LDFLAGS="-L/opt/drizzle/lib/64 -R/opt/drizzle/lib/64"
gmake clean
gmake all install
cd ..
cp /usr/lib/isaexec /opt/gearman/sbin/gearmand
cp /usr/lib/isaexec /opt/gearman/bin/gearman

Before we can start compiling Drizzle we need to make sure that Drizzle can detect our PCRE installation. OpenSolaris ships with a version that is too new for the Drizzle configure script, so that we need to create a symbolic link to make sure it detects it properly:

pfexec ln -s pcre/pcre.h /usr/include/pcre.h

Now all is set for compiling Drizzle:

cd drizzle
PATH=$PATH:/opt/dscm/bin ./config/autorun.sh
PATH=$PATH:/opt/google/bin ./configure CPPFLAGS="-I/opt/google/include -I/opt/gearman/include -I/opt/drizzle/include" \
   LDFLAGS="-L/opt/google/lib/64 -L/opt/gearman/lib/64 -L/opt/drizzle/lib/64 -R/opt/drizzle/lib/64:/opt/gearman/lib/64:/opt/google/lib/64" \
   --prefix=/opt/drizzle --libdir=/opt/drizzle/lib/`isainfo -k` 
PATH=$PATH:/opt/google/bin gmake all install

Now you should have Drizzle installed in /opt/drizzle. If you look in some of my previous blog posts you should be able to find out how to install it as an SMF service :-)

Cheers


Valid HTML! Valid CSS!

This is a personal weblog, I do not speak for my employer.