Collected thoughts and musings George's Sun Blog

Wednesday Jul 29, 2009

One of the project that I've been working on recently has been to add metadata support to the Apache Avro project, and I wanted to post an update of that progress.

Avro is a multi-language serialization/deserialization library and RPC middleware framework that is a Hadoop subproject.  Avro will eventually replace Hadoop's internal RPC protocols and on-disk data structure representations.  A key advantage of using Avro will be that different language clients can interact with Hadoop, and the schemas for protocols and on-disk data structures can change and evolve over time and still be supported.

The specification for Avro now includes metadata support (https://issues.apache.org/jira/browse/AVRO-67).  This metadata takes the form of a Map of bytes (with Utf8 strings as keys).  In Java, this looks like:

Map<Utf8,ByteBuffer> meta;

There is a metadata map sent along the RPC connection handshake, and one sent on each individual RPC call.  The map is cleared out before sending along each call.  To set, query, or manipulate the metadata, there is going to be a "plugin" architecture where you can write a plugin that will have access to this map.  Note that all the plugins share a single map per request/handshake.

What can metadata be used for?

  • Authentication credentials and setup
  • Authorization of clients
  • Encryption
  • Compression
  • Tracing (e.g., X-Trace)
  • Accounting

Currently, I'm working on the implementation of this API (see https://issues.apache.org/jira/browse/AVRO-76).  You can see the slow but steady evolution of the API at that link.

What kind of functionality do you need from a metadata layer in RPC?  Interested in implementing any of the above functionalities?  I'd love to hear from you.

Monday Jun 15, 2009

I'm "Live Twittering" HotCloud09.  Come check it out at http://twitter.com/SunLabsBigData

Thursday Jun 11, 2009

I've set up a Twitter account for the BigData Project at Sun Labs.  Check it out!

http://twitter.com/SunLabsBigData

Thursday Jan 29, 2009

Last night was the first meeting of the San Diego Hadoop Users Group (SD-HUG), held at Sun's SD Campus.  About a dozen people were in attendance, including people with general interest in hadoop, UCSD students, Jon from Streamy, and a great group from Veoh.  We had some great conversations about Hadoop, HBase, and related technologies.  Ryan Lynch of Veoh.com spent some time briefing us on how they use Hadoop as part of their video hosting site.  Chris Wensel (all the way down from the Bay Area) gave a presentation on Cascading.org, including some new features that let you build up smaller units of computation into larger and larger orchestrations of work.  Really cool stuff.  Thanks Ryan and Chris!


(Ryan presenting on the Hadoop goings-on at Veoh.com)

Ryan, Leor, and Chris

Chris and Jon

Friday Jan 16, 2009

In September, 2008 we released a pre-configured, self-contained Hadoop environment bundled onto an OpenSolaris live cd (see http://opensolaris.org/os/project/livehadoop/).  This effort was designed to provide end users, developers, and students with a very easy-to-use Hadoop environment with minimal configuration and setup.  So far that effort has been well received, and over 11,000 downloads of the CD have been made to date!

We're busy working on the second version of OpenSolaris Live Hadoop, and I wanted to put some of the new features out there for what to expect

New updates in the works:

  • Hadoop 0.19
  • OpenSolaris 2008.11 (with vastly improved laptop support)
  • The ability to boot off of a CD, DVD, or USB stick
  • Inclusion of JDK 6 with the Java compiler and related tools

Cool new features

  • The new build process we are using to make the live cd should make it much easier to include your own custom packages and data onto the cd.  This means that you should be able to create live images for your particular environment with your own courseware, sample data, documentation, or assignments on it.
  • Inclusion of trace support (based on X-Trace and DTrace).  This trace support should make understanding the internals of Hadoop easier and provide improved visibility into what is happening.  Expect this support to improve over time.
  • HBase, an open-source BigTable implementation
If you have any requests or recommendations, please let me know.  As for time frame, we're looking at February, assuming the new build process (based on OpenSolaris 2008.11) works as we expect.  Stay tuned!

Monday Nov 17, 2008

Because of the enormous developer and user activity around Hadoop, it has grown into a formidable piece of software.  Finding bugs in your Map/Reduce programs can be hard, since there are so many different software components involved in carrying out your job.  However, Hadoop can be configured to run in a single-process, single-JVM debug mode that makes this process a lot easier.  This blog post describes how to setup your enviornment to support this mode.  I've put together a small download that includes scripts needed to configure Eclipse, as well as a sample program that you can modify at will to use as the base for your programs.

Prerequisites:

  1. Java JDK 1.6 or higher (that includes the Java compiler 'javac')
  2. Hadoop 0.18.2 source code (http://hadoop.apache.org/core/releases.html)
  3. SingleProcessHadoop starter files (http://blogs.sun.com/george/resource/SingleProcessHadoop/SingleProcessHadoop.tar.gz)
First, download Hadoop 0.18.2, if you do not already have it, from the link above.  Unpack it into a directory, for example ~/src/hadoop-0.18.2.  Next, download the 'SingleProcessHadoop' starter files from the above link and unpack them into a second directory (for example, ~/src/SingleProcessHadoop).
  1. Go to the SingleProcessHadoop directory (cd ~/src/SingleProcessHadoop)
  2. Generate the Eclipse files by running the 'generate-eclipse-files.sh' script.  This script takes as its argument the location of your Hadoop directory.  In my case, I type (./generate-eclipse-files.sh ~/src/hadoop-0.18.2)
  3. Start Eclipse
  4. Import the SingleProcessHadoop project by invoking (File -> Import...).  Under the 'General' tab, click "Existing Projects into Workspace".  Under "Select root directory", click browse.  Use the file browser to navigate to the SingleProcessHadoop directory and import it.

You should now have the SingleProcessHadoop package set up in Eclipse:


The only Java file in the project contains the Wordcount demo taken from the Eclipse example code.  Let's run it and examine the output:

  1. Double-click on the SingleProcessWordCount.java file in the Project Explorer.  This file contains the Map and Reduce functions at the top.  The input to wordcount is on line 93:
    1. String input = "The quick brown fox\nhas many silly\nred fox sox\n";
  2. From the menu bar, select (Run -> Run As -> Java Application)
  3. The bottom of the screen should fill up with Red diagnostic and logging text.  After a minute or so, this should complete, and you can scroll up until you see the output of your job, which should be in black:


You can now debug your Map/Reduce applications by setting various breakpoints in Eclipse and selecting (Run -> Debug As -> Java Application) from the menu bar.

Thursday Oct 09, 2008

In part 1 of this series, we created an OpenSolaris zone.  In part 2, we cloned this zone to create two more, and set up networking and ssh host keys so that we can

  • Log into each zone from the global zone without using our password
  • Log into the global zone from the local zones without using our password
The stage is now set for us to setup and run Hadoop, which is what we cover in Part 3 of this tutorial.[Read More]

In Part 1 of my series on building a virtual Hadoop cluster with OpenSolaris, we built and started a zone.  In part 2, we will:

  1. Set up ssh keys so that you can log into the zone without using your password
  2. Clone the original zone into node2
  3. Clone the original zone into node3
  4. Set up /etc/hosts
  5. Initialize the known_hosts file
[Read More]

Experimenting with Hadoop can be a lot of fun, but if you don't have easy access to your own cluster, your options are somewhat limited.  You can run map/reduce jobs locally on a single node, using the "Pseudo" distributed configuration described in the Hadoop Quick Start Guide.  While this is a good way to get started initially, it is a somewhat limited environment to play around with, since you only have one DataNode, and one TaskTracker.

We can take advantage of OpenSolaris's Zone capability to create a fully distributed version of Hadoop, in which each Hadoop node runs in its own zone.  Zones are virtual machines that look and act like independent machines.  Zones are different than VMs like Xen or VMWare in that they have a much smaller memory requirement.  Less memory overhead means more memory for your Hadoop jobs.

Creating a fully distributed Hadoop virtual cluster on a single node takes about half an hour or so, depending on the speed of your network.  This blog post describes the first part of this process: building a Zone.  In Part 2, we will clone that zone to build up our cluster, and in Part 3, we will install and configure Hadoop.[Read More]