Collected thoughts and musings George's Sun Blog

Wednesday Jul 29, 2009

One of the project that I've been working on recently has been to add metadata support to the Apache Avro project, and I wanted to post an update of that progress.

Avro is a multi-language serialization/deserialization library and RPC middleware framework that is a Hadoop subproject.  Avro will eventually replace Hadoop's internal RPC protocols and on-disk data structure representations.  A key advantage of using Avro will be that different language clients can interact with Hadoop, and the schemas for protocols and on-disk data structures can change and evolve over time and still be supported.

The specification for Avro now includes metadata support (https://issues.apache.org/jira/browse/AVRO-67).  This metadata takes the form of a Map of bytes (with Utf8 strings as keys).  In Java, this looks like:

Map<Utf8,ByteBuffer> meta;

There is a metadata map sent along the RPC connection handshake, and one sent on each individual RPC call.  The map is cleared out before sending along each call.  To set, query, or manipulate the metadata, there is going to be a "plugin" architecture where you can write a plugin that will have access to this map.  Note that all the plugins share a single map per request/handshake.

What can metadata be used for?

  • Authentication credentials and setup
  • Authorization of clients
  • Encryption
  • Compression
  • Tracing (e.g., X-Trace)
  • Accounting

Currently, I'm working on the implementation of this API (see https://issues.apache.org/jira/browse/AVRO-76).  You can see the slow but steady evolution of the API at that link.

What kind of functionality do you need from a metadata layer in RPC?  Interested in implementing any of the above functionalities?  I'd love to hear from you.

Monday Jun 15, 2009

I'm "Live Twittering" HotCloud09.  Come check it out at http://twitter.com/SunLabsBigData

Thursday Jun 11, 2009

I've set up a Twitter account for the BigData Project at Sun Labs.  Check it out!

http://twitter.com/SunLabsBigData

Monday Apr 13, 2009

If you've read the technology section of the newspaper recently, been to any major research conferences, or looked at press releases from the past few months, you may be thinking that the entire field of Computing has suddenly decided to study Meteorology.  This is understandable, given the huge interest in Cloud Computing.  Cloud Computing is "hot", and understandably so.  We're finally beginning to see real adoption of rent-by-the-unit computation and storage, the democratization of software distribution, and the need for scale and elastic resource utilization by software developers.

There is obviously a scramble in the field for defining the right levels of abstraction for end users, application developers, and network operators.  Speaking about cloud computing, Greg Papadopoulos stressed the need for interoperability on his blog:

We should tear out a page from the internet playbook and work towards and open set of interoperable standards and all contribute to a software commons of open implementations.  Particularly important are standards for virtual machine representation (e.g. OVF), data in/exgest (e.g.  WebDAV), code ingest and provisioning (e.g. Eucalyptus), distributed/parallel data access (pNFS, MogileFS, HDFS), orchestration and messaging (OpenESB, ActiveMQ)  accounting, and identity/security ( SAML2, OpenID, OpenSSO).

I would add to this the need for an open set of interoperable standards for tracing and monitoring.  The ability to build new applications in the Cloud by dynamically invoking other local and remote services with interoperable protocols is very powerful.  However, that same dynamism, coupled with the multiple layers of abstraction that characterize datacenters, leads to systems that are very difficult to reason about in terms of performance and reliability.  After all, if your new Cloud application isn't scaling linearly, or if you're seeing a "long tail" of users that see bad performance, where in the myriad of subsystems, virtualization layers, and distributed storage systems does the problem reside?  Worse, does the problem emerge from the combination of underlying components, rather than any single malfunctioning piece?

Just as open standards and protocols make the Internet possible, open standards and protocols will make the "Inter-cloud" possible as well.  Adding introspection capabilities and tracing support to those standards will ensure that we will be able to reason about the resulting systems.  This in turn will lead to more reliable and dependable software.

Thursday Jan 29, 2009

Last night was the first meeting of the San Diego Hadoop Users Group (SD-HUG), held at Sun's SD Campus.  About a dozen people were in attendance, including people with general interest in hadoop, UCSD students, Jon from Streamy, and a great group from Veoh.  We had some great conversations about Hadoop, HBase, and related technologies.  Ryan Lynch of Veoh.com spent some time briefing us on how they use Hadoop as part of their video hosting site.  Chris Wensel (all the way down from the Bay Area) gave a presentation on Cascading.org, including some new features that let you build up smaller units of computation into larger and larger orchestrations of work.  Really cool stuff.  Thanks Ryan and Chris!


(Ryan presenting on the Hadoop goings-on at Veoh.com)

Ryan, Leor, and Chris

Chris and Jon

Friday Jan 16, 2009

In September, 2008 we released a pre-configured, self-contained Hadoop environment bundled onto an OpenSolaris live cd (see http://opensolaris.org/os/project/livehadoop/).  This effort was designed to provide end users, developers, and students with a very easy-to-use Hadoop environment with minimal configuration and setup.  So far that effort has been well received, and over 11,000 downloads of the CD have been made to date!

We're busy working on the second version of OpenSolaris Live Hadoop, and I wanted to put some of the new features out there for what to expect

New updates in the works:

  • Hadoop 0.19
  • OpenSolaris 2008.11 (with vastly improved laptop support)
  • The ability to boot off of a CD, DVD, or USB stick
  • Inclusion of JDK 6 with the Java compiler and related tools

Cool new features

  • The new build process we are using to make the live cd should make it much easier to include your own custom packages and data onto the cd.  This means that you should be able to create live images for your particular environment with your own courseware, sample data, documentation, or assignments on it.
  • Inclusion of trace support (based on X-Trace and DTrace).  This trace support should make understanding the internals of Hadoop easier and provide improved visibility into what is happening.  Expect this support to improve over time.
  • HBase, an open-source BigTable implementation
If you have any requests or recommendations, please let me know.  As for time frame, we're looking at February, assuming the new build process (based on OpenSolaris 2008.11) works as we expect.  Stay tuned!

Monday Nov 17, 2008

Because of the enormous developer and user activity around Hadoop, it has grown into a formidable piece of software.  Finding bugs in your Map/Reduce programs can be hard, since there are so many different software components involved in carrying out your job.  However, Hadoop can be configured to run in a single-process, single-JVM debug mode that makes this process a lot easier.  This blog post describes how to setup your enviornment to support this mode.  I've put together a small download that includes scripts needed to configure Eclipse, as well as a sample program that you can modify at will to use as the base for your programs.

Prerequisites:

  1. Java JDK 1.6 or higher (that includes the Java compiler 'javac')
  2. Hadoop 0.18.2 source code (http://hadoop.apache.org/core/releases.html)
  3. SingleProcessHadoop starter files (http://blogs.sun.com/george/resource/SingleProcessHadoop/SingleProcessHadoop.tar.gz)
First, download Hadoop 0.18.2, if you do not already have it, from the link above.  Unpack it into a directory, for example ~/src/hadoop-0.18.2.  Next, download the 'SingleProcessHadoop' starter files from the above link and unpack them into a second directory (for example, ~/src/SingleProcessHadoop).
  1. Go to the SingleProcessHadoop directory (cd ~/src/SingleProcessHadoop)
  2. Generate the Eclipse files by running the 'generate-eclipse-files.sh' script.  This script takes as its argument the location of your Hadoop directory.  In my case, I type (./generate-eclipse-files.sh ~/src/hadoop-0.18.2)
  3. Start Eclipse
  4. Import the SingleProcessHadoop project by invoking (File -> Import...).  Under the 'General' tab, click "Existing Projects into Workspace".  Under "Select root directory", click browse.  Use the file browser to navigate to the SingleProcessHadoop directory and import it.

You should now have the SingleProcessHadoop package set up in Eclipse:


The only Java file in the project contains the Wordcount demo taken from the Eclipse example code.  Let's run it and examine the output:

  1. Double-click on the SingleProcessWordCount.java file in the Project Explorer.  This file contains the Map and Reduce functions at the top.  The input to wordcount is on line 93:
    1. String input = "The quick brown fox\nhas many silly\nred fox sox\n";
  2. From the menu bar, select (Run -> Run As -> Java Application)
  3. The bottom of the screen should fill up with Red diagnostic and logging text.  After a minute or so, this should complete, and you can scroll up until you see the output of your job, which should be in black:


You can now debug your Map/Reduce applications by setting various breakpoints in Eclipse and selecting (Run -> Debug As -> Java Application) from the menu bar.

Monday Oct 27, 2008

I'm very excited to be attending OSDI 2008 this year in San Diego, CA (Dec 8-10).

There are some really intersting papers on tap--hope to see you there!

I'm going to OSDI '08

 I'm also going to attend the First USENIX Workshop on the Analysis of System Logs (WASL '08) the day before, on Sunday.  There are a huge number of surprisingly interesting issues that emerge related to generating, storing, analyzing, and processing log files (not the least of which is that you can generate terabytes per day without even trying.  Now where did that needle go in this haystack again?)

Thursday Oct 09, 2008

In part 1 of this series, we created an OpenSolaris zone.  In part 2, we cloned this zone to create two more, and set up networking and ssh host keys so that we can

  • Log into each zone from the global zone without using our password
  • Log into the global zone from the local zones without using our password
The stage is now set for us to setup and run Hadoop, which is what we cover in Part 3 of this tutorial.[Read More]

In Part 1 of my series on building a virtual Hadoop cluster with OpenSolaris, we built and started a zone.  In part 2, we will:

  1. Set up ssh keys so that you can log into the zone without using your password
  2. Clone the original zone into node2
  3. Clone the original zone into node3
  4. Set up /etc/hosts
  5. Initialize the known_hosts file
[Read More]

Experimenting with Hadoop can be a lot of fun, but if you don't have easy access to your own cluster, your options are somewhat limited.  You can run map/reduce jobs locally on a single node, using the "Pseudo" distributed configuration described in the Hadoop Quick Start Guide.  While this is a good way to get started initially, it is a somewhat limited environment to play around with, since you only have one DataNode, and one TaskTracker.

We can take advantage of OpenSolaris's Zone capability to create a fully distributed version of Hadoop, in which each Hadoop node runs in its own zone.  Zones are virtual machines that look and act like independent machines.  Zones are different than VMs like Xen or VMWare in that they have a much smaller memory requirement.  Less memory overhead means more memory for your Hadoop jobs.

Creating a fully distributed Hadoop virtual cluster on a single node takes about half an hour or so, depending on the speed of your network.  This blog post describes the first part of this process: building a Zone.  In Part 2, we will clone that zone to build up our cluster, and in Part 3, we will install and configure Hadoop.[Read More]

Tuesday Sep 02, 2008

Announcing the availability of an open-source "Live CD" aimed at providing new users to Hadoop with a fully functional, pre-configured Hadoop cluster that is easy to start up and use and lets people get a quick look at what Hadoop offers in terms of power and ease of use.  By lowering the barrier to getting Hadoop up and running, more people can try it out and explore its features.

The CD image provided gives users an environment emulating a fully distributed, three-node virtual Hadoop cluster.  One of the reasons we used OpenSolaris is its ability to emulate a multinode cluster environment in a very small memory foot print.  A three-node Map/Reduce cluster can be brought up on a machine with as little as 800 MB of memory.  Each additional virtual cluster node only requires about 40 MB of additional memory, in addition to the memory used by Hadoop.  This means that people can take Hadoop for a spin, even on their laptop.

The CD is also "live", meaning that it does not modify the contents of the user's computer. This makes it ideal for those wishing to try out Hadoop without having to install any software.  For example, students wishing to use Hadoop in a classroom lab environment can work entirely off of the CD.

Included in this release is Hadoop 0.17.1 running on OpenSolaris.  You can join the OpenSolaris Hadoop community online, as well as download the CD image, documentation, and other resources from http://opensolaris.org/os/project/livehadoop/  If you have any requests or suggestions for improvements to this distribution of Hadoop, please let us know through the community site, or join the discussion by sending an email to edu-discuss-subscribe@opensolaris.org