Collected thoughts and musings George's Sun Blog

Monday Nov 17, 2008

Because of the enormous developer and user activity around Hadoop, it has grown into a formidable piece of software.  Finding bugs in your Map/Reduce programs can be hard, since there are so many different software components involved in carrying out your job.  However, Hadoop can be configured to run in a single-process, single-JVM debug mode that makes this process a lot easier.  This blog post describes how to setup your enviornment to support this mode.  I've put together a small download that includes scripts needed to configure Eclipse, as well as a sample program that you can modify at will to use as the base for your programs.

Prerequisites:

  1. Java JDK 1.6 or higher (that includes the Java compiler 'javac')
  2. Hadoop 0.18.2 source code (http://hadoop.apache.org/core/releases.html)
  3. SingleProcessHadoop starter files (http://blogs.sun.com/george/resource/SingleProcessHadoop/SingleProcessHadoop.tar.gz)
First, download Hadoop 0.18.2, if you do not already have it, from the link above.  Unpack it into a directory, for example ~/src/hadoop-0.18.2.  Next, download the 'SingleProcessHadoop' starter files from the above link and unpack them into a second directory (for example, ~/src/SingleProcessHadoop).
  1. Go to the SingleProcessHadoop directory (cd ~/src/SingleProcessHadoop)
  2. Generate the Eclipse files by running the 'generate-eclipse-files.sh' script.  This script takes as its argument the location of your Hadoop directory.  In my case, I type (./generate-eclipse-files.sh ~/src/hadoop-0.18.2)
  3. Start Eclipse
  4. Import the SingleProcessHadoop project by invoking (File -> Import...).  Under the 'General' tab, click "Existing Projects into Workspace".  Under "Select root directory", click browse.  Use the file browser to navigate to the SingleProcessHadoop directory and import it.

You should now have the SingleProcessHadoop package set up in Eclipse:


The only Java file in the project contains the Wordcount demo taken from the Eclipse example code.  Let's run it and examine the output:

  1. Double-click on the SingleProcessWordCount.java file in the Project Explorer.  This file contains the Map and Reduce functions at the top.  The input to wordcount is on line 93:
    1. String input = "The quick brown fox\nhas many silly\nred fox sox\n";
  2. From the menu bar, select (Run -> Run As -> Java Application)
  3. The bottom of the screen should fill up with Red diagnostic and logging text.  After a minute or so, this should complete, and you can scroll up until you see the output of your job, which should be in black:


You can now debug your Map/Reduce applications by setting various breakpoints in Eclipse and selecting (Run -> Debug As -> Java Application) from the menu bar.

Comments:

Hmm... SingleProcessWordCount.java uses MiniMRCluster where-in the JobTracker and TaskTrackers are run as threads in the same process. However it still forks separate JVMs for the actual map/reduce tasks. Thus it has the unfortunate effect of letting one debug the framework itself, as opposed to the application (map/reduce tasks).

To truly debug the application it will need a new feature which will allow tasks to be run as threads of the TaskTracker, rather than as separate processes... something which has come up before: http://issues.apache.org/jira/browse/HADOOP-3675. Of course that approach has other implications (security - inasmuch running _client_ code inside the framework), but is still a valid use case for precisely this reason, debugging.

Posted by Arun C Murthy on November 20, 2008 at 10:09 AM PST #

I hasten to add that you may want to use the LocalJobRunner (set mapred.job.tracker config to "local") where-in the entire MR application runs as a single process (all maps are executed sequentially on the client, as opposed to be run on the _cluster_, then the one-and-only-one reducer is executed in-process). The current limitation of the LocalJobRunner having a single reducer might be painful for debugging lots of applications, something which can be fixed with some effort.

There is also built-in support for profiling/debugging MR applications on the cluster:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Profiling

Support for post-processing certain kinds of applications (e.g. Hadoop Pipes' applications):
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Debugging

Some more info: http://wiki.apache.org/hadoop/HadoopPresentations -> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/HadoopMapReduce_TuningAndDebugging.pdf

Posted by Arun C Murthy on November 20, 2008 at 10:20 AM PST #

Do you have something that works with hadoop.0.20?

Posted by Linh on January 27, 2010 at 09:04 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed