Collected thoughts and musings George's Sun Blog

Thursday Oct 09, 2008

In part 1 of this series, we created an OpenSolaris zone.  In part 2, we cloned this zone to create two more, and set up networking and ssh host keys so that we can

  • Log into each zone from the global zone without using our password
  • Log into the global zone from the local zones without using our password

The stage is now set for us to setup and run Hadoop, which is what we cover in Part 3 of this tutorial.  We'll install Hadoop by proceeding through these steps:

  1. Download and unpack the Hadoop distribution
  2. Configure Hadoop to work with our virtual cluster
  3. Start it up
  4. Test a simple map/reduce job

We start by making sure that our zones are installed and running:

$ pfexec zoneadm list -cv

  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              native   shared
   9 node3            running    /export/zones/node3            ipkg     shared
  10 node1            running    /export/zones/node1            ipkg     shared
  11 node2            running    /export/zones/node2            ipkg     shared

1. Download and unpack the Hadoop distribution

The Hadoop distribution is located here:  http://hadoop.apache.org/core/releases.html

Download the most recent release, using your browser and unpack it in the /opt directory.  Change into that directory:

$ cd /opt/hadoop-0.18.1

2. Configure Hadoop

Now edit the conf/masters file, and replace 'localhost' with 'master'.

Edit conf/slaves, and replace 'localhost' with:

node1
node2
node3

Now edit conf/hadoop-env.sh.  We are going to change some of the defaults.  Uncomment the following lines, and set them to the values given:

export JAVA_HOME=/usr/java
export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR -i ~/.ssh
/hadoop"
export HADOOP_LOG_DIR=/tmp

Now we create a conf/hadoop-site.xml file, which should have this content:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000/</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> <property> <name>dfs.replication</name> <value>3</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>mapred.job.tracker</name> <value>master:9001</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration>

Great!  Now we're ready to start up Hadoop

3. Starting Hadoop

We begin by formatting the namenode:

$ cd /opt/hadoop-0.18.1

$ bin/hadoop namenode -format

Now we can start up the daemons

$ bin/start-all.sh

4. Test a simple map/reduce job

If everything went well, you should have a running 4-node virtual Hadoop deployment based around OpenSolaris zones.

Let's run a simple test:

$ bin/hadoop fs -mkdir   /myinput

$ bin/hadoop fs -copyFromLocal conf/hadoop-default.xml   /myinput

$ bin/hadoop jar hadoop-*-examples.jar wordcount  /myinput  /myoutput

If everything is operational, you should see output similar to this:

08/10/09 16:14:16 INFO mapred.FileInputFormat: Total input paths to process : 1
08/10/09 16:14:16 INFO mapred.FileInputFormat: Total input paths to process : 1
08/10/09 16:14:16 INFO mapred.JobClient: Running job: job_200810091613_0001
08/10/09 16:14:17 INFO mapred.JobClient:  map 0% reduce 0%
08/10/09 16:14:24 INFO mapred.JobClient:  map 100% reduce 0%
08/10/09 16:14:30 INFO mapred.JobClient: Job complete: job_200810091613_0001
08/10/09 16:14:30 INFO mapred.JobClient: Counters: 16
08/10/09 16:14:30 INFO mapred.JobClient:   File Systems
08/10/09 16:14:30 INFO mapred.JobClient:     HDFS bytes read=39848
08/10/09 16:14:30 INFO mapred.JobClient:     HDFS bytes written=19551
08/10/09 16:14:30 INFO mapred.JobClient:     Local bytes read=24587
08/10/09 16:14:30 INFO mapred.JobClient:     Local bytes written=51980
08/10/09 16:14:30 INFO mapred.JobClient:   Job Counters 
08/10/09 16:14:30 INFO mapred.JobClient:     Launched reduce tasks=1
08/10/09 16:14:30 INFO mapred.JobClient:     Launched map tasks=2
08/10/09 16:14:30 INFO mapred.JobClient:     Data-local map tasks=2
08/10/09 16:14:30 INFO mapred.JobClient:   Map-Reduce Framework
08/10/09 16:14:30 INFO mapred.JobClient:     Reduce input groups=1225
08/10/09 16:14:30 INFO mapred.JobClient:     Combine output records=2651
08/10/09 16:14:30 INFO mapred.JobClient:     Map input records=1256
08/10/09 16:14:30 INFO mapred.JobClient:     Reduce output records=1225
08/10/09 16:14:30 INFO mapred.JobClient:     Map output bytes=52018
08/10/09 16:14:30 INFO mapred.JobClient:     Map input bytes=38733
08/10/09 16:14:30 INFO mapred.JobClient:     Combine input records=5353
08/10/09 16:14:30 INFO mapred.JobClient:     Map output records=3927
08/10/09 16:14:30 INFO mapred.JobClient:     Reduce input records=1225

Now you can examine the output of the wordcount:

$ bin/hadoop fs -cat /myoutput/part-00000

"_logs/history/"    1
"all".</description>    1
"block"(trace    1
"dir"(trac    1
"false",    1
"local",    1
"local".    2
"none".    1
"package.FactoryClassName".    1
"true"    1
...

Summary

Setting up a multi-node, distributed virtual Hadoop cluster is conceptually very simple with OpenSolaris.  Best of all, you can develop your Map/Reduce jobs from the comfort of your own home, or perhaps poolside?

If this tutorial was helpful to you, or if you have any questions or problems, please let me know at George.Porter <at> Sun.com

Comments:

In this configuration, I'd think the only reason to set the replication factor to 3 is if you want to play with failure scenarios. You likely won't get much benefit out of having that many replicas reading from the same physical disk/pool.

Posted by aw on October 14, 2008 at 03:22 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed