In part 1 of this series, we created an OpenSolaris zone. In part 2, we cloned this zone to create two more, and set up networking and ssh host keys so that we can
- Log into each zone from the global zone without using our password
- Log into the global zone from the local zones without using our password
The stage is now set for us to setup and run Hadoop, which is what we cover in Part 3 of this tutorial. We'll install Hadoop by proceeding through these steps:
- Download and unpack the Hadoop distribution
- Configure Hadoop to work with our virtual cluster
- Start it up
- Test a simple map/reduce job
We start by making sure that our zones are installed and running:
$ pfexec zoneadm list -cv
ID NAME STATUS PATH BRAND IP 0 global running / native shared 9 node3 running /export/zones/node3 ipkg shared 10 node1 running /export/zones/node1 ipkg shared 11 node2 running /export/zones/node2 ipkg shared
1. Download and unpack the Hadoop distribution
The Hadoop distribution is located here: http://hadoop.apache.org/core/releases.html
Download the most recent release, using your browser and unpack it in the /opt directory. Change into that directory:
$ cd /opt/hadoop-0.18.1
2. Configure Hadoop
Now edit the conf/masters file, and replace 'localhost' with 'master'.
Edit conf/slaves, and replace 'localhost' with:
node1
node2
node3
Now edit conf/hadoop-env.sh. We are going to change some of the defaults. Uncomment the following lines, and set them to the values given:
export JAVA_HOME=/usr/java
export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR -i ~/.ssh /hadoop"
export HADOOP_LOG_DIR=/tmpNow we create a conf/hadoop-site.xml file, which should have this content:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000/</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> <property> <name>dfs.replication</name> <value>3</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>mapred.job.tracker</name> <value>master:9001</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration>
Great! Now we're ready to start up Hadoop
3. Starting Hadoop
We begin by formatting the namenode:
$ cd /opt/hadoop-0.18.1
$ bin/hadoop namenode -format
Now we can start up the daemons
$ bin/start-all.sh
4. Test a simple map/reduce job
If everything went well, you should have a running 4-node virtual Hadoop deployment based around OpenSolaris zones.
Let's run a simple test:
$ bin/hadoop fs -mkdir /myinput
$ bin/hadoop fs -copyFromLocal conf/hadoop-default.xml /myinput
$ bin/hadoop jar hadoop-*-examples.jar wordcount /myinput /myoutput
If everything is operational, you should see output similar to this:
08/10/09 16:14:16 INFO mapred.FileInputFormat: Total input paths to process : 1 08/10/09 16:14:16 INFO mapred.FileInputFormat: Total input paths to process : 1 08/10/09 16:14:16 INFO mapred.JobClient: Running job: job_200810091613_0001 08/10/09 16:14:17 INFO mapred.JobClient: map 0% reduce 0% 08/10/09 16:14:24 INFO mapred.JobClient: map 100% reduce 0% 08/10/09 16:14:30 INFO mapred.JobClient: Job complete: job_200810091613_0001 08/10/09 16:14:30 INFO mapred.JobClient: Counters: 16 08/10/09 16:14:30 INFO mapred.JobClient: File Systems 08/10/09 16:14:30 INFO mapred.JobClient: HDFS bytes read=39848 08/10/09 16:14:30 INFO mapred.JobClient: HDFS bytes written=19551 08/10/09 16:14:30 INFO mapred.JobClient: Local bytes read=24587 08/10/09 16:14:30 INFO mapred.JobClient: Local bytes written=51980 08/10/09 16:14:30 INFO mapred.JobClient: Job Counters 08/10/09 16:14:30 INFO mapred.JobClient: Launched reduce tasks=1 08/10/09 16:14:30 INFO mapred.JobClient: Launched map tasks=2 08/10/09 16:14:30 INFO mapred.JobClient: Data-local map tasks=2 08/10/09 16:14:30 INFO mapred.JobClient: Map-Reduce Framework 08/10/09 16:14:30 INFO mapred.JobClient: Reduce input groups=1225 08/10/09 16:14:30 INFO mapred.JobClient: Combine output records=2651 08/10/09 16:14:30 INFO mapred.JobClient: Map input records=1256 08/10/09 16:14:30 INFO mapred.JobClient: Reduce output records=1225 08/10/09 16:14:30 INFO mapred.JobClient: Map output bytes=52018 08/10/09 16:14:30 INFO mapred.JobClient: Map input bytes=38733 08/10/09 16:14:30 INFO mapred.JobClient: Combine input records=5353 08/10/09 16:14:30 INFO mapred.JobClient: Map output records=3927 08/10/09 16:14:30 INFO mapred.JobClient: Reduce input records=1225
Now you can examine the output of the wordcount:
$ bin/hadoop fs -cat /myoutput/part-00000
"_logs/history/" 1 "all".</description> 1 "block"(trace 1 "dir"(trac 1 "false", 1 "local", 1 "local". 2 "none". 1 "package.FactoryClassName". 1 "true" 1 ...
Summary
Setting up a multi-node, distributed virtual Hadoop cluster is conceptually very simple with OpenSolaris. Best of all, you can develop your Map/Reduce jobs from the comfort of your own home, or perhaps poolside?
If this tutorial was helpful to you, or if you have any questions or problems, please let me know at George.Porter <at> Sun.com
In this configuration, I'd think the only reason to set the replication factor to 3 is if you want to play with failure scenarios. You likely won't get much benefit out of having that many replicas reading from the same physical disk/pool.
Posted by aw on October 14, 2008 at 03:22 AM PDT #