How to set up Hadoop two node cluster and run MapReduce jobs

February 25, 2019

This write-up walks through setting up a two node Hadoop v3.1.1 cluster, and running a couple of sample MapReduce jobs.

Prerequisites:

Two machines set up with RHEL 7. You could use another distribution, but the commands may vary.
Perl, wget, and other required packages downloaded using yum
Disable the firewall, or open up connectivity between the two machines. Since we are setting it up as a lab instance, we will go ahead and disable the firewall

systemctl stop firewalld 
systemctl disable firewalld

hadoop1 will be the master node, and hadoop2 will be the datanode.
Add entry for hadoop1 and hadoop2 under /etc/hosts on both machines. We will need a JDK installation (on both machines):
```
yum install java-1.8.0-openjdk -y
```
You can validate that java is installed by querying for the installed version
```
java -version
```
Create a separate directory under ‘/’ path where we will download the bits for hadoop (on both machines)
```
mkdir hadoop 
cd /hadoop/ 
wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz 
tar -xzf hadoop-3.1.1.tar.gz
```
In order to point hadoop to the correct java installation, we will need to capture the full path of java install
```
readlink -f $(which java)
```
Export the path as environment variable (on both machines)
```
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64/jre/
```
We will modify the bashrc profile file to make sure that all the required environment variables are available when we log in to the machine console. This change is made (on both machines):
```
vi ~/.bashrc
```
Add the following lines to the file

export HDFS_NAMENODE_USER="root"

export HDFS_DATANODE_USER="root" 
export HDFS_SECONDARYNAMENODE_USER="root" 
export YARN_RESOURCEMANAGER_USER="root" 
export YARN_NODEMANAGER_USER="root" 
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64/jre/ 
export PATH=$PATH:$JAVA_HOME/bin

Update the core-site file (on the master node)

vi /hadoop/hadoop-3.1.1/etc/hadoop/core-site.xml

Modify the <configuration> section as per below:

<configuration> 
 <property> 
  <name>fs.defaultFS</name> 
  <value>hdfs://hadoop1:9000</value> 
 </property> 
</configuration>

Update the hdfs-site file (on the master node)

vi /hadoop/hadoop-3.1.1/etc/hadoop/hdfs-site.xml

Modify the <configuration> section as per below:

<configuration> 
 <property> 
  <name>dfs.replication</name> 
  <value>1</value> 
 </property> 
</configuration>

Set up the machines for passwordless SSH access (on both machines):

 ssh-keygen 
 ssh-copy-id -i ~/.ssh/id_rsa.pub root@hadoop1 
 ssh-copy-id -i ~/.ssh/id_rsa.pub root@hadoop2

On the master node, update the workers file to reflect the slave nodes

vi /hadoop/hadoop-3.1.1/etc/hadoop/workers

Add the entry

hadoop2

And then on the master node, format the hdfs file system:

/hadoop/hadoop-3.1.1/bin/hdfs namenode -format

On the datanode, format the hdfs file system:

/hadoop/hadoop-3.1.1/bin/hdfs datanode –format

On the master node, start the dfs service:

/hadoop/hadoop-3.1.1/sbin/start-dfs.sh

On the master node, run the dfsadmin report, to validate the availability of datanodes

/hadoop/hadoop-3.1.1/bin/hdfs dfsadmin -report

The output of this command should show two entries for datanodes – one for hadoop1 and one for hadoop2.

The nodes are now set up to handle MapReduce jobs. We will look at two examples. We will use the sample jobs from hadoop-mapreduce-examples-3.1.1.jar file under the share folder. There is a large number of opensource java projects available, which run various kinds of mapreduce jobs.

We will run these exercises on the master node.

Exercise 1: We will solve a sudoku puzzle using MapReduce. First we will need to create a sudoku directory under root folder in hdfs file system.

/hadoop/hadoop-3.1.1/bin/hdfs dfs -mkdir /sudoku

Then create an input file with the sudoku puzzle, under your current directory:

vi solve_this.txt

Update the file with the below text. Each entry on the same line is separated by a space.

? 9 7 ? ? ? ? ? 5
? 6 3 ? 4 ? 2 ? ?
? ? ? 9 ? ? ? 8 ?
? ? 9 ? ? ? ? 7 ?
? ? ? 1 ? 6 ? ? ?
2 5 4 8 3 ? ? ? 1
? 7 ? ? ? 1 8 ? ?
? 8 ? ? 7 ? 6 ? 4
5 ? ? ? ? 2 ? 9 ?

Now move (put) the file from your current directory in to the hdfs folder (sudoku) that we created earlier.

/hadoop/hadoop-3.1.1/bin/hdfs dfs -put solve_this.txt /sudoku/solve_this.txt

To make sure that the file was copied:

/hadoop/hadoop-3.1.1/bin/hdfs dfs -ls /sudoku

Run the mapreduce job, to solve the puzzle:

/hadoop/hadoop-3.1.1/bin/hadoop jar /hadoop/hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar sudoku solve_this.txt
Solving solve_this.txt
1 9 7 6 2 8 4 3 5
8 6 3 7 4 5 2 1 9
4 2 5 9 1 3 7 8 6
6 1 9 2 5 4 3 7 8
7 3 8 1 9 6 5 4 2
2 5 4 8 3 7 9 6 1
9 7 2 4 6 1 8 5 3
3 8 1 5 7 9 6 2 4
5 4 6 3 8 2 1 9 7

Found 1 solutions

Exercise 2: We will run a wordcount method on the sudoku puzzle file.

Run the wordcount method on the sudoku puzzle file, and have the output stored in wcount_result folder.

/hadoop/hadoop-3.1.1/bin/hadoop jar /hadoop/hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /sudoku/solve_this.txt /sudoku/wcount_result

The lengthy output lists out the results of detailed analysis conducted on the file.

We will cat the results of various results,

/hadoop/hadoop-3.1.1/bin/hdfs dfs -cat /sudoku/wcount_result/*
1 3
2 3
3 2
4 3
5 3
6 3
7 4
8 4
9 4
? 52

The above output captures the total number of times a particular digit is listed in the solved puzzle.

To see all the different sample methods available under hadoop-mapreduce-examples-3.1.1.jar, run the following command:

/hadoop/hadoop-3.1.1/bin/hadoop jar /hadoop/hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar

If you have any questions about the steps documented here, would like more information on the installation procedure, or have any feedback or requests, please let us know at info@keyvatech.com.

Anuj joined Keyva from Tech Data where he was the Director of Automation Solutions. In this role, he specializes in developing and delivering vendor-agnostic solutions that avoid the “rip-and-replace” of existing IT investments. Tuli has worked on Cloud Automation, DevOps, Cloud Readiness Assessments and Migrations projects for healthcare, banking, ISP, telecommunications, government and other sectors.

During his previous years at Avnet, Seamless Technologies, and other organizations, he held multiple roles in the Cloud and Automation areas. Most recently, he led the development and management of Cloud Automation IP (intellectual property) and related professional services. He holds certifications for AWS, VMware, HPE, BMC and ITIL, and offers a hands-on perspective on these technologies.

Like what you read? Follow Anuj on LinkedIn at https://www.linkedin.com/in/anujtuli/

Get Appointment

How to set up Hadoop two node cluster and run MapReduce jobs - Keyva

Get In Touch