This write-up walks through setting up a two node Hadoop v3.1.1 cluster, and running a couple of sample MapReduce jobs.
Prerequisites:
- Two machines set up with RHEL 7. You could use another distribution, but the commands may vary.
- Perl, wget, and other required packages downloaded using yum
- Disable the firewall, or open up connectivity between the two machines. Since we are setting it up as a lab instance, we will go ahead and disable the firewall
systemctl stop firewalld systemctl disable firewalld
- hadoop1 will be the master node, and hadoop2 will be the datanode.
- Add entry for hadoop1 and hadoop2 under /etc/hosts on both machines. We will need a JDK installation (on both machines):
yum install java-1.8.0-openjdk -y
You can validate that java is installed by querying for the installed version
java -version
Create a separate directory under ‘/’ path where we will download the bits for hadoop (on both machines)
mkdir hadoop cd /hadoop/ wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz tar -xzf hadoop-3.1.1.tar.gz
In order to point hadoop to the correct java installation, we will need to capture the full path of java install
readlink -f $(which java)
Export the path as environment variable (on both machines)
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64/jre/
We will modify the bashrc profile file to make sure that all the required environment variables are available when we log in to the machine console. This change is made (on both machines):
vi ~/.bashrc
Add the following lines to the file
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root" export HDFS_SECONDARYNAMENODE_USER="root" export YARN_RESOURCEMANAGER_USER="root" export YARN_NODEMANAGER_USER="root" export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64/jre/ export PATH=$PATH:$JAVA_HOME/bin
Update the core-site file (on the master node)
vi /hadoop/hadoop-3.1.1/etc/hadoop/core-site.xml
Modify the <configuration> section as per below:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop1:9000</value> </property> </configuration>
Update the hdfs-site file (on the master node)
vi /hadoop/hadoop-3.1.1/etc/hadoop/hdfs-site.xml
Modify the <configuration> section as per below:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Set up the machines for passwordless SSH access (on both machines):
ssh-keygen ssh-copy-id -i ~/.ssh/id_rsa.pub root@hadoop1 ssh-copy-id -i ~/.ssh/id_rsa.pub root@hadoop2
On the master node, update the workers file to reflect the slave nodes
vi /hadoop/hadoop-3.1.1/etc/hadoop/workers
Add the entry
hadoop2
And then on the master node, format the hdfs file system:
/hadoop/hadoop-3.1.1/bin/hdfs namenode -format
On the datanode, format the hdfs file system:
/hadoop/hadoop-3.1.1/bin/hdfs datanode –format
On the master node, start the dfs service:
/hadoop/hadoop-3.1.1/sbin/start-dfs.sh
On the master node, run the dfsadmin report, to validate the availability of datanodes
/hadoop/hadoop-3.1.1/bin/hdfs dfsadmin -report
The output of this command should show two entries for datanodes – one for hadoop1 and one for hadoop2.
The nodes are now set up to handle MapReduce jobs. We will look at two examples. We will use the sample jobs from hadoop-mapreduce-examples-3.1.1.jar file under the share folder. There is a large number of opensource java projects available, which run various kinds of mapreduce jobs.
We will run these exercises on the master node.
Exercise 1: We will solve a sudoku puzzle using MapReduce. First we will need to create a sudoku directory under root folder in hdfs file system.
/hadoop/hadoop-3.1.1/bin/hdfs dfs -mkdir /sudoku
Then create an input file with the sudoku puzzle, under your current directory:
vi solve_this.txt
Update the file with the below text. Each entry on the same line is separated by a space.
? 9 7 ? ? ? ? ? 5 ? 6 3 ? 4 ? 2 ? ? ? ? ? 9 ? ? ? 8 ? ? ? 9 ? ? ? ? 7 ? ? ? ? 1 ? 6 ? ? ? 2 5 4 8 3 ? ? ? 1 ? 7 ? ? ? 1 8 ? ? ? 8 ? ? 7 ? 6 ? 4 5 ? ? ? ? 2 ? 9 ?
Now move (put) the file from your current directory in to the hdfs folder (sudoku) that we created earlier.
/hadoop/hadoop-3.1.1/bin/hdfs dfs -put solve_this.txt /sudoku/solve_this.txt
To make sure that the file was copied:
/hadoop/hadoop-3.1.1/bin/hdfs dfs -ls /sudoku
Run the mapreduce job, to solve the puzzle:
/hadoop/hadoop-3.1.1/bin/hadoop jar /hadoop/hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar sudoku solve_this.txt Solving solve_this.txt 1 9 7 6 2 8 4 3 5 8 6 3 7 4 5 2 1 9 4 2 5 9 1 3 7 8 6 6 1 9 2 5 4 3 7 8 7 3 8 1 9 6 5 4 2 2 5 4 8 3 7 9 6 1 9 7 2 4 6 1 8 5 3 3 8 1 5 7 9 6 2 4 5 4 6 3 8 2 1 9 7
Found 1 solutions
Exercise 2: We will run a wordcount method on the sudoku puzzle file.
Run the wordcount method on the sudoku puzzle file, and have the output stored in wcount_result folder.
/hadoop/hadoop-3.1.1/bin/hadoop jar /hadoop/hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /sudoku/solve_this.txt /sudoku/wcount_result
The lengthy output lists out the results of detailed analysis conducted on the file.
We will cat the results of various results,
/hadoop/hadoop-3.1.1/bin/hdfs dfs -cat /sudoku/wcount_result/* 1 3 2 3 3 2 4 3 5 3 6 3 7 4 8 4 9 4 ? 52
The above output captures the total number of times a particular digit is listed in the solved puzzle.
To see all the different sample methods available under hadoop-mapreduce-examples-3.1.1.jar, run the following command:
/hadoop/hadoop-3.1.1/bin/hadoop jar /hadoop/hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar
If you have any questions about the steps documented here, would like more information on the installation procedure, or have any feedback or requests, please let us know at [email protected].
Anuj joined Keyva from Tech Data where he was the Director of Automation Solutions. In this role, he specializes in developing and delivering vendor-agnostic solutions that avoid the “rip-and-replace” of existing IT investments. Tuli has worked on Cloud Automation, DevOps, Cloud Readiness Assessments and Migrations projects for healthcare, banking, ISP, telecommunications, government and other sectors.
During his previous years at Avnet, Seamless Technologies, and other organizations, he held multiple roles in the Cloud and Automation areas. Most recently, he led the development and management of Cloud Automation IP (intellectual property) and related professional services. He holds certifications for AWS, VMware, HPE, BMC and ITIL, and offers a hands-on perspective on these technologies.
Like what you read? Follow Anuj on LinkedIn at https://www.linkedin.com/in/anujtuli/