CPS610 Assignment 6

$30.00

Category: Tags: , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

4.7/5 - (6 votes)

There will be5marks if you implement MapReduced with Hadoop as an underline
platform for the word count problem. You can use the instructions in the following
page that uses virtual box, Python and word count example:
Step By Step guide for Hadoop installation on Ubuntu 20.04.1 with
MapReduce example using Streaming
1. Download VirtualBox from: https://www.virtualbox.org/wiki/Downloads
2. Download Ubuntu 20.04.1 LTS (desktop version amd64) from:
https://www.ubuntu.com/download/desktop
Downloaded file : ubuntu-20.04.1-desktop-amd64.iso
3. create a VM with Ubuntu 20.04 image
4. After installing Ubuntu login to the VM and follow instructions given in
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoopcommon/SingleCluster.html
Here I am giving step by step details for the installation steps.
5. First, we will update the system’s local repository and then install JAVA (default
JDK). Run below commands on the terminal.
sudo apt-get update
sudo apt install openjdk-8-jdk -y
6. Now we will install OpenSSH on Ubuntu following commands.
sudo apt install openssh-server openssh-client -y
6.1 Create new user (Here, put your username on section)
sudo adduser
su –
7. Now we will setup passwordless ssh for Hadoop. First check if you already have
passwordless ssh authentication setup; if it is new Ubuntu installation most likely
it wouldn’t set up. If passwordless ssh authentication is not setup, please follow
next step otherwise skip it.
8. run below commands:
ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost
8. Use the mirror link and download the Hadoop package with the wget command:
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop3.2.1.tar.gz
9. Once the download is complete, extract the files to initiate the Hadoop installation
tar xzf hadoop-3.2.1.tar.gz
10. See the list directories
11. Configure Hadoop Environment Variables
Edit the .bashrc shell configuration file using a text editor of your choice
sudo nano .bashrc
Define the Hadoop environment variables by adding the following content to the
end of the file .bashrc file
Here, put your username on section
export HADOOP_HOME=/home//hadoop-3.2.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS”-Djava.library.path=$HADOOP_HOME/lib/native”
12.Now will find the Java path, run the following command in your terminal window:
which javac
readlink -f /usr/bin/javac
The section of the path just before the /bin/javac directory needs to be assigned
to the $JAVA_HOME variable on the hadoop-env.sh File
13. Edit hadoop-env.sh File
Change directory to extracted folder and edit Hadoop-env.sh file for updating
java home_path. Use the following commands
cd hadoop-3.2.1
nano etc/hadoop/hadoop-env.sh
Uncomment the JAVA_HOME variable and add the following line in hadoop-env.sh file:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
14. Now we will update some configuration files for pseudo-distributed operation. First
we will edit etc/hadoop/core-site.xml file as below.
Sudo nano etc/hadoop/core-site.xml

hadoop.tmp.dir
/home//tmpdata
fs.default.name
hdfs://127.0.0.1:9000

Here, put your username on section
15.Edit hdfs-site.xml File
Create directory for NameNode and DataNode storage:
cd mkdir dfsdata
cd mkdir dfsdata/namenode
cd mkdir dfsdata/datanode
Sudo nano etc/hadoop/hdfs-site.xml

dfs.data.dir
/home//dfsdata/namenode
dfs.data.dir
/home//dfsdata/datanode
dfs.replication
1

Here, put your username on section
16.Now we will start NameNode and DataNode but before that we will format the
HDFS file system.
hdfs namenode -format
Now, Navigate to the hadoop-3.2.2/sbin directory and execute the following
commands to start the NameNode and DataNode:
cd sbin
./start-dfs.sh
./start-yarn.sh
Type this simple command to check if all the daemons are active and running as
Java processes:
jps
17.Now we can access Web-interface for NameNode at http://localhost:9870/
18. Now let’s create some directories in HDFS filesystem.
19. Let’s download one html page http://hadoop.apache.org and upload into HDFS file
system.
wget http://hadoop.apache.org -O hadoop_home_page.html
Please note that HDFS file system is not same as root file system.
Grep example:
20. For this example we are using hadoop-mapreduce-examples-3.2.1.jar file which
comes along with Hadoop. In this example we are trying to count the total number of
‘https’ word occurrences in the given files. First we run the Hadoop job then copy the
results from HDFS to the local file system. (you may get 3 occurrences of https)
Command:
hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep
/user/demo/hadoop_home_page.html -output
/user/demo/hadoop_home_page.html_OUTPUT_2
We can see that there are 2 occurrences of https in the given file and same we can
validate using wget command.
Wordcount example:
21. For wordcount example also we are using hadoop-mapreduce-examples-2.7.4.jar
file. The wordcount example returns the count of each word in the given documents.
Command:
hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount
/user/demo/hadoop_home_page.html /user/demo/hadoop_home_page.html_OUTPUT_1
Another three commands are in the following screen shot.
Wordcount using Hadoop streaming
(python)
22. Here is mapper and reducer program for wordcount.
23. We run the program as below and the copy the result to local file system.
Command :
hadoop jar ../share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar -mapper ./wordcount_map.py –
reducer ./wordcount_red.py -input /user/demo/hadoop_home_page.html -output
/user/demo/hadoop_home_page.html_OUTPUT_COUNT
Please note that if you power off the virtual machine and if you are not sure how to start the
namenode without formatting it you need to do the assignment all over
again in order to demo. But instead of power off the machine, if you save the state of the
machine although it is temporary solution it should work and you don’t need to do everything all
over.