Learn Hadoop Administration

Apache Hadoop pSudo Distribution Installation on Ubuntu

Apache Hadoop Installation on Single Node on Ubuntu

In this we are going to install apache hadoop in

1.     Standalone/Local Mode

2.     pSudo Distribution Mode

 

Pre-Requisites:

1.     We need Physical Machine or Virtual Machine

2.     Any Operating System

 

In This,

1.     Oracle VM Virtual Software

2.     Create VM with Ubuntu Server Operating System

 

In Local Mode or pSudo Distribution Mode, there three steps to install Hadoop. They are:

1.     Pre-Installation steps

2.     Installation steps

3.     Post-Installation steps.

 

Once, Pre-Installation and Installation Completed we can say "Standalone/Local"

installation is completed

           

Pre-Installation + Installation + Post-Installation Completed the we can say "pSudo Distribution Installation is Completed.       

           

Note: -

à Make sure any Virtualization software installed and any operating system

    installed.

à In this i have already installed "Oracle VM VirtualBox".

    CentOS 7(CUI, GUI) -- CUI.

         

Pre-Installation steps: -

1.     Java 8 or Java 1.8 or JDK 8 or JDK 1.8(Recommended) or Later

2.     Setup Passwordless SSH (This for pSudo Distribution Mode)

 

Note: -

To perform Pre-Installation steps or Installation steps or Post-Installation

Steps, we need either "root" user or "any user" with "sudo" permissions.

 

In My VM, I have two users

1. Main Admin User

  un: root

  pwd: cfamily

 

2. Hadoop Admin User

  un: cfamily (sudoer)

  pwd: cfamily  

 

Recommended is "any user" with "sudoer" permission"

 

Step1:  Java 8 Installation on Linux Operating System(Ubuntu)

$sudo apt update

$java -version

Command not found

 

Steps to install Java 8:

1: Download Java 8

2: Install Java 8

3: Set JAVA_HOME and PATH variable

4: Verify Java is installed or not

 

Download Java 8:

a) download tar file in windows and then upload into Ubuntu

          or

b) download directly in ubuntu server

$cd /opt 

$sudo wget https://download.oracle.com/otn/java/jdk/8u281-b09/89d678f2be164786b292527658ca1605/jdk-8u281-linux-x64.tar.gz

 

Install Java 8: -

a)  extract tar file

$sudo tar -xvzf jdk-8u281-linux-x64.tar.gz

b) remove tar.gz file 

$ sudo rm -rf jdk-8u281-linux-x64.tar.gz

c) rename

$ ls

jdk1.8.0_281

cfamily@cfamily:/opt$ sudo mv jdk1.8.0_281 java

cfamily@cfamily:/opt$ ls

java

 

Set JAVA_HOME and PATH variable: -

$sudo nano ~/.bashrc 

#JAVA_HOME

export JAVA_HOME=/opt/java

export PATH=$PATH:$JAVA_HOME/bin

ctrl+x+y

à restart bashrc 

$ source ~/.bashrc

 

Verify Java is installed or not: -

$java 

  or 

$java -version 

  or

$javac 

 

Step2: Passwordless SSH

àfor pSudo Distribution

à In pSudo Distribution Mode All Daemons thinks they are running in 

    different machines.

à So Daemon to Daemon communication it uses SSH Protocol

à SSH -- Secure Shell

    It requires username and password.           

E.g:-

$ ssh localhost

The authenticity of host 'localhost (::1)' can't be established.

 

cfamily@localhost's password:

Last login: Sat May 30 11:55:03 2020 from 192.168.0.7

 

[cfamily@cfamily ~]$ ssh localhost

cfamily@localhost's password:

Last login: Sat May 30 12:22:24 2020 from localhost

[cfamily@cfamily ~]$ exit

logout

Connection to localhost closed.

[cfamily@cfamily ~] $ exit

logout

Connection to localhost closed.

 

--> To make all Daemons can communicate with each other with password we are 

   

Set Passwordless SSH: -      

$ssh-keygen -t rsa

$cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$chmod og-wx ~/.ssh/authorized_keys

 

Then Verify: -

$ ssh localhost

Now it will not ask any password



II) Installation Steps

Step1: Download Latest Apache Hadoop

Step2: Install Apache Hadoop

Step3: Set HADOOP_HOME and PATH Variable

Step4: Verify Hadoop is Installed or not

 

Step1: Download Latest Apache Hadoop

Hadoop available in

Hadoop 1.x – Outdated

Hadoop 2.x – Still Industry using

Hadoop 3.x – New Application started

https://mirrors.estointernet.in/apache/hadoop/common/

$cd /opt

$sudo wget https://mirrors.estointernet.in/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

 

Step2: Install Apache Hadoop

a)      Extract tar file

b)      Remove tar file

c)      Rename Hadoop

$ ls

hadoop-3.3.0.tar.gz  java

cfamily@cfamily:/opt$ sudo tar -xvzf hadoop-3.3.0.tar.gz

$ ls

hadoop-3.3.0  hadoop-3.3.0.tar.gz  java

cfamily@cfamily:/opt$ sudo rm -rf *.gz

cfamily@cfamily:/opt$ ls

hadoop-3.3.0  java

cfamily@cfamily:/opt$ sudo mv hadoop-3.3.0 hadoop

cfamily@cfamily:/opt$ ls

hadoop  java

 

Step3: Set HADOOP_HOME and PATH Variable

a) Understanding Hadoop Installation

cfamily@cfamily:/opt$ cd hadoop/

cfamily@cfamily:/opt/hadoop$ pwd

/opt/hadoop

cfamily@cfamily:/opt/hadoop$ ls

bin      lib             licenses-binary  NOTICE.txt  share

etc      libexec         LICENSE.txt      README.txt

include  LICENSE-binary  NOTICE-binary    sbin

cfamily@cfamily:/opt/hadoop$ cd bin

cfamily@cfamily:/opt/hadoop/bin$ ls

All Admin Commands

cfamily@cfamily:/opt/hadoop/bin$ cd ..

cfamily@cfamily:/opt/hadoop$ cd sbin/

cfamily@cfamily:/opt/hadoop/sbin$ ls

All Admin and user Commands

cfamily@cfamily:/opt/hadoop/sbin$ cd ..

cfamily@cfamily:/opt/hadoop$ cd etc/

cfamily@cfamily:/opt/hadoop/etc$ ls

hadoop

cfamily@cfamily:/opt/hadoop/etc$ cd hadoop/

All configuration files

 

à In Linux for every use we have different files like .bashrc, .profile etc..

à These files executes automatically when user is login

à We are going to configure HADOOP_HOME and PATH variable in .bashrc file

$nano ~/.bashrc

#HADOOP

export HADOOP_HOME=/opt/hadoop

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

crtl+x

 

restart .bashrc

$ source ~/.bashrc

 

 

Step4: Verify Hadoop is Installed or not

$ hadoop version

Hadoop 3.3.0

 

Note: - Now we can say Local Mode/Standalone Mode installed Successfully.

 

III) Post-Installation steps

Step 1) In Post-Installation we will configure Hadoop

cfamily@cfamily:/opt/hadoop/etc/hadoop$ pwd

/opt/hadoop/etc/hadoop

cfamily@cfamily:/opt/hadoop/etc/hadoop$ ls

All configuration files available.

 

We can configure 4 important files. They are:

1.      core-site.xml

2.      hdfs-site.xml

3.      mapred-site.xml

4.      yarn-site.xml

 

 

core-site.xml: -

à we will specify "filesystem type", default is local filesystem.

à we will specify "namenode" and rpc port

$sudo nano core-site.xml

<configuration>

        <property>

            <name>fs.default.name</name>

            <value>hdfs://localhost:9000</value>

        </property>

 

hdfs-site.xml

à In this configuration file we will provide/configure

    block size -- default block size until hadoop 1.x 64MB from Hadoop 2.x

            128MB.

            replication-factor -- default replication-factor is "3"

            etc..

$sudo nano hdfs-site.xml

<configuration>

     <property>

            <name>dfs.replication</name>

            <value>1</value>

    </property>

</configuration>

Why i am setting replication-factor is "1"?

we have only one machine

 

mapred-site.xml

à until hadoop 1.x MapReduce Architecture 1

    JobTracker and TaskTracker

à From Hadoop 2.x -- MapReduce Architecture 2 -- YARN Architecture

    ResourceManager and NodeManager       

 

$sudo nano mapred-site.xml

<configuration>

    <property>

            <name>mapreduce.framework.name</name>

            <value>yarn</value>

    </property>

</configuration>

 

yarn-site.xml: -

à We will specify yarn configuration

 

$sudo nano yarn-site.xml

 

<configuration>

    <property>

            <name>yarn.acl.enable</name>

            <value>0</value>

    </property>

 

    <property>

            <name>yarn.resourcemanager.hostname</name>

            <value>localhost</value>

    </property>

 

    <property>

            <name>yarn.nodemanager.aux-services</name>

            <value>mapreduce_shuffle</value>

    </property>

</configuration>

 

Step 2: Format NameNode

Why we are formatting?

à Once configuration complete, we need to format NameNode about latest

   configuration

Note: - If you format NameNode we loss old metadata. So, take a backup of

metadata.

 

$hdfs namenode -format

 

Accessing Hadoop

à In Previous Session, We Successfully installed Apache Hadoop on Single

    Node -- pSudo Distribution Mode.

1. Starting and Stopping Hadoop Daemons

2. Enable required ports

3. Understanding NameNode Web UI          

 

Starting and Stopping Hadoop Daemons:-

HADOOP_HOME\bin

     hadoop,hdfs,mapred,yarn etc..

             

HADOOP_HOME\sbin

            start-all.sh,stop-all.sh, start-dfs.sh, stop-dfs.sh,start-yarn.sh,

            stop-yarn.sh, etc..

           

These contain hadoop commands for both hadoop admin and normal user commands.

 

define users:

$nano ~/.bashrc

export HDFS_NAMENODE_USER="cfamily"

export HDFS_DATANODE_USER="cfamily"

export HDFS_SECONDARYNAMENODE_USER="cfamily"

export YARN_RESOURCEMANAGER_USER="cfamily"

export YARN_NODEMANAGER_USER="cfamily"

 

restart .bashrc file:-

---------------------

$ source ~/.bashrc

 

Note: Configure JAVA_HOME in hadoop-env.sh

 

start all daemons at once: -

$start-all.sh

 

To know daemons are started or not: -

$ jps

3986 ResourceManager

4457 Jps

3707 SecondaryNameNode

3532 DataNode

4092 NodeManager

3437 NameNode

 

To Stop All Daemons at once: -

$stop-all.sh

$jps

 

Understanding NameNode Web UI

--> start all hadoop daemons

$start-all.sh

--> open any web browser and use url:

 http://ipaddress:9870  --> hadoop 3.x

 http://ipaddress:50070 --> hadoop 2.x

 

Note:- Linux is Secure, we can stop firewall, we can enable required ports

 

Enable required ports: -

$ sudo ufw allow 9870
 
NameNode Web UI: -
http://192.168.0.4:9870  
    utilities
        |-- browse filesystem 
/ -- root is a main filesystem