Setting up Hadoop Cluster

This article aims to explain the installation of Apache Hadoop 2.7.3 cluster. The servers used are a namenode, a secondary namenode and three datanodes. The nodes are virtual running on VMWare Workstation 10

Requirements
NameDescription
Java SE 1.8Java version 1.8 should be installed on server. Get Java from Oracle
node 1to be used for namenode
node 2to be used for secondary namenode
node 3to be used for datanode 1
node 4to be used for datanode 2
node 5to be used for datanode 3
Default Ports

Ensure that the below ports are not blocked by firewall and also, that these are not used by any other programs.

PortParameterConf FileUser for
8020fs.defaultFScore-site.xmlNameNode URI
50070dfs.http.address, dfs.https.porthdfs-site.xmlHTTP port used by Web interface's provided by Apache Hadoop. The URI is http://namenode:50070
1004dfs.datanode.addresshdfs-site.xmlDatanode address
1006dfs.datanode.http.addresshdfs-site.xmlDatanode HTTP address
8025dfs.datanode.ipc.addresshdfs-site.xmlDatanode IPC port
50090dfs.secondary.http.addresshdfs-site.xmlSecondary node HTTP port
50490dfs.secondary.https.porthdfs-site.xmlSecondary node HTTPS port
8088yarn.resourcemanager.addressyarn-site.xmlResource Manager
Apache Hadoop Distribution

Download the latest stable release of Apache Hadoop by visiting a Apache Download Mirror site. The software is presented as tarball of source as well as binary. I've used here the binary. Downloading the source requires you to compile.

Copy the software to all nodes in the cluster

Copy the software hadoop-x.y.z.tar.gz (eg. hadoop-2.7.3.tar.gz) to namenode, secondary namenode and all datanodes in the cluster.

Unpack the binary file
tar xvfz hadoop-x.y.z.tar.gz
Create appropriate directories for NameNode & Secondary NameNode
cd /opt/hadoop
mkdir -p hdfsdata/namedir1
mkdir p hdfsdata/namedir2
mkdir hdfsdata/edits1
mkdir hdfsdata/edits2
mkdir hdfsdata/data1
mkdir hdfsdata/data2
mkdir hdfsdata/nodemgrdata1
mkdir hdfsdata/nodemgrdata2
mkdir hdfsdata/nodemgrlog1
mkdir hdfsdata/nodemgrlog2

The above directories are used in parameters as seen in below table

DirectoryParameterConfig file
hdfsdata/namedir1, hdfsdata/namedir2dfs.namenode.name.dirhdfs-site.xml
hdfsdata/edits1, hdfsdata/edits2dfs.namenode.edits.dirhdfs-site.xml
hdfsdata/data1, hdfsdata/data2dfs.datanode.data.dirhdfs-site.xml
hdfsdata/nodemgrdata1, hdfsdata/nodemgrdata2yarn.nodemanager.local-dirsyarn-site.xml
hdfsdata/nodemgrlog1, hdfsdata/nodemgrlog2yarn.nodemanager.log-dirsyarn-site.xml
Create appropriate directories for all DataNode
cd /opt/hadoop
mkdir -p hdfsdata/data1
mkdir hdfsdata/data2
mkdir hdfsdata/nodemgrdata1
mkdir hdfsdata/nodemgrdata2
mkdir hdfsdata/nodemgrlog1
mkdir hdfsdata/nodemgrlog2
Environment Setting
# Set Java Home
JAVA_HOME=<path to Java>
export JAVA_HOME

# Hadoop Home
HADOOP_HOME=/opt/hadoop/hadoop-x.y.z
export HADOOP_HOME

# Hadoop Log Directory
HADOOP_LOG_DIR=/opt/hadoop/hdfsdata/log
export HADOOP_LOG_DIR

HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_PREFIX  

HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CONF_DIR  
DataNode Include File

The full path of this file will be the value of parameter dfs.hosts in the configuration file hdfs-site.xml. This contains the fully qualified names of all the datanodes (hostnames of datanodes only and not the hostnames of namenodes or the secondary namenode). Each hostname should appear in a single line as seen below. This file can be called by any name. Here, I named it dfs.include and resides in /opt/hadoop directory

cat /opt/hadoop/dfs.include
orcl1
orcl2
orcl3
orcl4
orcl5
Configuration files

Configuring the cluster nodes is based on three xml files. These files are present in etc/hadoop directory under the main installation directory.

Formating the HDFS filesystem

After you set up the software, the directories and the configuration files, the first thing to be done is formatting the file system. Formatting is done by logging in to the namenode and running the below command. Do this on namenode only. <cluster name> is the name you set to the cluster. Though -clusterid parameter is optional, it is better to set it to the cluster, or you will have to deal with a long string that the system sets for you.

cd $HADOOP_HOME/bin
./hdfs namenode -format -clusterid <cluster name>
Start NameNode daemon

To start the namenode daemon, execute the below command from namenode server

cd $HADOOP_HOME/sbin
./hadoop-daemon.sh start namenode

Alternatively, you can use below command as well

cd $HADOOP_HOME/bin
./hdfs --daemon start namenode
Start SecondaryNameNode daemon

To start the Secondary NameNode daemon, execute the below command from secondary namenode server

cd $HADOOP_HOME/sbin
./hadoop-daemon.sh start secondarynamenode

Alternative command is below

cd $HADOOP_HOME/bin
./hdfs --daemon start secondarynamenode
Start DataNode daemon

To start the datanode daemon, execute the below command from datanode server

cd $HADOOP_HOME/sbin
./hadoop-daemon.sh start datanode

Alternative command is below

cd $HADOOP_HOME/bin
./hdfs --daemon start datanode
Start YARN ResourceManager daemon

To start the datanode daemon, execute the below command from datanode server

cd $HADOOP_HOME/bin
./yarn --daemon start resourcemanager
Start YARN NodeManager daemon

To start the datanode daemon, execute the below command from datanode server

cd $HADOOP_HOME/bin
./yarn --daemon start nodemanager
Stop NameNode daemon

To stop the namenode daemon, execute the below command from namenode server

cd $HADOOP_HOME/sbin
./hadoop-daemon.sh stop namenode

Alternative, you can use the below command from also

cd $HADOOP_HOME/bin
./hdfs --daemon stop namenode
Stop Secondary NameNode daemon

To stop the secondary namenode daemon, execute the below command from secondary namenode server

cd $HADOOP_HOME/sbin
./hadoop-daemon.sh stop secondarynamenode

Alternative, you can use the below command from also

cd $HADOOP_HOME/bin
./hdfs --daemon stop secondarynamenode
Stop DataNode daemon

To stop the datanode daemon, execute the below command from datanode server. To stop all datanodes, this has to be executed from all the datanodes.

cd $HADOOP_HOME/sbin
./hadoop-daemon.sh stop datanode

Alternative, you can use the below command from also

cd $HADOOP_HOME/bin
./hdfs --daemon stop datanode
Comments