Multi-Node Cluster Install
Setting up a multi-node Hadoop cluster involves configuring several nodes (master and worker nodes) to work together. In this guide, we'll focus on a three-node cluster, where one node will act as the NameNode (master) and the other nodes will act as DataNodes (workers). We'll use a Linux-based OS (Ubuntu/CentOS) for this installation.
Pre-requisites
Java 8 or higher installed on all nodes.
SSH passwordless setup between all nodes.
All nodes should have static IP addresses.
Network configuration:
Ensure that all nodes can communicate with each other via their IP addresses.
All nodes should be able to resolve hostnames (
hostname
should resolve to an IP).
Step 1: Configure SSH on All Nodes
Generate SSH key on the master node:
ssh-keygen -t rsa
Copy the public key to all nodes (master and workers):
ssh-copy-id user@worker-node-ip ssh-copy-id user@master-node-ip
This will enable passwordless SSH access between the nodes.
Test SSH between the nodes:
ssh user@worker-node-ip ssh user@master-node-ip
Set up a hostname for each node in
/etc/hosts
(on all nodes):Open
/etc/hosts
and add entries like:192.168.1.1 master-node 192.168.1.2 worker-node-1 192.168.1.3 worker-node-2
This allows each node to resolve the names of the others.
Step 2: Install Java on All Nodes
Install Java 8 or higher (if not already installed) on all nodes:
On Ubuntu:
sudo apt update sudo apt install openjdk-8-jdk
On CentOS:
sudo yum install java-1.8.0-openjdk-devel
Set JAVA_HOME on all nodes:
Add the following to
~/.bashrc
or~/.profile
:export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export PATH=$PATH:$JAVA_HOME/bin
Reload the profile:
source ~/.bashrc
Step 3: Install Hadoop on All Nodes
Download Hadoop and extract it on all nodes:
On the master node:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz tar -xzvf hadoop-3.3.0.tar.gz sudo mv hadoop-3.3.0 /usr/local/hadoop
On each worker node: Download Hadoop and extract it in the same way as on the master node:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz tar -xzvf hadoop-3.3.0.tar.gz sudo mv hadoop-3.3.0 /usr/local/hadoop
Set Hadoop environment variables on all nodes: Add the following to the end of
~/.bashrc
:export HADOOP_HOME=/usr/local/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Reload
~/.bashrc
:source ~/.bashrc
Step 4: Configure Hadoop on All Nodes
1. Configure hadoop-env.sh
Edit
hadoop-env.sh
on all nodes (/usr/local/hadoop/etc/hadoop/hadoop-env.sh
):nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Add the following line to set
JAVA_HOME
:export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
2. Configure core-site.xml
Edit
core-site.xml
on all nodes (/usr/local/hadoop/etc/hadoop/core-site.xml
):nano /usr/local/hadoop/etc/hadoop/core-site.xml
Add the following configuration:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master-node:9000</value> </property> </configuration>
3. Configure hdfs-site.xml
Edit
hdfs-site.xml
on all nodes (/usr/local/hadoop/etc/hadoop/hdfs-site.xml
):nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Add the following configuration:
<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value> </property> </configuration>
4. Configure mapred-site.xml
Edit
mapred-site.xml
on all nodes (/usr/local/hadoop/etc/hadoop/mapred-site.xml
):cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
Add the following configuration:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
5. Configure yarn-site.xml
Edit
yarn-site.xml
on all nodes (/usr/local/hadoop/etc/hadoop/yarn-site.xml
):nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
Add the following configuration:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master-node:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master-node:8030</value> </property> </configuration>
Step 5: Set Up Directories for HDFS
On the master node: Create the directories for HDFS and give appropriate permissions:
sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode sudo chown -R user:user /usr/local/hadoop/hadoop_data
On each worker node: Create the directories for HDFS:
sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode sudo chown -R user:user /usr/local/hadoop/hadoop_data
Step 6: Format the Hadoop Filesystem (Only on Master Node)
Run the following command on the master node to format the NameNode:
hdfs namenode -format
Step 7: Start the Hadoop Cluster
Start HDFS: On the master node:
start-dfs.sh
Start YARN: On the master node:
start-yarn.sh
Check the status of your Hadoop cluster:
NameNode:
http://master-node:9870
ResourceManager:
http://master-node:8088
Step 8: Stop the Hadoop Cluster
To stop Hadoop, use the following commands:
Stop HDFS:
stop-dfs.sh
Stop YARN:
stop-yarn.sh
Troubleshooting Tips
SSH issues: Make sure passwordless SSH works correctly between nodes.
Permissions: Ensure that the Hadoop user has appropriate permissions for directories.
Firewall settings: Check firewall settings if the nodes cannot communicate.
Logs: Look into Hadoop log files (located in
$HADOOP_HOME/logs
) for detailed error messages if something goes wrong.
This multi-node setup allows Hadoop to operate efficiently across multiple machines, enabling scalable and distributed data processing.
Last updated