Setting up a multi-node Hadoop cluster involves configuring several nodes (master and worker nodes) to work together. In this guide, we'll focus on a three-node cluster , where one node will act as the NameNode (master) and the other nodes will act as DataNodes (workers). We'll use a Linux-based OS (Ubuntu/CentOS) for this installation.
Pre-requisites
Java 8 or higher installed on all nodes.
SSH passwordless setup between all nodes.
All nodes should have static IP addresses .
Network configuration :
Ensure that all nodes can communicate with each other via their IP addresses.
All nodes should be able to resolve hostnames (hostname should resolve to an IP).
Generate SSH key on the master node:
Copy the public key to all nodes (master and workers):
Copy ssh-copy-id user@worker-node-ip
ssh-copy-id user@master-node-ip This will enable passwordless SSH access between the nodes.
Test SSH between the nodes:
Copy ssh user@worker-node-ip
ssh user@master-node-ip Set up a hostname for each node in /etc/hosts (on all nodes):
Open /etc/hosts and add entries like:
Copy 192.168.1.1 master-node
192.168.1.2 worker-node-1
192.168.1.3 worker-node-2 This allows each node to resolve the names of the others.
Step 2: Install Java on All Nodes
Install Java 8 or higher (if not already installed) on all nodes:
On Ubuntu :
On CentOS :
Set JAVA_HOME on all nodes:
Add the following to ~/.bashrc or ~/.profile:
Step 3: Install Hadoop on All Nodes
Download Hadoop and extract it on all nodes:
On each worker node : Download Hadoop and extract it in the same way as on the master node:
Set Hadoop environment variables on all nodes: Add the following to the end of ~/.bashrc:
1. Configure hadoop-env.sh
Edit hadoop-env.sh on all nodes (/usr/local/hadoop/etc/hadoop/hadoop-env.sh):
Add the following line to set JAVA_HOME:
2. Configure core-site.xml
Edit core-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/core-site.xml):
Add the following configuration:
3. Configure hdfs-site.xml
Edit hdfs-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/hdfs-site.xml):
Add the following configuration:
4. Configure mapred-site.xml
Edit mapred-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/mapred-site.xml):
Add the following configuration:
5. Configure yarn-site.xml
Edit yarn-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/yarn-site.xml):
Add the following configuration:
Step 5: Set Up Directories for HDFS
On the master node : Create the directories for HDFS and give appropriate permissions:
On each worker node : Create the directories for HDFS:
Run the following command on the master node to format the NameNode:
Step 7: Start the Hadoop Cluster
Start HDFS : On the master node :
Start YARN : On the master node :
Check the status of your Hadoop cluster :
NameNode: http://master-node:9870
ResourceManager: http://master-node:8088
Step 8: Stop the Hadoop Cluster
To stop Hadoop, use the following commands:
Troubleshooting Tips
SSH issues : Make sure passwordless SSH works correctly between nodes.
Permissions : Ensure that the Hadoop user has appropriate permissions for directories.
Firewall settings : Check firewall settings if the nodes cannot communicate.
Logs : Look into Hadoop log files (located in $HADOOP_HOME/logs) for detailed error messages if something goes wrong.
This multi-node setup allows Hadoop to operate efficiently across multiple machines, enabling scalable and distributed data processing.
Last updated 11 months ago