Multi-Node Cluster Install
Setting up a multi-node Hadoop cluster involves configuring several nodes (master and worker nodes) to work together. In this guide, we'll focus on a three-node cluster, where one node will act as the NameNode (master) and the other nodes will act as DataNodes (workers). We'll use a Linux-based OS (Ubuntu/CentOS) for this installation.
Pre-requisites
Java 8 or higher installed on all nodes.
SSH passwordless setup between all nodes.
All nodes should have static IP addresses.
Network configuration:
Ensure that all nodes can communicate with each other via their IP addresses.
All nodes should be able to resolve hostnames (
hostnameshould resolve to an IP).
Step 1: Configure SSH on All Nodes
Generate SSH key on the master node:
ssh-keygen -t rsaCopy the public key to all nodes (master and workers):
ssh-copy-id user@worker-node-ip ssh-copy-id user@master-node-ipThis will enable passwordless SSH access between the nodes.
Test SSH between the nodes:
ssh user@worker-node-ip ssh user@master-node-ipSet up a hostname for each node in
/etc/hosts(on all nodes):Open
/etc/hostsand add entries like:192.168.1.1 master-node 192.168.1.2 worker-node-1 192.168.1.3 worker-node-2This allows each node to resolve the names of the others.
Step 2: Install Java on All Nodes
Install Java 8 or higher (if not already installed) on all nodes:
On Ubuntu:
On CentOS:
Set JAVA_HOME on all nodes:
Add the following to
~/.bashrcor~/.profile:Reload the profile:
Step 3: Install Hadoop on All Nodes
Download Hadoop and extract it on all nodes:
On the master node:
On each worker node: Download Hadoop and extract it in the same way as on the master node:
Set Hadoop environment variables on all nodes: Add the following to the end of
~/.bashrc:Reload
~/.bashrc:
Step 4: Configure Hadoop on All Nodes
1. Configure hadoop-env.sh
Edit
hadoop-env.shon all nodes (/usr/local/hadoop/etc/hadoop/hadoop-env.sh):Add the following line to set
JAVA_HOME:
2. Configure core-site.xml
Edit
core-site.xmlon all nodes (/usr/local/hadoop/etc/hadoop/core-site.xml):Add the following configuration:
3. Configure hdfs-site.xml
Edit
hdfs-site.xmlon all nodes (/usr/local/hadoop/etc/hadoop/hdfs-site.xml):Add the following configuration:
4. Configure mapred-site.xml
Edit
mapred-site.xmlon all nodes (/usr/local/hadoop/etc/hadoop/mapred-site.xml):Add the following configuration:
5. Configure yarn-site.xml
Edit
yarn-site.xmlon all nodes (/usr/local/hadoop/etc/hadoop/yarn-site.xml):Add the following configuration:
Step 5: Set Up Directories for HDFS
On the master node: Create the directories for HDFS and give appropriate permissions:
On each worker node: Create the directories for HDFS:
Step 6: Format the Hadoop Filesystem (Only on Master Node)
Run the following command on the master node to format the NameNode:
Step 7: Start the Hadoop Cluster
Start HDFS: On the master node:
Start YARN: On the master node:
Check the status of your Hadoop cluster:
NameNode:
http://master-node:9870ResourceManager:
http://master-node:8088
Step 8: Stop the Hadoop Cluster
To stop Hadoop, use the following commands:
Stop HDFS:
Stop YARN:
Troubleshooting Tips
SSH issues: Make sure passwordless SSH works correctly between nodes.
Permissions: Ensure that the Hadoop user has appropriate permissions for directories.
Firewall settings: Check firewall settings if the nodes cannot communicate.
Logs: Look into Hadoop log files (located in
$HADOOP_HOME/logs) for detailed error messages if something goes wrong.
This multi-node setup allows Hadoop to operate efficiently across multiple machines, enabling scalable and distributed data processing.
Last updated