Multi-Node Cluster Install

Setting up a multi-node Hadoop cluster involves configuring several nodes (master and worker nodes) to work together. In this guide, we'll focus on a three-node cluster, where one node will act as the NameNode (master) and the other nodes will act as DataNodes (workers). We'll use a Linux-based OS (Ubuntu/CentOS) for this installation.

Pre-requisites

Java 8 or higher installed on all nodes.
SSH passwordless setup between all nodes.
All nodes should have static IP addresses.
Network configuration:
- Ensure that all nodes can communicate with each other via their IP addresses.
- All nodes should be able to resolve hostnames (hostname should resolve to an IP).

Step 1: Configure SSH on All Nodes

Generate SSH key on the master node:
```
ssh-keygen -t rsa
```
Copy the public key to all nodes (master and workers):
```
ssh-copy-id user@worker-node-ip
ssh-copy-id user@master-node-ip
```
This will enable passwordless SSH access between the nodes.

Test SSH between the nodes:

ssh user@worker-node-ip
ssh user@master-node-ip

Set up a hostname for each node in /etc/hosts (on all nodes):
Open /etc/hosts and add entries like:
```
192.168.1.1    master-node
192.168.1.2    worker-node-1
192.168.1.3    worker-node-2
```
This allows each node to resolve the names of the others.

Step 2: Install Java on All Nodes

Install Java 8 or higher (if not already installed) on all nodes:
On Ubuntu:
```
sudo apt update
sudo apt install openjdk-8-jdk
```
On CentOS:
```
sudo yum install java-1.8.0-openjdk-devel
```

Set JAVA_HOME on all nodes:

Add the following to ~/.bashrc or ~/.profile:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

Reload the profile:
```
source ~/.bashrc
```

Step 3: Install Hadoop on All Nodes

Download Hadoop and extract it on all nodes:

On the master node:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
tar -xzvf hadoop-3.3.0.tar.gz
sudo mv hadoop-3.3.0 /usr/local/hadoop

On each worker node: Download Hadoop and extract it in the same way as on the master node:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
tar -xzvf hadoop-3.3.0.tar.gz
sudo mv hadoop-3.3.0 /usr/local/hadoop

Set Hadoop environment variables on all nodes: Add the following to the end of ~/.bashrc:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Reload ~/.bashrc:
```
source ~/.bashrc
```

Step 4: Configure Hadoop on All Nodes

1. Configure hadoop-env.sh

Edit hadoop-env.sh on all nodes (/usr/local/hadoop/etc/hadoop/hadoop-env.sh):
```
nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
```
Add the following line to set JAVA_HOME:
```
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
```

2. Configure core-site.xml

Edit core-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/core-site.xml):

nano /usr/local/hadoop/etc/hadoop/core-site.xml

Add the following configuration:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master-node:9000</value>
  </property>
</configuration>

3. Configure hdfs-site.xml

Edit hdfs-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/hdfs-site.xml):

nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following configuration:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
  </property>
</configuration>

4. Configure mapred-site.xml

Edit mapred-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/mapred-site.xml):

cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

Add the following configuration:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

5. Configure yarn-site.xml

Edit yarn-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/yarn-site.xml):

nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Add the following configuration:

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>master-node:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>master-node:8030</value>
  </property>
</configuration>

Step 5: Set Up Directories for HDFS

On the master node: Create the directories for HDFS and give appropriate permissions:

sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode
sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode
sudo chown -R user:user /usr/local/hadoop/hadoop_data

On each worker node: Create the directories for HDFS:

sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode
sudo chown -R user:user /usr/local/hadoop/hadoop_data

Step 6: Format the Hadoop Filesystem (Only on Master Node)

Run the following command on the master node to format the NameNode:

hdfs namenode -format

Step 7: Start the Hadoop Cluster

Start HDFS: On the master node:
```
start-dfs.sh
```
Start YARN: On the master node:
```
start-yarn.sh
```
Check the status of your Hadoop cluster:
- NameNode: http://master-node:9870
- ResourceManager: http://master-node:8088

Step 8: Stop the Hadoop Cluster

To stop Hadoop, use the following commands:

Stop HDFS:
```
stop-dfs.sh
```
Stop YARN:
```
stop-yarn.sh
```

Troubleshooting Tips

SSH issues: Make sure passwordless SSH works correctly between nodes.
Permissions: Ensure that the Hadoop user has appropriate permissions for directories.
Firewall settings: Check firewall settings if the nodes cannot communicate.
Logs: Look into Hadoop log files (located in $HADOOP_HOME/logs) for detailed error messages if something goes wrong.

This multi-node setup allows Hadoop to operate efficiently across multiple machines, enabling scalable and distributed data processing.

PreviousSingle Node Install NextCluster Install from Ambari

Last updated 3 months ago