Multi-Node Cluster Install

Setting up a multi-node Hadoop cluster involves configuring several nodes (master and worker nodes) to work together. In this guide, we'll focus on a three-node cluster, where one node will act as the NameNode (master) and the other nodes will act as DataNodes (workers). We'll use a Linux-based OS (Ubuntu/CentOS) for this installation.

Pre-requisites

  1. Java 8 or higher installed on all nodes.

  2. SSH passwordless setup between all nodes.

  3. All nodes should have static IP addresses.

  4. Network configuration:

    • Ensure that all nodes can communicate with each other via their IP addresses.

    • All nodes should be able to resolve hostnames (hostname should resolve to an IP).


Step 1: Configure SSH on All Nodes

  1. Generate SSH key on the master node:

    ssh-keygen -t rsa
  2. Copy the public key to all nodes (master and workers):

    ssh-copy-id user@worker-node-ip
    ssh-copy-id user@master-node-ip

    This will enable passwordless SSH access between the nodes.

  3. Test SSH between the nodes:

    ssh user@worker-node-ip
    ssh user@master-node-ip
  4. Set up a hostname for each node in /etc/hosts (on all nodes):

    Open /etc/hosts and add entries like:

    192.168.1.1    master-node
    192.168.1.2    worker-node-1
    192.168.1.3    worker-node-2

    This allows each node to resolve the names of the others.


Step 2: Install Java on All Nodes

  1. Install Java 8 or higher (if not already installed) on all nodes:

    On Ubuntu:

    sudo apt update
    sudo apt install openjdk-8-jdk

    On CentOS:

    sudo yum install java-1.8.0-openjdk-devel
  2. Set JAVA_HOME on all nodes:

    • Add the following to ~/.bashrc or ~/.profile:

      export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
      export PATH=$PATH:$JAVA_HOME/bin
    • Reload the profile:

      source ~/.bashrc

Step 3: Install Hadoop on All Nodes

  1. Download Hadoop and extract it on all nodes:

    • On the master node:

      wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
      tar -xzvf hadoop-3.3.0.tar.gz
      sudo mv hadoop-3.3.0 /usr/local/hadoop
    • On each worker node: Download Hadoop and extract it in the same way as on the master node:

      wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
      tar -xzvf hadoop-3.3.0.tar.gz
      sudo mv hadoop-3.3.0 /usr/local/hadoop
  2. Set Hadoop environment variables on all nodes: Add the following to the end of ~/.bashrc:

    export HADOOP_HOME=/usr/local/hadoop
    export HADOOP_INSTALL=$HADOOP_HOME
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
  3. Reload ~/.bashrc:

    source ~/.bashrc

Step 4: Configure Hadoop on All Nodes

1. Configure hadoop-env.sh

  1. Edit hadoop-env.sh on all nodes (/usr/local/hadoop/etc/hadoop/hadoop-env.sh):

    nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

    Add the following line to set JAVA_HOME:

    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

2. Configure core-site.xml

  1. Edit core-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/core-site.xml):

    nano /usr/local/hadoop/etc/hadoop/core-site.xml

    Add the following configuration:

    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master-node:9000</value>
      </property>
    </configuration>

3. Configure hdfs-site.xml

  1. Edit hdfs-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/hdfs-site.xml):

    nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

    Add the following configuration:

    <configuration>
      <property>
        <name>dfs.replication</name>
        <value>3</value>
      </property>
      <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
      </property>
      <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
      </property>
    </configuration>

4. Configure mapred-site.xml

  1. Edit mapred-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/mapred-site.xml):

    cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
    nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

    Add the following configuration:

    <configuration>
      <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
      </property>
    </configuration>

5. Configure yarn-site.xml

  1. Edit yarn-site.xml on all nodes (/usr/local/hadoop/etc/hadoop/yarn-site.xml):

    nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

    Add the following configuration:

    <configuration>
      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.resourcemanager.address</name>
        <value>master-node:8032</value>
      </property>
      <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master-node:8030</value>
      </property>
    </configuration>

Step 5: Set Up Directories for HDFS

  1. On the master node: Create the directories for HDFS and give appropriate permissions:

    sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode
    sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode
    sudo chown -R user:user /usr/local/hadoop/hadoop_data
  2. On each worker node: Create the directories for HDFS:

    sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode
    sudo chown -R user:user /usr/local/hadoop/hadoop_data

Step 6: Format the Hadoop Filesystem (Only on Master Node)

Run the following command on the master node to format the NameNode:

hdfs namenode -format

Step 7: Start the Hadoop Cluster

  1. Start HDFS: On the master node:

    start-dfs.sh
  2. Start YARN: On the master node:

    start-yarn.sh
  3. Check the status of your Hadoop cluster:

    • NameNode: http://master-node:9870

    • ResourceManager: http://master-node:8088


Step 8: Stop the Hadoop Cluster

To stop Hadoop, use the following commands:

  1. Stop HDFS:

    stop-dfs.sh
  2. Stop YARN:

    stop-yarn.sh

Troubleshooting Tips

  • SSH issues: Make sure passwordless SSH works correctly between nodes.

  • Permissions: Ensure that the Hadoop user has appropriate permissions for directories.

  • Firewall settings: Check firewall settings if the nodes cannot communicate.

  • Logs: Look into Hadoop log files (located in $HADOOP_HOME/logs) for detailed error messages if something goes wrong.

This multi-node setup allows Hadoop to operate efficiently across multiple machines, enabling scalable and distributed data processing.

Last updated