Single Node Install

Here is a step-by-step guide for installing Hadoop on a single-node setup on a Linux-based operating system (e.g., Ubuntu or CentOS), as Hadoop runs best on Linux.

Pre-requisites

  1. Java (version 8 or above) should be installed on the machine.

  2. SSH should be configured for passwordless login (even on a single-node).

  3. Linux distribution (Ubuntu, CentOS, etc.)

Step 1: Install Java

Hadoop requires Java to be installed. To install Java, follow these steps:

  1. Check Java version:

    java -version

    If Java is already installed, you’ll see the version. If not, proceed with installation.

  2. Install Java (OpenJDK 8):

    • On Ubuntu:

      sudo apt update
      sudo apt install openjdk-8-jdk
    • On CentOS:

      sudo yum install java-1.8.0-openjdk-devel
  3. Set the JAVA_HOME environment variable: Find the Java path using:

    update-alternatives --config java

    Copy the path (e.g., /usr/lib/jvm/java-8-openjdk-amd64/), and add the following to your ~/.bashrc or ~/.profile file:

    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
    export PATH=$PATH:$JAVA_HOME/bin
  4. Reload the bash profile:

    source ~/.bashrc

Step 2: Download Hadoop

  1. Go to the official Hadoop website and download the latest stable version. For this example, we'll use Hadoop 3.3.0.

    Alternatively, use the wget command to download:

    wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
  2. Extract the Hadoop archive:

    tar -xzvf hadoop-3.3.0.tar.gz
  3. Move the extracted folder to /usr/local/hadoop:

    sudo mv hadoop-3.3.0 /usr/local/hadoop

Step 3: Set Hadoop Environment Variables

You need to set environment variables to tell the system where Hadoop is located.

  1. Open ~/.bashrc (or ~/.profile depending on your system):

    nano ~/.bashrc
  2. Add the following lines at the end of the file:

    # Set Hadoop-related environment variables
    export HADOOP_HOME=/usr/local/hadoop
    export HADOOP_INSTALL=$HADOOP_HOME
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
  3. Apply the changes:

    source ~/.bashrc

Step 4: Configure Hadoop

  1. Navigate to the Hadoop configuration directory:

    cd /usr/local/hadoop/etc/hadoop
  2. Edit hadoop-env.sh: Open hadoop-env.sh for editing:

    nano hadoop-env.sh

    Add the following line to set Java:

    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
  3. Edit core-site.xml: Open core-site.xml for editing:

    nano core-site.xml

    Add the following configuration:

    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
      </property>
    </configuration>
  4. Edit hdfs-site.xml: Open hdfs-site.xml for editing:

    nano hdfs-site.xml

    Add the following configuration:

    <configuration>
      <property>
        <name>dfs.replication</name>
        <value>1</value>
      </property>
      <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
      </property>
      <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
      </property>
    </configuration>
  5. Edit mapred-site.xml: You’ll need to copy the template file and then configure it.

    cp mapred-site.xml.template mapred-site.xml
    nano mapred-site.xml

    Add the following configuration:

    <configuration>
      <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
      </property>
    </configuration>
  6. Edit yarn-site.xml: Open yarn-site.xml for editing:

    nano yarn-site.xml

    Add the following configuration:

    <configuration>
      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>localhost:8025</value>
      </property>
      <property>
        <name>yarn.resourcemanager.address</name>
        <value>localhost:8032</value>
      </property>
      <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>localhost:8030</value>
      </property>
    </configuration>

Step 5: Format HDFS

  1. Create the directory where Hadoop will store data:

    sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode
    sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode
  2. Format the Hadoop filesystem:

    hdfs namenode -format

Step 6: Start Hadoop

  1. Start the Hadoop Distributed File System (HDFS):

    start-dfs.sh
  2. Start YARN:

    start-yarn.sh
  3. Verify the installation: You can check the status of the HDFS and YARN services by visiting:

Step 7: Stop Hadoop

To stop the Hadoop services, run:

stop-dfs.sh
stop-yarn.sh

Troubleshooting

  • Java version issues: Ensure that you're using Java 8, as newer versions may not be compatible with Hadoop 3.x.

  • Permissions: Ensure that all directories have the proper permissions for Hadoop to read/write to them.

  • Firewall: If running Hadoop on a cluster or multiple nodes, ensure the necessary ports are open.

This setup assumes a single-node Hadoop cluster. For a multi-node cluster, you would need to configure additional settings for nodes, and the setup would involve configuring slaves and other services accordingly.

Last updated