# Overview

**YARN** (Yet Another Resource Negotiator) is a key component of the **Apache Hadoop** ecosystem. It improves the scalability, resource management, and job scheduling capabilities of Hadoop. YARN decouples the resource management and job scheduling functionalities from the MapReduce framework, allowing Hadoop to support more diverse processing frameworks, such as Apache Spark, Tez, and others.

YARN separates resource management and job scheduling concerns, providing a more flexible and scalable architecture for running various applications on top of Hadoop's distributed storage.

#### **Key Components of YARN**

YARN has a distributed and modular architecture with the following key components:

**1. ResourceManager (RM)**

* **Role**: The ResourceManager is the master daemon responsible for managing resources (memory, CPU) across the cluster and for job scheduling. It ensures that each application gets the resources it needs.
* **Components**:
  * **Scheduler**: Decides how resources should be allocated to the various jobs based on user-defined policies.
  * **ApplicationManager**: Manages the lifecycle of applications (e.g., starting, monitoring, and stopping them).

**ResourceManager** has two primary parts:

* **Scheduler**: It is responsible for allocating resources to various running applications based on the policies defined (e.g., fair scheduling or capacity scheduling). It does **not** monitor the status of the application.
* **ApplicationManager**: It is responsible for tracking the progress of submitted applications and helps in managing their life cycle.

**2. NodeManager (NM)**

* **Role**: The NodeManager is a per-node daemon that runs on each worker node in the Hadoop cluster. It monitors the resource usage (memory, CPU) of containers running on the node and reports this back to the ResourceManager.
* **Responsibilities**:
  * Launch and monitor containers (where tasks are executed).
  * Report the resource status and health of the node to the ResourceManager.
  * Manage the lifecycle of containers (creation, execution, and termination).

**3. ApplicationMaster (AM)**

* **Role**: Each application (MapReduce job, Spark job, etc.) running on the cluster has its own instance of the **ApplicationMaster**. The AM is responsible for negotiating resources with the ResourceManager, working with the NodeManager to execute tasks, and ensuring that tasks are completed successfully.
* **Responsibilities**:
  * Request resources from the ResourceManager (RM).
  * Monitor the progress of tasks running in containers.
  * Handle failures (e.g., reschedule tasks if necessary).

**4. Container**

* **Role**: A container is a unit of resource allocation in YARN. A container encapsulates the resources (memory, CPU, etc.) needed to execute a task.
* **Responsibilities**:
  * Run a specific task (MapReduce task, Spark job, etc.) as part of an application.
  * Managed by NodeManager on each worker node.
  * Containers are allocated based on the resource requirements of the application.

#### **YARN Architecture**

The YARN architecture consists of several key interactions between the components:

1. The **ResourceManager** (RM) manages all the cluster resources and coordinates the allocation of these resources for running applications.
2. The **NodeManager** (NM) is responsible for maintaining the health and status of resources on each individual node. It manages the execution of containers on worker nodes.
3. Each application has its own **ApplicationMaster** (AM) that works with the RM and NodeManager to request resources and manage job execution.
4. **Containers** are launched by the NodeManager on the worker nodes and run tasks as part of an application.

When a job is submitted, the flow works as follows:

1. The **Client** submits the job to the **ResourceManager**.
2. The **ResourceManager** decides which **NodeManager** can allocate resources for the job based on available resources and policies.
3. The **NodeManager** launches containers on available nodes and executes tasks assigned by the **ApplicationMaster**.
4. The **ApplicationMaster** interacts with the ResourceManager to request more resources if needed and handles failures or retries.
5. Once the application is complete, the **ApplicationMaster** shuts down, and the job is marked as finished.

#### **YARN Resource Management**

YARN manages resources through two primary scheduling mechanisms:

1. **CapacityScheduler**: The CapacityScheduler is designed for multi-tenant clusters. It allows for resource allocation based on the configured capacity of each user or application. This ensures that no single user or application consumes all resources, and it divides resources into queues for different teams or jobs.
2. **FairScheduler**: The FairScheduler ensures that resources are shared fairly among different jobs or users. If jobs are running and there are idle resources, the scheduler will assign resources to other jobs to ensure fair utilization of the cluster.

#### **Advantages of YARN**

1. **Resource Scalability**: YARN can scale to thousands of nodes and applications, enabling large Hadoop clusters to efficiently run different applications in parallel.
2. **Improved Resource Utilization**: By separating resource management from job scheduling, YARN allows different frameworks (such as MapReduce, Spark, Tez) to share resources in a more efficient and flexible manner.
3. **Multi-framework Support**: YARN can run a variety of processing frameworks beyond just MapReduce, such as Apache Spark, Apache Tez, and others.
4. **Fault Tolerance**: YARN helps to handle failures at the application or task level. If a task or container fails, the system can retry the task or reschedule it on another node.
5. **Multi-Tenancy**: YARN supports multi-tenant environments, allowing multiple users and applications to share the same cluster while providing isolation and fair sharing of resources.
6. **Centralized Resource Management**: By centralizing the management of resources across a cluster, YARN provides better visibility and control over the resource allocation for all applications.

####

**YARN**: In Hadoop 2.x, YARN separates job scheduling and resource management. The ResourceManager handles resource management, and each application has its own ApplicationMaster. This separation improves scalability and flexibility, allowing Hadoop to run various other frameworks besides MapReduce.

#### **YARN in Action (Workflow)**

1. A **client** submits a job to the **ResourceManager**.
2. The **ResourceManager** allocates resources (based on the available cluster capacity and policies) and assigns them to the **NodeManager** on worker nodes.
3. The **ApplicationMaster** negotiates resources with the ResourceManager and monitors the progress of the application.
4. **NodeManager** launches containers on worker nodes to run tasks as assigned by the ApplicationMaster.
5. Once tasks are complete, the **ApplicationMaster** signals completion and releases the allocated resources.

### Run YARN

HDFS is a distributed storage system and by itself it does not provide any services for running and scheduling tasks within the cluster. The following section describes starting, monitoring, and submitting jobs to YARN.

**Start and Stop YARN**

* Start YARN with the script:

```
start-yarn.sh
```

* Check that everything is running with the jps command. In addition to the previous HDFS daemon, you should see a **ResourceManager** on **node-master**, and a **NodeManager** on **node1** and **node2**.
* To stop YARN, run the following command on **node-master**:

```
stop-yarn.sh
```

**Monitor YARN**&#x20;

The yarn command provides utilities to manage the YARN cluster. Print a report of running nodes with the command:&#x20;

```
yarn node -list 
```

List running applications

```
yarn application -list
```

For additional details visit the [YARN ](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)page on Apache Hadoop website &#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.invariant.io/platform/apache-hadoop/yarn/overview.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
