Overview
YARN (Yet Another Resource Negotiator) is a key component of the Apache Hadoop ecosystem. It improves the scalability, resource management, and job scheduling capabilities of Hadoop. YARN decouples the resource management and job scheduling functionalities from the MapReduce framework, allowing Hadoop to support more diverse processing frameworks, such as Apache Spark, Tez, and others.
YARN separates resource management and job scheduling concerns, providing a more flexible and scalable architecture for running various applications on top of Hadoop's distributed storage.
Key Components of YARN
YARN has a distributed and modular architecture with the following key components:
1. ResourceManager (RM)
Role: The ResourceManager is the master daemon responsible for managing resources (memory, CPU) across the cluster and for job scheduling. It ensures that each application gets the resources it needs.
Components:
Scheduler: Decides how resources should be allocated to the various jobs based on user-defined policies.
ApplicationManager: Manages the lifecycle of applications (e.g., starting, monitoring, and stopping them).
ResourceManager has two primary parts:
Scheduler: It is responsible for allocating resources to various running applications based on the policies defined (e.g., fair scheduling or capacity scheduling). It does not monitor the status of the application.
ApplicationManager: It is responsible for tracking the progress of submitted applications and helps in managing their life cycle.
2. NodeManager (NM)
Role: The NodeManager is a per-node daemon that runs on each worker node in the Hadoop cluster. It monitors the resource usage (memory, CPU) of containers running on the node and reports this back to the ResourceManager.
Responsibilities:
Launch and monitor containers (where tasks are executed).
Report the resource status and health of the node to the ResourceManager.
Manage the lifecycle of containers (creation, execution, and termination).
3. ApplicationMaster (AM)
Role: Each application (MapReduce job, Spark job, etc.) running on the cluster has its own instance of the ApplicationMaster. The AM is responsible for negotiating resources with the ResourceManager, working with the NodeManager to execute tasks, and ensuring that tasks are completed successfully.
Responsibilities:
Request resources from the ResourceManager (RM).
Monitor the progress of tasks running in containers.
Handle failures (e.g., reschedule tasks if necessary).
4. Container
Role: A container is a unit of resource allocation in YARN. A container encapsulates the resources (memory, CPU, etc.) needed to execute a task.
Responsibilities:
Run a specific task (MapReduce task, Spark job, etc.) as part of an application.
Managed by NodeManager on each worker node.
Containers are allocated based on the resource requirements of the application.
YARN Architecture
The YARN architecture consists of several key interactions between the components:
The ResourceManager (RM) manages all the cluster resources and coordinates the allocation of these resources for running applications.
The NodeManager (NM) is responsible for maintaining the health and status of resources on each individual node. It manages the execution of containers on worker nodes.
Each application has its own ApplicationMaster (AM) that works with the RM and NodeManager to request resources and manage job execution.
Containers are launched by the NodeManager on the worker nodes and run tasks as part of an application.
When a job is submitted, the flow works as follows:
The Client submits the job to the ResourceManager.
The ResourceManager decides which NodeManager can allocate resources for the job based on available resources and policies.
The NodeManager launches containers on available nodes and executes tasks assigned by the ApplicationMaster.
The ApplicationMaster interacts with the ResourceManager to request more resources if needed and handles failures or retries.
Once the application is complete, the ApplicationMaster shuts down, and the job is marked as finished.
YARN Resource Management
YARN manages resources through two primary scheduling mechanisms:
CapacityScheduler: The CapacityScheduler is designed for multi-tenant clusters. It allows for resource allocation based on the configured capacity of each user or application. This ensures that no single user or application consumes all resources, and it divides resources into queues for different teams or jobs.
FairScheduler: The FairScheduler ensures that resources are shared fairly among different jobs or users. If jobs are running and there are idle resources, the scheduler will assign resources to other jobs to ensure fair utilization of the cluster.
Advantages of YARN
Resource Scalability: YARN can scale to thousands of nodes and applications, enabling large Hadoop clusters to efficiently run different applications in parallel.
Improved Resource Utilization: By separating resource management from job scheduling, YARN allows different frameworks (such as MapReduce, Spark, Tez) to share resources in a more efficient and flexible manner.
Multi-framework Support: YARN can run a variety of processing frameworks beyond just MapReduce, such as Apache Spark, Apache Tez, and others.
Fault Tolerance: YARN helps to handle failures at the application or task level. If a task or container fails, the system can retry the task or reschedule it on another node.
Multi-Tenancy: YARN supports multi-tenant environments, allowing multiple users and applications to share the same cluster while providing isolation and fair sharing of resources.
Centralized Resource Management: By centralizing the management of resources across a cluster, YARN provides better visibility and control over the resource allocation for all applications.
YARN: In Hadoop 2.x, YARN separates job scheduling and resource management. The ResourceManager handles resource management, and each application has its own ApplicationMaster. This separation improves scalability and flexibility, allowing Hadoop to run various other frameworks besides MapReduce.
YARN in Action (Workflow)
A client submits a job to the ResourceManager.
The ResourceManager allocates resources (based on the available cluster capacity and policies) and assigns them to the NodeManager on worker nodes.
The ApplicationMaster negotiates resources with the ResourceManager and monitors the progress of the application.
NodeManager launches containers on worker nodes to run tasks as assigned by the ApplicationMaster.
Once tasks are complete, the ApplicationMaster signals completion and releases the allocated resources.
Run YARN
HDFS is a distributed storage system and by itself it does not provide any services for running and scheduling tasks within the cluster. The following section describes starting, monitoring, and submitting jobs to YARN.
Start and Stop YARN
Start YARN with the script:
Check that everything is running with the jps command. In addition to the previous HDFS daemon, you should see a ResourceManager on node-master, and a NodeManager on node1 and node2.
To stop YARN, run the following command on node-master:
Monitor YARN
The yarn command provides utilities to manage the YARN cluster. Print a report of running nodes with the command:
List running applications
Last updated