Concepts

Apache Spark can run on different types of clusters. The applications run as independent sets of processes on a cluster and are coordinated by the SparkContext object in the driver program.

Cluster Manager Types

We support the following cluster managers:

Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
Hadoop YARN – Resource manager in Hadoop 2.
Kubernetes – System for automating deploying, and managing of containerized applications.

Spark Cluster concepts

Description

Term

A user program built on Spark. This consists of a driver program and executors on the cluster.

Application

The application jar contains the user's Spark application. This can be a "fat" or "uber" jar containing all the dependencies.

Application jar

The process which launches the application and creates the SparkContext

Driver program

An external service for acquiring resources on the cluster (e.g. Kubernetes, YARN)

Cluster manager

Used to determine where the driver process runs. For example, in "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.

Deploy mode

Any node that can run application code in the cluster

Worker node

A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.

Executor

A unit of work that will be sent to one executor

Task

A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action

Job

Each job is divided into smaller sets of tasks called stages which depend on each other. Similar to MapReduce

Stage

PreviousSpark DataHub NextCluster Setup

Last updated 3 years ago