Concepts
Apache Spark can run on different types of clusters. The applications run as independent sets of processes on a cluster and are coordinated by the SparkContext
object in the driver program.
Cluster Manager Types
We support the following cluster managers:
Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
Hadoop YARN – Resource manager in Hadoop 2.
Kubernetes – System for automating deploying, and managing of containerized applications.
Spark Cluster concepts
A user program built on Spark. This consists of a driver program and executors on the cluster.
Application
The application jar contains the user's Spark application. This can be a "fat" or "uber" jar containing all the dependencies.
Application jar
The process which launches the application and creates the SparkContext
Driver program
An external service for acquiring resources on the cluster (e.g. Kubernetes, YARN)
Cluster manager
Used to determine where the driver process runs. For example, in "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
Deploy mode
Any node that can run application code in the cluster
Worker node
A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Executor
A unit of work that will be sent to one executor
Task
A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action
Job
Each job is divided into smaller sets of tasks called stages which depend on each other. Similar to MapReduce
Stage
Last updated