Cluster Setup
Installing Spark Standalone to a Cluster
Download the tar from the downloads site
Add the spark binaries to the user path
To install Spark Standalone mode, place the Spark artifacts on each node on the cluster.
Understand Client and Cluster Mode
Spark jobs can be run on YARN in two modes: cluster mode and client mode. It is important to understand the difference between the two modes in order to choose correct memory allocation configuration, and to submit jobs as expected.
A Spark job essentially consists of two parts: The Spark Executors which run the actual tasks, and a Spark Driver that schedules the Executors.
Cluster mode: Everything runs inside the cluster. The job can be start from the edge node and the job can continue running even if you log ou. In this mode, the Spark Driver is encapsulated inside the YARN Application Master.
Client mode: The Spark driver runs on a client, such as your local computer. In this case, if the client is shut down, the job fails. Spark Executors still run on the cluster, and to schedule everything, a small YARN Application Master is created.
Client mode is suitable for interactive jobs, but applications fail if the client stops. For long running jobs, cluster mode should be used.
Starting the Cluster
Start a standalone master server by executing:
Next, start one or more workers and connect them to the master via:
Last updated