Cluster Setup

Installing Spark Standalone to a Cluster

Download the tar from the downloads site

cd /opt/invariant 
tar -xvzf spark-datahub-3.2.0.tgz
mv spark-datahub-3.2.0 spark

Add the spark binaries to the user path


To install Spark Standalone mode, place the Spark artifacts on each node on the cluster.

Understand Client and Cluster Mode

Spark jobs can be run on YARN in two modes: cluster mode and client mode. It is important to understand the difference between the two modes in order to choose correct memory allocation configuration, and to submit jobs as expected.

A Spark job essentially consists of two parts: The Spark Executors which run the actual tasks, and a Spark Driver that schedules the Executors.

  • Cluster mode: Everything runs inside the cluster. The job can be start from the edge node and the job can continue running even if you log ou. In this mode, the Spark Driver is encapsulated inside the YARN Application Master.

  • Client mode: The Spark driver runs on a client, such as your local computer. In this case, if the client is shut down, the job fails. Spark Executors still run on the cluster, and to schedule everything, a small YARN Application Master is created.

Client mode is suitable for interactive jobs, but applications fail if the client stops. For long running jobs, cluster mode should be used.

Starting the Cluster

Start a standalone master server by executing:


Once the master starts, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.

Next, start one or more workers and connect them to the master via:

./sbin/ <master-spark-URL>

Once the worker started, navigate to the master’s web UI (http://localhost:8080). You should see the new node listed there, along with its number of CPUs and memory (one gigabyte is left for the OS).

Last updated