Spark with YARN

For Spark to work with the YARN Resource Manager, it needs to be aware of Hadoop cluster configuration. This can be done by setting up the HADOOP_CONF_DIR environment variable.

Make sure that the HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.

Edit the user profile

export HADOOP_CONF_DIR=/opt/invariant/hadoop/etc/hadoop
export SPARK_HOME=/opt/invariant/spark
export LD_LIBRARY_PATH=/opt/invariant/hadoop/lib/native:$LD_LIBRARY_PATH

Rename the spark default template config file

 mv $SPARK_HOME/conf/spark-defaults.conf.template \
    $SPARK_HOME/conf/spark-defaults.conf

Next, update$SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn:

spark.master    yarn

Spark is now ready to work with the YARN cluster.

Launching Applications with YARN

To run a Spark application in cluster mode

$ ./bin/spark-submit --class path.to.your.Class --master yarn \
     --deploy-mode cluster [options] <app jar> [app options]

To run in client mode, change the --deploy-mode to client

Adding Additional JARs

When the application is run in cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar does not work with files that are local to the client. In order to make the jars on the client available to the application, we have to include them with the --jars option as part of the launch command.

$ ./bin/spark-submit --class io.invariant.sparkhub.TestApp \
    --master yarn \
    --deploy-mode cluster \
    --jars invariant-pipeline.jar,elastic-spark.jar \
    invariant-spark-test.jar \
    test_arg1 more_arg2

Reference

Additional details can be found from the Apache Spark website

https://spark.apache.org/docs/latest/running-on-yarn.html

PreviousCluster Setup NextPySpark Setup

Last updated 3 years ago