Spark with YARN
For Spark to work with the YARN Resource Manager, it needs to be aware of Hadoop cluster configuration. This can be done by setting up the HADOOP_CONF_DIR
environment variable.
Make sure that the HADOOP_CONF_DIR
or YARN_CONF_DIR
points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
Edit the user profile
Rename the spark default template config file
Next, update$SPARK_HOME/conf/spark-defaults.conf
and set spark.master
to yarn
:
Spark is now ready to work with the YARN cluster.
Launching Applications with YARN
To run a Spark application in cluster mode
To run in client mode, change the --deploy-mode to client
Adding Additional JARs
When the application is run in cluster
mode, the driver runs on a different machine than the client, so SparkContext.addJar
does not work with files that are local to the client. In order to make the jars on the client available to the application, we have to include them with the --jars
option as part of the launch command.
Reference
Additional details can be found from the Apache Spark website
Last updated