Spark with YARN
For Spark to work with the YARN Resource Manager, it needs to be aware of Hadoop cluster configuration. This can be done by setting up the HADOOP_CONF_DIR
environment variable.
Make sure that the HADOOP_CONF_DIR
or YARN_CONF_DIR
points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
Edit the user profile
export HADOOP_CONF_DIR=/opt/invariant/hadoop/etc/hadoop
export SPARK_HOME=/opt/invariant/spark
export LD_LIBRARY_PATH=/opt/invariant/hadoop/lib/native:$LD_LIBRARY_PATH
Rename the spark default template config file
mv $SPARK_HOME/conf/spark-defaults.conf.template \
$SPARK_HOME/conf/spark-defaults.conf
Next, update$SPARK_HOME/conf/spark-defaults.conf
and set spark.master
to yarn
:
spark.master yarn
Spark is now ready to work with the YARN cluster.
Launching Applications with YARN
To run a Spark application in cluster mode
$ ./bin/spark-submit --class path.to.your.Class --master yarn \
--deploy-mode cluster [options] <app jar> [app options]
To run in client mode, change the --deploy-mode to client
Adding Additional JARs
When the application is run in cluster
mode, the driver runs on a different machine than the client, so SparkContext.addJar
does not work with files that are local to the client. In order to make the jars on the client available to the application, we have to include them with the --jars
option as part of the launch command.
$ ./bin/spark-submit --class io.invariant.sparkhub.TestApp \
--master yarn \
--deploy-mode cluster \
--jars invariant-pipeline.jar,elastic-spark.jar \
invariant-spark-test.jar \
test_arg1 more_arg2
Reference
Additional details can be found from the Apache Spark website
Last updated