Documents
  • Invariant Documents
  • Platform
    • Data Platform
      • Install Overview
      • System Requirement
      • Software Requirement
      • Prepare the Environment
      • Installing Ambari Server
      • Setup Ambari Server
      • Start Ambari Server
      • Single Node Install
      • Multi-Node Cluster Install
      • Cluster Install from Ambari
      • Run and monitor HDFS
    • Apache Hadoop
      • Compatible Hadoop Versions
      • HDFS
        • HDFS Architecture
        • Name Node
        • Data Node
        • File Organization
        • Storage Format
          • ORC
          • Parquet
        • Schema Design
      • Hive
        • Data Organization
        • Data Types
        • Data Definition
        • Data Manipulation
          • CRUD Statement
            • Views, Indexes, Temporary Tables
        • Cost-based SQL Optimization
        • Subqueries
        • Common Table Expression
        • Transactions
        • SerDe
          • XML
          • JSON
        • UDF
      • Oozie
      • Sqoop
        • Commands
        • Import
      • YARN
        • Overview
        • Accessing YARN Logs
    • Apache Kafka
      • Compatible Kafka Versions
      • Installation
    • Elasticsearch
      • Compatible Elasticsearch Versions
      • Installation
  • Discovery
    • Introduction
      • Release Notes
    • Methodology
    • Discovery Pipeline
      • Installation
      • DB Event Listener
      • Pipeline Configuration
      • Error Handling
      • Security
    • Inventory Manager
      • Installation
      • Metadata Management
      • Column Mapping
      • Service Configuration
      • Metadata Configuration
      • Metadata Changes and Versioning
        • Generating Artifacts
      • Reconciliation, Merging Current View
        • Running daily reconciliation and merge
      • Data Inventory Reports
    • Schema Registry
  • Process Insight
    • Process Insight
      • Overview
    • Process Pipeline
      • Data Ingestion
      • Data Storage
    • Process Dashboards
      • Panels
      • Templating
      • Alerts
        • Rules
        • Notifications
  • Content Insight
    • Content Insight
      • Release Notes
      • Configuration
      • Content Indexing Pipeline
    • Management API
    • Query DSL
    • Configuration
  • Document Flow
    • Overview
  • Polyglot Data Manager
    • Polyglot Data Manager
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Operational Insight
    • Operational Insight
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Data Science
    • Data Science Notebook
      • Setup JupyterLab
      • Configuration
        • Configuration Settings
        • Libraries
    • Spark DataHub
      • Concepts
      • Cluster Setup
      • Spark with YARN
      • PySpark Setup
        • DataFrame API
      • Reference
  • Product Roadmap
    • Roadmap
  • TIPS
    • Service Troubleshooting
    • Service Startup Errors
    • Debugging YARN Applications
      • YARN CLI
    • Hadoop Credentials
    • Sqoop Troubleshooting
    • Log4j Vulnerability Fix
Powered by GitBook
On this page
  • Launching Applications with YARN
  • Adding Additional JARs
  1. Data Science
  2. Spark DataHub

Spark with YARN

For Spark to work with the YARN Resource Manager, it needs to be aware of Hadoop cluster configuration. This can be done by setting up the HADOOP_CONF_DIR environment variable.

Make sure that the HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.

Edit the user profile

export HADOOP_CONF_DIR=/opt/invariant/hadoop/etc/hadoop
export SPARK_HOME=/opt/invariant/spark
export LD_LIBRARY_PATH=/opt/invariant/hadoop/lib/native:$LD_LIBRARY_PATH

Rename the spark default template config file

 mv $SPARK_HOME/conf/spark-defaults.conf.template \
    $SPARK_HOME/conf/spark-defaults.conf

Next, update$SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn:

spark.master    yarn

Spark is now ready to work with the YARN cluster.

Launching Applications with YARN

To run a Spark application in cluster mode

$ ./bin/spark-submit --class path.to.your.Class --master yarn \
     --deploy-mode cluster [options] <app jar> [app options]

To run in client mode, change the --deploy-mode to client

Adding Additional JARs

When the application is run in cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar does not work with files that are local to the client. In order to make the jars on the client available to the application, we have to include them with the --jars option as part of the launch command.

$ ./bin/spark-submit --class io.invariant.sparkhub.TestApp \
    --master yarn \
    --deploy-mode cluster \
    --jars invariant-pipeline.jar,elastic-spark.jar \
    invariant-spark-test.jar \
    test_arg1 more_arg2

Reference

Additional details can be found from the Apache Spark website

PreviousCluster SetupNextPySpark Setup

Last updated 3 years ago

https://spark.apache.org/docs/latest/running-on-yarn.html