Documents
  • Invariant Documents
  • Platform
    • Data Platform
      • Install Overview
      • System Requirement
      • Software Requirement
      • Prepare the Environment
      • Installing Ambari Server
      • Setup Ambari Server
      • Start Ambari Server
      • Single Node Install
      • Multi-Node Cluster Install
      • Cluster Install from Ambari
      • Run and monitor HDFS
    • Apache Hadoop
      • Compatible Hadoop Versions
      • HDFS
        • HDFS Architecture
        • Name Node
        • Data Node
        • File Organization
        • Storage Format
          • ORC
          • Parquet
        • Schema Design
      • Hive
        • Data Organization
        • Data Types
        • Data Definition
        • Data Manipulation
          • CRUD Statement
            • Views, Indexes, Temporary Tables
        • Cost-based SQL Optimization
        • Subqueries
        • Common Table Expression
        • Transactions
        • SerDe
          • XML
          • JSON
        • UDF
      • Oozie
      • Sqoop
        • Commands
        • Import
      • YARN
        • Overview
        • Accessing YARN Logs
    • Apache Kafka
      • Compatible Kafka Versions
      • Installation
    • Elasticsearch
      • Compatible Elasticsearch Versions
      • Installation
  • Discovery
    • Introduction
      • Release Notes
    • Methodology
    • Discovery Pipeline
      • Installation
      • DB Event Listener
      • Pipeline Configuration
      • Error Handling
      • Security
    • Inventory Manager
      • Installation
      • Metadata Management
      • Column Mapping
      • Service Configuration
      • Metadata Configuration
      • Metadata Changes and Versioning
        • Generating Artifacts
      • Reconciliation, Merging Current View
        • Running daily reconciliation and merge
      • Data Inventory Reports
    • Schema Registry
  • Process Insight
    • Process Insight
      • Overview
    • Process Pipeline
      • Data Ingestion
      • Data Storage
    • Process Dashboards
      • Panels
      • Templating
      • Alerts
        • Rules
        • Notifications
  • Content Insight
    • Content Insight
      • Release Notes
      • Configuration
      • Content Indexing Pipeline
    • Management API
    • Query DSL
    • Configuration
  • Document Flow
    • Overview
  • Polyglot Data Manager
    • Polyglot Data Manager
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Operational Insight
    • Operational Insight
      • Release Notes
    • Data Store
      • Concepts
      • Sharding
    • Shippers
      • Filerelay Container
    • Processors
    • Search
    • User Interface
  • Data Science
    • Data Science Notebook
      • Setup JupyterLab
      • Configuration
        • Configuration Settings
        • Libraries
    • Spark DataHub
      • Concepts
      • Cluster Setup
      • Spark with YARN
      • PySpark Setup
        • DataFrame API
      • Reference
  • Product Roadmap
    • Roadmap
  • TIPS
    • Service Troubleshooting
    • Service Startup Errors
    • Debugging YARN Applications
      • YARN CLI
    • Hadoop Credentials
    • Sqoop Troubleshooting
    • Log4j Vulnerability Fix
Powered by GitBook
On this page
  • Installing Spark Standalone to a Cluster
  • Understand Client and Cluster Mode
  • Starting the Cluster
  1. Data Science
  2. Spark DataHub

Cluster Setup

Installing Spark Standalone to a Cluster

Download the tar from the downloads site

cd /opt/invariant 
tar -xvzf spark-datahub-3.2.0.tgz
mv spark-datahub-3.2.0 spark

Add the spark binaries to the user path

PATH=/opt/invariant/spark/bin

To install Spark Standalone mode, place the Spark artifacts on each node on the cluster.

Understand Client and Cluster Mode

Spark jobs can be run on YARN in two modes: cluster mode and client mode. It is important to understand the difference between the two modes in order to choose correct memory allocation configuration, and to submit jobs as expected.

A Spark job essentially consists of two parts: The Spark Executors which run the actual tasks, and a Spark Driver that schedules the Executors.

  • Cluster mode: Everything runs inside the cluster. The job can be start from the edge node and the job can continue running even if you log ou. In this mode, the Spark Driver is encapsulated inside the YARN Application Master.

  • Client mode: The Spark driver runs on a client, such as your local computer. In this case, if the client is shut down, the job fails. Spark Executors still run on the cluster, and to schedule everything, a small YARN Application Master is created.

Client mode is suitable for interactive jobs, but applications fail if the client stops. For long running jobs, cluster mode should be used.

Starting the Cluster

Start a standalone master server by executing:

./sbin/start-master.sh

Next, start one or more workers and connect them to the master via:

./sbin/start-worker.sh <master-spark-URL>
PreviousConceptsNextSpark with YARN

Last updated 3 years ago

Once the master starts, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is by default.

Once the worker started, navigate to the master’s web UI (). You should see the new node listed there, along with its number of CPUs and memory (one gigabyte is left for the OS).

http://localhost:8080
http://localhost:8080