Reference

The Apache Spark website is the best reference for getting started with programming, deploying and running Spark applications

https://spark.apache.org/docs/latest/index.html

Programming Guides:

  • Quick Start: a quick introduction to the Spark API; start here!

  • RDD Programming Guide: overview of Spark basics - RDDs (core but old API), accumulators, and broadcast variables

  • Spark SQL, Datasets, and DataFrames: processing structured data with relational queries (newer API than RDDs)

  • Structured Streaming: processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)

  • MLlib: applying machine learning algorithms

  • GraphX: processing graphs

  • PySpark: processing data with Spark in Python

API Docs:

Operations Guide:

  • Configuration: customize Spark via its configuration system

  • Monitoring: track the behavior of your applications

  • Tuning Guide: best practices to optimize performance and memory use

  • Job Scheduling: scheduling resources across and within Spark applications

Last updated