DataFrame API

This is a brief overview of the PySpark DataFrame API.

Init Session

PySpark applications start with initializing SparkSession. When running in PySpark shell via pyspark executable, the shell automatically creates the session in the variable spark for users.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Create DataFrame

Create a DataFrame using pyspark.sql.SparkSession.createDataFrame , which takes the schema argument to specify the schema of the DataFrame. If not provided, it can infer the schema by sampling the data.

from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

df = spark.createDataFrame([
    Row(a=16, b=2., c='jane', d=date(2010, 1, 1), e=datetime(2021, 1, 1, 12, 0)),
    Row(a=16, b=3., c='john', d=date(2010, 2, 1), e=datetime(2021, 1, 2, 12, 0)),
    Row(a=32, b=5., c='alex', d=date(2010, 3, 1), e=datetime(2022, 1, 3, 12, 0))
])
df

The DataFrame results and schema can be displayed using the commands show below

# All DataFrames above result same.
df.show()
df.printSchema()
+---+---+-----+----------+-------------------+
|   a|  b|   c|         d|                  e|
+----+---+----+----------+-------------------+
|  16|2.0|jane|2000-01-01|2000-01-01 12:00:00|
|  32|3.0|john|2000-02-01|2000-01-02 12:00:00|
|  64|4.0|alex|2000-03-01|2000-01-03 12:00:00|
+----+---+----+----------+-------------------+

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)

Viewing Data

The rows of a DataFrame can be displayed using DataFrame.show().

df.show(1)
+----+---+-------+----------+-------------------+
|   a|  b|      c|         d|                  e|
+----+---+-------+----------+-------------------+
|  16|2.0|jane|2000-01-01|2000-01-01 12:00:00|
+----+---+-------+----------+-------------------+
only showing top 1 row

Last updated