DataFrame API
This is a brief overview of the PySpark DataFrame API.
Init Session
PySpark applications start with initializing SparkSession
. When running in PySpark shell via pyspark executable, the shell automatically creates the session in the variable spark for users.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Create DataFrame
Create a DataFrame using pyspark.sql.SparkSession.createDataFrame
, which takes the schema
argument to specify the schema of the DataFrame. If not provided, it can infer the schema by sampling the data.
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row
df = spark.createDataFrame([
Row(a=16, b=2., c='jane', d=date(2010, 1, 1), e=datetime(2021, 1, 1, 12, 0)),
Row(a=16, b=3., c='john', d=date(2010, 2, 1), e=datetime(2021, 1, 2, 12, 0)),
Row(a=32, b=5., c='alex', d=date(2010, 3, 1), e=datetime(2022, 1, 3, 12, 0))
])
df
The DataFrame results and schema can be displayed using the commands show below
# All DataFrames above result same.
df.show()
df.printSchema()
+---+---+-----+----------+-------------------+
| a| b| c| d| e|
+----+---+----+----------+-------------------+
| 16|2.0|jane|2000-01-01|2000-01-01 12:00:00|
| 32|3.0|john|2000-02-01|2000-01-02 12:00:00|
| 64|4.0|alex|2000-03-01|2000-01-03 12:00:00|
+----+---+----+----------+-------------------+
root
|-- a: long (nullable = true)
|-- b: double (nullable = true)
|-- c: string (nullable = true)
|-- d: date (nullable = true)
|-- e: timestamp (nullable = true)
Viewing Data
The rows of a DataFrame can be displayed using DataFrame.show()
.
df.show(1)
+----+---+-------+----------+-------------------+
| a| b| c| d| e|
+----+---+-------+----------+-------------------+
| 16|2.0|jane|2000-01-01|2000-01-01 12:00:00|
+----+---+-------+----------+-------------------+
only showing top 1 row
Last updated