DataFrame API
This is a brief overview of the PySpark DataFrame API.
Init Session
PySpark applications start with initializing SparkSession
. When running in PySpark shell via pyspark executable, the shell automatically creates the session in the variable spark for users.
Create DataFrame
Create a DataFrame using pyspark.sql.SparkSession.createDataFrame
, which takes the schema
argument to specify the schema of the DataFrame. If not provided, it can infer the schema by sampling the data.
The DataFrame results and schema can be displayed using the commands show below
Viewing Data
The rows of a DataFrame can be displayed using DataFrame.show()
.
Last updated