PySpark DataFrames
Learn PySpark DataFrame API and its basic operations.
PySpark DataFrames
PySpark DataFrames is a distributed collection of data organized into named columns, similar to a table in a relational database or a DataFrame in R/Python. PySpark DataFrames are an abstraction on top of RDDs and provide a more concise and efficient way to handle structured data. Not only are they easy to understand, but their operations are optimized compared to RDDs. This is because of the inbuilt optimization. DataFrames are immutable, which means that any transformation operation on a DataFrame will create a new DataFrame.
PySpark DataFrames support a wide range of operations, such as filtering, grouping, joining, and aggregation, making it easier to perform complex data operations. They support both SQL queries and expression methods. PySpark DataFrames are implemented in the pyspark.sql
module and provide the DataFrame class.
Creating PySpark DataFrames
To use PySpark DataFrames, we first need to create a SparkSession
object, which is the entry point to PySpark SQL. Once we create a SparkSession
, it’s available in the PySpark shell as spark
. There are three methods available for creating PySpark DataFrames:
From existing RDDs
To create a PySpark DataFrame from an existing RDD, we can use the createDataFrame()
method provided by the SparkSession
object. This method allows us to pass an RDD along with the schema (column names) to create the DataFrame.
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("pyspark_sql").getOrCreate()print("Create a sample RDD")rdd = spark.sparkContext.parallelize([(1, "Alice"), (2, "Bob"), (3, "Charlie")])print("Create a PySpark DataFrame from RDD")df = spark.createDataFrame(rdd, ["id", "name"])print("Print the contents of the DataFrame")df.show()
Let’s understand the code above:
- Line 1: Import the
SparkSession
class from thepyspark.sql
module. - Line 2: Create a
SparkSession
with the name “pyspark_sql” using thebuilder
pattern and thegetOrCreate()
method. - Line 5: Use the
parallelize()
method of thesparkContext