...

/

PySpark DataFrames

PySpark DataFrames

Learn PySpark DataFrame API and its basic operations.

PySpark DataFrames

PySpark DataFrames is a distributed collection of data organized into named columns, similar to a table in a relational database or a DataFrame in R/Python. PySpark DataFrames are an abstraction on top of RDDs and provide a more concise and efficient way to handle structured data. Not only are they easy to understand, but their operations are optimized compared to RDDs. This is because of the inbuilt optimization. DataFrames are immutable, which means that any transformation operation on a DataFrame will create a new DataFrame.

Press + to interact

PySpark DataFrames support a wide range of operations, such as filtering, grouping, joining, and aggregation, making it easier to perform complex data operations. They support both SQL queries and expression methods. PySpark DataFrames are implemented in the pyspark.sql module and provide the DataFrame class.

Creating PySpark DataFrames

To use PySpark DataFrames, we first need to create a SparkSession object, which is the entry point to PySpark SQL. Once we create a SparkSession, it’s available in the PySpark shell as spark. There are three methods available for creating PySpark DataFrames:

From existing RDDs

To create a PySpark DataFrame from an existing RDD, we can use the createDataFrame() method provided by the SparkSession object. This method allows us to pass an RDD along with the schema (column names) to create the DataFrame.

Press + to interact
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark_sql").getOrCreate()
print("Create a sample RDD")
rdd = spark.sparkContext.parallelize([(1, "Alice"), (2, "Bob"), (3, "Charlie")])
print("Create a PySpark DataFrame from RDD")
df = spark.createDataFrame(rdd, ["id", "name"])
print("Print the contents of the DataFrame")
df.show()

Let’s understand the code above:

  • Line 1: Import the SparkSession class from the pyspark.sql module.
  • Line 2: Create a SparkSession with the name “pyspark_sql” using the builder pattern and the getOrCreate() method.
  • Line 5: Use the parallelize() method of the sparkContext
...