...

/

Solution: PySpark Data Structures

Solution: PySpark Data Structures

Learn about the solution to the problem from the previous lesson.

We'll cover the following...

Let’s look at solutions to the problems related to our understanding of PySpark Data structures, specifically focusing on PySpark DataFrames in this quiz.

  1. Create a PySpark DataFrame named df as shown below with the following provided data:
data = [('1', 'Alice', '25', 'NY'), ('2', 'Bob', '30', 'Chicago'), ('3', 'Charlie', '35', 'San Diego')]

ID

Name

Age

City

1

Alice

25

New York

2

Bob

30

Chicago

3

Charlie

35

San Diego

Press + to interact
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Copy the input data to here
data = [('1', 'Alice', '25', 'NY'), ('2', 'Bob', '30', 'Chicago'), ('3', 'Charlie', '35', 'San Diego')]
# Create an RDD first
rdd = spark.sparkContext.parallelize(data)
# Create a PySpark DataFrame named df
df = spark.createDataFrame(rdd, schema = ["Id", "name", "age", "city"])
# Print the contents of the df
df.show()

Let’s understand the above solution now:

  • Line 1: Import the SparkSession class from the pyspark.sql module.
  • Line 2: Create a SparkSession using the builder pattern and the getOrCreate() method.
  • Line 5: Copy the input data from the question.
  • Line 8: Create an RDD first by using spark.sparkContext.parallelize(data).
  • Line 11: Use the createDataFrame() method of the SparkSession to create a PySpark DataFrame named df from the created RDD rdd. The schema parameter is provided to specify the column names.
  • Line 14: Use the show() method of the DataFrame to display the contents of the DataFrame df.
  1. Show the first three rows of the df DataFrame and print the schema.
Press + to interact
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Copy the input data to here
data = [('1', 'Alice', '25', 'NY'), ('2', 'Bob', '30', 'Chicago'), ('3', 'Charlie', '35', 'San Diego')]
# Create an RDD first
rdd = spark.sparkContext.parallelize(data)
# Create a PySpark DataFrame named df
df = spark.createDataFrame(rdd, schema = ["Id", "name", "age", "city"])
# Show the first three rows of the `df` DataFrame
for row in df.take(3):
print(row)
# Print the schema of the `df` DataFrame
df.printSchema()

Let’s understand the above solution:

  • Line 1: Import the SparkSession
...