Solution: PySpark Data Structures
Learn about the solution to the problem from the previous lesson.
We'll cover the following...
Let’s look at solutions to the problems related to our understanding of PySpark Data structures, specifically focusing on PySpark DataFrames in this quiz.
- Create a PySpark DataFrame named
df
as shown below with the following provided data:
data = [('1', 'Alice', '25', 'NY'), ('2', 'Bob', '30', 'Chicago'), ('3', 'Charlie', '35', 'San Diego')]
ID | Name | Age | City |
1 | Alice | 25 | New York |
2 | Bob | 30 | Chicago |
3 | Charlie | 35 | San Diego |
Press + to interact
from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()# Copy the input data to heredata = [('1', 'Alice', '25', 'NY'), ('2', 'Bob', '30', 'Chicago'), ('3', 'Charlie', '35', 'San Diego')]# Create an RDD firstrdd = spark.sparkContext.parallelize(data)# Create a PySpark DataFrame named dfdf = spark.createDataFrame(rdd, schema = ["Id", "name", "age", "city"])# Print the contents of the dfdf.show()
Let’s understand the above solution now:
- Line 1: Import the
SparkSession
class from thepyspark.sql
module. - Line 2: Create a
SparkSession
using thebuilder
pattern and thegetOrCreate()
method. - Line 5: Copy the input data from the question.
- Line 8: Create an RDD first by using
spark.sparkContext.parallelize(data)
. - Line 11: Use the
createDataFrame()
method of theSparkSession
to create a PySpark DataFrame nameddf
from the created RDDrdd
. Theschema
parameter is provided to specify the column names. - Line 14: Use the
show()
method of the DataFrame to display the contents of the DataFramedf
.
- Show the first three rows of the
df
DataFrame and print the schema.
Press + to interact
from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()# Copy the input data to heredata = [('1', 'Alice', '25', 'NY'), ('2', 'Bob', '30', 'Chicago'), ('3', 'Charlie', '35', 'San Diego')]# Create an RDD firstrdd = spark.sparkContext.parallelize(data)# Create a PySpark DataFrame named dfdf = spark.createDataFrame(rdd, schema = ["Id", "name", "age", "city"])# Show the first three rows of the `df` DataFramefor row in df.take(3):print(row)# Print the schema of the `df` DataFramedf.printSchema()
Let’s understand the above solution:
- Line 1: Import the
SparkSession