Exploratory data analysis (EDA)

Exploratory data analysis (EDA) is a critical step in the data analysis process that involves understanding the dataset, identifying patterns, and gaining insights from the data. In this lesson, we’ll learn how to perform EDA using PySpark SQL with the obesity dataset, which provides information about the classification of individuals based on obesity. The dataset incorporates data from various sources, including medical records, surveys, and self-reported information.

Understanding the dataset

The first step in EDA is to gain a clear understanding of the dataset. This includes loading the dataset, inspecting its structure, examining the schema of the DataFrame, and previewing the data using DataFrame operations. Let’s see how we can achieve these tasks using PySpark SQL.

Previewing the data

After loading the data into a DataFrame, we can use various PySpark DataFrame operations to get a preview of the data. These operations allow us to inspect the data, perform basic transformations, and extract relevant information. Some common DataFrame operations for data preview are shown below:

Press + to interact

Python 3.8

Files

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark_sql").getOrCreate()
print(f'Import "obesity.csv" into PySpark DataFrame')
df = spark.read.csv("obesity.csv", header=True, inferSchema=True)
print("Create a Temporary View")
df.createOrReplaceTempView("people")
print("Select and display the first 5 rows and all columns from the temp view")
spark.sql("SELECT * FROM people").show(5)
print("Count the total number of rows in the DataFrame")
total_counts = spark.sql("SELECT COUNT(*) FROM people").first()[0]
print(f'Total counts: {total_counts}')
print("Select and display all data points related to people with obesity")
spark.sql("SELECT * FROM people WHERE Label = 'Obese'").show()
print("Group people by label and display them the results")
spark.sql("SELECT Label, COUNT(*) AS count FROM people GROUP BY Label").show()

In the above code:

Line 1: Import the SparkSession class from the pyspark.sql module.
Line 2: Create a SparkSession using the builder pattern and the appName() method to set the application name as “pyspark_sql.”
Line 5: Use the read.csv() method of the SparkSession to read a CSV file named “obesity.csv.” The header parameter is set to True to treat the first row as the header, and the inferSchema parameter is set to True to infer the schema from the data. The resulting DataFrame is assigned to the variable df.
Line 8: Create a temporary view of the DataFrame using the createOrReplaceTempView() method. The view is named “people.”
Line 11: Use the sql() method of SparkSession

...

Introduction to the Course

Introduction to Big Data

Exploring PySpark Core and RDDs

PySpark DataFrames and SQL

Customer Churn Analysis Using PySpark

Machine Learning with PySpark

Modeling with PySpark MLlib

Predicting Diabetes in Patients Using PySpark MLlib

Performance Optimization in PySpark

PySpark Optimization: Analyzing NYC Restaurants Data

Integrating PySpark with Other Big Data Tools

Wrap Up

Apriori Algorithm for Finding Frequent Itemsets with PySpark

Data Exploration and Analysis with PySpark SQL

Exploratory data analysis (EDA)

Understanding the dataset

Previewing the data