What is Exploratory Data Analysis?

Data analysis is older than computers! Computers have revolutionized data analysis, but the fundamental concepts have been around for centuries. John Graunt’s work on mortality rates in 17th-century London is considered one of the earliest examples of data analysis.   

Exploratory Data Analysis (EDA) in data science is a way to investigate datasets and find preliminary information and insights or uncover underlying patterns in the data. In this manner, Instead of making assumptions, data can be processed in a systematic method to gain insights and make informed decisions.

Key takeaways:

  • Exploratory Data Analysis (EDA) is a helpful tool for better understanding your data by uncovering patterns and insights that may not be obvious at first.

  • EDA helps you identify and fix issues in your data, such as missing values and outliers.

  • Data cleaning is critical to ensure your results are accurate and reliable.

  • Transforming your data with techniques like normalization, encoding, and feature engineering prepares it for more effective analysis and modeling.

  • Using visualizations makes complex data easier to interpret.

  • Each visualization type serves a specific purpose and can highlight different aspects of the data.

  • EDA is an ongoing process, so revisit steps as new insights emerge during your analysis.

  • The insights gained from EDA can guide your feature selection for model building.

Why Exploratory Data Analysis?

Some key advantages of Exploratory Data Analysis include:

  1. It Improves the understanding of variables by extracting averages, mean, minimum, and maximum values, etc.

  2. EDA helps discover errors, outliers, and missing values in the data.

  3. It identifies patterns by visualizing data in graphs such as box plots, scatter plots, and histograms.

Hence, the main goal is to understand the data better and use tools effectively to gain valuable insights or draw conclusions.

The role of EDA
The role of EDA

Types of Exploratory Data Analysis (EDA)

EDA techniques can be broadly categorized into four types, each serving a unique purpose. Below, we have listed these types and their examples:

  1. Univariate non-graphical analysis: This is the simplest form of data analysis, examining a single variable without the need for graphical tools. This method allows analysts to summarize data patterns, such as mean, median, and mode.
    Example: Analyzing customer ages to find the average, giving insights into the most common age range of your audience.

  2. Univariate graphical analysis: Graphical methods give a visual perspective on data, providing an immediate understanding of distributions. This type of EDA uses histograms, stem-and-leaf plots, or box plots to depict a single variable graphically.
    Example: A histogram showing the distribution of monthly customer purchases.

  3. Multivariate non-graphical analysis: This non-graphical method helps reveal correlations or dependencies when analyzing relationships between two or more variables. Cross-tabulations or correlation tables are common here.
    Example: Observing the relationship between customer age and frequency of product purchases.

  4. Multivariate graphical analysis: For complex datasets with multiple variables, graphical methods are invaluable for visualizing interactions. Scatter plots, grouped bar charts, or pair plots are commonly used.
    Example: A scatter plot showing the relationship between a marketing budget and sales growth across different regions.

The following table summarizes these types of EDA along with examples of each:

Type

Non-Graphical

Graphical

Univariate

Analyzes a single variable using basic statistical summaries.

Visualizes distribution and patterns of a single variable.

Examples: Mean, median, mode, range

Examples: Histograms, box plots, stem-and-leaf plots

Multivariate

Examines relationships between two or more variables with numeric summaries and tables.

Uses visuals to illustrate relationships between multiple variables, helping reveal patterns and trends.

Examples: Cross-tabulations, correlation statistics

Examples: Grouped bar plots, scatter plots, heatmaps

By using these four approaches, you’ll be well-equipped to perform insightful data analysis so you can gain a deeper understanding of your dataset, which is crucial for making informed decisions in your analysis.

Understand EDA with Python code examples

Let’s take a look at some Python code examples to understand how EDA works. In this Answer, we’ll use the Iris Fisher dataset to demonstrate key EDA tasks, with each step shown in the following code blocks. The dataset contains a set of 150 records under five attributes:

  • sepal length (cm)

  • sepal width (cm)

  • petal length (cm)

  • petal width (cm)

  • class (represents the flower species)

Exploratory Data Analysis (EDA) involves examining the dataset from multiple perspectives. Rather than following a strict, linear sequence, it includes several key aspects that help us understand, clean, and transform the data for further analysis. Following are the aspects of EDA that are typically revisited as more insights arise during the process:

1. Understand the dataset

Now, to understand the data better, let’s print the first few rows of the dataset using the following code:

# Importing libraries
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# Loading data for analysis
iris_data = load_iris()
# Creating a dataframe
iris_dataframe = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_dataframe['class'] = iris_data.target
print(iris_dataframe.head())

Here the head() method of the pandas library gives a peek into the DataFrame.

2. Observe data statistics

Another important aspect of exploratory data analysis is observing the statistical values of the data to decide if it needs to be preprocessed to make it more consistent.

The describe() method of the pandas library gives us important statistics of the data like min, max, mean, standard deviation, and quartiles. For example, if we want to verify the minimum and maximum values in our DataFrame, it can be done by invoking the describe() method as follows:

# Summary of numerical variables
print(iris_dataframe.describe())

3. Data cleaning

Data cleaning ensures the dataset is accurate and consistent by addressing issues like missing values, outliers, and duplicates that could adversely affect the analysis. In this phase, we prepare the data for further analysis by identifying and handling common issues such as null values, outliers, and duplicate entries.

To identify the number of nulls within each column, we can invoke the isnull() method on each column of the pandas Dataframe. If null values are found within a column, they can be replaced with the column mean using the fillna() method:

# Retrieving number of nulls in each column
print("Number of nulls in each column:")
print(iris_dataframe.apply(lambda x: sum(x.isnull()),axis=0))
# Filling null values with mean for a column
iris_dataframe['sepal length (cm)'].fillna(iris_dataframe['sepal length (cm)'].mean(), inplace=True)

4. Data transformation

Data transformation modifies data to prepare it for analysis or modeling. This includes techniques like scaling, normalization, encoding, and feature creation, which can reveal hidden patterns and improve model accuracy.

  • Scaling and normalization: These techniques adjust numerical data to a common scale, preventing models from being biased toward larger values. For example, normalization scales data to a range of 0–1, while standardization adjusts it to a standard scale.

  • Encoding categorical variables: Since machine learning models work with numerical data, so categorical variables (like Yes/No) are converted into numbers using methods like one-hot encoding or label encoding. This helps models understand categorical information.

  • Feature creation: New features can be created from existing data, such as deriving a “total sales” feature from “quantity” and “price.” These new features provide relevant information that enhances analysis and model performance.

Data transformation cleans and standardizes data, revealing valuable patterns that improve analysis and model accuracy. Here’s a Python code example demonstrating scaling:

# Display original data
print("Original Dataset:")
print(iris_dataframe.head())
# Scaling and normalization (Manual Min-Max Scaling)
# Define a function to manually scale a column between 0 and 1
def min_max_scale(column):
return (column - column.min()) / (column.max() - column.min())
# Apply Min-Max Scaling to each feature column
for column in iris_dataframe.columns[:-1]: # Exclude the 'class' column
iris_dataframe[column] = min_max_scale(iris_dataframe[column])
print("\nDataset after Min-Max Scaling (0-1):")
print(iris_dataframe.head())

5. Data visualizations

It is difficult for humans to visualize statistical values. However, data visualizations can help us better understand the data and detect patterns. We can visualize our data using histograms, box plots, scatter plots, etc., each giving different information.

Let’s learn the role of each plot type in EDA and how to create them in Python.

Histogram

We will use a histogramA histogram displays the frequency distribution of a dataset or a dataset feature, helping to visualize the underlying data distribution and identify patterns. to plot the frequency of sepal width and sepal length of the flowers within our dataset. This helps us to understand the underlying distribution:

# Histogram for sepal length and sepal width
fig = plt.figure(figsize= (10,5))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('sepal length (cm)')
ax1.set_ylabel('Count')
iris_dataframe['sepal length (cm)'].hist()
ax2 = fig.add_subplot(122)
ax2.set_xlabel('sepal width (cm)')
ax2.set_ylabel('Count')
iris_dataframe['sepal width (cm)'].hist(ax=ax2)
plt.show()

Box plot

Using a box plotA box plot visualizes the distribution of data, making it ideal for detecting outliers., we can look for outliers in the sepal width feature of our dataset; then, decide whether or not to remove these outliers from our dataset:

# Creating a box plot
iris_dataframe.boxplot(column='sepal width (cm)', by = 'class');
title_boxplot = 'sepal width (cm) by class'
plt.title( title_boxplot )
plt.suptitle('')
plt.ylabel('sepal width(cm)')
plt.show()

Scatter plot

For each class of flowers within our dataset, we can use a scatter plotA scatter plot shows the relationship between two variables, making it useful for identifying trends, correlations, or clusters. to judge how petal width and petal length are related to each other:

# Scatter plot of petal length and petal width for different classes
color= ['red' if l == 0 else 'blue' if l==1 else'green' for l in iris_data.target]
plt.scatter(iris_dataframe['petal length (cm)'], iris_dataframe['petal width (cm)'], color=color);
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.show()

Conclusion

Exploratory Data Analysis (EDA) is a foundational step in data science, enabling us to transform raw data into meaningful insights. Through data cleaning, statistical summaries, and visualizations, EDA helps us uncover patterns that guide further analysis, it’s essential to keep a few common pitfalls in mind. Avoid making premature assumptions about causation, overlooking variable relationships, or failing to thoroughly check for data quality issues, as these can lead to inaccurate conclusions.

EDA is often iterative, but knowing when to stop is equally important. When the main data patterns are well-understood, data quality issues are addressed, and you have enough clarity to confidently proceed with modeling or analysis, you’re ready to move forward. By understanding and applying EDA mindfully, you can build a strong foundation for informed, data-driven decisions and meaningful analyses.

Become a data analyst with our comprehensive learning path!

If you’re ready to kickstart your career as a data analyst, then our “Become a Data Analyst” path is designed to take you from your first line of code to landing your first job.

Whether you’re a beginner or looking to transition into a data-driven career, this step-by-step journey will equip you with the skills to turn raw data into actionable insights. Develop expertise in data cleaning, analysis, and storytelling to make informed decisions and drive business success. With our AI mentor by your side, you’ll tackle challenges with personalized guidance. Start your data analytics career today and make your mark in the world of data!

Frequently asked questions

Haven’t found what you were looking for? Contact Us


What are the four types of Exploratory Data Analysis?

The four types of EDA are as follows:

  • Univariate Non-Graphical Analysis: Analyzing a single variable using summary statistics like mean and median)
  • Univariate Graphical Analysis: Visualizing a single variable’s distribution through plots like histograms and box plots)
  • Multivariate Non-Graphical Analysis: Exploring relationships among two or more variables using tables and correlations)
  • Multivariate Graphical Analysis: Visualizing relationships between multiple variables with tools like scatter plots and grouped bar charts).

What are the steps of Exploratory Data Analysis?

Exploratory data analysis (EDA) often involves steps like

  • Data collection: Gathering and loading data
  • Data cleaning: Correcting inaccuracies and handling missing values
  • Data transformation: Modifying data to suit analysis needs
  • Data visualization: Using graphs to identify patterns and relationships

These steps may vary based on the user’s goals and the specifics of the dataset, allowing for flexibility in approach.


When to use Exploratory Data Analysis?

Use Exploratory Data Analysis whenever you start working with a new dataset to understand its structure, check for errors, and uncover patterns that can guide further analysis or model building.


Free Resources