What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is a way to investigate datasets and find preliminary information, insights, or uncover underlying patterns in the data. Instead of making assumptions, data can be processed in a systematic method to gain insights and make informed decisions.

Why Exploratory Data Analysis?

Some advantages of Exploratory Data Analysis include:

  1. Improve understanding of variables by extracting averages, mean, minimum, and maximum values, etc.
  2. Discover errors, outliers, and missing values in the data.
  3. Identify patterns by visualizing data in graphs such as box plots, scatter plots, and histograms.

Hence, the main goal is to understand the data better and use tools effectively to gain valuable insights or draw conclusions.

The Advantages of Exploratory Data Analysis

Example in Python

The iris fisher dataset has been used to demonstrate EDA tasks as shown in the following code blocks.

The formed dataset contains a set of 150 records under five attributes - sepal length (cm), sepal width (cm), petal length (cm), petal width (cm), and class(represents the flower species).

# Importing libraries
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# Loading data for analysis
iris_data = load_iris()
# Creating a dataframe
iris_dataframe = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_dataframe['class'] = iris_data.target
print(iris_dataframe.head())

Statistics

The first step in data analysis is to observe the statistical values of the data to decide if it needs to be preprocessed in order to make it more consistent

Describe

The describe() method of a pandas data frame gives us important statistics of the data like min, max, mean, standard deviation, and quartiles.

For example, we want to verify the minimum and maximum values in our data. This can be done by invoking the describe() method:

# Summary of numerical variables
print(iris_dataframe.describe())

Data cleaning

Removing nulls

In order to identify the number of nulls within each column, we can invoke the isnull() method on each column of the pandas data frame.

If null values are found within a column, they can be replaced with the column mean using the fillna() method:

# Retrieving number of nulls in each column
print("Number of nulls in each column:")
print(iris_dataframe.apply(lambda x: sum(x.isnull()),axis=0))
# filling null values with mean for a column
iris_dataframe['sepal length (cm)'].fillna(iris_dataframe['sepal length (cm)'].mean(), inplace=True)

Data visualizations

As human beings, it is difficult to visualize statistical values. As an alternative, visualizations can be utilized in order to better understand the data and detect patterns.

Here, we can visualize our data using histograms, box-plot, and scatter plot.

Histogram

We will plot the frequency of sepal width and sepal length of the flowers within our dataset. This helps us to understand the underlying distribution:

# Histogram for sepal length and sepal width
fig = plt.figure(figsize= (10,5))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('sepal length (cm')
ax1.set_ylabel('Count')
iris_dataframe['sepal length (cm)'].hist()
ax2 = fig.add_subplot(122)
ax2.set_xlabel('sepal width (cm)')
ax2.set_ylabel('Count')
iris_dataframe['sepal width (cm)'].hist(ax=ax2)
plt.show()

Box plot

We can look for outliers in the sepal width feature of our dataset; then, decide whether or not to remove these outliers from our dataset:

# Creating a box plot
iris_dataframe.boxplot(column='sepal width (cm)', by = 'class');
title_boxplot = 'sepal width (cm) by class'
plt.title( title_boxplot )
plt.suptitle('')
plt.ylabel('sepal width(cm)')
plt.show()

Scatter plot

For each class of flowers within our dataset, we can judge how petal width and petal length are related to each other:

# Scatter plot of petal length and petal width for different classes
color= ['red' if l == 0 else 'blue' if l==1 else'green' for l in iris_data.target]
plt.scatter(iris_dataframe['petal length (cm)'], iris_dataframe['petal width (cm)'], color=color);
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.show()