Exploratory Data Analysis (EDA) is a way to investigate datasets and find preliminary information, insights, or uncover underlying patterns in the data. Instead of making assumptions, data can be processed in a systematic method to gain insights and make informed decisions.
Some advantages of Exploratory Data Analysis include:
Improve understanding
of variables by extracting averages, mean, minimum, and maximum values, etc.Discover errors
, outliers, and missing values in the data.Identify patterns
by visualizing data in graphs such as box plots, scatter plots, and histograms.Hence, the main goal is to understand the data better and use tools effectively to gain valuable insights or draw conclusions.
The iris fisher dataset has been used to demonstrate EDA tasks as shown in the following code blocks.
The formed dataset contains a set of 150 records under five attributes - sepal length (cm)
, sepal width (cm)
, petal length (cm)
, petal width (cm)
, and class
(represents the flower species).
# Importing librariesimport pandas as pdimport matplotlibimport matplotlib.pyplot as pltfrom sklearn.datasets import load_iris# Loading data for analysisiris_data = load_iris()# Creating a dataframeiris_dataframe = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)iris_dataframe['class'] = iris_data.targetprint(iris_dataframe.head())
The first step in data analysis is to observe the statistical values of the data to decide if it needs to be preprocessed in order to make it more consistent
The describe()
method of a pandas
data frame gives us important statistics of the data like min
, max
, mean
, standard deviation
, and quartiles
.
For example, we want to verify the minimum
and maximum
values in our data. This can be done by invoking the describe()
method:
# Summary of numerical variablesprint(iris_dataframe.describe())
In order to identify the number of nulls within each column, we can invoke the isnull()
method on each column of the pandas
data frame.
If null values are found within a column, they can be replaced with the column mean using the fillna()
method:
# Retrieving number of nulls in each columnprint("Number of nulls in each column:")print(iris_dataframe.apply(lambda x: sum(x.isnull()),axis=0))# filling null values with mean for a columniris_dataframe['sepal length (cm)'].fillna(iris_dataframe['sepal length (cm)'].mean(), inplace=True)
As human beings, it is difficult to visualize statistical values. As an alternative, visualizations can be utilized in order to better understand the data and detect patterns.
Here, we can visualize our data using histograms
, box-plot
, and scatter plot
.
We will plot the frequency of sepal width
and sepal length
of the flowers within our dataset. This helps us to understand the underlying distribution:
# Histogram for sepal length and sepal widthfig = plt.figure(figsize= (10,5))ax1 = fig.add_subplot(121)ax1.set_xlabel('sepal length (cm')ax1.set_ylabel('Count')iris_dataframe['sepal length (cm)'].hist()ax2 = fig.add_subplot(122)ax2.set_xlabel('sepal width (cm)')ax2.set_ylabel('Count')iris_dataframe['sepal width (cm)'].hist(ax=ax2)plt.show()
We can look for outliers in the sepal width
feature of our dataset; then, decide whether or not to remove these outliers from our dataset:
# Creating a box plotiris_dataframe.boxplot(column='sepal width (cm)', by = 'class');title_boxplot = 'sepal width (cm) by class'plt.title( title_boxplot )plt.suptitle('')plt.ylabel('sepal width(cm)')plt.show()
For each class of flowers within our dataset, we can judge how petal width
and petal length
are related to each other:
# Scatter plot of petal length and petal width for different classescolor= ['red' if l == 0 else 'blue' if l==1 else'green' for l in iris_data.target]plt.scatter(iris_dataframe['petal length (cm)'], iris_dataframe['petal width (cm)'], color=color);plt.xlabel('petal length (cm)')plt.ylabel('petal width (cm)')plt.show()