Pearson's correlation is a statistical measure of linear correlation between two
It is denoted by the symbol "
If
If
If
Feature selection in machine learning refers to selecting a subset of relevant features from a larger set of available features. This process improves the performance of a machine learning model by reducing the dimensionality of the input space by eliminating redundant features.
Note: Check out dimensionality reduction.
Pearson's correlation is a feature selection method for continuous input and output data. It is a statistical filter method that operates independently of the specific machine learning model being used.
In this Answer, we will use the Boston housing dataset accessible from here. Here's an overview of the dataset's first four entries:
The MEDV
entry is our target variable, whereas the other entries are the features. The following code calculates the correlation of every feature with MEDV
(the target entry):
# Imports import seaborn as sns from sklearn.datasets import load_boston import pandas as pd import matplotlib.pyplot as plt # Load the housing dataset X, y = load_wine(return_X_y=True) data = pd.DataFrame(X, columns=load_boston().feature_names) data['target'] = y # Compute Pearson's correlation coefficient target = data.corr()['target'].abs().sort_values(ascending=False) # Plotting the correlations using a bar plot with values on top plt.figure(figsize=(10, 6)) ax = sns.barplot(x=target.values, y=target.index, palette='Reds_r') # Add values on top of the bars for i, v in enumerate(target.values): ax.text(v + 0.01, i, f'{v:.2f}', color='black', ha='left') plt.xlabel('Correlation') plt.ylabel('Features') plt.title('Correlation between Features and Target (MEDV)') plt.show()
Note: Switch to the "Output" tab in the code above to view the plot.
Lines 2–5: Importing necessary libraries.
Lines 7–10: Loading the Boston housing data and assigning the target value.
Line 13: Plotting the correlation of each feature with the target value.
Lines 16–17: Displaying the plots as bar graphs.
Lines 20–21: Coloring the bar graph.
Line 23–26: Labeling the plot.
We can select features with a certain threshold. Here we'll keep the features with Pearson co-efficient greater than the absolute value of 0.5 and discard the others. We can see that the PTRATIO
, RM
, and LSTAT
have Pearson's co-efficient greater than 0.5. We will pick these features and discard the others.
We will need to analyze it further if there's still a possibility of reducing dimensions. For that, we'll plot their correlation with each other as follows:
# Imports import seaborn as sns from sklearn.datasets import load_boston import pandas as pd import matplotlib.pyplot as plt X, y = load_boston(return_X_y=True) features = load_boston().feature_names data = pd.DataFrame(X, columns=features) data['MEDV'] = y # Compute Pearson's correlation coefficient target = data.corr()['MEDV'].abs().sort_values(ascending=False) # Selecting the most correlated feature with the target value target[abs(target)>0.5].dropna() # Plotting the most correlated feature as a heatmap sns.heatmap(data.corr().loc[['RM', 'PTRATIO', 'LSTAT'], ['RM', 'PTRATIO', 'LSTAT']], annot=True, cmap=plt.cm.Reds) plt.show()
Note: Switch to the "Output" tab in the code above to view the plot.
The code remains the same as the previous one with the following changes:
Line 16: Filtering out the feature having Pearson's co-efficient greater than 0.5.
Line 19: Plotting those features as a heatmap.
We can see that LSTAT
and RM
are closely related to each other. Hence, they are redundant features, meaning dropping one of them will help reduce dimensionality. Suppose we drop LSTAT
. Now, we have two features: RM
and PTRATIO
. The RM
feature is more related to the target value than the PTRATOI
is. Hence we can proceed only with RM
.
Using Pearson's correlation, we simplified the dataset by focusing on a single important factor. This reduction from 13 to 1 dimension will make our machine learning model faster and less complicated.
Note: Read about linear discriminant analysis.