Pearson's correlation

Pearson's correlation is a statistical measure of linear correlation between two continuous variablesData that can take on any numerical value within a specific range of a continuous scale.. It is also known as Pearson's correlation coefficient.

It is denoted by the symbol "rr" and takes values between 1-1 and 11. The value of rr indicates the degree of linear dependence between the variables:

  • If r=1r = 1, it indicates a perfect positive linear relationship, meaning that as one variable increases, the other variable increases proportionally.

  • If r=1r = -1, it indicates a perfect negative linear relationship, meaning that the other variable decreases proportionally as one variable increases.

  • If r=0r = 0, it indicates no linear relationship between the variable.

Feature selection

Feature selection in machine learning refers to selecting a subset of relevant features from a larger set of available features. This process improves the performance of a machine learning model by reducing the dimensionality of the input space by eliminating redundant features.

Note: Check out dimensionality reduction.

Pearson's correlation is a feature selection method for continuous input and output data. It is a statistical filter method that operates independently of the specific machine learning model being used.

Procedure

In this Answer, we will use the Boston housing dataset accessible from here. Here's an overview of the dataset's first four entries:

The MEDV entry is our target variable, whereas the other entries are the features. The following code calculates the correlation of every feature with MEDV (the target entry):

# Imports
import seaborn as sns
from sklearn.datasets import load_boston
import pandas as pd
import matplotlib.pyplot as plt

# Load the housing dataset
X, y = load_wine(return_X_y=True)
data = pd.DataFrame(X, columns=load_boston().feature_names)
data['target'] = y

# Compute Pearson's correlation coefficient
target = data.corr()['target'].abs().sort_values(ascending=False)

# Plotting the correlations using a bar plot with values on top
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=target.values, y=target.index, palette='Reds_r')

# Add values on top of the bars
for i, v in enumerate(target.values):
    ax.text(v + 0.01, i, f'{v:.2f}', color='black', ha='left')

plt.xlabel('Correlation')
plt.ylabel('Features')
plt.title('Correlation between Features and Target (MEDV)')
plt.show()
Python code for plotting the most correlated features with the target value

Note: Switch to the "Output" tab in the code above to view the plot.

Code explanation

  • Lines 2–5: Importing necessary libraries.

  • Lines 7–10: Loading the Boston housing data and assigning the target value.

  • Line 13: Plotting the correlation of each feature with the target value.

  • Lines 16–17: Displaying the plots as bar graphs.

  • Lines 20–21: Coloring the bar graph.

  • Line 23–26: Labeling the plot.

Analysis

We can select features with a certain threshold. Here we'll keep the features with Pearson co-efficient greater than the absolute value of 0.5 and discard the others. We can see that the PTRATIO, RM, and LSTAT have Pearson's co-efficient greater than 0.5. We will pick these features and discard the others.

We will need to analyze it further if there's still a possibility of reducing dimensions. For that, we'll plot their correlation with each other as follows:

# Imports
import seaborn as sns
from sklearn.datasets import load_boston
import pandas as pd
import matplotlib.pyplot as plt

X, y = load_boston(return_X_y=True)
features = load_boston().feature_names
data = pd.DataFrame(X, columns=features)
data['MEDV'] = y

# Compute Pearson's correlation coefficient
target = data.corr()['MEDV'].abs().sort_values(ascending=False)

# Selecting the most correlated feature with the target value
target[abs(target)>0.5].dropna()

# Plotting the most correlated feature as a heatmap
sns.heatmap(data.corr().loc[['RM', 'PTRATIO', 'LSTAT'], ['RM', 'PTRATIO', 'LSTAT']], annot=True, cmap=plt.cm.Reds)
plt.show()

Note: Switch to the "Output" tab in the code above to view the plot.

The code remains the same as the previous one with the following changes:

  • Line 16: Filtering out the feature having Pearson's co-efficient greater than 0.5.

  • Line 19: Plotting those features as a heatmap.

We can see that LSTAT and RM are closely related to each other. Hence, they are redundant features, meaning dropping one of them will help reduce dimensionality. Suppose we drop LSTAT. Now, we have two features: RM and PTRATIO. The RM feature is more related to the target value than the PTRATOI is. Hence we can proceed only with RM.

Conclusion

Using Pearson's correlation, we simplified the dataset by focusing on a single important factor. This reduction from 13 to 1 dimension will make our machine learning model faster and less complicated.

Note: Read about linear discriminant analysis.

Copyright ©2024 Educative, Inc. All rights reserved