...

/

Case Study: Explore Feature Impact with Partial Dependence Plots

Case Study: Explore Feature Impact with Partial Dependence Plots

Learn how to apply partial dependence plots to explore the impact of features on target variables.

So far, we’ve explored the relative importance of different features. In this lesson, we will embark on a new journey where we’ll discover how a specific feature interacts with the target variable.

More specifically, we’ll study the partial dependence plot (PDP)—a powerful visual tool in machine learning that unveils the influence of a particular feature on the model’s predictions, while keeping all other features constant. By examining the isolated impact of a single variable across a range of values, PDPs help us understand the complex inner workings of the model.

PDPs provide a global perspective, focusing on the average effect of a feature rather than specific instances. This technique offers a range of benefits:

  • It’s easy to compute and explain in simple terms, making it accessible to everyone.
  • It helps us uncover the relationship between a feature (or a combination of features) and the target variable.
  • Unlike other techniques, PDPs provide a causal interpretation, giving us valuable insights into how the feature impacts the model’s output.

In this lesson, we’ll analyze a loan dataset and apply the partial dependence plot to gain a deeper understanding of the model’s explainability.

Data ingestion and exploratory analysis

We’ll do some basic exploratory data analysis. Our main focus will be on the usage of partial dependence plots (PDPs) using the sklearn framework.

Press + to interact
main.py
loan_approval.csv
#Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
#Read the data
data = pd.read_csv('loan_approval.csv')
# Transform categorical variables into numeric values
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
obj = (data.dtypes == 'object')
for col in list(obj[obj].index):
data[col] = label_encoder.fit_transform(data[col])
# Missing values before imputation
print('Missing Values before imputation\n')
print(data.isna().sum())
for col in data.columns:
# Imputing missing values with mean
data[col] = data[col].fillna(data[col].mean())
print('Missing Values after imputation\n')
print(data.isna().sum())
# Drop target variable from feature set
X = data.drop(['Loan_Status'],axis=1)
Y = data['Loan_Status']
X.shape,Y.shape
#Split data into train and test samples
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4,random_state=1)
print('Data shape after Splitting into Training and Test sets:\n')
print('X_Train:',X_train.shape,'\nX_test:', X_test.shape,'\nY_train:', Y_train.shape,'\nY_test', Y_test.shape)

The above script reads the loan data, preprocesses it, and splits it into train and test samples for model-building purposes.

  • Lines 1–5: We load the Python libraries to be used for analysis purposes.

  • Lines 7–8: We load the input data that will be used for model training and evaluation.

  • Lines 10–15: We transform the categorical variables into numeric format, a ...