...
/Case Study: Explore Feature Impact with Partial Dependence Plots
Case Study: Explore Feature Impact with Partial Dependence Plots
Learn how to apply partial dependence plots to explore the impact of features on target variables.
So far, we’ve explored the relative importance of different features. In this lesson, we will embark on a new journey where we’ll discover how a specific feature interacts with the target variable.
More specifically, we’ll study the partial dependence plot (PDP)—a powerful visual tool in machine learning that unveils the influence of a particular feature on the model’s predictions, while keeping all other features constant. By examining the isolated impact of a single variable across a range of values, PDPs help us understand the complex inner workings of the model.
PDPs provide a global perspective, focusing on the average effect of a feature rather than specific instances. This technique offers a range of benefits:
- It’s easy to compute and explain in simple terms, making it accessible to everyone.
- It helps us uncover the relationship between a feature (or a combination of features) and the target variable.
- Unlike other techniques, PDPs provide a causal interpretation, giving us valuable insights into how the feature impacts the model’s output.
In this lesson, we’ll analyze a loan dataset and apply the partial dependence plot to gain a deeper understanding of the model’s explainability.
Data ingestion and exploratory analysis
We’ll do some basic exploratory data analysis. Our main focus will be on the usage of partial dependence plots (PDPs) using the sklearn
framework.
#Import librariesimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_split#Read the datadata = pd.read_csv('loan_approval.csv')# Transform categorical variables into numeric valuesfrom sklearn import preprocessinglabel_encoder = preprocessing.LabelEncoder()obj = (data.dtypes == 'object')for col in list(obj[obj].index):data[col] = label_encoder.fit_transform(data[col])# Missing values before imputationprint('Missing Values before imputation\n')print(data.isna().sum())for col in data.columns:# Imputing missing values with meandata[col] = data[col].fillna(data[col].mean())print('Missing Values after imputation\n')print(data.isna().sum())# Drop target variable from feature setX = data.drop(['Loan_Status'],axis=1)Y = data['Loan_Status']X.shape,Y.shape#Split data into train and test samplesX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4,random_state=1)print('Data shape after Splitting into Training and Test sets:\n')print('X_Train:',X_train.shape,'\nX_test:', X_test.shape,'\nY_train:', Y_train.shape,'\nY_test', Y_test.shape)
The above script reads the loan data, preprocesses it, and splits it into train and test samples for model-building purposes.
-
Lines 1–5: We load the Python libraries to be used for analysis purposes.
-
Lines 7–8: We load the input data that will be used for model training and evaluation.
-
Lines 10–15: We transform the categorical variables into numeric format, a ...