...

/

Case Study: Explain Model Decisions with the SHAP Framework

Case Study: Explain Model Decisions with the SHAP Framework

Learn to explain global and local model decisions using the SHAP framework and gain familiarity with different techniques to visualize the results.

In this lesson, let’s get familiar with the SHAP (SHapley Additive exPlanation) framework and use it for explaining regression model decisions.

We’ll study the housing data case where the challenge is to predict the house prices from measured features. We’ll train a regression model that predicts house prices from the housing dataset. Then, we’lls explain the model prediction on individual examples using the SHAP framework.

Train a regression model to predict house prices

We’ll do some basic exploratory data analysis and then train a regression model to predict house prices. Our main focus will be on using the SHAP framework to explain the global and local decisions of a model.

Press + to interact
main.py
housing.csv
# Import library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Read data
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv('housing.csv', header=None, delimiter=r"\s+", names=column_names)
print('Shape of Dataset:',data.shape)
print('Sample Records:/n',data.head(3))
# Data preparation and split
target_variable = "MEDV"
Y = data.loc[:, target_variable]
X = data.drop(columns=["MEDV"])
X = pd.get_dummies(X)
# Data Split
random_seed = 445
np.random.seed(random_seed)
X_train, X_test, y_train, y_test = train_test_split(
X,Y,test_size=0.1,shuffle=True,random_state=random_seed)
print('Data shape after Splitting into Training and Test sets:\n')
print('X_Train:',X_train.shape,'\nX_test:', X_test.shape,'\nY_train:', y_train.shape,'\nY_test', y_test.shape)
# Create an instance of the random forest regressor
model = RandomForestRegressor(n_estimators = 300, random_state = 123)
# Train the classifier
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Model performance metric
mse = mean_squared_error(y_test, y_pred)**(0.5)
print('Mean Squared Error:',mse)

The code explanation is provided below:

  • Lines 1–9: We import essential libraries to equip ourselves for data analysis.

  • Lines 11–15: We read the input data into our analysis and look at sample records.

  • Lines 17–28: We prepare the data for modeling and split the data into training and testing sets to evaluate our model.

  • Lines 30–38: We train a regressor model on training data, generate predictions on test data, and measure the model’s performance to assess its effectiveness.

We observe that the trained model has a mean squared error (MSE) performance metric of ~ 2.7. The MSE is a measure of how spread out the squared errors are. It gives us a single number that represents the average of the squared differences between predicted and actual values.

The model performance is reasonable and can be improved further by focusing on parameter hypertuning.

Because our interest here is to explain the global and individual model results, we will focus the efforts on the SHAP framework.

Using SHAP for global

...