...
/Case Study: Explain Model Decisions with the SHAP Framework
Case Study: Explain Model Decisions with the SHAP Framework
Learn to explain global and local model decisions using the SHAP framework and gain familiarity with different techniques to visualize the results.
In this lesson, let’s get familiar with the SHAP (SHapley Additive exPlanation) framework and use it for explaining regression model decisions.
We’ll study the housing data case where the challenge is to predict the house prices from measured features. We’ll train a regression model that predicts house prices from the housing dataset. Then, we’lls explain the model prediction on individual examples using the SHAP framework.
Train a regression model to predict house prices
We’ll do some basic exploratory data analysis and then train a regression model to predict house prices. Our main focus will be on using the SHAP framework to explain the global and local decisions of a model.
# Import libraryimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_error# Read datacolumn_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']data = pd.read_csv('housing.csv', header=None, delimiter=r"\s+", names=column_names)print('Shape of Dataset:',data.shape)print('Sample Records:/n',data.head(3))# Data preparation and splittarget_variable = "MEDV"Y = data.loc[:, target_variable]X = data.drop(columns=["MEDV"])X = pd.get_dummies(X)# Data Splitrandom_seed = 445np.random.seed(random_seed)X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.1,shuffle=True,random_state=random_seed)print('Data shape after Splitting into Training and Test sets:\n')print('X_Train:',X_train.shape,'\nX_test:', X_test.shape,'\nY_train:', y_train.shape,'\nY_test', y_test.shape)# Create an instance of the random forest regressormodel = RandomForestRegressor(n_estimators = 300, random_state = 123)# Train the classifiermodel.fit(X_train, y_train)# Make predictions on the test sety_pred = model.predict(X_test)# Model performance metricmse = mean_squared_error(y_test, y_pred)**(0.5)print('Mean Squared Error:',mse)
The code explanation is provided below:
-
Lines 1–9: We import essential libraries to equip ourselves for data analysis.
-
Lines 11–15: We read the input data into our analysis and look at sample records.
-
Lines 17–28: We prepare the data for modeling and split the data into training and testing sets to evaluate our model.
-
Lines 30–38: We train a regressor model on training data, generate predictions on test data, and measure the model’s performance to assess its effectiveness.
We observe that the trained model has a mean squared error (MSE) performance metric of ~ 2.7. The MSE is a measure of how spread out the squared errors are. It gives us a single number that represents the average of the squared differences between predicted and actual values.
The model performance is reasonable and can be improved further by focusing on parameter hypertuning.
Because our interest here is to explain the global and individual model results, we will focus the efforts on the SHAP framework.