Model Evaluation
Evaluate the quality of the model using a handful of evaluation metrics.
A residual value measures how much a regression line vertically misses a data point. Regression lines are the best fit for a set of data. We can think of the lines as averages; a few data points will fit the line, and others will miss. A residual plot has the residual values on the vertical axis, and the horizontal axis typically displays the independent variable.
Residual scatterplot
The scatterplot below (residual values against y_hat
predictions) is helpful for understanding randomness.
# Residual histogramplt.figure(figsize=(14, 6))print("Red dotted line shows the mean value '{}' of residuals (y_test-predictions). \THIS IS NOT THE MODEL LINE!".format(round(np.mean(y_test-predictions),4)))plt.axhline(y=np.mean(y_test-predictions), color='r', ls='--', linewidth=3)plt.scatter(predictions,y_test-predictions);plt.xlabel('Model Predictions (y_hat)')plt.ylabel('Residual Values')plt.title('\nResidual scatter plot vs model predictions.');
We are interested in the following.
Scedasticity: We want a consistent variance between our low and high predictions (homoscedasticity). Our target is not normally distributed if we spot the opposite (heteroscedasticity). The remedy is to run the target vector through a power transformation to make it more Gaussian-like, such as Box-Cox or Yeo-Johnson.
Outliers: If the loss function involves squaring the residuals (for example, MSE, RMSE, R2), then outliers will have a lot of leverage over the model. We recommend removing the worst outliers or offenders from the training data.
In statistics, scedasticity is the distribution of error terms, and they can either be distributed randomly and with constant variance (homoscedasticity) or with some pattern (heteroscedasticity).
Residual histogram
The residual histogram is also helpful as it can tell us how much the predicted value differs from the actual value in y_test
. We can do the subtraction y_test - predictions
for this plot. Before we move on, let's try to understand the skewness in the distribution plots. The code below generates data and creates the plots for learning.
fig,ax=plt.subplots(nrows=3,figsize=(18,14))# Generating datanp.random.seed(41);size,scale = 100000,10PositiveSkew=pd.Series(np.random.gamma(scale,size=size)**2.5)NotmalDistribution=pd.Series(np.random.randn(size));NegativeSkew = PositiveSkew[::-1]# Positive skewed plotsns.histplot(data=PositiveSkew,ax=ax[0],kde=True,bins=200,color='r',line_kws = {'lw':3});ax[0].text(1000,4000,'Positive Skew / Right Skewed',fontsize=18,color='r')ax[0].axvline(np.mean(PositiveSkew),lw=5,c='DarkRed')ax[0].text(390,1500,'Mean',fontsize=18,color='DarkRed',weight='bold')ax[0].axvline(np.median(PositiveSkew),lw =5,c='gold',ls = '--')ax[0].text(160,1500,'Median',fontsize=18,color='gold',weight='bold')ax[0].set_xlim(0,1500);ax[0].set(xticklabels=[]);ax[0].set(yticklabels=[]);ax[0].set(ylabel=None)# Normal distribution - No Biassns.histplot(data=NotmalDistribution,ax=ax[1],kde=True,bins=40,color='g',line_kws={'lw':3});ax[1].text(1.6,6050,'Normal Distribution, No Bias - OK',fontsize=18,color='g')ax[1].axvline(np.mean(NotmalDistribution),lw=10,c ='DarkRed')ax[1].text(0.1,4000,'Mean',fontsize=18,color='DarkRed',weight='bold')ax[1].axvline(np.median(NotmalDistribution),lw =3,c='gold',ls = '--')ax[1].text(0.1,3000,'Median',fontsize=18,color='gold',weight='bold')ax[1].set(xticklabels=[]);ax[1].set(yticklabels=[]);ax[1].set(ylabel=None);# Negative Skewsns.histplot(data=NegativeSkew,ax=ax[2],kde=True,bins=200,color='r',line_kws={'lw':3})ax[2].text(1000,4000,'Negative Skew / Left Skewed',fontsize=18,color='r')ax[2].axvline(np.mean(NegativeSkew), lw=5, c = 'DarkRed')ax[2].text(470,1500,'Mean',fontsize=18,color='DarkRed',weight='bold')ax[2].axvline(np.median(NegativeSkew),lw =5,c='gold',ls = '--')ax[2].text(280,1500,'Median',fontsize=18,color='gold',weight='bold')ax[2].set_xlim(1500,0);ax[2].set(xticklabels=[]);ax[2].set(yticklabels=[]);ax[2].set(ylabel=None);print("NOTE: The 'x-' limits are enforced in these histograms, to see the full plots, remove the limits!")
Note: The
mean
andmedian
are the same for normal distributions. However, they become different numbers if the distributions are skewed. The mean will be on the right side of the median in a right-skewed (positive) distribution. In a left-skewed (negative) distribution, the mean will be on the left of the median.
For the best-fitted model, the mean and the sum of residuals are always zero for the training data (X_train
, y_train
). Let's see how the residual histogram looks for the testing data (X_test
, y_test
). We don't expect the mean and the sum of residuals to be zero in this plot.
pred_train = lm.predict(X_train)print("TRAINING DATA (X_train, y_train)")print("Mean of the residual:", abs(round(np.mean(y_train-pred_train), 4)))print("Sum of the residual:", abs(round(np.sum(y_train-pred_train), 4)))# Residual Histogram# Training Datafig1, ax1 = plt.subplots(nrows=2,figsize=(16, 10))sns.histplot(data=y_train-pred_train,ax=ax1[0], kde=True, bins=100);#sns.displot(y_test-predictions);ax1[0].axvline(x=np.mean(y_train-pred_train), color='r', ls='--', linewidth=2)ax1[0].set_title("Residual histogram for TRAINING DATA (X_train, y_train)", fontsize=16)ax1[0].set_xlabel('Residuals (y_train-pred_train)', fontsize=14)ax1[0].set_ylabel('Relative Frequency', fontsize=14)ax1[0].set_xlim(-20, 30)# putting ssome text on the plot!ax1[0].text(x=5, y=18, bbox=dict(facecolor='green', alpha=0.3),fontsize=12,s="For best fitted model, the mean and the sum of residual always\\nequals to zero for the TRAINING DATA.");ax1[0].text(x=5, y=13, bbox=dict(facecolor='red', alpha=0.3),fontsize=12,s="Compare this plot with the above example histograms.\n\nWhat do you think?");# Testing Dataprint("\nTEST DATA (X_test, y_test)")print("Mean of the residual:", round(np.mean(y_test-predictions),4))print("Sum of the residual:", round(sum(y_test-predictions),4))sns.histplot(data=y_test-predictions, ax=ax1[1], kde=True, bins=100);#sns.displot(y_test-predictions);ax1[1].axvline(x=np.mean(y_test-predictions), color='r', ls='--', linewidth=2)ax1[1].set_xlabel('Residuals (y_test - pred_test)', fontsize=14)ax1[1].set_ylabel('Relative Frequency', fontsize=14)ax1[1].set_title('Residual histogram from TEST DATA (X_test, y_test)', fontsize=16)ax1[1].set_xlim(-20, 30)# putting some text on the plot!ax1[1].text(x=5, y=6, bbox=dict(facecolor='red', alpha=0.3),fontsize=12,s="We don't expect mean and sum of residual to be zero in this plot.\\nThis is unseen/test data for the model!");fig1.tight_layout(pad=2.0)
We have trained our model lm
, and the residual plot looks good.
R-squared and accuracy of the fit
Let's see what the accuracy of our model is. We can call the score
function on our trained model for this purpose.
# Calling score on trained model "lm"print("R^2 (train) - The accuracy score of our model in train part is: ", lm.score(X = X_train, y = y_train))print("R^2 (test) - The accuracy score of our model on test part is: ", lm.score(X = X_test, y = y_test))
Adjusted R2 is always preferred for multivariant (multiple) linear regression, where we have more than one predictor variable.
# Adjusted R^2 (see explanation "R^2 vs R^2_adj" at the end)r_2_train = lm.score(X = X_train, y = y_train)r_2_test = lm.score(X = X_test, y = y_test)p = len(X_train.columns) # n_features same for train & testn_train = len(X_train) # n_data_points trainn_test = len(X_test) # n_data_points testprint("R^2 (adjusted) for (X_train, y_train) = ", (1 - (((1-r_2_train)*(n_train-1))/(n_train-p-1))))print("R^2 (adjusted) for (X_train, y_train) = ", (1 - (((1-r_2_test)*(n_test-1))/(n_test-p-1))))
Alternatively, we can use the r2_score
function from sklearn.metrics
.
# using r2_score here -- same as above but different module from metrics# We need to import matrices to access r2_scorefrom sklearn import metricsprint("The accuracy score of our model is on test data: ")print(metrics.r2_score(y_true = y_test, y_pred = predictions))# y_true, y_pred order is must if we are not passing them by name (y_test, y_pred)
R-squared—coefficient of determination—is a regression score function. It is a statistical measure representing the proportion of the variance for a dependent variable explained by an independent variable or variables in a regression model. The best possible score is 1.0
. It can be negative (because the model can be arbitrarily worse).
Note: A constant model that always predicts the expected value of
y
, disregarding the input features, would get the R-squared score of0.0
. The lesson "R-square and Goodness of the Fit" could be a great read to get more details onvs. ...