Evaluating Regression Models
This lesson will focus on ways to evaluate the performance of regression Models.
In the previous lessons, we learned how to make and fit linear regression models in Python. But we did not discuss ways to judge the performance of the models. In this lesson, we will focus on techniques used to evaluate the performance of linear regression models.
We will be using the same model that we used in the last lesson where we tried to predict house prices using the USA Housing Dataset.
Losses
We can evaluate the model performance by looking at different losses. We have already looked at mean squared loss.
import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error,mean_absolute_error,median_absolute_errorimport numpy as npdf = pd.read_csv('USA_Housing.csv')# Split dataframe into Xs and YX = df.drop(columns = ['Address','Price'])Y = df[['Price']]# Linear Regression model fittinglr = LinearRegression()lr.fit(X,Y)# Loss and predictionspredictions = lr.predict(X)df['Predictions'] = predictionsprint(df[['Price','Predictions']].head())mse_loss = mean_squared_error(y_true = Y,y_pred = predictions)mae_loss = mean_absolute_error(y_true = Y,y_pred = predictions)print('MSE loss = ',mse_loss)print('MAE loss = ',mae_loss)median_abs_loss = median_absolute_error(y_true = Y,y_pred = predictions)print('Median abs loss = ',median_abs_loss)
In lines 3 and 4, we have imported the LinearRegression
class and mean_squared_error
function. We read the data into a dataframe in line 7. Since we will not be using any non-numeric variables for prediction, we drop Price
and Address
and form a new dataframe X
in line 10. It has all the variables that we can use in prediction. In line 11, we separate the actual values of Price
in a dataframe called Y
.
In line 14, we initialize the LinearRegression
class and call the class object lr
. We then use the fit
function to fit our model in the next line. The fit
function will find the best model for us and store the model parameters internally.
Now we get predictions using our fitted model in line 18 using the predict
function. Then we add a column Predictions
in the dataframe in line 20. The next line will show us the actual values and predicted values of the top 5 rows side by side.
In line 23, we take the mean squared error and save it as mse_loss
using the mean_squared_error
function. In the next line, we take the mean absolute error using the mean_absolute_error
function and save it as mae_loss
. Both functions expect the same arguments, actual values (y_true
) and predicted values(y_pred
). We print these losses in the next two lines.
Interpreting losses
Losses are a good indication of the performance of the model. Now by looking at the losses, we can see that the model does not perform great. The more intuitive Mean Absolute Error of almost does not seem very good performance by the model on average.
However, some loss metrics, such as MSE and MAE, are greatly affected by outliers. For instance, there might be some outliers in the data that push the mean loss value up. Therefore, we also calculate the median absolute loss in line 28. We can see that the median absolute loss is almost , which is a noticeable drop from the mean absolute error. This shows that we cannot always rely on loss functions to evaluate the performance, so we might need some other ways to look at the model’s performance.
Plotting absolute error percentages
To check how well our model performed let’s plot the absolute percentage error in each prediction
import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_errorimport numpy as npdf = pd.read_csv('USA_Housing.csv')# Split dataframe into Xs and YX = df.drop(columns = ['Address','Price'])Y = df[['Price']]# Linear Regression model fittinglr = LinearRegression()lr.fit(X,Y)# Loss and predictionspredictions = lr.predict(X)# Plot error %errors = np.abs((Y-predictions) / Y) * 100plt.plot(range(errors.shape[0]),errors)
After performing the regression, we compute absolute percentage errors in line 21. We use the following formula:
...