The Motivation for Regularization

Learn how overfitting and underfitting are related to the bias-variance trade-off.

What is regularization?

Regularization is a technique used in machine learning to prevent overfitting and improve the generalization of the trained model. The main idea behind regularization is to add a penalty term to the cost function of the model that discourages the model from learning complex or redundant features that are specific to the training data and might not generalize well to new, unseen data.

The bias-variance trade-Off

We can extend the basic logistic regression model that we have learned about by using regularization, also called shrinkage. In fact, every logistic regression that you have fit so far in scikit-learn has used some amount of regularization. That is because it is a default option in the logistic regression model object. However, until now, we have ignored it.

As you learn about these concepts in greater depth, you will also become familiar with a few foundational concepts in machine learning: overfitting, underfitting, and the bias-variance trade-off. A model is said to overfit the training data if the performance of the model on the training data (for example, the ROC AUC) is substantially better than the performance on a held-out test set. In other words, good performance on the training set does not generalize to the unseen test set. We started to discuss these concepts in the chapter “Introduction to Scikit-Learn and Model Evaluation,” when we distinguished between model training and test scores.

When a model is overfitted to the training data, it is said to have high variance. In other words, whatever variability exists in the training data, the model has learned this very well—in fact, too well. This will be reflected in a high model training score. However, when such a model is used to make predictions on new and unseen data, the performance is lower. Overfitting is more likely in the following circumstances:

  • There are a large number of features available in relation to the number of samples. In particular, there may be so many possible features that it is cumbersome to directly inspect all of them, like we were able to do with the case study data.

  • A complex model, that is, more complex than logistic regression, is used. These include models such as gradient boosting ensembles or neural networks.

The risk of overfitting in complex models

Under these circumstances, the model has an opportunity develop more complex hypotheses about the relationships between features and the response variable in the training data during model fitting, making overfitting more likely.

In contrast, if a model is not fitting the training data very well, this is known as underfitting, and the model is said to have high bias.

We can examine the differences between underfitting, overfitting, and the ideal that sits in between by fitting polynomial models on some hypothetical data:

Get hands-on with 1300+ tech skills courses.