What is Double Descent?

Overview

In the classical machine learning approach, the performance consistently improves when we train models until they overfit the training data. When training a model, we usually make a trade-off between overfitting and underfitting. We call this trade-off the bias-variance trade-off.

But in the case of bigger models, after many iterations, they again start to perform well even on the test dataset. We call this phenomenon the double descent. Here, models with many parameters are the more prominent models to train. The number of parameters needs to be larger than the number of training examples.

When we train a model, we can't be sure how it will perform on the actual data. But we can measure its performance on a known dataset (training dataset). This performance is called empirical risk.

Now we can formally define double descent as:

For a model with high capacity (higher number of training parameters), the empirical risk first increases but then decreases.

Classical approach

In theory, we usually stop model training when it starts to perform poorly on the test data. In other words, we stop model training when the empirical risk increases. This implies that the model has started to overfit the dataset, interpolate in a strict order, and cannot generalize properly.

As shown in the figure above, the training error approaches zero, but after a specific point, the test error deteriorates. In theory, we would consider this point a good stopping point for our training. And thus, we have to stop at a point where the training error has not reached its minimum.

However, in practice, big models usually attain close-to-zero error on training data and still generalize well on the test dataset.

Modern approach

Even though modern models have a close-to-zero training error, they give accurate predictions on unseen data. These models have a large number of parameters to learn. Even more significant than the total number of training examples. The parameters need to be bigger when multiple outputs (multiclass classification) are required. For example, the ImageNet dataset has roughly $10^6$ training examples and roughly $10^3$ classes. So the model may require around $10^9$ parameters.

We can observe the double descent phenomenon by the double descent curve, which is another performance monitoring curve. We can plot it against the number of iterations and the training and testing loss.

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments