What is Double Descent?

Overview

In the classical machine learning approach, the performance consistently improves when we train models until they overfit the training data. When training a model, we usually make a trade-off between overfitting and underfitting. We call this trade-off the bias-variance trade-off.

But in the case of bigger models, after many iterations, they again start to perform well even on the test dataset. We call this phenomenon the double descent. Here, models with many parameters are the more prominent models to train. The number of parameters needs to be larger than the number of training examples.

When we train a model, we can't be sure how it will perform on the actual data. But we can measure its performance on a known dataset (training dataset). This performance is called empirical risk.

Now we can formally define double descent as:

For a model with high capacity (higher number of training parameters), the empirical risk first increases but then decreases.

Classical approach

In theory, we usually stop model training when it starts to perform poorly on the test data. In other words, we stop model training when the empirical risk increases. This implies that the model has started to overfit the dataset, interpolate in a strict order, and cannot generalize properly.

The classical approach
The classical approach

As shown in the figure above, the training error approaches zero, but after a specific point, the test error deteriorates. In theory, we would consider this point a good stopping point for our training. And thus, we have to stop at a point where the training error has not reached its minimum.

However, in practice, big models usually attain close-to-zero error on training data and still generalize well on the test dataset.

Modern approach

Even though modern models have a close-to-zero training error, they give accurate predictions on unseen data. These models have a large number of parameters to learn. Even more significant than the total number of training examples. The parameters need to be bigger when multiple outputs (multiclass classification) are required. For example, the ImageNet dataset has roughly 106 10^6 training examples and roughly 103 10^3 classes. So the model may require around 109 10^9 parameters.

We can observe the double descent phenomenon by the double descent curve, which is another performance monitoring curve. We can plot it against the number of iterations and the training and testing loss.

The modern approach
The modern approach

The point where the model again starts to perform well on the testing data is called the interpolation threshold. The region on the left of this interpolation threshold is called the under-parameterized region. The region on the right of the interpolation threshold is called the over-parameterized region. This region signifies that increasing the number of iterations is resulting in the decrease of testing loss.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved