Overfitting and Underfitting

Learn about overfitting, underfitting, and how to avoid them.

What is overfitting?

If the goal is to only adapt to training data, we can choose a model with many parameters (large parameter space) and fit the training data nearly precisely. A model with many parameters is more flexible in adapting to complex patterns. For example, a 10-degree polynomial is more flexible than a 2-degree polynomial. When the training data has accidental irregularities, often termed as noise, a more flexible model runs after the noise to fit the data exactly and often misses the underlying pattern in the data. Overfitting is a modeling error where the model aligns too closely with the training data and might not generalize well to unseen data. This means that the model performs exceptionally well on the training data but is unsuitable for other data. In real life, we can relate this to rote learning. If a student memorizes the solution to a specific problem, they’ll undoubtedly perform well in that problem. However, if the problem is changed, the student won’t perform well.

In machine learning, we try to limit overfitting as much as we can.

Note: We want our model to perform well on the training data but also avoid overfitting.

Try it yourself

Let’s try to understand the overfitting and flexibility of the model by implementing a toy example.

Import packages

First, we import the necessary packages, i.e., numpy, which helps in array-wise computations, matplotlib, which provides support for visualizing graphs, and finally, sklearn, which handles machine learning.

Press + to interact
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression as LR
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

In this example, we’ll approximate xx sin(x)sin(x) with a polynomial using regression. For this purpose, we’ll use LinearRegression and PolynomialFeatures from the famous machine learning package sklearn. Finally, make_pipeline will help compose different transformations like a single transformation, and train_test_split will be helpful in doing training and testing data splits.

Function to be

...