Overfitting and Underfitting
Learn about overfitting, underfitting, and how to avoid them.
We'll cover the following...
What is overfitting?
If the goal is to only adapt to training data, we can choose a model with many parameters (large parameter space) and fit the training data nearly precisely. A model with many parameters is more flexible in adapting to complex patterns. For example, a 10-degree polynomial is more flexible than a 2-degree polynomial. When the training data has accidental irregularities, often termed as noise, a more flexible model runs after the noise to fit the data exactly and often misses the underlying pattern in the data. Overfitting is a modeling error where the model aligns too closely with the training data and might not generalize well to unseen data. This means that the model performs exceptionally well on the training data but is unsuitable for other data. In real life, we can relate this to rote learning. If a student memorizes the solution to a specific problem, they’ll undoubtedly perform well in that problem. However, if the problem is changed, the student won’t perform well.
In machine learning, we try to limit overfitting as much as we can.
Note: We want our model to perform well on the training data but also avoid overfitting.
Try it yourself
Let’s try to understand the overfitting and flexibility of the model by implementing a toy example.
Import packages
First, we import the necessary packages, i.e., numpy
, which helps in array-wise computations, matplotlib
, which provides support for visualizing graphs, and finally, sklearn
, which handles machine learning.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LinearRegression as LRfrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.pipeline import make_pipelinefrom sklearn.model_selection import train_test_split
In this example, we’ll approximate with a polynomial using regression. For this purpose, we’ll use LinearRegression
and PolynomialFeatures
from the famous machine learning package sklearn
. Finally, make_pipeline
will help compose different transformations like a single transformation, and train_test_split
will be helpful in doing training and testing data splits.