What is regularization in machine learning?

When a model is overfitted, it performs exceptionally well on the data it was trained on, but its accuracy drops significantly when presented with unseen data. Regularization techniques come into play to help achieve the right balance in the model.In ML, epoch is one complete pass of the training dataset.

Process of regularization

The regularization is accomplished by either reducing the importance of certain features or entirely removing them from the model during training.

Original function

$f(x_{i}) = w_{0} + w_{1}x_{1} + w_{2}x_{2}^{2}+w_{3}x_{3}^3 + w_{4}x_{4}^4$

Regularized function

$f(x_{i}) = w_{0} + w_{1}x_{1} + w_{2}x_{2}^{2}$

The regularized function is simplified and less prone to overfitting, as illustrated in this basic example. This demonstrates the effectiveness of regularization in reducing complexity and improving generalization.

Regularization parameter

To update the weights while training the model, we try to minimize the cost function associated with a regression model. A cost function measures the predictability power of a machine learning model on a dataset. In regularization, we add another parameter $\lambda$ which determines how much to penalize the weights.

When $\lambda$ is zero the regularization term becomes zero, and we are back to the original regression loss function.
When $\lambda$ is very large, we penalize the weights so they approach zero, which ledo an under-fitted model.

Example

In the given diagram, four features are represented: swimmers, temp, stock_price, and watched_jaws. At the top, the total number of features remaining in the model is depicted, gradually decreasing from 4 to 0. As the regularization parameter ( $\lambda$ ) increases, the coefficients or weights associated with each feature decrease and eventually become zero. This process allows the regularization technique to shrink or eliminate less relevant features, helping to simplify the model and improve its generalization capability.

L1 regularization

L1 or Lasso regression is used to simplify the models by ultimately shrinking the parameters to zero with the help of regularization parameter $\lambda$ by adding a new term to the cost function. L1 is useful for feature selection. The new updated cost function with L1 is:

L2 works better when all the input features have a strong impact on the output, and the weights assigned to them are approximately of the same magnitude.

Differences between L1 and L2

Both L1 and L2 regularization are used in ML models for training purposes. It depends on the use case, such as L1 is more helpful in dealing with higher dimensional data, whereas L2 is more useful where we want a contribution from all of the features by varying degrees of importance of each feature. Moreover, they can be used in combination as well, which is another type of regularization called Elastic Net regularization.

The differences between L1 and L2 are described below in the table.

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

L1 or Lasso Regression	L2 or Ridge Regression
Penalizes the sum of absolute value of weights	Penalizes the sum of square weights
Sparse solution	Non-sparse solution
Robust to outliers	Not robust to outliers
Cannot learn complex patterns	Can learn complex patterns
Built in feature selection	No feature selection
Reduces noise	Unable to reduce noise