What is regularization in machine learning?

Regularization is a technique frequently employed in machine learning and Artificial Intelligence to ensure that trained models generalize well beyond the specific data they were trained on. It helps prevent overfitting, ensuring the model performs effectively on new, unseen data. The two most commonly used types of regularization are L1 and L2 regularization, which will be discussed in subsequent sections.

Visualization of under-fitting,over-fitting and normal fitting.
Visualization of under-fitting,over-fitting and normal fitting.

When a model is overfitted, it performs exceptionally well on the data it was trained on, but its accuracy drops significantly when presented with unseen data. Regularization techniques come into play to help achieve the right balance in the model.In ML, epoch is one complete pass of the training dataset.

Process of regularization

The regularization is accomplished by either reducing the importance of certain features or entirely removing them from the model during training.

Original function

f(xi)=w0+w1x1+w2x22+w3x33+w4x44f(x_{i}) = w_{0} + w_{1}x_{1} + w_{2}x_{2}^{2}+w_{3}x_{3}^3 + w_{4}x_{4}^4

Regularized function

f(xi)=w0+w1x1+w2x22f(x_{i}) = w_{0} + w_{1}x_{1} + w_{2}x_{2}^{2}

The regularized function is simplified and less prone to overfitting, as illustrated in this basic example. This demonstrates the effectiveness of regularization in reducing complexity and improving generalization.

Regularization parameter

To update the weights while training the model, we try to minimize the cost function associated with a regression model. A cost function measures the predictability power of a machine learning model on a dataset. In regularization, we add another parameter λ\lambda which determines how much to penalize the weights.

  1. When λ\lambda is zero the regularization term becomes zero, and we are back to the original regression loss function.

  2. When λ\lambda is very large, we penalize the weights so they approach zero, which ledo an under-fitted model.

Example

Example of Regularization
Example of Regularization

In the given diagram, four features are represented: swimmers, temp, stock_price, and watched_jaws. At the top, the total number of features remaining in the model is depicted, gradually decreasing from 4 to 0. As the regularization parameter (λ\lambda) increases, the coefficients or weights associated with each feature decrease and eventually become zero. This process allows the regularization technique to shrink or eliminate less relevant features, helping to simplify the model and improve its generalization capability.

L1 regularization

 L1 or Lasso regression is used to simplify the models by ultimately shrinking the parameters to zero with the help of regularization parameterλ\lambdaby adding a new term to the cost function. L1 is useful for feature selection. The new updated cost function with L1 is:

L1 allows feature selection that assigns zero weight to unimportant input features and non-zero weight to valuable features. It outputs a sparse solution where most features will have zero weights.

L2 regularization

L2 or Ridge regression is used to lessen the impact of a feature during model training. It makes the weights small but not zero. The updated cost function with L2 regularization is

L2 works better when all the input features have a strong impact on the output, and the weights assigned to them are approximately of the same magnitude.

Differences between L1 and L2

Both L1 and L2 regularization are used in ML models for training purposes. It depends on the use case, such as L1 is more helpful in dealing with higher dimensional data, whereas L2 is more useful where we want a contribution from all of the features by varying degrees of importance of each feature. Moreover, they can be used in combination as well, which is another type of regularization called Elastic Net regularization.

The differences between L1 and L2 are described below in the table.

L1 or Lasso Regression

L2 or Ridge Regression

Penalizes the sum of absolute value of weights

Penalizes the sum of square weights

Sparse solution

Non-sparse solution

Robust to outliers

Not robust to outliers

Cannot learn complex patterns

Can learn complex patterns

Built in feature selection

No feature selection

Reduces noise

Unable to reduce noise

Conclusion

Regularization is used to enhance the predictability of the model by preventing over-fitting. It adds a penalty term λ\lambda to the cost function which reduces the magnitude of the weights associated with features. As a result, regularization helps find the right balance between model complexity and accuracy, ultimately creating robust and reliable models.

Copyright ©2024 Educative, Inc. All rights reserved