Regularize the Model

Learn about regularization techniques, L1 and L2.

L1 and L2 regularization

L1 and L2 are two of the most common methods to regularize a decision boundary. It is the technical name for the operation that we informally called smoothing out. L1 and L2 work similarly, and they have mostly similar effects. Once we get into advanced ML territory, we may want to look deeper into their relative merits — but for our purposes in this course, we follow a simple rule either pick randomly between L1 and L2 or try both and see which one works better.

Let’s see how L1 and L2 work.

How L1 and L2 work

L1 and L2 rely on the same idea. They add a regularization term to the neural network’s loss. For example, here’s the loss augmented by L1 regularization:

Lregularized=Lnon-regularized+λw\large{ L_{\text{regularized}}=L_{\text{non-regularized}}+\lambda \sum{|w|}}

In the case of our neural network, the non-regularized loss is the cross-entropy loss. To that original loss, L1 adds the sum of the absolute values of all the weights in the network, multiplied by a constant called lambda (or λ\lambda in symbols).

Lambda is a new hyperparameter that we can use to tune the amount of regularization in the network. The higher the value of lambda, the higher the impact of the regularization term. If lambda is 0, the entire regularization term becomes 0, and we fall back to a non-regularized neural network.

To understand what the regularization term does to the network, remember that the entire point of training is to minimize the loss. Now that we have added that term, the absolute value of the weights has become part of the loss. That means that the gradient descent algorithm will automatically try to keep the weights small so that the loss can also stay small.

Get hands-on with 1300+ tech skills courses.