A Regularization Toolbox

Discover the benefits of regularization techniques.

Combat overfitting through regularization

Just like tuning hyperparameters, reducing overfitting is more art than science. Besides L1 and L2, we can use many other regularization methods. An overview of some of the techniques is given below:

Small network size: The most fundamental regularization technique is to make the overfitting network smaller. It is also the most efficient technique. After all, overfitting happens because the system is too smart for the data it’s learning. Smaller networks are not as smart as big networks. We should try to reduce the number of hidden nodes or remove a few layers. We’ll use this approach in the chapter’s closing exercise.

Reduce input variables: Instead of simplifying the model, we can also reduce overfitting by simplifying the data. We can remove a few input variables. Let’s say we predict a boiler’s consumption from a set of 20 input variables. An overfitting network strives to fit the details of that dataset, noise included. We can drop a few variables that are less likely to impact consumption (like the day of the week) in favor of the ones that seem more relevant (like the outside temperature). The idea is that the fewer features we have, the less noise we inject into the system.

Curtail the training dataset: Another way to reduce overfitting is to cut short the network’s training. This idea is not as weird as it sounds. If we look at the history of the network’s loss during training, we can see the system moving from underfitting to overfitting as it learns the noise in the training data. Once overfitting kicks in, the validation flattens and then diverges from the training loss. If we stop training at that point, we’ll get a network that has not yet learned enough to overfit the data. This technique is called early stopping.

Increase learning rate: Finally, and perhaps surprisingly, sometimes we can reduce overfitting by increasing a neural network’s learning rate. To understand why remember that the learning rate measures the size of each GD step. With the bigger learning rate, GD takes bolder, coarser steps. As a result, the trained model is likely to be less detailed, which might help reduce overfitting.

We went through a few regularization techniques, and we’ll learn about a couple more in the next chapter. Each of these approaches might or might not work for a specific network and dataset. Here, as in many other aspects of ML, our mileage may vary. Be ready to experiment with different approaches, either alone or in combination, and learn by experience which approaches work best in which circumstances.

Get hands-on with 1400+ tech skills courses.