Tune Learning Rate and Batch Size
Learn what happens when we tweak the learning rate and batch size while training a neural network.
We'll cover the following...
Tune the learning rate
We’ll use our old hyperparameter called lr
. This hyperparameter has been with us since almost the beginning of this course. Chances are, we already tuned it, maybe by trying a few random values. It’s time to be more precise about lr
tuning.
To understand the trade-off of different learning rates, let’s go back to the basics and visualize gradient descent. The following diagrams show a few steps of GD along a one-dimensional loss curve, with three different values of lr
. The red cross marks the starting point, and the green cross marks the minimum:
Let’s remember what lr
does. The bigger it is, the larger each step of GD is. The first diagram uses a small lr
, so the algorithm takes tiny steps towards the minimum. The second example uses a larger lr
, which results in bolder steps and a faster descent.
However, we cannot just set a very large lr
and blaze towards the minimum at ludicrous speed, as the third diagram proves. In this case, lr
is so large that each step of gradient descent lands farther away from the goal than it started. Not only does this training process fail to find the minimum, but it ...