Building a Better Optimiser
Learn about the optimization procedure and how to apply it to neural networks.
We'll cover the following
The optimization procedure is used to minimize the error function (in examples like the ones we have discussed so far), effectively “learning” the parameters of the network by selecting those that yield the lowest error. Referring to our discussion of backpropagation, this problem has two components:
How to initialize the weights: In many applications historically, we see that the authors used random weights within some range and hoped that the use of backpropagation would result in at least a locally minimal loss function from this random starting point.
How to find the local minimum loss: In basic backpropagation, we used gradient descent using a fixed learning rate and a first derivative update to traverse the potential solution space of weight matrices; however, there is good reason to believe there might be more efficient ways to find a local minimum.
In fact, both of these have turned out to be key considerations toward progress in deep learning research.
Gradient descent to ADAM
The original version proposed in 1986 for training neural networks averaged the loss over the entire dataset before taking the gradient and updating the weights. Obviously, this is quite slow and makes distributing the model difficult, as we can’t split up the input data and model replicas; if we use them, each needs to have access to the whole dataset.
In contrast, SGD computes gradient updates after
However, SGD can be slow, leading researchers to propose alternatives that accelerate the search for a minimum. As seen in the original backpropagation algorithm, one idea is to use a form of exponentially weighted momentum that remembers prior steps and continues in promising directions. Variants have been proposed, such as Nesterov Momentum, which adds a term to increase this
Get hands-on with 1400+ tech skills courses.