The Nesterov Momentum
Learn how the Nesterov momentum can be used to escape local optimum in non-convex optimization.
We'll cover the following...
Need for momentum
As shown in the figure below, imagine a ball falling down a valley. If its momentum (mass
When applied to non-convex optimization, we cannot guarantee the convergence of gradient descent to the global optimal solution. It often gets stuck at a local optimum because the gradient vanishes at that point and we cannot perform updates anymore.
Similar to the “ball falling down a valley” situation above, we also need a sense of momentum in non-convex optimization to escape a local optimum. The Nesterov momentum is a popular technique that mimics this behavior by maintaining a velocity vector that is an exponential moving average of negative gradients.
How does the Nesterov momentum work?
At every step, it then performs an update in the direction of the velocity vector. In simple terms, a velocity vector is an average direction that can be used to perform updates when the actual gradient is zero.
The Nesterov momentum update at a time