The Nesterov Momentum

Learn how the Nesterov momentum can be used to escape local optimum in non-convex optimization.

Need for momentum

As shown in the figure below, imagine a ball falling down a valley. If its momentum (mass ×\times velocity) is large enough at the bottom of the valley, there are chances that the ball can escape it and find a deeper valley. However, if the momentum is small, the ball will just move to and fro within the same valley until it settles down at its bottom.

Small momentum
Small momentum
Large momentum
Large momentum

When applied to non-convex optimization, we cannot guarantee the convergence of gradient descent to the global optimal solution. It often gets stuck at a local optimum because the gradient vanishes at that point and we cannot perform updates anymore.

Similar to the “ball falling down a valley” situation above, we also need a sense of momentum in non-convex optimization to escape a local optimum. The Nesterov momentum is a popular technique that mimics this behavior by maintaining a velocity vector that is an exponential moving average of negative gradients.

How does the Nesterov momentum work?

At every step, it then performs an update in the direction of the velocity vector. In simple terms, a velocity vector is an average direction that can be used to perform updates when the actual gradient is zero.

The Nesterov momentum update at a time tt ...