Optimization for Machine Learning with NumPy and SciPy/

...

The Nesterov Momentum

Learn how the Nesterov momentum can be used to escape local optimum in non-convex optimization.

We'll cover the following...

Need for momentum
How does the Nesterov momentum work?
Implementation of the Nesterov momentum

When applied to non-convex optimization, we cannot guarantee the convergence of gradient descent to the global optimal solution. It often gets stuck at a local optimum because the gradient vanishes at that point and we cannot perform updates anymore.

Similar to the “ball falling down a valley” situation above, we also need a sense of momentum in non-convex optimization to escape a local optimum. The Nesterov momentum is a popular technique that mimics this behavior by maintaining a velocity vector that is an exponential moving average of negative gradients.

How does the Nesterov momentum work?

At every step, it then performs an update in the direction of the velocity vector. In simple terms, a velocity vector is an average direction that can be used to perform updates when the actual gradient is zero.

The Nesterov momentum update at a time $t$ is written as follows:

Introduction to Optimization

Vector Calculus

Convex Optimization

Gradient Descent for Non-Convex Optimization

Use Particle Swarm Optimizer to Optimize a Non-convex Function

Constrained Optimization

Miscellaneous Methods

Course Conclusion

Test Your Concepts of Optimization

Training Support Vector Machines (SVMs)

The Nesterov Momentum

Need for momentum

How does the Nesterov momentum work?