Optimization for Machine Learning with NumPy and SciPy/

...

Adaptive Gradient Descent

Learn how the Adaptive Gradient Algorithm (AdaGrad) adapts the step size of gradient descent based on historical updates.

We'll cover the following...

What is AdaGrad?
Implementation of AdaGrad

The stability of gradient descent highly depends on the step size of the algorithm. Choosing a large step size can lead to oscillations or overshooting of parameters, whereas using one that is too small can lead to slow convergence or increase the chances of getting in local minima. The Adaptive Gradient Algorithm (AdaGrad) is an optimization algorithm that adapts the learning rate for each parameter based on past gradients. Using AdaGrad, we can avoid the need to manually tune the learning rate during the optimization process.

What is AdaGrad?

The main idea of AdaGrad is to scale the gradient for each parameter by the inverse square root of the sum of squares of the past gradients. This means that parameters with large gradients will have smaller updates, and parameters with small gradients will have larger updates.

The update rule of AdaGrad at a time $t$ is given as follows:

Introduction to Optimization

Vector Calculus

Convex Optimization

Gradient Descent for Non-Convex Optimization

Use Particle Swarm Optimizer to Optimize a Non-convex Function

Constrained Optimization

Miscellaneous Methods

Course Conclusion

Test Your Concepts of Optimization

Training Support Vector Machines (SVMs)

Adaptive Gradient Descent

What is AdaGrad?