Adaptive Gradient Descent

Learn how the Adaptive Gradient Algorithm (AdaGrad) adapts the step size of gradient descent based on historical updates.

The stability of gradient descent highly depends on the step size of the algorithm. Choosing a large step size can lead to oscillations or overshooting of parameters, whereas using one that is too small can lead to slow convergence or increase the chances of getting in local minima. The Adaptive Gradient Algorithm (AdaGrad) is an optimization algorithm that adapts the learning rate for each parameter based on past gradients. Using AdaGrad, we can avoid the need to manually tune the learning rate during the optimization process.

What is AdaGrad?

The main idea of AdaGrad is to scale the gradient for each parameter by the inverse square root of the sum of squares of the past gradients. This means that parameters with large gradients will have smaller updates, and parameters with small gradients will have larger updates.

The update rule of AdaGrad at a time tt is given as follows:

Get hands-on with 1400+ tech skills courses.