...

/

Popular Optimization Algorithms

Popular Optimization Algorithms

Discover the most frequently-used alternatives of gradient descent and the intuition behind them.

Concerns on SGD

This basic version of SGD comes with some limitations and problems that might negatively affect the training.

  1. If the loss function changes quickly in one direction and slowly in another, it may result in a high oscillation of gradients making the training progress very slow.

  2. If the loss function has a local minimum or a saddle point, it is highly likely that SGD will be stuck there without being able to “jump out” and proceed in finding a better minimum. This happens because the gradient becomes zero, so there is no update in the weight whatsoever.

A saddle point is a point on the surface of the graph of a function where the slopes (derivatives) are all zero but which is not a local maximum of the function.

  1. The gradients are still noisy because we estimate them based only on a small sample of our dataset. The noisy updates might not correlate well with the true direction of the loss function.

  2. Choosing a good loss function is tricky and requires time-consuming experimentation with different hyperparameters.

  3. The same learning rate is applied to all of our parameters, which can become problematic for features with different frequencies or significance.

To overcome some of these problems, many improvements have been proposed over the years.

Adding momentum

One of the basic improvements over SGD comes from adding the notion of momentum. Borrowing the principle of momentum from physics, we enforce SGD to keep moving in the same direction as the previous timesteps. To accomplish this, we introduce two new variables: velocity and friction.

  • Velocity vv is computed as the running mean of gradients up until a point in time and indicates the direction in which the gradient should keep moving towards.

  • Friction ρ\rho is a constant number that aims to decay.

At every time step, we update our velocity by decaying the previous velocity on a factor of ρ\rho and add the gradient of the weights on the current time. Then, we update our weights in the direction of the velocity vector as:

vt+1=ρvt+wC(x,y,W)v_{t+1} = \rho v_t + \nabla_w C(x,y,W)

w=wlearning_ratevt+1w = w - {learning\_rate} \cdot v_{t+1}

for t in range(steps):
    dw = gradient(loss, w)
    v = rho*v +dw	
    w = w - learning_rate *v

But what do we gain with momentum?

  • We can now escape local minimums or saddle points because we keep moving downwards even though the gradient of the mini-batch might be zero.

  • Momentum can also help us reduce the oscillation of the gradients because the velocity vectors can smooth out these highly changing landscapes.

  • Finally, it reduces the noise of the gradients (stochasticity) and follows a more direct walk down the landscape.

Adaptive learning rate

The other big idea of optimization algorithms is adaptive learning rate. The intuition is that we would like to perform smaller updates for frequent features and bigger ones for infrequent ones. This will allow us to overcome some of the problems of SGD mentioned before.

Adagrad

Adagrad keeps a running sum of the squares of the gradients in each dimension, and in each update, we scale the learning rate based on the sum. That way we achieve a different learning rate for each parameter (or an adaptive learning rate). Moreover, by using the root of the squared gradients, we only take into account the magnitude of the gradients and not the sign.

w=wlG+ewC(x,y,W)w = w - \frac{l}{\sqrt{G} + e} \odot \nabla_w C(x,y,W) ...