Minibatch Gradient Descent

Learn how to use minibatch gradient descent to solve the intractable gradient descent optimization.

Stochastic gradient descent (SGD)

Recall that to compute the gradient θJ(θ)\nabla_\theta J(\theta) of an objective J(θ)J(\theta), we need to aggregate the gradients θL(fθ(xi),yi)\nabla_\theta \mathcal{L}(f_\theta(x_i), y_i) for the whole dataset D={(xi,yi)}i=1ND = \{ (x_i, y_i) \}_{i=1}^N to perform just one update. This makes vanilla gradient descent very slow and intractable for large datasets, such as ImageNet having one million images, that do not fit in memory.

SGD, in contrast, performs the update based on one example at a time as follows:

Get hands-on with 1200+ tech skills courses.