Understand Batches

Explore the effect of batch size over training and loss of neural networks.

Mini-batch GD feels counter-intuitive. Why do smaller batches result in faster training? The answer is that they do not: mini-batch GD is generally slower than batch GD at processing the whole training set because it calculates the gradient for each batch rather than once for all the examples.

Even if mini-batch GD is slower, it tends to converge faster during the first iterations of training. In other words, mini-batch GD is slower at processing the training set, but it moves quickly toward the target, giving us that fast feedback we need. Let’s see how.

Twist that path

To see why mini-batches converge faster, let’s visualize gradient descent on a small two-dimensional training set.

The diagram below illustrates the path of plain-vanilla batch GD on the loss surface during the first few dozens of iterations.

Get hands-on with 1300+ tech skills courses.