Fundamentals of Machine Learning for Software Engineers/

...

Understand Batches

Explore the effect of batch size over training and loss of neural networks.

We'll cover the following...

Twist that path
Batches large and small
Batches: the good and the bad
Summary

Mini-batch GD feels counter-intuitive. Why do smaller batches result in faster training? The answer is that they do not: mini-batch GD is generally slower than batch GD at processing the whole training set because it calculates the gradient for each batch rather than once for all the examples.

Even if mini-batch GD is slower, it tends to converge faster during the first iterations of training. In other words, mini-batch GD is slower at processing the training set, but it moves quickly toward the target, giving us that fast feedback we need. Let’s see how.

Twist that path

To see why mini-batches converge faster, let’s visualize gradient descent on a small two-dimensional training set.

The diagram below illustrates the path of plain-vanilla batch GD on the loss surface during the first few dozens of iterations.

Let’s recall a similar diagram from back when we get introduced with gradient descent. At each iteration, the system calculates the gradient of the loss over the entire training set, and steps in the opposite direction as the gradient rolls steadily toward the minimum.

Now let’s repeat the training using batches and use the smallest possible batch size, that is, a batch size of 1. This extreme variant of mini-batch GD is often called stochastic gradient descent, where “stochastic” is the statistical term for randomly distributed. The idea of stochastic gradient descent is that we select one random example per iteration and take a step of GD based on that one example. In our case, we do not even need to select a random example at each iteration because the MNIST dataset has already been shuffled. We can pick the examples in order, one at a time.

The following diagram illustrates the result of training the neural network with stochastic GD:

How Machine Learning Works

Our First Learning Program

Walking the Gradient

Hyperspace

A Discern Machine

Get Real

The Final Challenge

The Perceptron

Designing the Network

Building the Network

Training the Network

How Classifiers Work

Batchin’ Up

The Zen of Testing

Let’s Do Development

A Deeper Kind of Network

Diabetes Prediction Using Keras

Defeating Overfitting

Taming Deep Networks

Beyond Vanilla Networks

Into the Deep

Recognize Handwritten Digits Using a Deep Neural Network

Machine Learning Fundamentals

Understand Batches

Twist that path