Gradient Descent

This lesson will focus on the intuition behind the gradient descent algorithm.

In the last lesson, we minimized a loss function to find the best model to predict the tip paid by customers. But there was a drawback with the approach. We manually entered the values of the model parameter θ\theta and compared the losses. But this approach of manually choosing the values of the model parameters is not scalable because:

  • It works only on predetermined values of θ\theta

  • Most models have many model parameters and complex structures of the prediction function, for which it will require a lot of time to choose parameters manually.

  • We may not choose the best set of model parameters, and then we will not get to the best model.

We need some approach that chooses the model parameters automatically and then arrives at the best model.

Intuition

Since we need a method in which we do not use predetermined values of θ\theta, let’s start by picking a random value of θ\theta and see what our loss is. After this, we will decide whether to increase the current value of θ\theta or decrease it and the amount to increase or decrease. Let’s look at the direction and amount separately.

Direction of change in θ\theta

Look at the error surface of the example in the previous lesson below.

widget

We have highlighted two points in this curve. The red highlighted point A is the value of the loss at θ=0.10\theta=0.10. If we choose this as our starting point for θ\theta, then at this point, we need to choose a new value for θ\theta that is closer to the minimum of this curve. Let’s look at the slope of the line at this point.

widget

The slope at point A is negative, which means that if we increase θ\theta from this point, the loss will decrease. Therefore, we need to increase the value of θ\theta from here to reach the minimum of the curve.

Now let’s consider another situation where we start at θ=0.18\theta=0.18 ...