What is gradient descent?

The math behind gradient descent

Gradient descent starts from a random point on the cost function and iteratively moves the points in the negative direction of the gradient of the function until the point reaches a local minima.

As a mathematical formula, it would look like this:

\theta^{t+1}_{j} = \theta^{t}_{j} - \alpha\frac{\partial J(\theta)}{\partial \theta_{j}}

Here, $\theta^{t}_{j}$ is the $j$ th component of the weight vector at $t$ th iteration, $J(\theta)$ is the cost function and $\alpha$ is the learning rate. The gradient of any function points along the direction of maximum increase (i.e., moving along the direction of the gradient increases the value of the function) but since we want to minimize the value of cost function, we move in the opposite direction to the gradient as indicated by the negative sign with the learning rate $\alpha$ . We continue updating the weights at each iteration until we reach the minimum value of the cost function, and the model is said to have converged.

Note: The negative gradient term in the formula above is multiplied by $\alpha$ . Here, $\alpha$ is the selected learning rate.

Determining the learning rate ( $\alpha$ ) value is a very important factor for gradient descent.The learning rate ( $\alpha$ ) is the hyperparameter by which we can control the jump size of the weight update in each iteration. Setting the value too high will make the adjustment of the weight oscillate too high above the local minimum. On the other hand, having a very small value of $\alpha$ can eventually give us optimal low-cost value but it will cause the algorithm to converge very slowly. Some preferred values of $\alpha$ are 0.1, 0.01, or 0.001. However, it can vary for each model.

Types of gradient descent

There are three main types of gradient descent algorithms used depending on model requirements. Let’s explore all three:

1. Batch gradient descent
Batch gradient descent calculates the error for each point in the training dataset before updating model parameters. This process is repeated in each epoch. Batch gradient descent produces a stable convergence but it is computationally intensive. Plus, having a larger dataset also reduces standard error. It is also memory heavy as the whole dataset resides in memory when the algorithm is being run.

2. Stochastic gradient descent
In stochastic gradient descent, parameters are updated based on the gradient with respect to single training data points. This makes stochastic gradient descent a faster option compared to batch gradient descent. Frequent loss function updates also provide a detailed rate of improvement. However, these frequent updates make stochastic gradient descent more computationally expensive than batch gradient descent, which could lead to noisy gradients. This avoids the model to converge in local minima and ensures the global minimum is reached.

3. Mini-batch gradient descent
Mini-batch gradient descent is a combination of both batch gradient descent and stochastic descent. Mini-batch gradient descent splits the training dataset into small batches and updates parameters based on the gradients of these batches. This makes it faster than batch gradient descent and less erroneous than stochastic descent.

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments