The success of a machine learning model relies on minimizing the error between the true and the predicted values. This error/cost is evaluated using special functions called cost functions.
All machine learning algorithms use Gradient descent or its variants to minimize their cost functions. The intuition behind this technique is to iteratively update weights to reach an optimal minimum of the cost function.
Gradient descent starts from a random point on the cost function and iteratively moves the points in the negative direction of the gradient of the function until the point reaches a local minima.
As a mathematical formula, it would look like this:
Here, is the th component of the weight vector at th iteration, is the cost function and is the learning rate. The gradient of any function points along the direction of maximum increase (i.e., moving along the direction of the gradient increases the value of the function) but since we want to minimize the value of cost function, we move in the opposite direction to the gradient as indicated by the negative sign with the learning rate . We continue updating the weights at each iteration until we reach the minimum value of the cost function, and the model is said to have converged.
Note: The negative gradient term in the formula above is multiplied by . Here, is the selected learning rate.
Determining the learning rate () value is a very important factor for gradient descent.The learning rate () is the hyperparameter by which we can control the jump size of the weight update in each iteration. Setting the value too high will make the adjustment of the weight oscillate too high above the local minimum. On the other hand, having a very small value of can eventually give us optimal low-cost value but it will cause the algorithm to converge very slowly. Some preferred values of are 0.1, 0.01, or 0.001. However, it can vary for each model.
There are three main types of gradient descent algorithms used depending on model requirements. Let’s explore all three:
1. Batch gradient descent
Batch gradient descent calculates the error for each point in the training dataset before updating model parameters. This process is repeated in each epoch. Batch gradient descent produces a stable convergence but it is computationally intensive. Plus, having a larger dataset also reduces standard error. It is also memory heavy as the whole dataset resides in memory when the algorithm is being run.
2. Stochastic gradient descent
In stochastic gradient descent, parameters are updated based on the gradient with respect to single training data points. This makes stochastic gradient descent a faster option compared to batch gradient descent. Frequent loss function updates also provide a detailed rate of improvement. However, these frequent updates make stochastic gradient descent more computationally expensive than batch gradient descent, which could lead to noisy gradients. This avoids the model to converge in local minima and ensures the global minimum is reached.
3. Mini-batch gradient descent
Mini-batch gradient descent is a combination of both batch gradient descent and stochastic descent. Mini-batch gradient descent splits the training dataset into small batches and updates parameters based on the gradients of these batches. This makes it faster than batch gradient descent and less erroneous than stochastic descent.