In machine learning, we use gradient descent as an optimization technique to find the optimal model parameters that would result in the minimal cost value.
Depending on the implementation, there are three types of gradient descent algorithms.
Batch gradient descent
Stochastic gradient descent
Mini-Batch gradient descent
In this shot, we’ll be focusing on Stochastic gradient descent.
We use the entire dataset to calculate the gradient for every iteration in a standard gradient approach. The downside of this approach is recognized when the dataset size increases considerably. For each iteration, all the dataset samples will be used until a minimum is found, making the algorithm inefficient and resource-intensive.
Therefore, a better approach is to use the Stochastic Gradient Descent (SGD), in which a few dataset items are sampled and used for each iteration. This sample is collected randomly after shuffling the dataset.
Due to the randomization involved in SGD, it takes more iterations to reach the minima, and the path/s taken to get that minima are noisier.
This is illustrated in the image below.
Let’s look at an example pseudocode implementation of SGD in Python.
def SGD(theta0, learning_rate, no_iterations):i = 0theta = theta0for i in range(i+1, no_iterations+1):cost, gradient = predict(theta)theta = theta - (learning_rate * gradient)
In the code above, theta0
is the initial point from where the SGD is started, learning_rate
is the learning rate of the algorithm, and no_iterations
represents the total number of iterations for which the SGD process will be run.
We initialize a function that takes in three parameters. We’re assuming a predict
function has been implemented that returns the cost and the gradient we’ll optimize. Once the iterations are exhausted, we produce the output as theta.
Free Resources