...

/

Optimization and Gradient Descent

Optimization and Gradient Descent

Learn about the fundamental algorithm behind machine learning training: gradient descent.

In our 2D example, the loss function can be thought of as a parabolic-shaped function that reaches its minimum on a certain pair of w1w_1 and w2w_2. Visually, we have:

To find these weights, the core idea is to simply follow the slope of the curve. Although we don’t know the actual shape of the loss, we can calculate the slope in a point and then move towards the downhill direction.

You can think of the loss function as a mountain. The current loss gives us information about the local slope.

But what is the slope?

Slope: the derivative of the loss function

In calculus, the slope is the derivative of the function at this point and is denoted as wx\frac{\partial w}{\partial x}. The ultimate goal would be to find the global min. The minimums, local or global, have a nearly zero derivative, which indicates that we are located at the minimum of the curve.

For now, suppose that we want to minimize the loss function CC. By calculating the derivative, we will take small steps along the slope in an iterative fashion. In this way, we can gradually reach the minimum of the curve.

The same principle can be extended into many dimensions NN. Despite the fact this is very difficult to visualize, maths is here to help us.

Keep in mind that the minimum is not always the global minimum.

Computing the gradient of a loss function

The question is how do we compute the derivative (or gradient) with respect to the weights? In simple cases, such as the two-dimensional one, we can compute the analytical form with calculus.

Since our loss function is C=(f(xi,W)yi)2C= (f(x_i,\mathbf{W}) - y_i)^2, where the classifier ff is f=w1x+w2f= w_1*x+w_2 ...