...

Newton's Method

Learn about one of the most famous second-order optimization algorithms.

We'll cover the following...

Second-order optimization methods use the second derivatives of a function in each iteration. That’s the only difference from first-order iterative methods. These methods require the target function to be not just differentiable but doubly differentiable. That is, the target function should have both its first and second derivative well defined.

Besides this stronger requirement, when applicable, second-order methods offer a greater convergence velocity than first-order methods. We’ll learn about the classic version of Newton’s method. This is probably the most famous second-order optimization algorithm and is the seed of many variants designed to overcome some of its limitations.

But first, let’s talk about second-order derivatives and their meaning.

Note: Newton’s method is also known as the Newton-Raphson method.

Interpretation of second-order derivatives

It’s time to give more importance to these derivatives. We’ve talked a little bit about the Hessians but we’ve mostly just ignored them. And that’s a little unfair, because second-order derivatives measure an important property of the functions.

In the same way that the first derivative and the gradient tell us about where a function is increasing or decreasing, the second derivative and the Hessian tell us about how the function is curved.

A univariate function with a positive second derivative is convex, and if the second derivative is negative, then the function is concave.

There’s a saddle point where the curvature changes. Newton’s method could get stuck at that point. Also, the first derivative is zero at these saddle points and yet they’re not a minimum or a maximum.

When the second derivative is zero at a point, that’s a saddle point. We already know saddle points, where the first derivative is zero but they’re not minima nor maxima. Well, the problem with these points is that the curvature of the function changes in them. The function goes from convex to concave or vice versa. Thus, the second derivative changes its sign.

Analogously for multidimensional functions, if the Hessian is a definite positive (the test we did to check for minima in the previous section), then the function is convex at that point; if the Hessian is a definite negative (the test we did to check for maxima in the previous section), then the function is concave. Otherwise, the function is stationary at that point (a saddle point).

Second-order methods take advantage of curvature information to make bigger steps toward the minimum and solve the problem faster. Let’s define Newton’s method formally.

Defining Newton’s method

First of all, we need to determine the requirements of this method. Besides the requirements of the gradient descent algorithm, now we also need the function to be differentiable twice. As happened with gradient descent, this method can get stuck in saddle points. So, to assure convergence, we need the second derivative to not change its sign.

Note: Remember that the second derivative changes its sign at saddle points. A way to indicate that the function doesn’t have saddle points is to say that the second derivative doesn’t change its sign.

Convex functions don’t have a maximum and concave ones don’t have a minimum. As Newton’s method uses curvature information, it automatically looks for a minimum in convex functions and a maximum in concave ones.

The iterative process is somehow different from the gradient descent algorithm. In gradient descent, we use just the gradient to determine the next point. In Newton’s method, we also use the second derivative to take a more efficient step.

Note: The demonstration of why this step is more efficient is beyond the scope of this course. The general idea is that while first-order methods use a linear approximation of the step, second-order methods like Newton’s use a quadratic approximation, which is faster.

If we’re at point $x_0$ and want to go to the next point $x_1$ , we have:

x_1 = x_0 + t