What is optimization?

Let’s assume a machine learning problem, where we have a model fθ(.)f_\theta(.) parameterized by the weights θ\theta. Given a dataset D={(x1,y1),(x2,y2),...,(xN,yN)}D = \{ (x_1, y_1), (x_2, y_2), ..., (x_N, y_N) \} with NN training examples, where xix_i represents the input and yiy_i represents the corresponding ground-truth label, we want to minimize the following loss function between the model prediction fθ(xi)f_\theta(x_i) and the ground-truth label yiy_i:

Here, L\mathcal{L} is any arbitrary loss function—such as cross-entropy—to measure the discrepancy between the predicted and the ground-truth value.

The process of finding the optimal parameter θ\theta^* is known as optimization. In general, an optimization problem can be formulated as follows:

where g(x)g(x) is our objective function, X\mathcal{X} is our search space, and {hi(x)}i=0m1\{h_i(x) \}_{i=0}^{m-1} are the mm constraints.

A real-world example

Let’s consider a real-world example where we want to create a soda can bottle (cylindrical shape) that has a volume of 500 ml (1 ml = 1 cubic cm) with the least amount of material possible. The material cost is 0.01 cents/cm30.01 \text{ cents}/ \text{cm}^3, which is proportional to the surface area of the can. As a result, we want to minimize the cost of material for the can, subject to the constraint that its volume is 500 ml.

Mathematically, we can represent the problem above as the following optimization problem:

where rr is the radius of the can and hh is the height of the can. Here, our objective function g(r,h)=0.01×(2πr2+2πrh)g(r, h) = 0.01 \times (2\pi r^2 + 2\pi rh) is the cost of material for the can and h(r,h)=πr2h500=0h(r, h) = \pi r^2 h - 500 = 0 is the volume constraint.

To solve the problem above, we can substitute h=500πr2h = \frac{500}{\pi r^2} and rewrite our optimization problem as follows:

The figure given below shows the graph of the objective g(r)g(r). As can be seen, the minimum lies at r=4.3r = 4.3 cm, which gives us a cost of 3.483.48 cents per soda can.

Press + to interact
Objective of the soda can problem
Objective of the soda can problem

The effect of constraints on an optimal solution

Let’s take an example where g(x)=(x+1)(x3)g(x)= (x+1)(x-3) and m=0m=0 , i.e., no constraints are present. As can be seen in the unconstrained optimization figure below, the optimal solution lies at x=1x = 1. However, as shown in the constrained optimization figure, if we add a constraint h(x):x0h(x): x \leq 0, the optimal solution changes to x=0x=0.

Unconstrained optimization
Unconstrained optimization
Constrained optimization (white region is the allowed region)
Constrained optimization (white region is the allowed region)

Note: The solutions (1,4)(1,-4) and (0,3)(0,-3) result in x=1x =1 and x=0x=0 for the equation g(x)=(x+1)(x3)g(x) = (x+1)(x-3).

Local vs. global optimal solution

The solution to an optimization problem—by default, we assume that it is a minimization problem—can be of the following two types:

  • Local optimum: A solution xx^* is said to be a local optimum if g(xh)g(x)g(x+h)g(x^*-h) \geq g(x^*) \leq g(x^*+h) , where h<<0h << 0. In other words, the value of the objective function increases in all directions around xx^*.

  • Global optimum: A solution xx^* is said to be a global optimum if it is a local optimum and g(x)g(x)g(x) \geq g(x^*) for all xx.

Let’s consider the graph f(x)f(x) below that represents a house search, where the vertical axis shows house prices and the horizontal axis represents houses in various neighborhoods.

Press + to interact
Local vs. global optimal solution
Local vs. global optimal solution

The local optima—red—are houses with competitive prices within their immediate neighborhoods but aren’t the cheapest citywide. However, the global optimum—green—is the absolute best deal across all neighborhoods, offering the lowest price while meeting essential criteria.

Types of optimization problems

When m>0m > 0 , the optimization problem is referred to as constrained optimization. Otherwise, it is referred to as unconstrained optimization. Furthermore, based on the nature of the objective function g(x)g(x), the search space X\mathcal{X}, and the constraints hi(x)h_i(x), the optimization problems can be divided into multiple types, as shown in the following image:

Press + to interact
Types of optimization problems
Types of optimization problems

Linear programming (LP)

When both the objective function and the constraints are linear, the optimization problem is said to be a linear programming (LP) problem. The general form of LP is given as follows:

where cRdc \in \R^d is the cost vector, ARm×dA \in \R^{m \times d} is the constraint matrix, and bRmb \in \R^m is the constraint vector.

For example, let’s consider a diet problem where the task is to find the optimal combination of foods that satisfy the nutritional requirements of a person at a minimum cost. Let’s say we have the following four foods to choose from:

Food

Cost per Unit ($)

Calories per Unit

Protein per Unit (g)

Fat per Unit (g)

Rice

0.12

130

2.7

0.3

Beans

0.18

120

7.5

0.5

Milk

0.23

150

8

8

Bread

0.05

75

2

1

We want to calculate the following nutritional requirements per day:

  • Calories: At least 2000, i.e., 130x1+120x2+150x3+75x42000130x_1 + 120x_2 + 150x_3 + 75x_4 \geq 2000.

  • Protein: At least 55 g, i.e., 2.7x1+7.5x2+8x3+2x4552.7x_1 + 7.5x_2 + 8x_3 + 2x_4 \geq 55.

  • Fat: At most 70 g, i.e., 0.3x1+0.5x2+8x3+x4700.3x_1 + 0.5x_2 + 8x_3 + x_4 \leq 70.

In these relations, x1x_1​ is the amount of rice, x2x_2​ is the amount of beans, x3x_3​ is the amount of milk, and x4x_4​ is the amount of bread.

The objective function is the total cost of the foods, which is given by the following:

Here, cT=[0.120.180.230.05]c^T = \begin{bmatrix} 0.12 & 0.18 & 0.23 & 0.05 \end{bmatrix} is the cost vector and x=[x1x2x3x4]x = \begin{bmatrix} x_1 \\ x_2 \\x_3 \\ x_4 \end{bmatrix}.

The constraints can also be written in a matrix form as follows:

where A=[130120150752.77.5820.30.581]A = \begin{bmatrix} -130 & -120 & -150 & -75 \\ -2.7 & -7.5 & -8 & -2 \\ 0.3 & 0.5 & 8 & 1 \end{bmatrix} is the constraint matrix and b=[20005570]b = \begin{bmatrix} -2000 \\ -55 \\ 70 \end{bmatrix} is the constraint vector. Other trivial constraints include the nonnegativity restriction, showing that the amount of food must be nonnegative, i.e.:

Quadratic programming (QP)

The optimization problem is a quadratic programming (QP) problem when the objective function is quadratic, but the constraints are linear. The general form of QP is given as follows:

where cRdc \in \R^d, ARm×dA \in \R^{m \times d} , bRmb \in \R^m, and QRd×dQ \in \R^{d \times d} is a positive definite square matrix.

As an example, let’s consider a portfolio optimization problem where the task is to find the best allocation of assets—such as stocks, bonds, or cash—that maximizes the expected return and minimizes the risk of a portfolio. Let’s say we have the following three types of assets to choose from:

  • Stocks that give an expected return of 15%15\% but have a higher risk component (or variance) of 2020

  • Bonds that give an expected return of 7%7\% and have a medium risk component (or variance) of 1010

  • Cash that gives an expected return of 2%2\% and has a low-risk component (or variance) of 55

Considering that we want to assign the x1,x2,x3x_1, x_2, x_3 fraction of the portfolio to stocks, bonds, and cash, respectively, the asset weights should sum to one. At the same time, we would like to generate a collective return of at least 10%10\% on the whole portfolio. The constraints can then be written as follows:

With these constraints, we need to find the asset allocation that yields the minimum risk R(x)R(x). The risk is proportional to the variance of the portfolio return, which will be a quadratic function of the asset weights.

Note: x1+x2+x31x_1 + x_2 + x_3 \leq 1 and x1+x2+x31x_1 + x_2 + x_3 \geq 1 implies that x1+x2+x3=1x_1 + x_2 + x_3 = 1.

This can be written as a QP problem as follows:

subject to

where x=[x1x2x3]x = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}, Q=[400002000010]Q = \begin{bmatrix} 40 & 0 & 0 \\ 0 & 20 & 0 \\ 0 & 0 & 10 \end{bmatrix}, c=0c = 0, A=[1111111572]A = \begin{bmatrix} 1 & 1 & 1 \\ -1 & -1 & -1 \\ -15 & -7 & -2 \end{bmatrix}, and b=[1110]b = \begin{bmatrix} 1 \\ -1 \\ -10 \end{bmatrix}. Other trivial constraints include the nonnegativity restriction that the amount of food must be nonnegative, i.e.:

Integer linear programming (ILP)

The optimization problem is said to be an integer linear programming (ILP) problem when both the objective function and the constraints are linear, but the search space is restricted to the integers Z\Z. The general form of ILP is given as follows:

where cRdc \in \R^d, ARm×dA \in \R^{m \times d} , and bRmb \in \R^m.

As an example, let’s consider the knapsack problem, where the task is to pack a set of NN items with different weights and values into a sack with a limited capacity CC so that the total value of the packed items is maximized. The objective for the knapsack problem can be written as follows:

where xi{0,1}x_i \in \{0,1\} is a binary variable representing whether the ithi^{th} item is selected or not and viv_i and wiw_i represent the value and the weight of the ithi^{th} item, respectively. The optimization above can be written in the form of an ILP problem with cT=[v1v2...vn]c^T = \begin{bmatrix} v_1 & v_2 & . & . & . & v_n \end{bmatrix}, x=[x1x2...xn]x = \begin{bmatrix} x_1 \\ x_2 \\ . \\ . \\ . \\ x_n \end{bmatrix}, A=[w1w2...wn]A = \begin{bmatrix} w_1 & w_2 & . & . & . & w_n \end{bmatrix}, and b=[C]b = \begin{bmatrix} C \end{bmatrix}.

Convex and non-convex optimization

When the objective function and the constraints are convex, the problem is said to be a convex optimization problem. We will discuss convex functions in detail later in the course, but for now, we can think of a convex function as a smiley/bowl-like structure, as shown in the figure below. Convex optimization problems only have one global solution. Both LP and QP are special cases of convex optimization problems.

When either the objective function or the constraints are non-convex, the problem is said to be a non-convex optimization problem. Deep neural network training is a non-convex optimization problem because of nonlinear functions, like sigmoid, ReLU, etc.

Convex function (bowl-like)
Convex function (bowl-like)
Non-convex function
Non-convex function

Types of optimization techniques 

Based on the nature of the algorithm, optimization techniques can be broadly divided into the following types:

Press + to interact
Types of optimization techniques
Types of optimization techniques
  • Gradient-free, brute-force, or heuristics: These techniques rely on iterating through the search space in a random or heuristic fashion to find the optimal point that minimizes the objective function while satisfying all the constraints, for example, random search, grid search, and the Nelder-Mead algorithm.

  • First-order gradient techniques: These techniques utilize gradients of the objective function to move in the direction where the objective function keeps decreasing.

  • Second-order (Newtonian) techniques: These techniques also utilize second-order gradients known as Hessians to provide additional information about the curvature and move in the direction where the objective function keeps decreasing.

  • Genetic algorithms: Genetic algorithms (GAs) are adaptive heuristic search algorithms that belong to the larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and genetics.