Optimization for Machine Learning with NumPy and SciPy/

...

/

The Motivation for Optimization

What is optimization?

Let’s assume a machine learning problem, where we have a model $f_\theta(.)$ parameterized by the weights $\theta$ . Given a dataset $D = \{ (x_1, y_1), (x_2, y_2), ..., (x_N, y_N) \}$ with $N$ training examples, where $x_i$ represents the input and $y_i$ represents the corresponding ground-truth label, we want to minimize the following loss function between the model prediction $f_\theta(x_i)$ and the ground-truth label $y_i$ :

where $g(x)$ is our objective function, $\mathcal{X}$ is our search space, and $\{h_i(x) \}_{i=0}^{m-1}$ are the $m$ constraints.

A real-world example

Let’s consider a real-world example where we want to create a soda can bottle (cylindrical shape) that has a volume of 500 ml (1 ml = 1 cubic cm) with the least amount of material possible. The material cost is $0.01 \text{ cents}/ \text{cm}^3$ , which is proportional to the surface area of the can. As a result, we want to minimize the cost of material for the can, subject to the constraint that its volume is 500 ml.

Mathematically, we can represent the problem above as the following optimization problem:

Note: The solutions $(1,-4)$ and $(0,-3)$ result in $x =1$ and $x=0$ for the equation $g(x) = (x+1)(x-3)$ .

Local vs. global optimal solution

The solution to an optimization problem—by default, we assume that it is a minimization problem—can be of the following two types:

Local optimum: A solution $x^*$ is said to be a local optimum if $g(x^*-h) \geq g(x^*) \leq g(x^*+h)$ , where $h << 0$ . In other words, the value of the objective function increases in all directions around $x^*$ .
Global optimum: A solution $x^*$ is said to be a global optimum if it is a local optimum and $g(x) \geq g(x^*)$ for all $x$ .

Let’s consider the graph $f(x)$ below that represents a house search, where the vertical axis shows house prices and the horizontal axis represents houses in various neighborhoods.

The local optima—red—are houses with competitive prices within their immediate neighborhoods but aren’t the cheapest citywide. However, the global optimum—green—is the absolute best deal across all neighborhoods, offering the lowest price while meeting essential criteria.

Types of optimization problems

When $m > 0$ , the optimization problem is referred to as constrained optimization. Otherwise, it is referred to as unconstrained optimization. Furthermore, based on the nature of the objective function $g(x)$ , the search space $\mathcal{X}$ , and the constraints $h_i(x)$ , the optimization problems can be divided into multiple types, as shown in the following image:

We want to calculate the following nutritional requirements per day:

Calories: At least 2000, i.e., $130x_1 + 120x_2 + 150x_3 + 75x_4 \geq 2000$ .
Protein: At least 55 g, i.e., $2.7x_1 + 7.5x_2 + 8x_3 + 2x_4 \geq 55$ .
Fat: At most 70 g, i.e., $0.3x_1 + 0.5x_2 + 8x_3 + x_4 \leq 70$ .

In these relations, $x_1$ is the amount of rice, $x_2$ is the amount of beans, $x_3$ is the amount of milk, and $x_4$ is the amount of bread.

The objective function is the total cost of the foods, which is given by the following:

where $A = \begin{bmatrix} -130 & -120 & -150 & -75 \\ -2.7 & -7.5 & -8 & -2 \\ 0.3 & 0.5 & 8 & 1 \end{bmatrix}$ is the constraint matrix and $b = \begin{bmatrix} -2000 \\ -55 \\ 70 \end{bmatrix}$ is the constraint vector. Other trivial constraints include the nonnegativity restriction, showing that the amount of food must be nonnegative, i.e.:

where $c \in \R^d$ , $A \in \R^{m \times d}$ , $b \in \R^m$ , and $Q \in \R^{d \times d}$ is a positive definite square matrix.

As an example, let’s consider a portfolio optimization problem where the task is to find the best allocation of assets—such as stocks, bonds, or cash—that maximizes the expected return and minimizes the risk of a portfolio. Let’s say we have the following three types of assets to choose from:

Stocks that give an expected return of $15\%$ but have a higher risk component (or variance) of $20$
Bonds that give an expected return of $7\%$ and have a medium risk component (or variance) of $10$
Cash that gives an expected return of $2\%$ and has a low-risk component (or variance) of $5$

Considering that we want to assign the $x_1, x_2, x_3$ fraction of the portfolio to stocks, bonds, and cash, respectively, the asset weights should sum to one. At the same time, we would like to generate a collective return of at least $10\%$ on the whole portfolio. The constraints can then be written as follows:

where $x = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}$ , $Q = \begin{bmatrix} 40 & 0 & 0 \\ 0 & 20 & 0 \\ 0 & 0 & 10 \end{bmatrix}$ , $c = 0$ , $A = \begin{bmatrix} 1 & 1 & 1 \\ -1 & -1 & -1 \\ -15 & -7 & -2 \end{bmatrix}$ , and $b = \begin{bmatrix} 1 \\ -1 \\ -10 \end{bmatrix}$ . Other trivial constraints include the nonnegativity restriction that the amount of food must be nonnegative, i.e.:

where $x_i \in \{0,1\}$ is a binary variable representing whether the $i^{th}$ item is selected or not and $v_i$ and $w_i$ represent the value and the weight of the $i^{th}$ item, respectively. The optimization above can be written in the form of an ILP problem with $c^T = \begin{bmatrix} v_1 & v_2 & . & . & . & v_n \end{bmatrix}$ , $x = \begin{bmatrix} x_1 \\ x_2 \\ . \\ . \\ . \\ x_n \end{bmatrix}$ , $A = \begin{bmatrix} w_1 & w_2 & . & . & . & w_n \end{bmatrix}$ , and $b = \begin{bmatrix} C \end{bmatrix}$ .

Convex and non-convex optimization

When the objective function and the constraints are convex, the problem is said to be a convex optimization problem. We will discuss convex functions in detail later in the course, but for now, we can think of a convex function as a smiley/bowl-like structure, as shown in the figure below. Convex optimization problems only have one global solution. Both LP and QP are special cases of convex optimization problems.

When either the objective function or the constraints are non-convex, the problem is said to be a non-convex optimization problem. Deep neural network training is a non-convex optimization problem because of nonlinear functions, like sigmoid, ReLU, etc.

Gradient-free, brute-force, or heuristics: These techniques rely on iterating through the search space in a random or heuristic fashion to find the optimal point that minimizes the objective function while satisfying all the constraints, for example, random search, grid search, and the Nelder-Mead algorithm.
First-order gradient techniques: These techniques utilize gradients of the objective function to move in the direction where the objective function keeps decreasing.
Second-order (Newtonian) techniques: These techniques also utilize second-order gradients known as Hessians to provide additional information about the curvature and move in the direction where the objective function keeps decreasing.
Genetic algorithms: Genetic algorithms (GAs) are adaptive heuristic search algorithms that belong to the larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and genetics.

Food	Cost per Unit ($)	Calories per Unit	Protein per Unit (g)	Fat per Unit (g)
Rice	0.12	130	2.7	0.3
Beans	0.18	120	7.5	0.5
Milk	0.23	150	8	8
Bread	0.05	75	2	1

Introduction to Optimization

Vector Calculus

Convex Optimization

Gradient Descent for Non-Convex Optimization

Use Particle Swarm Optimizer to Optimize a Non-convex Function

Constrained Optimization

Miscellaneous Methods

Course Conclusion

Test Your Concepts of Optimization

Training Support Vector Machines (SVMs)

The Motivation for Optimization

What is optimization?

A real-world example

The effect of constraints on an optimal solution

Local vs. global optimal solution

Types of optimization problems

Linear programming (LP)

Quadratic programming (QP)

Integer linear programming (ILP)

Convex and non-convex optimization

Types of optimization techniques