Linear Regression

Learn to fit a function into the available data through linear regression.

Function approximation

Approximating a function means estimating the values of its parameters. Consider the SSE function we discussed in the previous lesson.

SSE(w^)=(Aw^b)T(Aw^b)SSE(\bold{\hat w})=(A\bold{\hat w}-\bold{b})^T(A\bold{\hat w}-\bold{b})

Approximating SSE means estimating the vector, w\bold w, that nearly satisfies the linear system, also called the linear least squared error solution.

Formal definition

Consider a data set, D={(x1,y1),(x2,y2),...,(xn,yn)}D=\{(\bold{x_1},\bold{y_1}),(\bold{x_2},\bold{y_2}),...,(\bold{x_n},\bold{y_n})\}, where each entry is a pair, xi\bold{x_i} and yi\bold{y_i}, of objects (scalars, vectors, matrices, and so on). Function approximation seeks a function, fwf_\bold{w}, such that:

fw(xi)yif_\bold{w}(\bold{x_i})\approx\bold{y_i}

Example

Let D={(4,1),(3,9)}D=\{(4,1),(-3,9)\}. The function fwf_\bold{w} represents a line passing through the two data points in the xyxy plane. However, infinitely many non-linear curves pass through the same data points. A few of these curves are shown in the figure below.

In the figure to the right, the green points are from the data set. A line, in the color red, and two different curves, of colors black and blue, are passing through the data points. As we can see, all the functions (red line, blue curve, and black curve) approximate the data rather exactly. In this case, we can find the exact function(s) that fit data. When an exact fit is hard to estimate, we may rely on an approximate fit, that is, a curve which is near to the data points.

Curve fitting in data points in green colors
Press + to interact
import matplotlib.pyplot as plt
plt.scatter([-3, 4], [9, 1], c='g', linewidths=10)
# Red line
plt.plot([-3, 4], [9, 1], 'r')
# Blue curve
plt.plot([-5, -3, -2, 1, 2, 3, 3.5, 4, 7], [11, 9, 4, 5, -1, 5, 4, 1, 6], 'b')
# Black curve
plt.plot([-5, -3, -2, 1, 2, 3, 3.5, 4, 7], [2, 9, -4, 5, 2, 10, -4, 1, -1], 'k')

Note: The term “approximation” may become more relevant when considering several data points!

Regression vs. classification

In the dataset ...