...

/

Implementation of Linear Regression

Implementation of Linear Regression

This lesson will provide an overview of linear regression and the steps involved in its implementation.

Quick overview of linear regression

Linear regression, as you may know, plots a straight line or plane called the hyperplane that predicts the target value of data inputs by determining the dependence between the dependent variable (y) and its changing independent variables (X). In a p-dimensional space, a hyperplane is a subspace equivalent to dimension p−1.

Thus, in a two-dimensional space, a hyperplane is a one-dimensional subspace/flat line. In three-dimensional space, a hyperplane is effectively a two-dimensional subspace. Although it becomes difficult to visualize a hyperplane in four or more dimensions, the notion of a p−1 hyperplane still applies.

The goal of the hyperplane is to bisect the known data points with minimal distance between itself and each data point. This means that if you were to draw a perpendicular line (90-degree angle) from the hyperplane to every data point on the plot, the distance of each data point would be the smallest possible distance of any potential hyperplane.

In preparation for building a linear regression model, you first need to remove or fill missing values and confirm that the independent variables are those most correlated with the dependent variable. Those same independent variables, however, should not be correlated with each other.

If a strong linear correlation exists between two or more independent variables, we encounter a problem called collinearity (in the case of two variables) or multi-collinearity (in the case of more than two correlated variables) where individual variables are not considered unique.

While this doesn’t affect the model’s overall accuracy, it affects the calculations and interpretation of individual (independent) variables. However, you can still reliably predict the output (dependent variable) using collinear variables. It just becomes difficult to say which variables are influential and which are redundant in deciding the model’s outputs.

The linear regression equation

The equation is: Y=a+bXY= a + bX, where Y is the dependent variable, X is the independent variable, b is the line slope, and a is the y-intercept.

a=n(i=1nxiyi)(i=1nxi)(i=1nyi)n(i=1nxi2)(i=1nxi)2a=\frac{n(\sum_{i=1}^n x_iy_i)-(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n(\sum_{i=1}^n x_i^2)-(\sum_{i=1}^n x_i)^2} ...