Linear Regression
In this lesson, learn what is linear regression and how to use it.
We'll cover the following
What is Linear Regression?
Linear Regression
may be one of the most commonly used models in the real world. It is a linear approach to modeling the relationship between a scalar response (dependent variable) and one or more explanatory variables (independent variables). Unlike the classification task, whose result is a probability of one category, regression tasks produce an output value with realistic physical meaning, such as product sales, employee engagement, future house price, health metric, GDP prediction, and more.
Linear regression is about learning the relationship between the dependent and independent variables from a pile of historical data. Take the house price prediction task as an example. You receive data on housing prices in various parts of the city. Each sample contains information about how many bedrooms there are in the house, how far it is from the city center, how far it is from the airport, whether there is a hospital nearby, and so on. Next time, when you get a house that is not in your database but has that information, you can give an accurate price forecast using your model.
From a mathematical point of view, linear regression is about fitting data to minimize the sum of residual between each point and the predicted value. As shown in the figure below, the red line is the model we solved, the blue point is the original data, and the distance between the point and the red line is the residual. Our goal is to minimize the sum of residuals.
Below is the model form. is the parameter you want to learn from the data. is the input data or feature. The second equation is a vector representation of the first equation.
Like other tasks, we need to define a loss function or objective function for Linear regression
. Below is the most commonly used loss function, mean squared error (MSE
). Sometimes, you can use the root-mean-square error (RMSE
) which is very similar to the MSE
.
is the i-th sample’s real value. is the prediction value for i-th sample. The square of the difference between them is the loss of the i-th sample. is the number of total samples. Add and average all the losses and you’ll have the loss of the whole data set. Our goal is to minimize this value.
There are many ways to solve this equation and get the optimal parameter. However, we’re not going to get into the mathematical details and principles. Here we just focus on how to use scikit-learn
to accomplish our task.
Let’s start coding.
Modeling on house price
Let’s skip the data loading and splitting as the complete code will be shown later. For the ordinary linear regression, you can create a LinearRegression
object from the linear_model
module. There aren’t too many parameters, so notice there is one parameter normalize
. The goal of this parameter is to scale features, without distorting differences in the ranges of values. Most of the time, you don’t need this parameter. When the range of features is very different, then you need to set normalize=True
.
For example, there is a dataset with two features, people’s height and income. The range of height is between 140 and 200(cm). However, the range of income is between $20,000 and $80,000. This means that these two features influence the result in different dimensions. The income will influence the result more due to its larger value. However, this doesn’t mean that income is more important.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
# train_x and train_y are training data and labels.
lr.fit(train_x, train_y)
pred_y = lr.predict(test_x)
After training the model, it’s time to evaluate the performance of our model.
For the regression task, the most commonly used metric is MSE
or RMSE
. In our demo, we use the MSE
to evaluate the model.
import sklearn.metrics as metrics
# pred_y is the prediction result
mse = metrics.mean_squared_error(test_y, pred_y)
Here is the complete code so you can try.
import sklearn.datasets as datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionimport sklearn.metrics as metricshouse = datasets.load_boston()print("The data shape of house is {}".format(house.data.shape))print("The number of feature in this data set is {}".format(house.data.shape[1]))train_x, test_x, train_y, test_y = train_test_split(house.data,house.target,test_size=0.2,random_state=42)print("The first five samples {}".format(train_x[:5]))print("The first five targets {}".format(train_y[:5]))print("The number of samples in train set is {}".format(train_x.shape[0]))print("The number of samples in test set is {}".format(test_x.shape[0]))lr = LinearRegression()lr.fit(train_x, train_y)pred_y = lr.predict(test_x)print("The first five prediction {}".format(pred_y[:5]))print("The real first five labels {}".format(test_y[:5]))mse = metrics.mean_squared_error(test_y, pred_y)print("Mean Squared Error {}".format(mse))
-
First, we load the dataset at
line 6
by callingload_boston
. -
Then, we split the dataset into two parts, the train set and the test set at
line 12
. The training set accounts for 80%. -
A linear regression model is created and trained at
line 23
(insklearn
, thetrain
is equal tofit
). -
mean_squared_error
is called atline 29
to evaluate the performance of this model.
Modeling on generated data
In this demo, we want to draw the fitting line, and the error band
.
Our data only has one feature, so our model is a line. Since we can get the coefficient
and bias
terms from our model, we can get the estimation value with these parameters.
Of course, we can also use predict()
to get the prediction directly, which is much more convenient. The reason we didn’t choose the built-in function here is that we wanted to get a little bit of insight into the internal mechanisms. Below is how we get those parameters from the model and get the prediction value and error.
import numpy as np
from sklearn.linear_model import LinearRegression
# reshape 1-D array to 2-D array
x = np.linspace(0, 10, 11)
y = [3.9, 4.4, 10.8, 10.3, 11.2, 13.1, 14.1, 9.9, 13.9, 15.1, 12.5]
lr2 = LinearRegression()
lr2.fit(x.reshape(-1, 1), y)
# y_est is the prediction value. Normally, you get this value by calling predict()
y_est = x * lr2.coef_ + lr2.intercept_
y_err = x.std() * np.sqrt(1/len(x) +
(x - x.mean())**2 / np.sum((x - x.mean())**2))
The below chart shows the original data, fitting line, and error band.
Here is the complete code so you can try.
from sklearn.linear_model import LinearRegressionimport numpy as npimport matplotlib.pyplot as pltx = np.linspace(0, 10, 11)y = [3.9, 4.4, 10.8, 10.3, 11.2, 13.1, 14.1, 9.9, 13.9, 15.1, 12.5]lr2 = LinearRegression()lr2.fit(x.reshape(-1, 1), y)print('Coefficients: {}'.format(lr2.coef_))print('Bias term: {}'.format(lr2.intercept_))y_est = x * lr2.coef_ + lr2.intercept_y_err = x.std() * np.sqrt(1 / len(x) +(x - x.mean())**2 / np.sum((x - x.mean())**2))fig, ax = plt.subplots()ax.plot(x, y_est, '-')ax.fill_between(x, y_est - y_err, y_est + y_err, alpha=0.2)ax.plot(x, y, 'o')fig.savefig("output/img.png", dpi = 300)plt.close(fig)
-
line 5
andline 6
are our train data. Actually,x
is your feature (you have one feature, each data point is an instance/sample) andy
is the target/label. -
A
linear regression
model is created atline 8
and trained atline 9
. -
In this example, we don’t use the built-in function
prediction
. We use the learned parameters to get the prediction value following the equation atline 13
. Moreover, we also get the error atline 14
. -
From the
line 16
toline 21
we plot the original data, fitting line, and error band.
We recommend that you launch the widget to open the Jupyter
file below which contains more content and interactive operations.