Hands-on Machine Learning with Scikit-Learn/

...

Linear Regression

In this lesson, learn what is linear regression and how to use it.

We'll cover the following...

What is Linear Regression?
Modeling on house price
Modeling on generated data

What is Linear Regression?

Linear Regression may be one of the most commonly used models in the real world. It is a linear approach to modeling the relationship between a scalar response (dependent variable) and one or more explanatory variables (independent variables). Unlike the classification task, whose result is a probability of one category, regression tasks produce an output value with realistic physical meaning, such as product sales, employee engagement, future house price, health metric, GDP prediction, and more.

Linear regression is about learning the relationship between the dependent and independent variables from a pile of historical data. Take the house price prediction task as an example. You receive data on housing prices in various parts of the city. Each sample contains information about how many bedrooms there are in the house, how far it is from the city center, how far it is from the airport, whether there is a hospital nearby, and so on. Next time, when you get a house that is not in your database but has that information, you can give an accurate price forecast using your model.

Below is the model form. $\beta$ is the parameter you want to learn from the data. $x$ is the input data or feature. The second equation is a vector representation of the first equation.

f(x_{i}) = \beta_{0} + \beta_{1}x_{i1}+ \beta_{2}x_{i2}+ \beta_{3}x_{i3}+...++ \beta_{n}x_{in}=x^{T}_{i}\beta+\varepsilon

Like other tasks, we need to define a loss function or objective function for Linear regression. Below is the most commonly used loss function, mean squared error (MSE). Sometimes, you can use the root-mean-square error (RMSE) which is very similar to the MSE.

$y_{i}$ is the i-th sample’s real value. $f(x_{i})$ is the prediction value for i-th sample. The square of the difference between them is the loss of the i-th sample. $m$ is the number of total samples. Add and average all the losses and you’ll have the loss of the whole data set. Our goal is to minimize this value.

J(\beta)=\frac{1}{m}\sum^{m}_{i=1}(y_{i}-f(x_{i}))^2

There are many ways to solve this equation and get the optimal parameter. However, we’re not going to get into the mathematical details and principles. Here we just focus on how to use scikit-learn to accomplish our task.

Let’s start coding.

Modeling on house price

Let’s skip the data loading and splitting as the complete code will be shown later. For the ordinary linear regression, you can create a LinearRegression object from the linear_model module. There aren’t too many parameters, so notice there is one parameter normalize. The goal of this parameter is to scale features, without distorting differences in the ranges of values. Most of the time, you don’t need this parameter. When the range of features is very different, then you need to set normalize=True.

For example, there is a dataset with two features, people’s height and income. The range of height is between 140 and 200(cm). However, the range of income is between $20,000 and $80,000. This means that these two features influence the result in different dimensions. The income will influence the result more due to its larger value. However, this doesn’t mean that income is more important.

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
# train_x and train_y are training data and labels.
lr.fit(train_x, train_y)
pred_y = lr.predict(test_x)

After training the model, it’s time to evaluate the performance of our model.

For the regression task, the most commonly used metric is MSE or RMSE. In our demo, we use the MSE to evaluate the model.

import sklearn.metrics as metrics

# pred_y is the prediction result
mse = metrics.mean_squared_error(test_y, pred_y)

Here is the complete code so you can try.

Press + to interact

Python 3.5

import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics
house = datasets.load_boston()
print("The data shape of house is {}".format(house.data.shape))
print("The number of feature in this data set is {}".format(
    house.data.shape[1]))
train_x, test_x, train_y, test_y = train_test_split(house.data,
                                                    house.target,
                                                    test_size=0.2,
                                                    random_state=42)
print("The first five samples {}".format(train_x[:5]))
print("The first five targets {}".format(train_y[:5]))
print("The number of samples in train set is {}".format(train_x.shape[0]))
print("The number of samples in test set is {}".format(test_x.shape[0]))
lr = LinearRegression()
lr.fit(train_x, train_y)
pred_y = lr.predict(test_x)
print("The first five prediction {}".format(pred_y[:5]))
print("The real first five labels {}".format(test_y[:5]))
mse = metrics.mean_squared_error(test_y, pred_y)
print("Mean Squared Error {}".format(mse))

Modeling on generated data

In this demo, we want to draw the fitting line, and the error band.

Our data only has one feature, so our model is a line. Since we can get the coefficient and bias terms from our model, we can get the estimation value with these parameters.

Of course, we can also use predict() to get the prediction directly, which is much more convenient. The reason we didn’t choose the built-in function here is that we wanted to get a little bit of insight into the internal mechanisms. Below is how we get those parameters from the model and get the prediction value and error.

import numpy as np
from sklearn.linear_model import LinearRegression

# reshape 1-D array to 2-D array
x = np.linspace(0, 10, 11)
y = [3.9, 4.4, 10.8, 10.3, 11.2, 13.1, 14.1,  9.9, 13.9, 15.1, 12.5]
lr2 = LinearRegression()
lr2.fit(x.reshape(-1, 1), y)

# y_est is the prediction value. Normally, you get this value by calling predict()
y_est = x * lr2.coef_ + lr2.intercept_
y_err = x.std() * np.sqrt(1/len(x) +
                          (x - x.mean())**2 / np.sum((x - x.mean())**2))

The below chart shows the original data, fitting line, and error band.

Press + to interact

Python 3.5

from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 11)
y = [3.9, 4.4, 10.8, 10.3, 11.2, 13.1, 14.1, 9.9, 13.9, 15.1, 12.5]
lr2 = LinearRegression()
lr2.fit(x.reshape(-1, 1), y)
print('Coefficients: {}'.format(lr2.coef_))
print('Bias term: {}'.format(lr2.intercept_))
y_est = x * lr2.coef_ + lr2.intercept_
y_err = x.std() * np.sqrt(1 / len(x) +
                          (x - x.mean())**2 / np.sum((x - x.mean())**2))
fig, ax = plt.subplots()
ax.plot(x, y_est, '-')
ax.fill_between(x, y_est - y_err, y_est + y_err, alpha=0.2)
ax.plot(x, y, 'o')
fig.savefig("output/img.png", dpi = 300)
plt.close(fig)

line 5 and line 6 are our train data. Actually, x is your feature (you have one feature, each data point is an instance/sample) and y is the target/label.
A linear regression model is created at line 8 and trained at line 9.
In this example, we don’t use the built-in function prediction. We use the learned parameters to get the prediction value following the equation $y=weight*feature+bias$ at line 13. Moreover, we also get the error at line 14.
From the line 16 to line 21 we plot the original data, fitting line, and error band.

Preliminaries

Working with Datasets

Feature Engineering

General Concepts

Linear Regression

Logistic Regression

Support Vector Machine

Tree Model and Ensemble Method

Unsupervised Learning

Deep Learning

Others

What's Next

Linear Regression

What is Linear Regression?

Modeling on house price

Modeling on generated data