Exercise: Linear Regression in scikit-learn
Learn how to implement linear regression in scikit-learn using synthetic data.
We'll cover the following
Linear regression with scikit-learn
In this exercise, we will take the synthetic data we just generated and determine a line of best fit, or linear regression, using scikit-learn. The first step is to import a linear regression model class from scikit-learn and create an object from it. The import is similar to the LogisticRegression
class we worked with previously. As with any model class, you should observe what all the default options are. Notice that for linear regression, there are not that many options to specify: you will use the defaults for this exercise. The default settings include fit_intercept=True
, meaning the regression model will include an intercept term. This is certainly appropriate because we added an intercept to the synthetic data. Perform the following steps to complete the exercise, noting that the code creating the data for linear regression from the preceding lessons must be run first in the same notebook, found at the end of the lesson:
-
Execute this code to import the linear regression model class and instantiate it with all the default options:
from sklearn.linear_model import LinearRegression lin_reg = LinearRegression(fit_intercept=True, copy_X=True, n_jobs=None) lin_reg
You should see the following output:
LinearRegression()
No options are displayed because we used all the defaults. Now we can fit the model using our synthetic data, remembering to reshape the feature array (as we did earlier) so that that samples are along the first dimension. After fitting the linear regression model, we examine
lin_reg.intercept_
, which contains the intercept of the fitted model, as well aslin_reg.coef_
, which contains the slope. -
Run this code to fit the model and examine the coefficients:
lin_reg.fit(X.reshape(-1,1), y) print(lin_reg.intercept_) print(lin_reg.coef_)
You should see this output for the intercept and slope:
# -1.2522197212675905 # [0.25711689]
We again see that actually fitting a model in scikit-learn, once the data is prepared and the options for the model are decided, is a trivial process. This is because all the algorithmic work of determining the model parameters is abstracted away from the user. We will discuss this process later, for the logistic regression model we’ll use on the case study data.
What about the slope and intercept of our fitted model?
These numbers are fairly close to the slope and intercept we indicated when creating the model. However, because of the random noise, they are only approximations.
Finally, we can use the model to make predictions on feature values. Here, we do this using the same data used to fit the model: the array of features,
X
. We capture the output of this as a variable,y_pred
. This is very similar to the example shown in the figure below, only here we are making predictions on the same data used to fit the model (previously, we made predictions on different data) and we put the output of thepredict
method into a variable. -
Run this code to make predictions:
y_pred = lin_reg.predict(X.reshape(-1,1))
We can plot the predictions,
y_pred
, against featureX
as a line plot over the scatter plot of the feature and response data, like we made in the figure in the . Here, we make the addition ofprevious lesson "Plot the noisy linear relationship" plt.plot
, which produces a line plot by default, to plot the feature and the model-predicted response values for the model training data. Notice that we follow theX
andy
data withr
in our call toplt.plot
. This keyword argument causes the line to be red and is part of a shorthand syntax for plot formatting. -
This code can be used to plot the raw data, as well as the fitted model predictions on this data:
plt.scatter(X,y,s=1) plt.plot(X,y_pred,'r') plt.xlabel('X') plt.ylabel('y')
After executing this cell, you should see something like this:
Get hands-on with 1400+ tech skills courses.