Linear regression using scikit-learn

Key takeaways:

  • Linear regression is a type of supervised machine learning method to predict continuous values.

  • Linear regression estimates the best-fit line through the data points by minimizing the error between the actual and predicted values.

  • Scikit-learn simplifies machine learning tasks with built-in algorithms like linear regression, clustering, and data handling functions.

  • Mean Squared Error (MSE) evaluates model performance, indicating how far off predictions are from actual values.

  • Linear regression requires certain conditions, such as a strong relationship between features and target and the absence of significant outliers, feature scaling, etc., for better results.

Linear regression and scikit-learn

In linear regression, we derive a linear function using some points in the dataset while minimizing the error between the actual and the predicted value given by the function.

Estimating the line using linear regression
Estimating the line using linear regression

Scikit-learn is a machine learning library with algorithms related to artificial intelligence. It includes algorithms such as k-means, k-nearest neighbors, and other regression and clustering algorithms. This library provides data-handling functions as well. Here, we’ll discuss how to install scikit-learn and apply the linear regression method. Let’s try to understand the functions with code.

Installation

We can install scikit-learn with the help of the following Python command:

pip install scikit-learn
Command to install Scikit

Build a linear regression model with scikit-learn

Let’s explore how we can perform linear regression with the help of scikit-learn on a real-world use case.

Step 1: Import the libraries

First of all, to work with the scikit-learn library, we need to import the necessary libraries required for the execution of the code. We have included the numpy and matplotlib library as well.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

Step 2: Load the dataset

To understand linear regression, we used the Boston housing dataset. This dataset is mainly used for regression tasks, in which we have to predict the price of the houses given the features of the house. In the following code, X represents the data, and Y represents the target.

boston = load_boston()
X = boston.data
Y = boston.target

Step 3: Split the data

The handy function provided by the scikit-learn library splits the data into training and testing data. We have to define the ratio of splitting, which in this case, is test_size=0.2. This means that 80% of the data is reserved for training the model, while 20% is reserved for testing it. We can further split the training data into validation and training data sets.

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Step 4: Train the model

We can define the regression model with the help of LinearRegression class. After defining the model, we’ll pass the data X_train and target Y_train to the function model.fit(). This function will train the model.

model = LinearRegression()
model.fit(X_train, y_train)

Step 5: Make predictions

As usual, we check the model’s accuracy to know how well a model is trained on the dataset. We’ll pass the unseen data X_test.

y_pred = model.predict(X_test)

Step 6: Calculate the mean square error

After the model has predicted the estimated value, we'll check the error by the mean_squared_error() function.

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Step 7: Plot the results

At last, we plot the graph between the actual and predicted values for the testing data.

plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2) # Regression line
plt.xlabel("True Values")
plt.ylabel("Predicted Values")
plt.title("True vs. Predicted Values")
plt.savefig('./output/plot.png')
plt.show()

Complete Python code to demonstrate linear regression using Scikit-learn

The complete Python code to demonstrate linear regression using scikit-learn is given below:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
Y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2) # Regression line
plt.xlabel("True Values")
plt.ylabel("Predicted Values")
plt.title("True vs. Predicted Values")
plt.savefig('./output/plot.png')
plt.show()

The Mean Squared Error (MSE) of about 24 shows that the model doesn’t fit the data well. One reason could be that the features in the dataset have different scales, making it harder for the model to work properly.

Key requirements for using linear regression

To achieve reliable and accurate results with linear regression, several important conditions must be met:

  • Strong linear relationship between target and features: The dependent variable (target) should have a clear and measurable relationship with the independent variables (features), and this relationship must follow a straight-line pattern. If the data shows a non-linear relationship, linear regression may not be appropriate.

  • Minimal outliers: Outliers can disproportionately affect the model, leading to skewed predictions. It’s important to clean the data and remove any outliers to avoid this.

  • Minimal multicollinearity: The independent variables should not be highly correlated with each other. High multicollinearity can distort the model’s coefficients and lead to inaccurate predictions.

  • Feature scaling: Linear regression assumes that features have a comparable scale. When features have widely different ranges (e.g., age in years vs. income in thousands), scaling can help improve the model’s accuracy by preventing features with larger ranges from dominating the predictions.

Conclusion

This Answer covered the concept of linear regression using the scikit-learn library. We looked into the functions like making a model, training the model, and then testing the accuracy of the model. The scikit-learn covers all the functions related to linear regression.

To see practical applications of these concepts in various contexts, consider attempting real-world applications of linear regression, like predicting traffic volume using machine learning or predicting house prices. And if you’re curious about applying regression techniques in a different programming language, this project on price prediction with regression analysis in R is a good place to start. Exploring these valuable projects will equip you with the knowledge to implement linear regression techniques confidently in your own projects.

Q

What is the purpose of the train_test_split function in scikit-learn?

A)

It splits your code into separate files.

B)

It creates a visualization of your dataset.

C)

It splits the dataset into training and testing sets.

D)

It preprocesses the data before training a model.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


Why is it called linear regression?

It’s called “linear” because it models the relationship between variables as a straight line. The term “regression” refers to the process of estimating that relationship.


When to use linear regression?

You should use linear regression when you want to predict a continuous value, and there’s a clear, straight-line relationship between the variables. It works best when the data doesn’t have outliers or highly correlated features.


When to use linear vs. logistic regression?

Linear regression is always used for regression problems where your target variable is a continuous number, like house prices, temperature, or stock prices. For example, predicting the price of a house based on its square footage. Logistic regression is mostly used for classification problems where your target variable is categorical, such as “yes” or “no,” “spam” or “not spam,” or “positive” or “negative.” For example, predicting whether an email is spam or not based on its content.

For more details, look at “What is the difference between linear and logistic regression?”


Free Resources

Copyright ©2025 Educative, Inc. All rights reserved