Linear regression using scikit-learn

Key takeaways:

Linear regression is a type of supervised machine learning method to predict continuous values.
Linear regression estimates the best-fit line through the data points by minimizing the error between the actual and predicted values.
Scikit-learn simplifies machine learning tasks with built-in algorithms like linear regression, clustering, and data handling functions.
Mean Squared Error (MSE) evaluates model performance, indicating how far off predictions are from actual values.
Linear regression requires certain conditions, such as a strong relationship between features and target and the absence of significant outliers, feature scaling, etc., for better results.

Linear regression and scikit-learn

In linear regression, we derive a linear function using some points in the dataset while minimizing the error between the actual and the predicted value given by the function.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
Y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2)  # Regression line
plt.xlabel("True Values")
plt.ylabel("Predicted Values")
plt.title("True vs. Predicted Values")
plt.savefig('./output/plot.png')
plt.show()

The Mean Squared Error (MSE) of about 24 shows that the model doesn’t fit the data well. One reason could be that the features in the dataset have different scales, making it harder for the model to work properly.

Key requirements for using linear regression

To achieve reliable and accurate results with linear regression, several important conditions must be met:

Strong linear relationship between target and features: The dependent variable (target) should have a clear and measurable relationship with the independent variables (features), and this relationship must follow a straight-line pattern. If the data shows a non-linear relationship, linear regression may not be appropriate.
Minimal outliers: Outliers can disproportionately affect the model, leading to skewed predictions. It’s important to clean the data and remove any outliers to avoid this.
Minimal multicollinearity: The independent variables should not be highly correlated with each other. High multicollinearity can distort the model’s coefficients and lead to inaccurate predictions.
Feature scaling: Linear regression assumes that features have a comparable scale. When features have widely different ranges (e.g., age in years vs. income in thousands), scaling can help improve the model’s accuracy by preventing features with larger ranges from dominating the predictions.

Conclusion

This Answer covered the concept of linear regression using the scikit-learn library. We looked into the functions like making a model, training the model, and then testing the accuracy of the model. The scikit-learn covers all the functions related to linear regression.

To see practical applications of these concepts in various contexts, consider attempting real-world applications of linear regression, like predicting traffic volume using machine learning or predicting house prices. And if you’re curious about applying regression techniques in a different programming language, this project on price prediction with regression analysis in R is a good place to start. Exploring these valuable projects will equip you with the knowledge to implement linear regression techniques confidently in your own projects.

Frequently asked questions

Haven’t found what you were looking for? Contact Us

Why is it called linear regression?

It’s called “linear” because it models the relationship between variables as a straight line. The term “regression” refers to the process of estimating that relationship.

When to use linear regression?

You should use linear regression when you want to predict a continuous value, and there’s a clear, straight-line relationship between the variables. It works best when the data doesn’t have outliers or highly correlated features.

When to use linear vs. logistic regression?

Linear regression is always used for regression problems where your target variable is a continuous number, like house prices, temperature, or stock prices. For example, predicting the price of a house based on its square footage. Logistic regression is mostly used for classification problems where your target variable is categorical, such as “yes” or “no,” “spam” or “not spam,” or “positive” or “negative.” For example, predicting whether an email is spam or not based on its content.

For more details, look at “What is the difference between linear and logistic regression?”

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources