It’s called “linear” because it models the relationship between variables as a straight line. The term “regression” refers to the process of estimating that relationship.
Key takeaways:
Linear regression is a type of supervised machine learning method to predict continuous values.
Linear regression estimates the best-fit line through the data points by minimizing the error between the actual and predicted values.
Scikit-learn simplifies machine learning tasks with built-in algorithms like linear regression, clustering, and data handling functions.
Mean Squared Error (MSE) evaluates model performance, indicating how far off predictions are from actual values.
Linear regression requires certain conditions, such as a strong relationship between features and target and the absence of significant outliers, feature scaling, etc., for better results.
In linear regression, we derive a linear function using some points in the dataset while minimizing the error between the actual and the predicted value given by the function.
Scikit-learn is a machine learning library with algorithms related to artificial intelligence. It includes algorithms such as k-means, k-nearest neighbors, and other regression and clustering algorithms. This library provides data-handling functions as well. Here, we’ll discuss how to install scikit-learn and apply the linear regression method. Let’s try to understand the functions with code.
We can install scikit-learn
with the help of the following Python command:
pip install scikit-learn
Let’s explore how we can perform linear regression with the help of scikit-learn on a real-world use case.
First of all, to work with the scikit-learn library, we need to import the necessary libraries required for the execution of the code. We have included the numpy
and matplotlib
library as well.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.datasets import load_boston
To understand linear regression, we used the Boston housing dataset. This dataset is mainly used for regression tasks, in which we have to predict the price of the houses given the features of the house. In the following code, X
represents the data, and Y
represents the target.
boston = load_boston()X = boston.dataY = boston.target
The handy function provided by the scikit-learn library splits the data into training and testing data. We have to define the ratio of splitting, which in this case, is test_size=0.2
. This means that 80% of the data is reserved for training the model, while 20% is reserved for testing it. We can further split the training data into validation and training data sets.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
We can define the regression model with the help of LinearRegression
class. After defining the model, we’ll pass the data X_train
and target Y_train
to the function model.fit()
. This function will train the model.
model = LinearRegression()model.fit(X_train, y_train)
As usual, we check the model’s accuracy to know how well a model is trained on the dataset. We’ll pass the unseen data X_test
.
y_pred = model.predict(X_test)
After the model has predicted the estimated value, we'll check the error by the mean_squared_error()
function.
mse = mean_squared_error(y_test, y_pred)print("Mean Squared Error:", mse)
At last, we plot the graph between the actual and predicted values for the testing data.
plt.scatter(y_test, y_pred)plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2) # Regression lineplt.xlabel("True Values")plt.ylabel("Predicted Values")plt.title("True vs. Predicted Values")plt.savefig('./output/plot.png')plt.show()
The complete Python code to demonstrate linear regression using scikit-learn is given below:
import numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_errorimport matplotlib.pyplot as pltfrom sklearn.datasets import load_bostonboston = load_boston()X = boston.dataY = boston.targetX_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)model = LinearRegression()model.fit(X_train, y_train)y_pred = model.predict(X_test)mse = mean_squared_error(y_test, y_pred)print("Mean Squared Error:", mse)plt.scatter(y_test, y_pred)plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2) # Regression lineplt.xlabel("True Values")plt.ylabel("Predicted Values")plt.title("True vs. Predicted Values")plt.savefig('./output/plot.png')plt.show()
The Mean Squared Error (MSE) of about 24 shows that the model doesn’t fit the data well. One reason could be that the features in the dataset have different scales, making it harder for the model to work properly.
To achieve reliable and accurate results with linear regression, several important conditions must be met:
Strong linear relationship between target and features: The dependent variable (target) should have a clear and measurable relationship with the independent variables (features), and this relationship must follow a straight-line pattern. If the data shows a non-linear relationship, linear regression may not be appropriate.
Minimal outliers: Outliers can disproportionately affect the model, leading to skewed predictions. It’s important to clean the data and remove any outliers to avoid this.
Minimal multicollinearity: The independent variables should not be highly correlated with each other. High multicollinearity can distort the model’s coefficients and lead to inaccurate predictions.
Feature scaling: Linear regression assumes that features have a comparable scale. When features have widely different ranges (e.g., age in years vs. income in thousands), scaling can help improve the model’s accuracy by preventing features with larger ranges from dominating the predictions.
This Answer covered the concept of linear regression using the scikit-learn library. We looked into the functions like making a model, training the model, and then testing the accuracy of the model. The scikit-learn covers all the functions related to linear regression.
To see practical applications of these concepts in various contexts, consider attempting real-world applications of linear regression, like predicting traffic volume using machine learning or predicting house prices. And if you’re curious about applying regression techniques in a different programming language, this project on price prediction with regression analysis in R is a good place to start. Exploring these valuable projects will equip you with the knowledge to implement linear regression techniques confidently in your own projects.
What is the purpose of the train_test_split
function in scikit-learn?
It splits your code into separate files.
It creates a visualization of your dataset.
It splits the dataset into training and testing sets.
It preprocesses the data before training a model.
Haven’t found what you were looking for? Contact Us
Free Resources