Home/Blog/Data Science/What are Regularization Techniques in Regression
Home/Blog/Data Science/What are Regularization Techniques in Regression

What are Regularization Techniques in Regression

Najeeb Ul Hassan
Jan 22, 2024
8 min read

Machine learning empowers computers to learn and make accurate predictions based on patterns in the data. We start by preparing the data and splitting it into training and test sets. We then select an appropriate model that best describes our problem. With the model in place, we adjust its parameters using the training data such that the model fits well. We then assess the performance of the trained model using the test data and appropriate evaluation metrics. Based on the evaluation results, we fine-tune the model’s hyperparameters to optimize performance. Finally, we deploy the trained and optimized model to generate predictions on new, unseen data.

Machine learning workflow
Machine learning workflow

This process aims to fine-tune the model to perform well on new unseen data and make accurate predictions.

Regression#

Regression is a statistical tool that describes the relationship between a dependent variable and one or more independent variables. This helps us understand how changes in one variable lead to changes in another.

Let’s take an example of a linear regression where we wish to model our dependent variable yy, based on the independent variable xx. The independent variables are the dataset’s features that quantify the dependent variable. Mathematically, we can write a linear regression model as follows:

y=wx+w0y=w\cdot x + w_0

Where w0w_0 is the y-intercept and ww is the coefficient that represents the change in yy for a one-unit change in xx. During the training phase, we find the optimal values of w0w_0 and ww such that the regression equation fits the data. This process is called optimization, which minimizes a specified objective function. The objective function guides the optimization process by providing a quantitative measure of how well the model is performing. In linear regression, this objective function is the mean squared error (MSE). The MSE calculates the average of the squared differences between the predicted values y^i\hat{y}_i and the actual values yiy_i. It can be written as follows:

L=1Ni=1N(y^iyi)2L = \frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2

Regularization#

Regression models can sometimes suffer from overfitting. This is because, sometimes, when the model becomes too complex, it tries to fit every detail in the available features during the training phase. As a result, it also captures the noise in the training data. When we apply the same model to new unseen data, it fails to generalize well. This problem is called overfitting.

Overfitting: the model fits too closely to the training dataset
Overfitting: the model fits too closely to the training dataset

Take an example of a fitness tracker that monitors the progress of an athlete training for a marathon. To predict the completion time of a marathon, the fitness tracker records several quantities like sleep hours, calorie intake, and running distance. The fitness tracker employs a regression model to predict the completion time using these features. However, a fitness tracker might also capture irrelevant or noisy data caused by GPS inaccuracies or occasional outliers in calorie intake. These inaccuracies and outliers negatively impact the performance of the regression model. As a result, our model would perform well in the training phase but fails to provide accurate results in the evaluation phase when a new unseen terrain is presented to the model on the actual marathon day.

One way to address overfitting is to apply regularization techniques. Regularization techniques help control the model’s complexity and prevent overfitting by adding a penalty term to the model’s loss function, therefore, discouraging overly complex representations. It keeps the focus on the most relevant features and prevents the model from getting distracted by irrelevant or noisy details.

In simple terms, regularization is like having a knowledgeable guide that helps keep the focus on the important features and provides more accurate predictions for unseen data.

Let’s look at the commonly used regularization techniques.

Ridge regression#

Let’s assume we wish to build a linear regression model using multiple features, also known as multiple linear regression. We aim to find the best line that minimizes the difference between the observed and predicted value of the dependent variable based on multiple input features.

Now, assume that some of the input features are highly correlated, a phenomenon commonly known as multicollinearity. Recalling our example of predicting the completion time of the marathon, the calorie intake and running time might be highly correlated. This makes it difficult for the model to determine the individual effect of highly correlated features on the target variable. This can result in inaccurate regression coefficients.

The marathon completion time based on the input features
The marathon completion time based on the input features

Ridge regression introduces an L2 regularization term to the coefficients of the regression model. This penalty term is based on the squared values of the coefficients wiw_i.

In ridge regression, the model aims to minimize the sum of squared coefficients of the features in addition to minimizing the errors between the actual and the predicted output. The sum of squared coefficients is scaled by a regularization parameter α\alpha that controls the strength of regularization. Mathematically, we can write the objective function for ridge regression as follows:

LRidge=L+αj=1pwj2L_{\text{Ridge}} = L + \alpha \sum_{j=1}^p w_j^2

Where pp denotes the number of features or predictors in our model.

The penalty term α1pwj2\alpha \sum_1^p w_j^2 prevents the models from assigning excessively large weights to any specific predictor, which reduces the model’s sensitivity to noisy or unimportant variables.

Lasso regression#

In addition to multicollinearity, a large number of features also make the model complex. To deal with that, Lasso regression adds a penalty term L1 to the linear regression objective function based on the sum of the absolute values of the coefficients. This penalty promotes sparsity in the coefficient matrix by effectively performing feature selection by driving some coefficients to exactly zero.

LLasso=L+αj=1pwjL_{\text{Lasso}} = L + \alpha \sum_{j=1}^p |w_j|

Lasso regression differs from Ridge regression in terms of the penalty term. In Lasso regression, the penalty tends to identify the less important features and shrink the corresponding coefficients to exactly zero.

This “zeroing out” of less relevant features in Lasso regression simplifies the model and helps the model focus only on the most influential features. This enhances the model’s accuracy and ability to generalize to new unseen data.

Elastic Net regression#

Elastic Net regression combines both L1 and L2 penalties in the objective function as follows:

LElastic=L+α[L1ratioj=1pwj+12(1L1ratio)j=1pwj2]L_{\text{Elastic}} = L + \alpha \left[ L1_{\text{ratio}} \sum_{j=1}^p |w_j| + \frac{1}{2}(1-L1_{\text{ratio}}) \sum_{j=1}^p w_j^2 \right]

Here, L1ratioL1_{\text{ratio}} offers a balance between leveraging the strengths of both Lasso and Ridge regressions. The value of L1ratio=1L1_{\text{ratio}} = 1 makes the Elastic Net regression the same as the Lasso regression, and an L1ratio=0L1_{\text{ratio}} = 0 would provide all weightage to the Ridge regression penalty term. Elastic Net regression performs exceptionally well with datasets with high-dimensional and correlated features.

Implementation in Python#

Now that we have seen how regularization works, let’s implement a regression model to predict the price of a house based on its size, number of rooms, and location.

We will start by importing the necessary libraries and methods. We will then load the dataset and divide it into training and testing data sets. Let’s see how it can be done in Python.

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
# Load the Boston Housing dataset
boston = load_boston()
X = boston.data
y = boston.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  • Line 1: We import the Boston housing dataset.
  • Line 2: We import the methods for linear, Ridge, Lasso, and Elastic Net regressions.
  • Line 3: We import the train_test_split function from the sklearn library to split the data.
  • Lines 6–8: We load the Boston housing dataset into features X and the dependent variable y.
  • Line 11: We split the dataset using test_size=0.2 to select 80%80\% of the data for training and the remaining 20%20\% for testing purposes.

Ridge regression#

Now, let’s apply the linear and Ridge regression models on our dataset to predict the price of the house.

# Fit a Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
# Fit a Ridge Regression model
# The value of alpha gives the regularization strength
ridge_reg = Ridge(alpha=1)
ridge_reg.fit(X_train, y_train)
print("The coefficient matrix for Linear regression")
print(linear_reg.coef_)
print("The coefficient matrix for Ridge regression")
print(ridge_reg.coef_)

Here, we apply linear and Ridge regression models in lines 2–3 and lines 7–8, respectively. We also print the resulting coefficient matrix of the two models. The coefficient matrix consists of the coefficients wjw_{j} that define the relationship between the jthj^{\text{th}} independent variable and our dependent variable.

Note that the coefficients for the Ridge regression model are very similar to the linear regression model. This shows that the features are not highly correlated.

Lasso regression#

Now, let’s apply the Lasso regression model to predict the house price and compare the coefficients with linear regression.

# Fit a Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
# Fit a Lasso Regression model
# The value of alpha gives the regularization strength
lasso_reg = Lasso(alpha=1.0)
lasso_reg.fit(X_train, y_train)
print("The coefficient matrix for Linear regression")
print(linear_reg.coef_)
print("The coefficient matrix for Lasso regression")
print(lasso_reg.coef_)

Note that the Lasso regression model zeros out some of the coefficients, effectively performing feature selection. This eliminates the less effective features and reduces the risk of overfitting.

Elastic Net regression#

Now, let’s also see the effect of applying Elastic Net regression on the data.

# Fit a Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
# Fit a Elastic Net Regression model
# The value of alpha gives the regularization strength and l1_ratio gives the balance between Ridge and Lasso
elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic_net.fit(X_train, y_train)
print("The coefficient matrix for Linear regression")
print(linear_reg.coef_)
print("The coefficient matrix for Elastic Net regression")
print(elastic_net.coef_)

Here, L1ratioL1_{\text{ratio}} is set to 0.50.5, giving an equal balance of the strengths to both Lasso and Ridge regressions. Similar to the Lasso regression, some of the coefficients are zero, eliminating their effect in calculating the output dependent variable.

Note: Go ahead and see how the coefficients change by changing the value of L1ratioL1_{\text{ratio}}. Remember, the value of L1ratioL1_{\text{ratio}} closer to 11 makes the Elastic Net regression behave more like the Lasso regression, and an L1ratioL1_{\text{ratio}} closer to 00 would provide more weightage to the Ridge regression penalty term.

Future readings#

This blog has briefly introduced commonly used regularization techniques and their implementation in Python. Regularization plays a crucial role to prevent overfitting, and it also improves the generalization of models. We encourage you to explore these techniques further to gain a deeper understanding. Additionally, you can check out the following courses on Educative:

Mastering Machine Learning Theory and Practice

Cover
Mastering Machine Learning Theory and Practice

The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.

36hrs
Beginner
109 Playgrounds
10 Quizzes

Become a Machine Learning Engineer

Cover
Become a Machine Learning Engineer

Start your journey to becoming a machine learning engineer by mastering the fundamentals of coding with Python. Learn machine learning techniques, data manipulation, and visualization. As you progress, you'll explore object-oriented programming and the machine learning process, gaining hands-on experience with machine learning algorithms and tools like scikit-learn. Tackle practical projects, including predicting auto insurance payments and customer segmentation using K-means clustering. Finally, explore the deep learning models with convolutional neural networks and apply your skills to an AI-powered image colorization project.

105hrs
Beginner
17 Challenges
11 Quizzes

Data Science Interview Handbook

Cover
Data Science Interview Handbook

This course will increase your skills to crack the data science or machine learning interview. You will cover all the most common data science and ML concepts coupled with relevant interview questions. You will start by covering Python basics as well as the most widely used algorithms and data structures. From there, you will move on to more advanced topics like feature engineering, unsupervised learning, as well as neural networks and deep learning. This course takes a non-traditional approach to interview prep, in that it focuses on data science fundamentals instead of open-ended questions. In all, this course will get you ready for data science interviews. By the time you finish this course, you will have reviewed all the major concepts in data science and will have a good idea of what interview questions you can expect.

9hrs
Intermediate
140 Playgrounds
128 Quizzes

Frequently Asked Questions

What is regularization in linear regression?

Regularization manages the intricacy of a linear regression model. It achieves this by imposing penalties on coefficients that are considered non-essential or irrelevant to the predictive accuracy. Through regularization, the variance of the model is effectively reduced, safeguarding against overfitting and improving the model’s resilience to noise and outliers.


  

Free Resources