Sklearn is a powerful open-source machine learning library that provides efficient tools for data analysis, classification, regression, and more. In this Answer, we will look into applying Ridge regression using sklearn.
Ridge regression is a commonly used regularization technique. It is a model tuning method that is used to perform analysis on data that has multicollinearity. Multicollinearity is a concept involved in statistics in which multiple independent columns are correlated with one another. It causes a large
We will be using the fetch_california_housing
dataset from Sklearn and converting it into a pandas DataFrame. The major information that the fetch_california_housing
object returns is:
data
: Each row contains the data corresponding to the eight features. (rows = 20640, columns = 8)
target
: Each row contains the average house value in units of 100000. (rows = 20640, columns = 1)
feature_names
: The array of the feature names. Its length is 8.
The code to import the dataset and create a data frame is given below:
from sklearn.datasets import fetch_california_housing import pandas as pd housing_data = fetch_california_housing() housing_df = pd.DataFrame(housing_data["data"] , columns=housing_data["feature_names"]) housing_df["target"]=housing_data["target"] print(housing_df.head())
Line 1: We import the fetch_california_housing
dataset from the sklearn.datasets
.
Line 4: We create an instance of the dataset.
Line 6: We create a DataFrame
by passing housing_data["data"]
and housing_data["feature_names"]
. housing_data["data"]
contains the array of data of all columns, and housing_data["feature_names"]
contains an array of column names.
Line 7: We create a target
column in our data frame by passing the housing_data["target"]
. The housing_data["target"]
is the output feature of our data set, which is not included in housing_data["data"]
.
To perform the Ridge regression, we have to split our data into training and testing data. The code to perform the train-test split is given below:
from sklearn.datasets import fetch_california_housing import pandas as pd from sklearn.model_selection import train_test_split import numpy as np housing = fetch_california_housing() housing_df = pd.DataFrame(housing["data"] , columns=housing["feature_names"]) housing_df["target"]=housing["target"] print(housing_df.head()) np.random.seed(42) X = housing_df.drop('target' , axis = 1) Y = housing_df["target"] x_train, x_test, y_train, y_test = train_test_split(X , Y , test_size = 0.2)
Line 13: We create the X
label (features) by removing the target
column.
Line 14: We create the Y
label (output) by considering the target
column only.
Line 16: We divide our X
and Y
labels into train and test data using the train_test_split
function. We set the test data size to 0.2 (20% data of the whole dataset).
Now that we have split our data into train and test data, we can apply the Ridge regression model to predict the house values. The code is given below:
from sklearn.datasets import fetch_california_housing import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import Ridge import numpy as np housing = fetch_california_housing() housing_df = pd.DataFrame(housing["data"] , columns=housing["feature_names"]) housing_df["target"]=housing["target"] print(housing_df.head()) np.random.seed(42) X = housing_df.drop('target' , axis = 1) Y = housing_df["target"] x_train, x_test, y_train, y_test = train_test_split(X , Y , test_size = 0.2) model = Ridge() model.fit(x_train , y_train) score = model.score(x_test , y_test) print("Accuracy of the model is: " + str(score*100) + "%")
Line 4: We import the Ridge
model from sklearn.linear_model
.
Line 19: We create an instance of the Ridge
model.
Line 20: We train the model on the training data using the fit
function.
Line 21: After training the model, we predict the house value by passing the test data. The score
function performs prediction and calculates the accuracy of the model.
Ridge regression is a powerful technique to regularize linear regression models when dealing with multicollinearity. Sklearn allows to implement Ridge regression simply and concisely. By following the steps discussed in this Answer, we can easily incorporate Ridge regression into our machine-learning projects.
Free Resources