Regression with PyCaret
Let’s learn how to import necessary libraries and datasets for regression with PyCaret.
The linear regression model
A fundamental task in supervised machine learning is regression where the goal is to predict a continuous value. This is achieved by understanding the relationship between the target variable and the feature variables on a given dataset. One of the most basic regression models is linear regression. It is defined in the following equation. The equivalent vectorized form of the equation is also provided, where the inner product of the transposed vector and is calculated.
- is the target variable for the th instance of the given dataset.
- to are the feature variables.
- is the intercept term.
- to are the coefficients of the feature variables.
- is the error variable.
Regression methods in PyCaret
Besides linear regression, we have other regression models such as lasso, random forest, support vector machines, and gradient boosting. In the remaining lessons, we’ll see how PyCaret can help us choose and train the optimal regression model for a specific dataset. We’ll also learn about exploratory data analysis (
Importing the necessary libraries
First, we import the Python libraries that are necessary for our project.
# Importing necessary librariesimport pandas as pdimport matplotlib.pyplot as pltimport matplotlib as mplimport seaborn as snsfrom pycaret.datasets import get_datafrom pycaret.regression import *mpl.rcParams['figure.dpi'] = 300
Some standard machine learning libraries are included, such as pandas, Matplotlib, and Seaborn. We also import all PyCaret functions that are related to regression. The last line specifies that Matplotlib figures will have a 300 DPI resolution, but we can omit that if we wish.
Loading the dataset
Machine learning projects can only succeed if the appropriate data is available, so PyCaret includes a variety of datasets that can be used to test its features. In this chapter, we’ll use insurance.csv
, a dataset that originates from the book Machine Learning with R by Brett Lantz. This is a health insurance dataset, where the features are various attributes including age, sex, body mass index (BMI), whether the person is a smoker or not, number of children, and US region. Furthermore, the dataset’s target variable is the billed charges for every individual. Real-world data is usually more complex, but working with so-called toy datasets will help us grasp the concepts and techniques before dealing with more difficult cases.
We use the get_data()
PyCaret function to load the dataset to a pandas dataframe.
# Loading/Importing datasetdata = get_data('insurance')
As we can see, the output is equivalent to the head()
pandas function that prints the first five dataset rows. This lets us get a first glimpse of the data we are working with.
We use the pandas info()
function to examine some basic information about the dataset.
# Getting dataset infodata.info()
As we can see in the output, there are rows and none of the columns have null
values. Furthermore, the data type of each column has been automatically inferred by the pandas library.