Regression with H2O
Learn about regression analysis and its implementation using the H2O package.
We'll cover the following
What is regression analysis?
Regression analysis is a type of statistical method that is used to model the relationship between a dependent variable
the target variable set of input features set of coefficients for each feature intercept term error term
There are various types of regression models , such as linear regression, logistic regression, and support vector machines. In recent years, advances in machine learning and big data have led to the development of new and more sophisticated regression models like decision trees, random forests, and gradient boosting machines. We choose a specific regression model type to fit the problem and the nature of the data to be analyzed.
H2O-supported regression models
The H2O module is very effective for supporting supervised regression tasks. In addition to generalized linear models, it hosts many other new algorithms like random forest, gradient boosting machines, and deep neural networks (DNNs). We choose an algorithm based on various factors of the task at hand and business requirements. Some algorithms are preferred because they offer better predictive power or a faster runtime or they are easier to explain.
H2O also gives us the capability to do automatic machine learning, also known as automated machine learning (AutoML). AutoML is useful for automating end-to-end machine learning workflow. It involves the automatic training and tuning of numerous models within a user-specified time limit and using various algorithms.
We’ll now dive deep into the details and see how H2O AutoML can help us choose the best regression model. We’ll also do an EDA to better understand the dataset and get statistical insights prior to training the model.
Import libraries and load data
Let’s import the necessary libraries to make ourselves familiar with the data. Here, we’re going to import pandas, NumPy, and Matplotlib. We have preloaded the dataset in the course content, and we’ll use it throughout the course.
import pandas as pd # for dataframe operationsimport numpy as np # for vector operationsimport matplotlib as mpl # for creating visualizationsimport matplotlib.pyplot as plt # for plotting
Data quality and quantity have a direct impact on the accuracy and reliability of the results obtained from any machine learning model. Having a relevant and correct dataset is crucial to building a robust and generalized machine learning model. The data needs to be representative, of a large sample size and high quality, independent (for linear regression models), and relevant to the problem.
For our regression task, we’ll use the airline fares dataset to build a model that predicts airfare based on departure and destination city, flight duration, the number of days until departure, and some additional features.
# Reading dataset and checking some samplesdata = pd.read_csv(filepath+'Airline_Fares.csv.zip', compression='zip')data.head()
Here, we’re loading the airline fares dataset using the read_csv
function from the pandas library on line 2. To get a quick glimpse of the data, line 3 prints the first five dataset rows.
Now, we’ll use another pandas method info()
to check basic information about the data types and missing values in our dataset.
# Checking dataset infodata.info()
Our dataset has 300,153 rows and 11 columns, and there are no missing values. Also, we can see that the pandas library has already inferred the data type of each feature.