Regression with H2O

Learn about regression analysis and its implementation using the H2O package.

What is regression analysis?

Regression analysis is a type of statistical method that is used to model the relationship between a dependent variable yy (target variable) and one or more independent variables xx (input features). The regression model finds the mathematical relationship that best predicts the dependent variable. This relationship is most commonly represented in the form of a linear equation:

  • yy the target variable

  • x1....xnx_1....x_n set of input features

  • β1.....βn\beta_1.....\beta_n set of coefficients for each feature

  • β0\beta_0 intercept term

  • ϵ\epsilon error term

Press + to interact
Linear regression
Linear regression

There are various types of regression models , such as linear regression, logistic regression, and support vector machines. In recent years, advances in machine learning and big data have led to the development of new and more sophisticated regression models like decision trees, random forests, and gradient boosting machines. We choose a specific regression model type to fit the problem and the nature of the data to be analyzed.

H2O-supported regression models

The H2O module is very effective for supporting supervised regression tasks. In addition to generalized linear models, it hosts many other new algorithms like random forest, gradient boosting machines, and deep neural networks (DNNs). We choose an algorithm based on various factors of the task at hand and business requirements. Some algorithms are preferred because they offer better predictive power or a faster runtime or they are easier to explain.

H2O also gives us the capability to do automatic machine learning, also known as automated machine learning (AutoML). AutoML is useful for automating end-to-end machine learning workflow. It involves the automatic training and tuning of numerous models within a user-specified time limit and using various algorithms.

We’ll now dive deep into the details and see how H2O AutoML can help us choose the best regression model. We’ll also do an EDA to better understand the dataset and get statistical insights prior to training the model.

Import libraries and load data

Let’s import the necessary libraries to make ourselves familiar with the data. Here, we’re going to import pandas, NumPy, and Matplotlib. We have preloaded the dataset in the course content, and we’ll use it throughout the course.

Press + to interact
import pandas as pd # for dataframe operations
import numpy as np # for vector operations
import matplotlib as mpl # for creating visualizations
import matplotlib.pyplot as plt # for plotting

Data quality and quantity have a direct impact on the accuracy and reliability of the results obtained from any machine learning model. Having a relevant and correct dataset is crucial to building a robust and generalized machine learning model. The data needs to be representative, of a large sample size and high quality, independent (for linear regression models), and relevant to the problem.

For our regression task, we’ll use the airline fares dataset to build a model that predicts airfare based on departure and destination city, flight duration, the number of days until departure, and some additional features.

Press + to interact
# Reading dataset and checking some samples
data = pd.read_csv(filepath+'Airline_Fares.csv.zip', compression='zip')
data.head()

Here, we’re loading the airline fares dataset using the read_csv function from the pandas library on line 2. To get a quick glimpse of the data, line 3 prints the first five dataset rows.

Now, we’ll use another pandas method info() to check basic information about the data types and missing values in our dataset.

Press + to interact
# Checking dataset info
data.info()

Our dataset has 300,153 rows and 11 columns, and there are no missing values. Also, we can see that the pandas library has already inferred the data type of each feature.