Distributed Machine Learning and Its Implementation with H2O/

...

Regression with H2O

Learn about regression analysis and its implementation using the H2O package.

We'll cover the following...

What is regression analysis?
H2O-supported regression models
Import libraries and load data

Press + to interact

There are various types of regression models , such as linear regression, logistic regression, and support vector machines. In recent years, advances in machine learning and big data have led to the development of new and more sophisticated regression models like decision trees, random forests, and gradient boosting machines. We choose a specific regression model type to fit the problem and the nature of the data to be analyzed.

H2O-supported regression models

The H2O module is very effective for supporting supervised regression tasks. In addition to generalized linear models, it hosts many other new algorithms like random forest, gradient boosting machines, and deep neural networks (DNNs). We choose an algorithm based on various factors of the task at hand and business requirements. Some algorithms are preferred because they offer better predictive power or a faster runtime or they are easier to explain.

H2O also gives us the capability to do automatic machine learning, also known as automated machine learning (AutoML). AutoML is useful for automating end-to-end machine learning workflow. It involves the automatic training and tuning of numerous models within a user-specified time limit and using various algorithms.

We’ll now dive deep into the details and see how H2O AutoML can help us choose the best regression model. We’ll also do an EDA to better understand the dataset and get statistical insights prior to training the model.

Import libraries and load data

Let’s import the necessary libraries to make ourselves familiar with the data. Here, we’re going to import pandas, NumPy, and Matplotlib. We have preloaded the dataset in the course content, and we’ll use it throughout the course.

Press + to interact

Data quality and quantity have a direct impact on the accuracy and reliability of the results obtained from any machine learning model. Having a relevant and correct dataset is crucial to building a robust and generalized machine learning model. The data needs to be representative, of a large sample size and high quality, independent (for linear regression models), and relevant to the problem.

For our regression task, we’ll use the airline fares dataset to build a model that predicts airfare based on departure and destination city, flight duration, the number of days until departure, and some additional features.

Press + to interact

Introduction to Machine Learning

Supervised Learning: Regression Models with H2O

Supervised Learning: Classification Models with H2O

Unsupervised Learning: Clustering with H2O

Unsupervised Learning: Anomaly Detection with H2O

Closing Notes

Appendix

Regression with H2O

What is regression analysis?

H2O-supported regression models

Import libraries and load data