The first steps of machine learning

Once a machine learning project starts, the first step is splitting the data into training and test sets. This splitting of dataset will be discussed later in the course.

The second step is to perform exploratory data analysis (EDA) only on the training data.

A frequent error of new machine learning practitioners is to perform EDA on all the data before conducting the training / test split. Performing EDA on all the data can lead to the machine learning model making low quality predictions on new / unseen data, which is to say that the model overfits.

This type of overfitting is caused by information leakage. Information leakage happens when test set information influences the training of a model. Strict adherence to machine learning best practices can avoid information leakage. Information Leakage will be covered later in the course .

EDA for machine learning

The purpose of EDA with the training data is threefold:

...

Welcome to the Course

Supervised Learning

Classification Tree Math

Using Classification Trees in R

Introducing the Bias-Variance Tradeoff

Model Tuning

Model Tuning with tidymodels

Feature Engineering

Regression Trees

The Random Forest Algorithm

Using Random Forests

Gradient Boosting Trees

Continuing Your Journey

Credit Card Fraud Detection using the R Language

Exploratory Data Analysis—Profiling

The first steps of machine learning

EDA for machine learning