Preparing the Test Dataset
Learn how to split data into training and test datasets, and then prepare the test dataset for predictions.
We'll cover the following...
Splitting the data
The first step of any machine learning project is splitting the data into training and test datasets. The training dataset is used throughout crafting machine learning models, including exploratory data analysis (EDA), feature engineering, training, and tuning. The test dataset is used at the end of the project as the final test of a machine learning model’s prediction quality.
The rsample
package offers the initial_split()
, training()
, and testing()
functions for splitting data. The following code demonstrates using the Adult Census Income dataset:
#================================================================================================# Load libraries - suppress messages#suppressMessages(library(tidyverse))suppressMessages(library(tidymodels))#================================================================================================# Load the Adult Census Income dataset#adult_census <- read_csv("adult_census.csv", show_col_types = FALSE)#================================================================================================# Load the Adult Census Income dataset, create factors, and engineer a new feature## It is best practice to set the seed for split reproducibilityset.seed(498798)adult_split <- initial_split(adult_census, prop = 0.8, strata = "income")# Create the training and test data framesadult_train <- training(adult_split)adult_test <- testing(adult_split)str(adult_train)str(adult_test)
The initial_split()
function randomly splits data into training and test datasets. As this data split is only performed once at the beginning of the project, it’s best practice to set the random seed via the set.seed()
function to allow for reproducibility.
In the call to initial_split()
, the prop
parameter is set to use 80 ...