Making Predictions
Apply what you've learned about the random forest algorithm to make predictions on a test dataset.
We'll cover the following...
Training a model
Using a sample of the Adult Census Income dataset, the following code trains a random forest from a training dataset split to predict the income
label from a collection of features:
Press + to interact
#================================================================================================# Load libraries - suppress messages#suppressMessages(library(tidyverse))suppressMessages(library(tidymodels))#================================================================================================# Load the Adult Census Income dataset#adult_census <- read_csv("adult_census.csv", show_col_types = FALSE)#================================================================================================# Load the Adult Census Income dataset, create factors, and engineer a new feature## It is best practice to set the seed for split reproducibilityset.seed(498798)adult_split <- initial_split(adult_census, prop = 0.8, strata = "income")# Create the training and test data framesadult_train <- training(adult_split)adult_test <- testing(adult_split)#================================================================================================# Use the entire dataset to define factor levels## NOTE - The levels are not sorted as would normally happen when creating R factors#work_class_levels <- unique(adult_census$work_class)education_levels <- unique(adult_census$education)marital_status_levels <- unique(adult_census$marital_status)occupation_levels <- unique(adult_census$occupation)relationship_levels <- unique(adult_census$relationship)race_levels <- unique(adult_census$race)native_country_levels <- unique(adult_census$native_country)#================================================================================================# It is best practice to create character-based factors outside of a recipe#adult_train <- adult_train %>%mutate(work_class = factor(work_class, levels = work_class_levels),education = factor(education, levels = education_levels),marital_status = factor(marital_status, levels = marital_status_levels),occupation = factor(occupation, levels = occupation_levels),relationship = factor(relationship, levels = relationship_levels),race = factor(race, levels = race_levels),sex = factor(sex),native_country = factor(native_country, levels = native_country_levels),income = factor(income))str(adult_train)#================================================================================================# Craft the recipe - recipes package## The use of "~ ." tells tidymodels to use all other features to predict incomeadult_recipe <- recipe(income ~ ., data = adult_train)#================================================================================================# Specify the algorithm - parsnip package#adult_model <- rand_forest(trees = 250) %>%set_engine("randomForest") %>%set_mode("classification")#================================================================================================# Set up workflow - workflow package#adult_workflow <- workflow() %>%add_recipe(adult_recipe) %>%add_model(adult_model)#================================================================================================# Fit model - parsnip package## Setting seed for reproducibilityset.seed(54321)adult_fit <- fit(adult_workflow, adult_train)#================================================================================================# Display the model's summarized output#adult_forest <- extract_fit_parsnip(adult_fit)adult_forest
The above code uses 250 trees in the forest to conserve computation and memory and sets the random seed for reproducibility. The resulting random forest ensemble has the following predictive performance: ...