Testing the Titanic Dataset
Apply what you've learned about the random forest algorithm to the Titanic test dataset.
We'll cover the following...
Profiling the Titanic test dataset
It’s best practice to profile the test dataset before making predictions for final testing. The goal of profiling the test dataset is only to uncover potential problems in preparing the test dataset for predictions. Any time the test dataset is examined, there is the risk of information leakage. We have to be careful to prevent leakage.
The following code uses the skimr
package to profile the Titanic test dataset:
#================================================================================================# Load libraries - supress messages#suppressMessages(library(tidyverse))library(skimr)#================================================================================================# Load the Titanic test data#titanic_test <- read_csv("titanic_test.csv", show_col_types = FALSE)#================================================================================================# Use the skimr package to get a first pass of the data#skim(titanic_test)
When profiling the test dataset for potential preparation issues, focus on the features that are used in the predictive model and look for the following:
Are there any missing features?
Are there any missing feature values?
Do the levels match the training data for features transformed to be factors?
Missing features are typically the result of an error in the code that creates the test dataset—for example, forgetting to select a particular ...