Using XGBoost with tidymodels
Build on your knowledge of tidymodels to use the XGBoost algorithm in your machine learning code.
We'll cover the following
Data preparation
The XGBoost algorithm only supports numeric data. For example, the R xgboost
package doesn’t recognize R factors, including ordering factor levels. When using the recipes
package for preparing data for use with xgboost
, we have to follow these steps:
Prepare the training data according to best practices using
dplyr
(e.g,mutate()
function) andrecipes
functions (e.g.,step_num2factor()
).Transform categorical predictive features into numeric representations using data preparation functions from the
recipes
package.
Note: This applies to the predictive features only.
When performing classification, ensure the label is a factor. Each label value is automatically converted into a numeric representation. In the case of binary classification, the event_level
parameter of the set_engine()
function can be set to first
or second
to indicate the positive outcome.
It's best practice to perform training data preparation independently of which machine learning algorithm might be used. For example, using R factors and data preparation functions from the recipes
package allows for quickly moving between machine learning algorithms like CART, random forest, and XGBoost.
Transforming categories to number
One-hot encoding is a data preparation technique for transforming categorical data into numeric representations. One-hot encoding works by creating a new feature for every distinct category and using a binary indicator to denote the category for the observation.
Take, for example, the following sample of Adult Census Income data:
Get hands-on with 1300+ tech skills courses.