Using XGBoost with tidymodels

Build on your knowledge of tidymodels to use the XGBoost algorithm in your machine learning code.

Data preparation

The XGBoost algorithm only supports numeric data. For example, the R xgboost package doesn’t recognize R factors, including ordering factor levels. When using the recipes package for preparing data for use with xgboost, we have to follow these steps: One-hot encoding converts each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns.

  1. Prepare the training data according to best practices using dplyr (e.g, mutate() function) and recipes functions (e.g., step_num2factor()).

  2. Transform categorical predictive features into numeric representations using data preparation functions from the recipes package.

Note: This applies to the predictive features only.

When performing classification, ensure the label is a factor. Each label value is automatically converted into a numeric representation. In the case of binary classification, the event_level parameter of the set_engine() function can be set to first or second to indicate the positive outcome.One-hot encoding converts each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns.

It's best practice to perform training data preparation independently of which machine learning algorithm might be used. For example, using R factors and data preparation functions from the recipes package allows for quickly moving between machine learning algorithms like CART, random forest, and XGBoost.

Transforming categories to number

One-hot encoding is a data preparation technique for transforming categorical data into numeric representations. One-hot encoding works by creating a new feature for every distinct category and using a binary indicator to denote the category for the observation.

Take, for example, the following sample of Adult Census Income data:

Get hands-on with 1400+ tech skills courses.