Controlling Complexity
Learn to control the complexity of CART decision trees via hyperparameters.
We'll cover the following
The CART hyperparameters in R
The classification and regression tree (CART) algorithms are available in R via the rpart
package. CART trees can be specified in tidymodels
by using the value rpart
with the set_engine()
function.
The rpart
package supports many hyperparameters for controlling the complexity (i.e., tuning) of decision tree models as they are being built. Of these hyperparameters, the following are the most useful in practice:
minsplit
: The minimum number of observations that must exist in a node for a split to be attemptedminbucket
: The minimum number of observations in any leaf node
The rpart
package maintains a relationship between the above hyperparameters:
If only a value for
minsplit
is provided, thenminbucket
is set tominsplit / 3
.If only a value for
minbucket
is provided, thenminsplit
is set tominbucket * 3
.
Given these values, it’s common to tune only minsplit
and allow minbucket
to be set automatically.
Controlling complexity
Conceptually, decision trees with more nodes are more complex than those with fewer nodes. Consider the nature of root and internal nodes within decision trees. These nodes represent rules the CART algorithm has learned from the training data. More of these nodes represent more patterns in the data the algorithm has learned—more complexity.
Now consider the nature of the minsplit
hyperparameter. This hyperparameter controls the number of nodes that can be built in the tree. In general, larger values of minsplit
produce smaller, less complex trees. Smaller values do the opposite, they produce larger, more complex trees.
Let’s make these abstract ideas concrete with an example. Take the following data sample from the Adult Census Income dataset:
Get hands-on with 1400+ tech skills courses.