Engineering Features for Decision Trees
Using your knowledge of decision trees, learn the fundamentals of engineering the best features for your models.
We'll cover the following
The best features for decision trees
Feature engineering is an iterative, creative process. The best features result from combining business domain knowledge with technical knowledge of the decision tree algorithm. From the algorithmic perspective, the following are critical for engineering the best decision tree features:
The best categorical features produce the data’s purest splits (i.e., decision boundaries). This is especially true when multiple categorical features are used simultaneously.
Avoid categorical features with many categories / levels (e.g., over 30). The algorithm will prefer features with many levels, which often leads to overfitting.
A particular class of many level categorical features has “unique-like” data. Examples include database ID columns, government identifiers, timestamps, etc. Avoid these features or engineer out the uniqueness in the data.
In general, numeric features don’t require preprocessing when used with decision trees. For example, the CART algorithm ignores many common preprocessing transformations (e.g., taking the square root of a numeric feature).
Engineering new numeric features derived from two or more numeric features (e.g., creating a ratio of two numeric features) can help the algorithm find more effective decision boundaries.
Irrespective of the algorithm used, feature engineering is a process of experimentation. Features are brainstormed based on business domain knowledge and then vetted to see if they are likely to improve the model’s decision boundaries. In the case of the CART algorithm, data visualization is a very effective technique for vetting features.
Vetting features with data visualization
The decision boundaries produced by the decision tree algorithm have a rectangular geometry. This rectangular geometry makes the ggplot2
package uniquely suited for vetting the quality of features using data visualization. For example, the following code uses domain knowledge of the Titanic dataset to vet the combination of the Pclass
and Sex
features:
Get hands-on with 1400+ tech skills courses.