Decision Tree Intuition
Build an intuitive understanding of the CART classification decision tree algorithm.
We'll cover the following...
Trees are rules
A CART classification tree embodies rules to assign / predict labels. The algorithm learns these rules from the combination of features and the labels provided in the training data. Take the following sample data from the Adult Census Income dataset:
Adult Census Income Data Sample
Occupation | Relationship | Income |
Adm-clerical | Not-in-family | <=50K |
Exec-managerial | Husband | <=50K |
Handlers-cleaners | Not-in-family | <=50K |
Handlers-cleaners | Husband | <=50K |
Prof-speciality | Wife | <=50K |
Exec-managerial | Wife | <=50K |
Other-service | Not-in-family | <=50K |
Exec-managerial | Husband | >50K |
Prof-speciality | Not-in-family | >50K |
Exec-managerial | Husband | >50K |
In this sample data, the occupation
and relationship
columns are the features, and the income
column is the label.
Imagine a CART classification tree trained on this data. While it would be silly to do so in real life because there are only ten observations, it’s a valuable thought exercise to gain an intuitive understanding of the algorithm.
Using the data given above, the algorithm could learn the following rules expressed in R code:
if (occupation %in% c("Adm-clerical", "Handlers-cleaners", "Other-service")) {income <- "<=50K"} else if (relationship == "Wife") {income <- "<=50K"} else {income <- ">50K"}
How does the CART classification tree algorithm arrive at these rules?
Minimizing impurity
Machine learning algorithms work by trying to achieve an objective. In the case of CART classification trees, the objective is to minimize impurity. The Gini impurity calculation used by the algorithm is taught later in the course. For now, building intuition about impurity is sufficient.
While CART classification trees seek to minimize impurity, it’s more intuitive to think about the opposite—maximizing purity. Consider the Adult Census Income dataset. Each observation in the dataset has an income
label that can take one of two possible values.
The CART ...