Classification Tree Training Example
Learn how Gini impurity and Gini change are used to train a decision tree from a dataset.
We'll cover the following...
The dataset
We’ll use the following hypothetical dataset for this lesson. The dataset has been designed similarly to the Adult Census Income dataset, but the data is much simpler. The goal is to train a classification decision tree that accurately predicts income levels.
The dataset consists of three binary categorical features (college
, union
, and manager
) and a binary label (income):
Hypothetical Data Sample
College | Union | Manager | Income |
no | yes | no | >50K |
no | yes | no | >50K |
no | no | no | <=50K |
no | no | no | <=50K |
no | no | no | <=50K |
yes | no | yes | >50K |
yes | no | yes | >50K |
yes | no | yes | >50K |
yes | yes | no | <=50K |
yes | yes | no | <=50K |
The algorithm
The classification decision tree algorithm has the following steps:
Calculate the Gini impurity of the parent node.
For each available feature, calculate the Gini gain.
Choose to split the tree on the feature with the highest Gini gain from step 2.
While features remain, repeat steps 1–3 for each split.
This algorithm is relatively simple because the data is designed to eliminate complexities. In the following two lessons, you’ll learn how the CART classification tree algorithm handles real-world data complexities.
Gini change of the college
feature
The root node of the tree represents the first data split. According to the preceding algorithm, the first step is to calculate the Gini impurity of the parent node. In this case, the root node has all the observations. For the hypothetical data sample, there are five of each label.
Here’s the Gini impurity of all the data:
Next, the algorithm requires the Gini change to be calculated for each feature. The algorithm is simple and proceeds through the features from left to right, starting with the college
feature.
Here’s the subset of the original data applicable to the Gini change calculation for the college
feature:
Data Subset
College | Income |
no | >50K |
no | >50K |
no | <=50K |
no | <=50K |
no | <=50K |
yes | >50K |
yes | >50K |
yes | >50K |
yes | <=50K |
yes | <=50K |
Here’s the Gini change calculation:
As a binary categorical feature, calculating the Gini change of college
is based on the yes
and no
values.
First, the proportions of college
values are calculated as follows:
...
...