...

/

Classification Tree Training Example

Classification Tree Training Example

Learn how Gini impurity and Gini change are used to train a decision tree from a dataset.

The dataset

We’ll use the following hypothetical dataset for this lesson. The dataset has been designed similarly to the Adult Census Income dataset, but the data is much simpler. The goal is to train a classification decision tree that accurately predicts income levels.

The dataset consists of three binary categorical features (college, union, and manager) and a binary label (income):

Hypothetical Data Sample

College

Union

Manager

Income

no

yes

no

>50K

no

yes

no

>50K

no

no

no

<=50K

no

no

no

<=50K

no

no

no

<=50K

yes

no

yes

>50K

yes

no

yes

>50K

yes

no

yes

>50K

yes

yes

no

<=50K

yes

yes

no

<=50K

The algorithm

The classification decision tree algorithm has the following steps:

  1. Calculate the Gini impurity of the parent node.

  2. For each available feature, calculate the Gini gain.

  3. Choose to split the tree on the feature with the highest Gini gain from step 2.

  4. While features remain, repeat steps 1–3 for each split.

This algorithm is relatively simple because the data is designed to eliminate complexities. In the following two lessons, you’ll learn how the CART classification tree algorithm handles real-world data complexities.

Gini change of the college feature

The root node of the tree represents the first data split. According to the preceding algorithm, the first step is to calculate the Gini impurity of the parent node. In this case, the root node has all the observations. For the hypothetical data sample, there are five of each label.

Here’s the Gini impurity of all the data:

Next, the algorithm requires the Gini change to be calculated for each feature. The algorithm is simple and proceeds through the features from left to right, starting with the college feature.

Here’s the subset of the original data applicable to the Gini change calculation for the college feature:

Data Subset

College

Income

no

>50K

no

>50K

no

<=50K

no

<=50K

no

<=50K

yes

>50K

yes

>50K

yes

>50K

yes

<=50K

yes

<=50K

Here’s the Gini change calculation:

As a binary categorical feature, calculating the Gini change of college is based on the yes and no values.

First, the proportions of college values are calculated as follows:

Proportionyes=N(tyes)N=510=0.5Proportion_{yes} = \frac{N(t_{yes})}{N} = \frac{5}{10} = 0.5 ...

Proportionno=N(tno)N=510=0.5Proportion_{no} = \frac{N(t_{no})}{N} = \frac{5}{10} = 0.5 ...