Gini Impurity

Learn the math used by CART classification trees to define purity vs. impurity.

Impurity intuition

Like all machine learning algorithms, CART classification trees use math to learn from data. Before looking at the calculations used by CART classification trees, it’s helpful to understand the mathematics intuitively.

To keep things simple, consider the Adult Census Income dataset. This dataset is a classification scenario with two possible label values: <=50K and >50K. This scenario is also known as a binary classification scenario.

CART classification trees attempt to split labels into the purest grouping possible. Purity / impurity is a spectrum, as illustrated below:

Press + to interact
Adult Census Income purity / impurity spectrum
Adult Census Income purity / impurity spectrum

While this concept of a purity / impurity spectrum is quite intuitive to humans, the CART classification tree algorithm needs a calculation that embodies this spectrum in a standardized way. With several calculations available, this course uses the most common measure that is used by default in R: Gini impurity.

Gini impurity (in this course)

The Gini impurity calculation ...