Gini Impurity
Learn the math used by CART classification trees to define purity vs. impurity.
We'll cover the following...
Impurity intuition
Like all machine learning algorithms, CART classification trees use math to learn from data. Before looking at the calculations used by CART classification trees, it’s helpful to understand the mathematics intuitively.
To keep things simple, consider the Adult Census Income dataset. This dataset is a classification scenario with two possible label values: <=50K
and >50K
. This scenario is also known as a binary classification scenario.
CART classification trees attempt to split labels into the purest grouping possible. Purity / impurity is a spectrum, as illustrated below:
While this concept of a purity / impurity spectrum is quite intuitive to humans, the CART classification tree algorithm needs a calculation that embodies this spectrum in a standardized way. With several calculations available, this course uses the most common measure that is used by default in R: Gini impurity.
Gini impurity (in this course)
The Gini impurity calculation ...