One of the key concepts in decision trees is the calculation of impurity to determine the heterogeneity (mixed) of a dataset. One common impurity measure is Gini impurity, which measures the probability of incorrectly classifying a randomly chosen element from the dataset. Its range is from to . Lower Gini impurity values indicate a purer subset. The formula for Gini impurity is given below:
Here, is the Gini impurity, is the number of classes in the dataset, and is the probability of an element in the given dataset belonging to class .
Gini impurity is a crucial criterion for decision tree algorithms due to several reasons:
Simplicity: Gini impurity is easy to understand and compute, making it a popular choice in decision tree algorithms.
Efficiency: Calculating Gini impurity is computationally efficient, especially compared to alternatives like entropy.
Robustness: Gini impurity is not sensitive to class distribution imbalances, which makes it suitable for datasets with unequal class frequencies.
Let’s define the Gini impurity function and calculate the impurity for two different arrays.
import numpy as npdef gini_impurity(y):classes, counts = np.unique(y, return_counts=True)prob = counts / len(y)impurity = 1 - np.sum(prob**2)return impurityy_pure = [0, 0, 0, 0, 0]y_impure = [0, 0, 1, 1, 1]print(f'gini impurity of a dataset with single class = {gini_impurity(y_pure)}')print(f'gini impurity of a dataset with a mix of two classes = {gini_impurity(y_impure)}')
Line 3: We define a function called the gini_impurity
that takes one argument, y
, which is a list of labels or classes. This function calculates the Gini impurity of a dataset.
Line 4: Inside the function, we use the unique
function to get the unique classes in the y
array and their corresponding counts. The classes
contain the unique class labels, and the counts
contain the number of occurrences of each class.
Line 5: We calculate the probability of each class by dividing the counts by the total number of elements in y
. This gives us the relative frequency of each class in the dataset.
Line 6: Finally, the Gini impurity formula is applied to calculate the impurity of the dataset.
Note: In decision trees, Gini impurity is recursively used to determine how to split a dataset into subsets. The goal is to minimize Gini impurity after each split until certain stopping criteria are met, such as a maximum depth or a minimum number of samples in a leaf node.
Free Resources