What is Gini impurity?

One of the key concepts in decision trees is the calculation of impurity to determine the heterogeneity (mixed) of a dataset. One common impurity measure is Gini impurity, which measures the probability of incorrectly classifying a randomly chosen element from the dataset. Its range is from 00 to 11. Lower Gini impurity values indicate a purer subset. The formula for Gini impurity is given below:

G=1i=1kpi2\begin{align*} G = 1- \sum_{i=1}^{k}{p_i}^2 \end{align*}

Here, GG is the Gini impurity, kk is the number of classes in the dataset, and pip_i is the probability of an element in the given dataset belonging to class ii.

Why use Gini impurity?

Gini impurity is a crucial criterion for decision tree algorithms due to several reasons:

  • Simplicity: Gini impurity is easy to understand and compute, making it a popular choice in decision tree algorithms.

  • Efficiency: Calculating Gini impurity is computationally efficient, especially compared to alternatives like entropy.

  • Robustness: Gini impurity is not sensitive to class distribution imbalances, which makes it suitable for datasets with unequal class frequencies.

Implementation

Let’s define the Gini impurity function and calculate the impurity for two different arrays.

import numpy as np
def gini_impurity(y):
classes, counts = np.unique(y, return_counts=True)
prob = counts / len(y)
impurity = 1 - np.sum(prob**2)
return impurity
y_pure = [0, 0, 0, 0, 0]
y_impure = [0, 0, 1, 1, 1]
print(f'gini impurity of a dataset with single class = {gini_impurity(y_pure)}')
print(f'gini impurity of a dataset with a mix of two classes = {gini_impurity(y_impure)}')

Code explanation

  • Line 3: We define a function called the gini_impurity that takes one argument, y, which is a list of labels or classes. This function calculates the Gini impurity of a dataset.

  • Line 4: Inside the function, we use the unique function to get the unique classes in the y array and their corresponding counts. The classes contain the unique class labels, and the counts contain the number of occurrences of each class.

  • Line 5: We calculate the probability of each class by dividing the counts by the total number of elements in y. This gives us the relative frequency of each class in the dataset.

  • Line 6: Finally, the Gini impurity formula is applied to calculate the impurity of the dataset.

Note: In decision trees, Gini impurity is recursively used to determine how to split a dataset into subsets. The goal is to minimize Gini impurity after each split until certain stopping criteria are met, such as a maximum depth or a minimum number of samples in a leaf node.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved