What is Gini impurity?

One of the key concepts in decision trees is the calculation of impurity to determine the heterogeneity (mixed) of a dataset. One common impurity measure is Gini impurity, which measures the probability of incorrectly classifying a randomly chosen element from the dataset. Its range is from $0$ to $1$ . Lower Gini impurity values indicate a purer subset. The formula for Gini impurity is given below:

\begin{align*} G = 1- \sum_{i=1}^{k}{p_i}^2 \end{align*}

Here, $G$ is the Gini impurity, $k$ is the number of classes in the dataset, and $p_i$ is the probability of an element in the given dataset belonging to class $i$ .

Why use Gini impurity?

Gini impurity is a crucial criterion for decision tree algorithms due to several reasons:

Simplicity: Gini impurity is easy to understand and compute, making it a popular choice in decision tree algorithms.
Efficiency: Calculating Gini impurity is computationally efficient, especially compared to alternatives like entropy.
Robustness: Gini impurity is not sensitive to class distribution imbalances, which makes it suitable for datasets with unequal class frequencies.

Implementation

Let’s define the Gini impurity function and calculate the impurity for two different arrays.

Code explanation

Line 3: We define a function called the gini_impurity that takes one argument, y, which is a list of labels or classes. This function calculates the Gini impurity of a dataset.
Line 4: Inside the function, we use the unique function to get the unique classes in the y array and their corresponding counts. The classes contain the unique class labels, and the counts contain the number of occurrences of each class.
Line 5: We calculate the probability of each class by dividing the counts by the total number of elements in y. This gives us the relative frequency of each class in the dataset.
Line 6: Finally, the Gini impurity formula is applied to calculate the impurity of the dataset.

Note: In decision trees, Gini impurity is recursively used to determine how to split a dataset into subsets. The goal is to minimize Gini impurity after each split until certain stopping criteria are met, such as a maximum depth or a minimum number of samples in a leaf node.

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

TRENDING TOPICS

Learn to Code

Tech Interview Prep

Generative AI

Data Science

Machine Learning

GitHub Students Scholarship

Early Access Courses

Blind 75

Layoffs

Pricing

For Individuals

Try for Free

Gift a Subscription

CONTRIBUTE

Become an Author

Become an Affiliate

Earn Referral Credits

RESOURCES

Blog

Cheatsheets

Webinars

Answers

ABOUT US

Our Team

Careers

Hiring

Frequently Asked Questions

Press

LEGAL

Cookie Policy

Business Terms of Service

Data Processing Agreement

INTERVIEW PREP COURSES

Grokking the Modern System Design Interview

Grokking the Product Architecture Design Interview

Grokking the Coding Interview Patterns

Machine Learning System Design