Decision Trees
Learn about decision trees, and understand their working and implementation.
Decision trees
Decision trees are powerful and widely used machine learning algorithms that are popular for both classification and regression tasks. They’re simple to understand and interpret, making them valuable tools for data analysis and decision-making. Apart from their standalone usage, they often serve as a default choice for a weak learner in ensemble learning. In this lesson, we’ll explore the concept of decision trees, how they work, and implement a basic decision tree from scratch using numpy
in Python. We’ll also present its implementation using sklearn
.
How do decision trees work?
Decision trees are tree-like structures where each internal node represents a decision based on a specific feature, and each leaf node represents the outcome (class label for classification or predicted value for regression). They work by recursively splitting the dataset into subsets based on the most significant attribute (feature) at each step, with the goal of creating leaf nodes that represent the final predicted class or value.
Gini impurity
One of the key concepts in decision trees is the calculation of impurity to determine how heterogeneous (mixed) a dataset is. One common impurity measure is the Gini impurity, which measures the probability of incorrectly classifying a randomly chosen element from the dataset. Lower Gini impurity values indicate purer subsets. The formula for Gini impurity is given below:
Here, is the Gini impurity, is the number of classes in the dataset, and is the probability of an element in the given dataset belonging to class .
Note: Apart from Gini impurity, we can also use entropy, misclassification error, information gain, gain ratio, chi-square statistic, and MSE to create decision trees.
Optimization
The goal is to figure out the feature and the corresponding threshold to minimize the Gini impurity if we split the dataset based on that feature on the threshold. We do this hierarchically until we get the Gini impurity as zero. For the level of the tree, the optimization problem is as follows:
...