What is a pruning algorithm?

Introduction

In data mining and machine learning, classification reduces the data set into smaller subsets until all we are left with are decision nodes and leaf nodes. Pruning essentially optimizes the classification process by removing unnecessary nodes and making the decision tree smaller.

A visual for pruning's outcome

Pruning helps remove unwanted data and helps avoid overfitting, which is a very common problem in decision trees. Overfitting occurs when the model trains too well and starts learning noise other than the required characteristics. Pruning may help overcome this problem but can also lead to underfitting. Therefore, finding the right threshold of learning varies from problem to problem.

Methods of pruning

The most commonly used pruning algorithm on decision trees is the Alpha-Beta pruning. You can have a quick look at it over here.

There are multiple ways to prune your decision tree. Some of which are:

  • Pruning by information gain
  • Pruning by classification performance on the validation set

Pruning by information gain makes use of the information initially available when the tree is built from the training data.

Pruning by classification performance on the validation set makes use of the validation dataset and prunes the decision tree according to the best classification on the validation dataset.

Pruning algorithm

Pruning by information gain

The algorithm is as follows:

  • Catalog all twigsnodes whose children are all leaves.
  • Keep a total count of all the leaves in the tree.
  • Keep a threshold of the number of leaves in the tree needed.
  • Loop until the number of leaves in the tree exceeds the set threshold.
  • Find the twig which gives the least information gain.
  • Take the twig and remove its children.
  • We remove the children because we aren’t gaining enough information from the node, and hence the node can be declared irrelevant.
  • Now relabel the twig to be a leaf.
  • Change the leaf count.
Differentiating in low information gain and high information gain

Pruning by classification performance on the validation set

The algorithm is just the same as the one provided above. The difference here is that:

Each instance of the validation data is passed recursively until we find the node that provides the least information for the validation data set.

Cataloging twigs
1 of 10

The illustration above shows how for a single validation dataset, the tree is pruned. This process is repeated multiple times for different datasets.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved