In data mining and machine learning, classification reduces the data set into smaller subsets until all we are left with are decision nodes and leaf nodes. Pruning essentially optimizes the classification process by removing unnecessary nodes and making the decision tree smaller.
Pruning helps remove unwanted data and helps avoid overfitting, which is a very common problem in decision trees. Overfitting occurs when the model trains too well and starts learning noise other than the required characteristics. Pruning may help overcome this problem but can also lead to underfitting. Therefore, finding the right threshold of learning varies from problem to problem.
The most commonly used pruning algorithm on decision trees is the Alpha-Beta pruning. You can have a quick look at it over here.
There are multiple ways to prune your decision tree. Some of which are:
Pruning by information gain makes use of the information initially available when the tree is built from the training data.
Pruning by classification performance on the validation set makes use of the validation dataset and prunes the decision tree according to the best classification on the validation dataset.
The algorithm is as follows:
The algorithm is just the same as the one provided above. The difference here is that:
Each instance of the validation data is passed recursively until we find the node that provides the least information for the validation data set.
The illustration above shows how for a single validation dataset, the tree is pruned. This process is repeated multiple times for different datasets.
Free Resources