Bootstrap, Bagging, and Random Forests
Learn about Bootstrap, bagging, and random forest.
Let's cover the theoretical background and critical concepts governing decision tree learning and random forest.
Bootstrap
Bootstrap is a widely applicable and powerful statistical tool that can quantify the uncertainty associated with a given estimator or statistical learning method. Consider that we have a dataset with 100 values and we want to estimate the sample’s mean. This can be done as follows:
For such a small sample, we expect an error in the mean. However, using the Bootstrap procedure, we can improve the estimate of our mean using the following steps:
Create many random subsets (say 500) of our dataset with replacement—selecting the same value multiple times.
Calculate the mean of each subset.
Calculate the average of our collected means and use that as our estimated mean for the data.
For example, we had five subsets that gave the mean values of 2.5, 3.5, 5.5, 4.3, and 2.9. The estimated mean, in this case, will be 3.74.
Bagging
Bagging or Bootstrap aggregating is an application of Bootstrap. It is a general-purpose procedure for reducing the high variance of machine learning algorithms, typically in decision trees. This compelling and straightforward ensemble method combines the predictions from multiple machine learning algorithms to make more accurate predictions than any individual model. Let’s say we have a sample with 5,000 instances or values. We want to use the decision tree (CART) algorithm. The bagging procedure will work as follows:
Create many random subsets (say 500) of our dataset with replacement.
Train the model on each subset.
For the new dataset, calculate and output the average prediction from each model.
Note: For example, if we have five bagged decision trees, making predictions for classes G, G, B, and B. The most requested (mode) G will be their final prediction.
Decision trees are greedy. They ...