Bagging

Learn how bagging is the first technique used by the random forest algorithm to produce valuable ensembles.

Randomizing observations

To build a valuable machine learning ensemble, the models within the ensemble should produce predictions with low correlation. In other words, the ensemble models should be different from each other. The random forest algorithm takes advantage of the high variance of the CART algorithm to manufacture diversity across the ensemble models.

The random forest algorithm uses bagging (i.e., bootstrap aggregation) to randomize the observations used to train the individual CART decision trees. CART decision trees trained on randomized observations exhibit a diversity of predictions.

Bagging uses random sampling with replacement to create many diverse training datasets from a single starting training dataset. Each training dataset created by bagging has the same number of observations as the starting training dataset.

Random sampling with replacement

Let’s understand how bagging works through an example. Imagine you have a jar in which you place four balls, each with a distinct color: blue, green, red, and orange. In this example, each ball represents an observation in the training dataset.

After shaking the jar, you retrieve a ball randomly (i.e., you can’t see into the jar). Let’s say the ball you retrieved is the blue ball. You write down the word “blue” on a piece of paper and place the ball back into the jar. You then repeat this process three more times.

Let’s say the paper reads blue, orange, orange, green. This represents a randomized training dataset created from the starting training set. Randomized training datasets created this way are known called bags. The following image depicts the process:

Get hands-on with 1400+ tech skills courses.