Feature Importance

Learn how the random forest algorithm determines the most important features for making accurate predictions.

Finding the features that matter

When using machine learning, it’s natural to ask, “Which features are the most important for making accurate predictions?” The random forest implements permutation importance to help answer this question. Permutation importance works by randomly shuffling (permuting) feature data and assessing the impact of the shuffling on the quality of predictions.

Here’s the intuition of permutation importance:

  • If you permute the values of highly predictive features, tree accuracy should decrease a lot.

  • If you permute the values of features that aren’t predictive, tree accuracy shouldn’t decrease much.

Imagine the worst feature possible —a set of completely random values. Theoretically, tree accuracy would not decrease if you permute the feature values.

Another use of OOB data

In addition to using out-of-bag (OOB) data to estimate the generalization error, the random forest algorithm uses OOB data to implement permutation importance. The two most common forms of permutation performance are:

  • Ranking features by mean decrease in accuracy.

  • Ranking features by mean decrease in node purity.

This course focuses on the mean decrease in accuracy as it’s the preferred method of determining feature importance. The randomForest package can produce visualizations listing the features in descending order of importance by mean decrease in accuracy via the varImpPlot() function.

Run the following code to see an example using the Titanic training dataset.

Get hands-on with 1200+ tech skills courses.