Exercise: Fitting a Random Forest
Learn how to fit a random forest model with cross-validation on the training data from the case study.
We'll cover the following...
Extending decision trees with random forests
In this exercise, we will extend our efforts with decision trees by using the random forest model with cross-validation on the training data from the case study. We will observe the effect of increasing the number of trees in the forest and examine the feature importance that can be calculated using a random forest model. Perform the following steps to complete the exercise:
-
Import the random forest classifier model class as follows:
from sklearn.ensemble import RandomForestClassifier
-
Instantiate the class using these options:
rf = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=3, min_samples_split=2,\ min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None,\ min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None,\ random_state=4, verbose=0, warm_start=False, class_weight=None)
For this exercise, we’ll use mainly the default options. However, note that we will set
max_depth = 3
. Here, we are only going to explore the effect of using different numbers of trees, which we will illustrate with relatively shallow trees for the sake of shorter runtimes. To find the best model performance, we’d typically try more trees and deeper depths of trees.We also set
random_state
for consistent results across runs. -
Create a parameter grid for this exercise in order to search the numbers of trees, ranging from 10 to 100 by 10s:
rf_params_ex = {'n_estimators':list(range(10,110,10))}
We use Python’s
range()
function to create an iterator for the integer values we want, and then convert them to a list usinglist()
. -
Instantiate a grid search cross-validation object for the random forest model using the parameter grid from the previous step. Otherwise, you can use the same options that were used for the cross-validation of the decision tree: