Exercise: Find Optimal Hyperparameters for a Decision Tree
Learn to find the optimal maximum depth hyperparameter for a decision tree by using the grid search method.
We'll cover the following...
Using GridSearchCV to tune hyperparameters
In this exercise, we will use GridSearchCV
to tune the hyperparameters for a decision tree model. You will learn about a convenient way of searching different hyperparameters with scikit-learn. Perform the following steps to complete the exercise:
-
Import the
GridSearchCV
class with this code:from sklearn.model_selection import GridSearchCV
The next step is to define the hyperparameters that we want to search using cross-validation. We will find the best maximum depth of tree, using the
max_depth
parameter. Deeper trees have more node splits, which partition the training set into smaller and smaller subspaces using the features. While we don’t know the best maximum depth ahead of time, it is helpful to consider some limiting cases when considering the range of parameters to use for the grid search.We know that one is the minimum depth, consisting of a tree with just one split. As for the largest depth, you can consider how many samples you have in your training data, or, more appropriately in this case, how many samples will be in the training fold for each split of the cross-validation. We will perform a 4-fold cross-validation like we did in the previous section. So, how many samples will be in each training fold, and how does this relate to the depth of the tree?
-
Find the number of samples in the training data using this code:
X_train.shape
The output should be as follows:
(21331, 17)
With 21,331 training samples and 4-fold cross-validation, there will be three-fourths of the samples, or about 16,000 samples, in each training fold.
What does this mean for how deep we may wish to grow our ...