Exercise: Find Optimal Hyperparameters for a Decision Tree
Learn to find the optimal maximum depth hyperparameter for a decision tree by using the grid search method.
We'll cover the following
Using GridSearchCV to tune hyperparameters
In this exercise, we will use GridSearchCV
to tune the hyperparameters for a decision tree model. You will learn about a convenient way of searching different hyperparameters with scikit-learn. Perform the following steps to complete the exercise:
-
Import the
GridSearchCV
class with this code:from sklearn.model_selection import GridSearchCV
The next step is to define the hyperparameters that we want to search using cross-validation. We will find the best maximum depth of tree, using the
max_depth
parameter. Deeper trees have more node splits, which partition the training set into smaller and smaller subspaces using the features. While we don’t know the best maximum depth ahead of time, it is helpful to consider some limiting cases when considering the range of parameters to use for the grid search.We know that one is the minimum depth, consisting of a tree with just one split. As for the largest depth, you can consider how many samples you have in your training data, or, more appropriately in this case, how many samples will be in the training fold for each split of the cross-validation. We will perform a 4-fold cross-validation like we did in the previous section. So, how many samples will be in each training fold, and how does this relate to the depth of the tree?
-
Find the number of samples in the training data using this code:
X_train.shape
The output should be as follows:
(21331, 17)
With 21,331 training samples and 4-fold cross-validation, there will be three-fourths of the samples, or about 16,000 samples, in each training fold.
What does this mean for how deep we may wish to grow our tree?
A theoretical limitation is that we need at least one sample in each leaf. From our discussion regarding how the depth of the tree relates to the number of leaves, we know a tree that splits at every node before the last level, with levels, has leaf nodes. Therefore, a tree with L leaf nodes has a depth of approximately log2(L). In the limiting case, if we grow the tree deep enough so that every leaf node has one training sample for a given fold, the depth will be log2(16,000) ≈ 14. So, 14 is the theoretical limit to the depth of a tree that we could grow in this case.
Practically speaking, you will probably not want to grow a tree this deep, as the rules used to generate the decision tree will be very specific to the training data and the model is likely to be overfit. However, this gives you an idea of the range of values we may wish to consider for the
max_depth
hyperparameter. We will explore a range of depths from 1 up to 12. -
Define a dictionary with the key being the hyperparameter name and the value being the list of values of this hyperparameter that we want to search in cross-validation:
params = {'max_depth':[1, 2, 4, 6, 8, 10, 12]}
In this case, we are only searching one hyperparameter. However, you could define a dictionary with multiple key-value pairs to search over multiple hyperparameters simultaneously.
-
If you are running all the exercises for this section in a single notebook, you can reuse the decision tree object,
dt
, from earlier. If not, you need to create a decision tree object for the hyperparameter search:dt = tree.DecisionTreeClassifier()
Now we want to instantiate the
GridSearchCV
class. -
Instantiate the
GridSearchCV
class using these options:cv = GridSearchCV(dt, param_grid=params, scoring='roc_auc', n_jobs=None, refit=True, cv=4, verbose=1,\ pre_dispatch=None, error_score=np.nan, return_train_score=True)
Note here that we use the ROC AUC metric (
scoring='roc_auc'
), that we do 4-fold cross-validation (cv=4
), and that we calculate training scores (return_train_score=True
) to assess the bias-variance trade-off.Once the cross-validation object is defined, we can simply use the
.fit
method on it as we would with a model object. This encapsulates essentially all the functionality of the cross-validation loop. -
Perform 4-fold cross-validation to search for the optimal maximum depth using this code:
cv.fit(X_train, y_train)
The output should be as follows:
Fitting 4 folds for each of 7 candidates, totalling 28 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 3.2s finished GridSearchCV(cv=4, estimator=DecisionTreeClassifier(), param_grid={'max_depth': [1, 2, 4, 6, 8, 10, 12]}, pre_dispatch=None, return_train_score=True, scoring='roc_auc', verbose=1)
All the options that we specified are printed as output. Additionally, there is some output information regarding how many cross-validation fits were performed. We had 4 folds and 7 hyperparameters, meaning 4 x 7 = 28 fits are performed. The amount of time this took is also displayed. You can control how much output you get from this procedure with the
verbose
keyword argument; larger numbers mean more output.Now it’s time to examine the results of the cross-validation procedure. Among the methods that are available on the fitted
GridSearchCV
object is.cv_results_
. This is a dictionary containing the names of results as keys and the results themselves as values. For example, themean_test_score
key holds the average testing score across the folds for each of the seven hyperparameters. You could directly examine this output by runningcv.cv_results_
in a code cell. However, this is not easy to read. Dictionaries with this kind of structure can be used immediately in the creation of a pandas DataFrame, which makes looking at the results a little easier. -
Run the following code to create and examine a pandas DataFrame of crossvalidation results:
cv_results_df = pd.DataFrame(cv.cv_results_) cv_results_df
The output should look like this:
Get hands-on with 1400+ tech skills courses.