Exercise: Find Optimal Hyperparameters for a Decision Tree

Learn to find the optimal maximum depth hyperparameter for a decision tree by using the grid search method.

Using GridSearchCV to tune hyperparameters

In this exercise, we will use GridSearchCV to tune the hyperparameters for a decision tree model. You will learn about a convenient way of searching different hyperparameters with scikit-learn. Perform the following steps to complete the exercise:

  1. Import the GridSearchCV class with this code:

    from sklearn.model_selection import GridSearchCV
    

    The next step is to define the hyperparameters that we want to search using cross-validation. We will find the best maximum depth of tree, using the max_depth parameter. Deeper trees have more node splits, which partition the training set into smaller and smaller subspaces using the features. While we don’t know the best maximum depth ahead of time, it is helpful to consider some limiting cases when considering the range of parameters to use for the grid search.

    We know that one is the minimum depth, consisting of a tree with just one split. As for the largest depth, you can consider how many samples you have in your training data, or, more appropriately in this case, how many samples will be in the training fold for each split of the cross-validation. We will perform a 4-fold cross-validation like we did in the previous section. So, how many samples will be in each training fold, and how does this relate to the depth of the tree?

  2. Find the number of samples in the training data using this code:

    X_train.shape
    

    The output should be as follows:

    (21331, 17)
    

    With 21,331 training samples and 4-fold cross-validation, there will be three-fourths of the samples, or about 16,000 samples, in each training fold.

    What does this mean for how deep we may wish to grow our tree?

    A theoretical limitation is that we need at least one sample in each leaf. From our discussion regarding how the depth of the tree relates to the number of leaves, we know a tree that splits at every node before the last level, with nn levels, has 2n2n leaf nodes. Therefore, a tree with L leaf nodes has a depth of approximately log2(L). In the limiting case, if we grow the tree deep enough so that every leaf node has one training sample for a given fold, the depth will be log2(16,000) ≈ 14. So, 14 is the theoretical limit to the depth of a tree that we could grow in this case.

    Practically speaking, you will probably not want to grow a tree this deep, as the rules used to generate the decision tree will be very specific to the training data and the model is likely to be overfit. However, this gives you an idea of the range of values we may wish to consider for the max_depth hyperparameter. We will explore a range of depths from 1 up to 12.

  3. Define a dictionary with the key being the hyperparameter name and the value being the list of values of this hyperparameter that we want to search in cross-validation:

    params = {'max_depth':[1, 2, 4, 6, 8, 10, 12]}
    

    In this case, we are only searching one hyperparameter. However, you could define a dictionary with multiple key-value pairs to search over multiple hyperparameters simultaneously.

  4. If you are running all the exercises for this section in a single notebook, you can reuse the decision tree object, dt, from earlier. If not, you need to create a decision tree object for the hyperparameter search:

    dt = tree.DecisionTreeClassifier()
    

    Now we want to instantiate the GridSearchCV class.

  5. Instantiate the GridSearchCV class using these options:

    cv = GridSearchCV(dt, param_grid=params, scoring='roc_auc', n_jobs=None, refit=True, cv=4, verbose=1,\
    pre_dispatch=None, error_score=np.nan, return_train_score=True)
    

    Note here that we use the ROC AUC metric (scoring='roc_auc'), that we do 4-fold cross-validation (cv=4), and that we calculate training scores (return_train_score=True) to assess the bias-variance trade-off.

    Once the cross-validation object is defined, we can simply use the .fit method on it as we would with a model object. This encapsulates essentially all the functionality of the cross-validation loop.

  6. Perform 4-fold cross-validation to search for the optimal maximum depth using this code:

    cv.fit(X_train, y_train)
    

    The output should be as follows:

    Fitting 4 folds for each of 7 candidates, totalling 28 fits 
    [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. 
    [Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:    3.2s finished 
    GridSearchCV(cv=4, estimator=DecisionTreeClassifier(), 
    param_grid={'max_depth': [1, 2, 4, 6, 8, 10, 12]}, 
    pre_dispatch=None, return_train_score=True, scoring='roc_auc', verbose=1)
    

    All the options that we specified are printed as output. Additionally, there is some output information regarding how many cross-validation fits were performed. We had 4 folds and 7 hyperparameters, meaning 4 x 7 = 28 fits are performed. The amount of time this took is also displayed. You can control how much output you get from this procedure with the verbose keyword argument; larger numbers mean more output.

    Now it’s time to examine the results of the cross-validation procedure. Among the methods that are available on the fitted GridSearchCV object is .cv_results_. This is a dictionary containing the names of results as keys and the results themselves as values. For example, the mean_test_score key holds the average testing score across the folds for each of the seven hyperparameters. You could directly examine this output by running cv.cv_results_ in a code cell. However, this is not easy to read. Dictionaries with this kind of structure can be used immediately in the creation of a pandas DataFrame, which makes looking at the results a little easier.

  7. Run the following code to create and examine a pandas DataFrame of crossvalidation results:

    cv_results_df = pd.DataFrame(cv.cv_results_) 
    cv_results_df
    

    The output should look like this:

Get hands-on with 1300+ tech skills courses.