Advantages of using `GridSearchCV`

In the “The Bias-Variance Trade-Off” chapter, we gained a deep understanding of cross-validation by writing our own function to do it, using the KFold class to generate the training and testing indices. This was helpful to get a thorough understanding of how the process works. However, scikit-learn offers a convenient class that can do more of the heavy lifting for us: GridSearchCV. The GridSearchCV class can take as input a model that we want to find optimal hyperparameters for, such as a decision tree or a logistic regression, and a “grid” of hyperparameters that we want to perform cross-validation over. For example, in a logistic regression, we may want to get the average cross-validation score over all the folds for different values of the regularization parameter, C. With decision trees, we may want to explore different depths of trees.

You can also search multiple parameters at once, for example, if we wanted to try different depths of trees and different numbers of max_features to consider at each node split.

GridSearchCV does what is called an exhaustive grid search over all the possible combinations of parameters that we supply. This means that if we supplied five different values for each of the two hyperparameters, the cross-validation procedure would be run 5 x 5 = 25 times. If you are searching many values of many hyperparameters, the number of cross-validation runs can grow very quickly. In these cases, you may wish to use RandomizedSearchCV, which searches a random subset of hyperparameter combinations from the universe of all possibilities in the grid you supply.

GridSearchCV can speed up your work by streamlining the cross-validation process. You should be familiar with the concepts of cross-validation from the previous chapter, so we proceed directly to listing all the options available for GridSearchCV.

Press + to interact

The Options for GridSearchCV

Parameter	Possible values	Notes
`Estimator`	`estimator` object	This is a model object that you have instantiated from a model class. The hyperparameters will be updated as GridSearchCV does its work.
`param_grid`	`dict` or `list` of `dict` (dictionaries)	This is a model object that you have instantiated from a model class. The hyperparameters will be updated as `GridSearchCV` does its work.
`scoring`	`String`, callable, `list`, `tuple`, or `dict`	This represents the model assessment metric(s) you want to use to measure training and testing performance across the folds, for example, `'roc_auc'`.
`n_jobs`	`int` or `None`	The number of processing jobs to run in parallel. It may speed up cross-validation to run parallel jobs, but it is good to experiment to be sure.
`pre_dispatch`	`int` or `string`	The number of jobs or a formula for the number of jobs to dispatch. Relevant for parallel processing using `n_jobs`.
`cv`	`int`, cross-validation generator, or iterable	If supplying an integer, this is the number of folds to use fore cross-validation.
`refit`	`bool` or `string`	After doing the cross-validation, the "best" hyperparameters according to the metric specified in scoring can be used directly with the fitted `GridSearchCV` object to make predictions. If `refit=True`, the model will be refit to all of the data (not just one of the folds) using the best hyperparameters. Use the `string` argument in case multiple metrics are specified, to indicate which one to use in order to choose the best hyperparameters.
`verbose`	`int`	Controls how much output you will see from the cross-validation procedure.
`error_score`	`'raise'` or numeric	What to do if an error happens during model fitting.
`return_train_score`	`bool`	Whether or not to compute and return training scores on the folds. It is not required for selecting the best hyperparameters based on testing fold scores, and for some datasets and models, this can take substantially more time. However, it does give insights into possible overfitting.

In the next lesson, we’ll make use of the standard error of the mean to create error bars. We’ll average the model performance metric across the testing folds, and the error bars will help us visualize how variable model performance is across the folds.

The standard error of the mean is also known as the standard deviation of the sampling distribution of the sample mean. That is a long name, but the concept isn’t too complicated. The idea behind this is that the population of model performance metrics that we wish to make error bars for represent one possible way of sampling a theoretical, larger population of similar samples, for example if more data were available and we used it to have more testing folds. If we could take repeated samples from the larger population, each of these sampling events would result in a slightly different mean (the sample mean).

Constructing ...

Introduction

Data Exploration and Cleaning

(Challenge) Exploring Remaining Financial Features in Dataset

Introduction to scikit-learn and Model Evaluation

Fake News Detection Using Scikit-learn

(Challenge) Logistic Regression and Precision-Recall Curve

Details of Logistic Regression and Feature Extraction

(Challenge) Logistic Regression Model and Coefficients

The Bias-Variance Trade-Off

(Challenge) Cross-Validation and Feature Engineering

Decision Trees and Random Forests

(Challenge) Cross-Validation Grid Search with Random Forest

Gradient Boosting, XGBoost, and SHAP Values

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Predict Frog Toxicity with Python and XGBoost

Test Set Analysis, Financial Insights, and Delivery to the Client

(Challenge) Deriving Financial Insights

Appendix

A More Convenient Approach to Cross-Validation

Advantages of using `GridSearchCV`

Options for GridSearchCV

The Options for GridSearchCV

(Challenge) Exploring Remaining Financial Features in Dataset

Fake News Detection Using Scikit-learn

(Challenge) Logistic Regression and Precision-Recall Curve

(Challenge) Logistic Regression Model and Coefficients

(Challenge) Cross-Validation and Feature Engineering

(Challenge) Cross-Validation Grid Search with Random Forest

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Predict Frog Toxicity with Python and XGBoost

(Challenge) Deriving Financial Insights

A More Convenient Approach to Cross-Validation

Advantages of using GridSearchCV

Options for GridSearchCV

The Options for GridSearchCV

Advantages of using `GridSearchCV`