Choosing the regularization parameter

By now, you may suspect that we could use regularization in order to decrease the overfitting we observed when we tried to model the synthetic data in Exercise: Generating and Modeling Synthetic Classification Data. The question is, how do we choose the regularization parameter CC?, CC is an example of a model hyperparameter. Hyperparameters are different from the parameters that are estimated when a model is trained, such as the coefficients and the intercept of a logistic regression. Rather than being estimated by an automated procedure like the parameters are, hyperparameters are input directly by the user as keyword arguments, typically when instantiating the model class. So, how do we know what values to choose?

Hyperparameters are more difficult to estimate than parameters. This is because it is up to the data scientist to determine what the best value is, as opposed to letting an optimization algorithm find it. However, it is possible to programmatically choose hyperparameter values, which could be viewed as an optimization procedure in its own right. Practically speaking, in the case of the regularization parameter CC, this is most commonly done by fitting the model on one set of data with a particular value of CC, determining model training performance, and then assessing the out-of-sample performance on another set of data.

We are already familiar with the concept of using model training and test sets. However, there is a key difference here; for instance, what would happen if we were to use the test set multiple times in order to see the effect of different values of CC?

It may occur to you that after the first time you use the unseen test set to assess the out-of-sample performance for a particular value of CC, it is no longer an “unseen” test set. While only the training data was used for estimating the model parameters (that is, the coefficients and the intercept), now the test data is being used to estimate the hyperparameter CC. Effectively, the test data has now become additional training data in the sense that it is being used to find a good value for the hyperparameter. For this reason, it is common to divide the data into three parts: a training set, a test set, and a validation set. The validation set serves multiple purposes, let’s discuss them.

Estimating hyperparameters

The validation set can be repeatedly used to assess the out-of-sample performance with different hyperparameter values to select hyperparameters.

A comparison of different models

In addition to finding hyperparameter values for a model, the validation set can be used to estimate the out-of-sample performance of different models; for example, if we wanted to compare logistic regression to random forest.

Data management best practices

As a data scientist, it's up to you to figure out how to divide up your data for different predictive modeling tasks. In the ideal case, you should reserve a portion of your data for the very end of the process, after you've already selected model hyperparameters and also selected the best model. This unseen test set is reserved for the last step, when it can be used to assess the endpoint of your model-building efforts, to see how the final model generalizes to new unseen data. When reserving the test set, it is good practice to make sure that the features and responses have similar characteristics to the rest of the data. In other words, the class fraction should be the same, and the distribution of features should be similar. This way, the test data should be representative of the data you built the model with.

While model validation is a good practice, it raises the question of whether the particular split we choose for the training, validation, and test data has any effect on the outcomes that we are tracking. For example, perhaps the relationship between the features and the response variable is slightly different in the unseen test set that we have reserved, or in the validation set, versus the training set. It is likely impossible to eliminate all such variability, but we can use the method of cross-validation to avoid placing too much faith in one particular split of the data.

Scikit-learn cross-validation functions

Scikit-learn provides convenient functions to facilitate cross-validation analyses. These functions play a similar role to train_test_split, which we have already been using, although the default behavior is somewhat different. Let’s get familiar with them now. First, import these two classes:

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold

Similar to train_test_split, we need to specify what proportion of the dataset we would like to use for training versus testing. However, with cross-validation (specifically the k-fold cross-validation that was implemented in the classes we just imported), rather than specifying a proportion directly, we simply indicate how many folds we would like –– that is, the “k folds.” The idea here is that the data will be divided into kk equal proportions. For example, if we specify 4 folds, then each fold will have 25% of the data. These folds will be the test data in four separate instances of model training, while the remaining 75% from each fold will be used to train the model. In this procedure, each data point gets used as training data a total of k1k - 1 times, and as test data only once.

When instantiating the class, we indicate the number of folds, whether or not to shuffle the data before splitting, and a random seed if we want repeatable results across different runs:

n_folds = 4
k_folds = KFold(n_splits=n_folds, shuffle=False)

Here, we’ve instantiated an object with four folds and no shuffling. The way in which we use the object that is returned, which we’ve called k_folds, is by passing the features and response data that we wish to use for cross-validation, to the .split method of this object. This outputs an iterator, which means that we can loop through the output to get the different splits of training and test data. If we took the training data from our synthetic classification problem, X_syn_train and y_syn_train, we could loop through the splits like this:

for train_index, test_index in k_folds_iterator.split(X_syn_train, y_syn_train):

The iterator will return the row indices of X_syn_train and y_syn_train, which we can use to index the data. Inside this for loop, we can write code to use these indices to select data for repeatedly training and testing a model object with different subsets of the data. In this way, we can get a robust indication of the out-of-sample performance when using one particular hyperparameter value, and then repeat the whole process using another hyperparameter value. Consequently, the cross-validation loop may sit nested inside an outer loop over different hyperparameter values. We’ll illustrate this in the following exercise.

First though, what do these splits look like? If we were to simply plot the indices from train_index and test_index as different colors, we would get something that looks like this:

Get hands-on with 1400+ tech skills courses.