Exercise: The Synthetic Data Classification Problem

Learn to address the overfitting problem in a synthetic classification problem using L1 regularization.

Reducing overfitting on the synthetic data classification problem

This exercise is a continuation of Exercise: Generating and Modeling Synthetic Classification Data. Here, we will use a cross-validation procedure in order to find a good value for the hyperparameter CC. We will do this by using only the training data, reserving the test data for after model building is complete. Be prepared—this is a long exercise—but it will illustrate a general procedure that you will be able to use with many different kinds of machine learning models, so it is worth the time spent here. Perform the following steps to complete the exercise:

  1. Vary the value of the regularization parameter, CC, to range from C=1000C = 1000 to C=0.001C = 0.001. You can use the following snippets to do this.

    First, define exponents, which will be powers of 10, as follows:

    C_val_exponents = np.linspace(3,-3,13) 
    C_val_exponents
    

    Here is the output of the preceding code:

    array([ 3. , 2.5, 2. , 1.5, 1. , 0.5, 0. , -0.5, -1. , -1.5, -2. , -2.5, -3. ])
    

    Now, vary CC by the powers of 10, as follows:

    C_vals = np.float(10)**C_val_exponents 
    C_vals
    

    Here is the output of the preceding code:

    array([1.00000000e+03, 3.16227766e+02, 1.00000000e+02, 3.16227766e+01, 1.00000000e+01, 3.16227766e+00, 1.00000000e+00, 3.16227766e01, 1.00000000e-01, 3.16227766e-02, 1.00000000e-02, 3.16227766e03, 1.00000000e-03])
    

    It’s generally a good idea to vary the regularization parameter by powers of 10, or by using a similar strategy, as training models can take a substantial amount of time, especially when using k-fold cross-validation. This gives you a good idea of how a wide range of CC values impacts the bias-variance trade-off, without needing to train a very large number of models. In addition to the integer powers of 10, we also include points on the log10log_{10} scale that are about halfway between. If it seems like there is some interesting behavior in between these relatively widely spaced values, you can add more granular values for CC in a smaller part of the range of possible values.

  2. Import the roc_curve class:

    from sklearn.metrics import roc_curve
    

    We’ll continue to use the ROC AUC score for assessing, training, and testing performance. Now that we have several values of CC to try and several folds (in this case four) for the cross-validation, we will want to store the training and test scores for each fold and for each value of CC.

  3. Define a function that takes the k_folds cross-validation splitter, the array of CC values (C_vals), the model object (model), and the features and response variable (X and Y, respectively) as inputs, to explore different amounts of regularization with k-fold cross-validation. Use the following code:

    def cross_val_C_search(k_folds, C_vals, model, X, Y):
    

    Note: The function we started in this step will return the ROC AUCs and ROC curve data. The return block will be written during a later step in the exercise. For now, you can simply write the preceding code as is, because we will be defining k_folds, C_vals, model, X, and Y as we progress in the exercise.

  4. Within this function block, create a NumPy array to hold model performance data, with dimensions n_folds by len(C_vals):

    n_folds = k_folds.n_splits 
    cv_train_roc_auc = np.empty((n_folds, len(C_vals))) 
    cv_test_roc_auc = np.empty((n_folds, len(C_vals)))
    

    Next, we’ll store the arrays of true and false positive rates and thresholds that go along with each of the test ROC AUC scores in a list of lists.

    Note: This is a convenient way to store all this model performance information, as a list in Python can contain any kind of data, including another list. Here, each item of the inner lists in the list of lists will be a tuple holding the arrays of TPR, FPR, and the thresholds for each of the folds, for each of the CC values. Tuples are an ordered collection data type in Python, similar to lists, but unlike lists they are immutable: the items in a tuple can’t be changed after the tuple is created. When a function returns multiple values, like the roc_curve function of scikit-learn, these values can be output to a single variable, which will be a tuple of those values. This way of storing results should be more obvious when we access these arrays later in order to examine them.

  5. Create a list of empty lists using [[]] and *len(C_vals) as follows:

    cv_test_roc = [[]]*len(C_vals)
    

    Using *len(C_vals) indicates that there should be a list of tuples of metrics (TPR, FPR, thresholds) for each value of CC.

    We have learned how to loop through the different folds for cross-validation in the preceding section. What we need to do now is write an outer loop in which we will nest the cross-validation loop.

  6. Create an outer loop for training and testing each of the k-folds for each value of CC:

    for c_val_counter in range(len(C_vals)): 
        #Set the C value for the model object 
        model.C = C_vals[c_val_counter] 
        #Count folds for each value of C 
        fold_counter = 0 
    

    We can reuse the same model object that we have already, and simply set a new value of C within each run of the loop. Inside the loop of CC values, we run the cross-validation loop. We begin by yielding the training and test data row indices for each split.

  7. Obtain the training and test indices for each fold:

    for train_index, test_index in k_folds.split(X, Y):
    
  8. Index the features and response variable to obtain the training and test data for this fold using the following code:

    X_cv_train, X_cv_test = X[train_index], X[test_index] 
    y_cv_train, y_cv_test = Y[train_index], Y[test_index] 
    

    The training data for the current fold is then used to train the model.

  9. Fit the model on the training data, as follows:

    model.fit(X_cv_train, y_cv_train)
    

    This will effectively “reset” the model from whatever the previous coefficients and intercept were to reflect the training on this new data.

    The training and test ROC AUC scores are then obtained, as well as the arrays of TPRs, FPRs, and thresholds that go along with the test data.

  10. Obtain the training ROC AUC score:

    y_cv_train_predict_proba = model.predict_proba(X_cv_train) 
    cv_train_roc_auc[fold_counter, c_val_counter] = roc_auc_score(y_cv_train,\
    y_cv_train_predict_proba[:,1]) 
    
  11. Obtain the test ROC AUC score:

    y_cv_test_predict_proba = model.predict_proba(X_cv_test) 
    cv_test_roc_auc[fold_counter, c_val_counter] = roc_auc_score(y_cv_test, y_cv_test_predict_proba[:,1])
    
  12. Obtain the test ROC curves for each fold using the following code:

    this_fold_roc = roc_curve(y_cv_test, y_cv_test_predict_proba[:,1])
    cv_test_roc[c_val_counter].append(this_fold_roc)
    

    We will use a fold counter to keep track of the folds that are incremented, and once outside the cross-validation loop, we print a status update to standard output. Whenever performing long computational procedures, it’s a good idea to periodically print the status of the job so that you can monitor its progress and confirm that things are still working correctly. This cross-validation procedure will likely take only a few seconds on your laptop, but for longer jobs this can be especially reassuring.

  13. Increment the fold counter using the following code:

    fold_counter += 1
    
  14. Write the following code to indicate the progress of execution for each value of C:

    print('Done with C = {}'.format(lr_syn.C))
    
  15. Write the code to return the ROC AUCs and ROC curve data and finish the function:

    return cv_train_roc_auc, cv_test_roc_auc, cv_test_roc 
    

    Note that we will continue to use the split into four folds that we illustrated previously, but you are encouraged to try this procedure with different numbers of folds to compare the effect.

    We have covered a lot of material in the preceding steps. You may want to take a few moments to review this in order to make sure that you understand each part. Running the function is comparatively simple. That is the beauty of a well-designed function—all the complicated parts get abstracted away, allowing you to concentrate on usage.

  16. Run the function we’ve designed to examine cross-validation performance, with the C values that we previously defined, and by using the model and data we were working with in the previous exercise. Use the following code:

    cv_train_roc_auc, cv_test_roc_auc, cv_test_roc = cross_val_C_search(k_folds, C_vals, lr_syn,\
    X_syn_train, y_syn_train)
    

    When you run this code, you should see the following output populate below the code cell as the cross-validation is completed for each value of C:

    Done with C = 1000.0 
    Done with C = 316.22776601683796 
    Done with C = 100.0 
    Done with C = 31.622776601683793 
    Done with C = 10.0 
    Done with C = 3.1622776601683795 
    Done with C = 1.0 
    Done with C = 0.31622776601683794
    Done with C = 0.1 
    Done with C = 0.03162277660168379 
    Done with C = 0.01 
    Done with C = 0.0031622776601683794 
    Done with C = 0.001
    

    So, what do the results of the cross-validation look like? There are a few ways to examine this. It is useful to look at the performance of each fold individually, so that you can see how variable the results are.

    This tells you how different subsets of your data perform as test sets, leading to a general idea of the range of performance you can expect from the unseen test set. What we’re interested in here is whether or not we are able to use regularization to alleviate the overfitting that we saw. We know that using C=1,000C = 1,000 led to overfitting—we know this from comparing the training and test scores. But what about the other CC values that we’ve tried? A good way to visualize this will be to plot the training and test scores on the y-axis and the values of CC on the x-axis.

  17. Loop over each of the folds to view their results individually by using the following code:

    for this_fold in range(k_folds.n_splits): 
        plt.plot(C_val_exponents, cv_train_roc_auc[this_fold], '-o', color=cmap(this_fold),\
        label='Training fold {}'.format(this_fold+1)) 
        plt.plot(C_val_exponents, cv_test_roc_auc[this_fold], '-x', color=cmap(this_fold),\
        label='Testing fold {}'.format(this_fold+1)) 
    plt.ylabel('ROC AUC') 
    plt.xlabel('log$_{10}$(C)') 
    plt.legend(loc = [1.1, 0.2]) 
    plt.title('Cross validation scores for each fold')
    

    You will obtain the following output:

Get hands-on with 1200+ tech skills courses.