...

/

Exercise: The Synthetic Data Classification Problem

Exercise: The Synthetic Data Classification Problem

Learn to address the overfitting problem in a synthetic classification problem using L1 regularization.

Reducing overfitting on the synthetic data classification problem

This exercise is a continuation of Exercise: Generating and Modeling Synthetic Classification Data. Here, we will use a cross-validation procedure in order to find a good value for the hyperparameter CC. We will do this by using only the training data, reserving the test data for after model building is complete. Be prepared—this is a long exercise—but it will illustrate a general procedure that you will be able to use with many different kinds of machine learning models, so it is worth the time spent here. Perform the following steps to complete the exercise:

  1. Vary the value of the regularization parameter, CC, to range from C=1000C = 1000 to C=0.001C = 0.001. You can use the following snippets to do this.

    First, define exponents, which will be powers of 10, as follows:

    C_val_exponents = np.linspace(3,-3,13) 
    C_val_exponents
    

    Here is the output of the preceding code:

    array([ 3. , 2.5, 2. , 1.5, 1. , 0.5, 0. , -0.5, -1. , -1.5, -2. , -2.5, -3. ])
    

    Now, vary CC by the powers of 10, as follows:

    C_vals = np.float(10)**C_val_exponents 
    C_vals
    

    Here is the output of the preceding code:

    array([1.00000000e+03, 3.16227766e+02, 1.00000000e+02, 3.16227766e+01, 1.00000000e+01, 3.16227766e+00, 1.00000000e+00, 3.16227766e01, 1.00000000e-01, 3.16227766e-02, 1.00000000e-02, 3.16227766e03, 1.00000000e-03])
    

    It’s generally a good idea to vary the regularization parameter by powers of 10, or by using a similar strategy, as training models can take a substantial amount of time, especially when using k-fold cross-validation. This gives you a good idea of how a wide range of CC values impacts the bias-variance trade-off, without needing to train a very large number of models. In addition to the integer powers of 10, we also include points on the log10log_{10} scale that are about halfway between. If it seems like there is some interesting behavior in between these relatively widely spaced values, you can add more granular values for CC in a smaller part of the range of possible values.

  2. Import the roc_curve class:

    from sklearn.metrics import roc_curve
    

    We’ll continue to use the ROC AUC score for assessing, training, and testing performance. Now that we have several values of CC to try and several folds (in this case four) for the cross-validation, we will want to store the training and test scores for each fold and for each value of CC ...

Access this course and 1400+ top-rated courses and projects.