Exercise: The Synthetic Data Classification Problem
Learn to address the overfitting problem in a synthetic classification problem using L1 regularization.
We'll cover the following...
Reducing overfitting on the synthetic data classification problem
This exercise is a continuation of Exercise: Generating and Modeling Synthetic Classification Data. Here, we will use a cross-validation procedure in order to find a good value for the hyperparameter . We will do this by using only the training data, reserving the test data for after model building is complete. Be prepared—this is a long exercise—but it will illustrate a general procedure that you will be able to use with many different kinds of machine learning models, so it is worth the time spent here. Perform the following steps to complete the exercise:
-
Vary the value of the regularization parameter, , to range from to . You can use the following snippets to do this.
First, define exponents, which will be powers of 10, as follows:
C_val_exponents = np.linspace(3,-3,13) C_val_exponents
Here is the output of the preceding code:
array([ 3. , 2.5, 2. , 1.5, 1. , 0.5, 0. , -0.5, -1. , -1.5, -2. , -2.5, -3. ])
Now, vary by the powers of 10, as follows:
C_vals = np.float(10)**C_val_exponents C_vals
Here is the output of the preceding code:
array([1.00000000e+03, 3.16227766e+02, 1.00000000e+02, 3.16227766e+01, 1.00000000e+01, 3.16227766e+00, 1.00000000e+00, 3.16227766e01, 1.00000000e-01, 3.16227766e-02, 1.00000000e-02, 3.16227766e03, 1.00000000e-03])
It’s generally a good idea to vary the regularization parameter by powers of 10, or by using a similar strategy, as training models can take a substantial amount of time, especially when using k-fold cross-validation. This gives you a good idea of how a wide range of values impacts the bias-variance trade-off, without needing to train a very large number of models. In addition to the integer powers of 10, we also include points on the scale that are about halfway between. If it seems like there is some interesting behavior in between these relatively widely spaced values, you can add more granular values for in a smaller part of the range of possible values.
-
Import the
roc_curve
class:from sklearn.metrics import roc_curve
We’ll continue to use the ROC AUC score for assessing, training, and testing performance. Now that we have several values of to try and several folds (in this case four) for the cross-validation, we will want to store the training and test scores for each fold and for each value of ...