Exercise: The Synthetic Data Classification Problem

Learn to address the overfitting problem in a synthetic classification problem using L1 regularization.

We'll cover the following...

Reducing overfitting on the synthetic data classification problem
Try it yourself

Reducing overfitting on the synthetic data classification problem

This exercise is a continuation of Exercise: Generating and Modeling Synthetic Classification Data. Here, we will use a cross-validation procedure in order to find a good value for the hyperparameter $C$ . We will do this by using only the training data, reserving the test data for after model building is complete. Be prepared—this is a long exercise—but it will illustrate a general procedure that you will be able to use with many different kinds of machine learning models, so it is worth the time spent here. Perform the following steps to complete the exercise:

Vary the value of the regularization parameter, $C$ , to range from $C = 1000$ to $C = 0.001$ . You can use the following snippets to do this.

First, define exponents, which will be powers of 10, as follows:
```
C_val_exponents = np.linspace(3,-3,13) 
C_val_exponents
```
Here is the output of the preceding code:
```
array([ 3. , 2.5, 2. , 1.5, 1. , 0.5, 0. , -0.5, -1. , -1.5, -2. , -2.5, -3. ])
```
Now, vary $C$ by the powers of 10, as follows:
```
C_vals = np.float(10)**C_val_exponents 
C_vals
```
Here is the output of the preceding code:
```
array([1.00000000e+03, 3.16227766e+02, 1.00000000e+02, 3.16227766e+01, 1.00000000e+01, 3.16227766e+00, 1.00000000e+00, 3.16227766e01, 1.00000000e-01, 3.16227766e-02, 1.00000000e-02, 3.16227766e03, 1.00000000e-03])
```
It’s generally a good idea to vary the regularization parameter by powers of 10, or by using a similar strategy, as training models can take a substantial amount of time, especially when using k-fold cross-validation. This gives you a good idea of how a wide range of $C$ values impacts the bias-variance trade-off, without needing to train a very large number of models. In addition to the integer powers of 10, we also include points on the $log_{10}$ scale that are about halfway between. If it seems like there is some interesting behavior in between these relatively widely spaced values, you can add more granular values for $C$ in a smaller part of the range of possible values.
Import the roc_curve class:
```
from sklearn.metrics import roc_curve
```
We’ll continue to use the ROC AUC score for assessing, training, and testing performance. Now that we have several values of $C$ to try and several folds (in this case four) for the cross-validation, we will want to store the training and test scores for each fold and for each value of $C$ .
Define a function that takes the k_folds cross-validation splitter, the array of $C$ values (C_vals), the model object (model), and the features and response variable (X and Y, respectively) as inputs, to explore different amounts of regularization with k-fold cross-validation. Use the following code:
```
def cross_val_C_search(k_folds, C_vals, model, X, Y):
```
Note: The function we started in this step will return the ROC AUCs and ROC curve data. The return block will be written during a later step in the exercise. For now, you can simply write the preceding code as is, because we will be defining k_folds, C_vals, model, X, and Y as we progress in the exercise.
Within this function block, create a NumPy array to hold model performance data, with dimensions n_folds by len(C_vals):
```
n_folds = k_folds.n_splits 
cv_train_roc_auc = np.empty((n_folds, len(C_vals))) 
cv_test_roc_auc = np.empty((n_folds, len(C_vals)))
```
Next, we’ll store the arrays of true and false positive rates and thresholds that go along with each of the test ROC AUC scores in a list of lists.

Note: This is a convenient way to store all this model performance information, as a list in Python can contain any kind of data, including another list. Here, each item of the inner lists in the ...

Introduction

Data Exploration and Cleaning

(Challenge) Exploring Remaining Financial Features in Dataset

Introduction to scikit-learn and Model Evaluation

Fake News Detection Using Scikit-learn

(Challenge) Logistic Regression and Precision-Recall Curve

Details of Logistic Regression and Feature Extraction

(Challenge) Logistic Regression Model and Coefficients

The Bias-Variance Trade-Off

(Challenge) Cross-Validation and Feature Engineering

Decision Trees and Random Forests

(Challenge) Cross-Validation Grid Search with Random Forest

Gradient Boosting, XGBoost, and SHAP Values

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Predict Frog Toxicity with Python and XGBoost

Test Set Analysis, Financial Insights, and Delivery to the Client

(Challenge) Deriving Financial Insights

Appendix

Exercise: The Synthetic Data Classification Problem

Reducing overfitting on the synthetic data classification problem