Exercise: Generating and Modeling Synthetic Classification Data

Learn how overfitting happens by using a synthetic dataset with many candidate features and relatively few samples.

We'll cover the following...

Overfitting in binary classification
Try it yourself

Overfitting in binary classification

Consider yourself in the situation where you are given a binary classification dataset with many candidate features (200), where you don’t have time to look through all of them individually. It’s possible that some of these features are highly correlated or related in some other way. However, with this many variables, it can be difficult to effectively explore all of them. Additionally, the dataset has relatively few samples: only 1,000. We are going to generate this challenging dataset by using a feature of scikit-learn that allows you to create synthetic datasets for making conceptual explorations such as this. Perform the following steps to complete the exercise:

Import the make_classification, train_test_split, LogisticRegression, and roc_auc_score classes using the following code:
```
from sklearn.datasets import make_classification 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import roc_auc_score
```
Notice that we’ve imported several familiar classes from scikit-learn, in addition to a new one that we haven’t seen before: make_classification. This class does just what its name indicates—it makes data for a classification problem. Using the various keyword arguments, you can specify how many samples and features to include, and how many classes the response variable will have. There is also a range of other options that effectively control how “easy” the problem will be to solve.

Note: For more information, refer to the ...

Introduction

Data Exploration and Cleaning

(Challenge) Exploring Remaining Financial Features in Dataset

Introduction to scikit-learn and Model Evaluation

Fake News Detection Using Scikit-learn

(Challenge) Logistic Regression and Precision-Recall Curve

Details of Logistic Regression and Feature Extraction

(Challenge) Logistic Regression Model and Coefficients

The Bias-Variance Trade-Off

(Challenge) Cross-Validation and Feature Engineering

Decision Trees and Random Forests

(Challenge) Cross-Validation Grid Search with Random Forest

Gradient Boosting, XGBoost, and SHAP Values

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Predict Frog Toxicity with Python and XGBoost

Test Set Analysis, Financial Insights, and Delivery to the Client

(Challenge) Deriving Financial Insights

Appendix

Exercise: Generating and Modeling Synthetic Classification Data

Overfitting in binary classification