...

/

Exercise: Generating and Modeling Synthetic Classification Data

Exercise: Generating and Modeling Synthetic Classification Data

Learn how overfitting happens by using a synthetic dataset with many candidate features and relatively few samples.

Overfitting in binary classification

Consider yourself in the situation where you are given a binary classification dataset with many candidate features (200), where you don’t have time to look through all of them individually. It’s possible that some of these features are highly correlated or related in some other way. However, with this many variables, it can be difficult to effectively explore all of them. Additionally, the dataset has relatively few samples: only 1,000. We are going to generate this challenging dataset by using a feature of scikit-learn that allows you to create synthetic datasets for making conceptual explorations such as this. Perform the following steps to complete the exercise:

  1. Import the make_classification, train_test_split, LogisticRegression, and roc_auc_score classes using the following code:

    from sklearn.datasets import make_classification 
    from sklearn.model_selection import train_test_split 
    from sklearn.linear_model import LogisticRegression 
    from sklearn.metrics import roc_auc_score
    

    Notice that we’ve imported several familiar classes from scikit-learn, in addition to a new one that we haven’t seen before: make_classification. This class does just what its name ...

Access this course and 1400+ top-rated courses and projects.