Search⌘ K

Exercise: Generating and Modeling Synthetic Classification Data

Explore how to create synthetic datasets with many features for binary classification and use logistic regression with L1 regularization to address overfitting. Learn to split data, train models, and evaluate performance with ROC AUC scores, gaining hands-on experience with managing high-dimensional data and balancing bias and variance.

Overfitting in binary classification

Consider yourself in the situation where you are given a binary classification dataset with many candidate features (200), where you don’t have time to look through all of them individually. It’s possible that some of these features are highly correlated or related in some other way. However, with this many variables, it can be difficult to effectively explore all of them. Additionally, the dataset has relatively few samples: only 1,000. We are going to generate this challenging dataset by using a feature of scikit-learn that allows you to create synthetic datasets for making conceptual explorations such as this. Perform the following steps to complete the exercise:

  1. Import the make_classification, train_test_split, LogisticRegression, and roc_auc_score classes using the following code:

    from sklearn.datasets import make_classification 
    from sklearn.model_selection import train_test_split 
    from sklearn.linear_model import LogisticRegression 
    from sklearn.metrics import roc_auc_score
    

    Notice that we’ve imported several familiar classes from scikit-learn, in addition to a new one that we haven’t seen before: make_classification. This class does just what its name indicates—it makes data for a classification problem. Using the various keyword arguments, you can specify how many samples and features to include, and how many classes the response variable will have. There is also a range of other options that effectively control how “easy” the problem will be to solve.

    Note: For more information, refer to the ...