...
/Exercise: Generating and Modeling Synthetic Classification Data
Exercise: Generating and Modeling Synthetic Classification Data
Learn how overfitting happens by using a synthetic dataset with many candidate features and relatively few samples.
We'll cover the following...
Overfitting in binary classification
Consider yourself in the situation where you are given a binary classification dataset with many candidate features (200), where you don’t have time to look through all of them individually. It’s possible that some of these features are highly correlated or related in some other way. However, with this many variables, it can be difficult to effectively explore all of them. Additionally, the dataset has relatively few samples: only 1,000. We are going to generate this challenging dataset by using a feature of scikit-learn that allows you to create synthetic datasets for making conceptual explorations such as this. Perform the following steps to complete the exercise:
-
Import the
make_classification
,train_test_split
,LogisticRegression
, androc_auc_score
classes using the following code:from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score
Notice that we’ve imported several familiar classes from scikit-learn, in addition to a new one that we haven’t seen before:
make_classification
. This class does just what its name ...