Splitting the Data: Training and Test Sets

Learn how to split the data for the model evaluation using scikit-learn.

In the lesson Introduction: Scikit-Learn and Model Evaluation, we introduced the concept of using a trained model to make predictions on new data that the model had never “seen” before. It turns out this is a foundational concept in predictive modeling. In our quest to create a model that has predictive capabilities, we need some kind of measure of how well the model can make predictions on data that was not used to fit the model. This is because in fitting a model, the model becomes “specialized” at learning the relationship between features and response on the specific set of labeled data that were used for fitting. While this is nice, in the end we want to be able to use the model to make accurate predictions on new, unseen data, for which we don’t know the true value of the labels.

Evaluating binary classification with a train/test split

In our case study, once we deliver the trained model to our client, they will then generate a new dataset of features like those we have now, except instead of spanning the period from April to September, they will span from May to October. And our client will be using the model with these features, to predict whether accounts will default in November.

In order to know how well we can expect our model to predict which accounts will actually default in November (which won’t be known until December), we can take our current dataset and reserve some of the data we have, with known labels, from the model training process. This data is referred to as test data and may also be called out-of-sample data because it consists of samples that were not used in training the model. Those samples used to train the model are called training data. The practice of holding out a set of test data gives us an idea of how the model will perform when it is used for its intended purpose, to make predictions on samples that were not included during model training. In this section, we’ll create an example train/test split to illustrate different binary classification metrics.

Train/test split in scikit-learn

We will use the convenient train_test_split functionality of scikit-learn to split the data so that 80% will be used for training, holding 20% back for testing. These percentages are a common way to make such a split; in general, you want enough training data to allow the algorithm to adequately “learn” from a representative sample of data. However, these percentages are not set in stone. If you have a very large number of samples, you may not need as large a percentage of training data, because you will be able to achieve a pretty large, representative training set with a lower percentage. We encourage you to experiment with different sizes and see the effect. Also, be aware that every problem is different with respect to how much data is needed to effectively train a model. There is no hard and fast rule for sizing your training and test sets.

For our 80/20 split, we can use the code shown in the following snippet:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['EDUCATION'].values.reshape(-1,1),\
df['default payment next month'].values, test_size=0.2, random_state=24)

Notice that we’ve set test_size to 0.2, or 20%. The size of the training data will be automatically set to the remainder, 80%. Let’s examine the shapes of our training and test data.

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

To see whether they are as expected, as shown in the following output:

# (21331, 1)
# (5333, 1)
# (21331,)
# (5333,)

Get hands-on with 1200+ tech skills courses.