...

/

Data Splits Using the Slicing API

Data Splits Using the Slicing API

Use the slicing API from the TF framework to split a given dataset into training, test, and validation sets.

DL algorithms require large datasets to train models. Once the model is trained, we have to find its performance on unseen examples to assess its generalization ability. To this end, we have to split our dataset into various partitions. This lesson presents common dataset partitions and uses TensorFlow Datasets (TFDS) to demonstrate dataset splits using the slicing API of the TF framework.

Common dataset splits

It’s common practice to split a dataset into three partitions for training, validating, and testing a DL model. The following figure presents three partitions of a full dataset. The greater length of the training set indicates that the number of training examples is greater than the examples in the other two partitions.

Press + to interact
Division of a dataset into training, validation, and test sets
Division of a dataset into training, validation, and test sets

These splits are discussed below.

Training set

The DL model uses examples from the training set to learn network parameters. The larger the training set, the higher the chances for the model to discover patterns in the data. This partition comprises the most examples from the original dataset, usually ranging from 60% to more than 90%. However, depending on the size of the available dataset, the percentage of the training set can change.

If a machine’s memory is unable to store a large training set, we divide it into multiple batches. The TF framework stores individual dataset batches in the main memory to train a DL model. Therefore, the training requires less memory because we use individual batches during the learning process.

For instance, a DL model for recognizing human faces uses a training set of images and the associated labels to learn salient face features that describe a particular person.

Validation set

Model hyperparameters, such as the number of model parameters to learn and the number of iterations to use, correspond to the settings to control the learning process. The values of hyperparameters affect the performance of a trained model. If we have a validation set, we can tune hyperparameters during the training process.

To tune network parameters and validate the training process, we use the validation set, which is a small portion of the training set. Depending on the underlying problem the DL model intends to solve and the size of the original dataset, the validation set can be as small as 0.5% of the dataset to 10% of the dataset. The validation set evaluates network ...