...

Data Splits Using the Slicing API

Use the slicing API from the TF framework to split a given dataset into training, test, and validation sets.

We'll cover the following...

Common dataset splits
TF Datasets
Slicing API of TFDS

DL algorithms require large datasets to train models. Once the model is trained, we have to find its performance on unseen examples to assess its generalization ability. To this end, we have to split our dataset into various partitions. This lesson presents common dataset partitions and uses TensorFlow Datasets (TFDS) to demonstrate dataset splits using the slicing API of the TF framework.

Common dataset splits

It’s common practice to split a dataset into three partitions for training, validating, and testing a DL model. The following figure presents three partitions of a full dataset. The greater length of the training set indicates that the number of training examples is greater than the examples in the other two partitions.

Press + to interact

These splits are discussed below.

Training set

The DL model uses examples from the training set to learn network parameters. The larger the training set, the higher the chances for the model to discover patterns in the data. This partition comprises the most examples from the original dataset, usually ranging from 60% to more than 90%. However, depending on the size of the available dataset, the percentage of the training set can change.

If a machine’s memory is unable to store a large training set, we divide it into multiple batches. The TF framework stores individual dataset batches in the main memory to train a DL model. Therefore, the training requires less memory because we use individual batches during the learning process.

For instance, a DL model for recognizing human faces uses a training set of images and the associated labels to learn salient face features that describe a particular person.

Validation set

Model hyperparameters, such as the number of model parameters to learn and the number of iterations to use, correspond to the settings to control the learning process. The values of hyperparameters affect the performance of a trained model. If we have a validation set, we can tune hyperparameters during the training process.

To tune network parameters and validate the training process, we use the validation set, which is a small portion of the training set. Depending on the underlying problem the DL model intends to solve and the size of the original dataset, the validation set can be as small as 0.5% of the dataset to 10% of the dataset. The validation set evaluates network parameters to help identify model shortcomings. This set can also discover if the model is overfitting or underfitting.

Note: Specifically, a large difference between the accuracy or loss of the training and the validation sets is an indication of model overfitting. ...

Getting Started with Python

Machine Learning (ML) and Deep Learning (DL)

Customer Segmentation with K-Means Clustering

TensorFlow (TF)

Cats vs Dogs Classification with Convolutional Neural Networks

Dataset Processing Using TensorFlow

Keras: High-Level TF API

Diabetes Prediction Using Keras

Quick Start with Android Apps

TensorFlow (TF) Lite

Image Classification Apps Using TF Lite

Object Detection Apps Using TF Lite

Appendix

DL Model Using TF, Keras, and TF Lite

Data Splits Using the Slicing API

Common dataset splits

Training set

Validation set