Data Preparation

Learn to prepare the data for training and different types of subset categories.

Spliting datasets

After selecting a public dataset or creating our own custom one, we have to split the dataset first into three different subset categories, which are the following:

  • Train dataset
  • Validation dataset
  • Test dataset

Train dataset

This dataset is used during training by showing data and its labels to the model. At the end of each epoch, this dataset is also used to calculate the accuracy and loss by not updating the weights but simply checking the model’s performance via simple inference. We call it train loss and train accuracy.

Validation dataset

We calculate the train loss and accuracy by showing the dataset that the model already used. But what kind of a loss and accuracy result would our model give to a dataset it didn’t see before? This should be an essential question considering that we train our models to use it later for real-time, never-before-seen data. Well, the answer is the validation dataset! This one is used at the end of each epoch only to calculate loss and accuracy (results don’t update any weights like train loss and train accuracy), and we call it validation loss and validation accuracy.

Now we are more able to evaluate our model performance.

Test dataset

Imagine that the training is done, and our model is ready to go, but we want to see how our model works at the end of our training. A validation dataset can be used for this task. Still, even though it is not used directly to update our weights, the validation dataset can sometimes affect training by managing the hyperparameter updates.

Remember: The image classification architectures decrease the learning rate by 10 if the validation loss doesn’t improve for a few epochs It is an easy-to-understand example of how the validation dataset implicitly affects our model training.

Therefore, we still need a subset that never contributed to the training process to ensure more trustable results before we use our actual model. This one we call a test dataset! This is a subset separated from the training process to send through the final model via simple inference to obtain the test loss and test accuracy.

Get hands-on with 1400+ tech skills courses.