Validation

Get introduced to the importance of data splits and the process of cross-validation.

Since regularization is a method to fine-tune the subject model by introducing an additional penalty in the error function, we need to validate its impact. Several hyperparameters need to be set before optimizing the objective function. The hyperparameters include model fwf_\bold w, loss function LL, regularization function RR, and the scale of regularization λ\lambda. Validation is the process of testing the accuracy of the trained model, which also measures the validity of the hyperparameters.

Note: An accurate indicator of generalization is the performance of the trained model on unseen dataThis is the data that isn’t used in the training process..

Data splits

Where to get the unseen data for validation? One way is to hold out a percentage of available data and use the rest for training. Once the training is complete, the validation can be carried out on the subset of available data that was kept for validation, known as the hold-out set.

Note: The more popular term used for hold-out set is test set.

Press + to interact
Train-test split
Train-test split

How large should the test set be? To assess the generalization, we need the test set to be large. But we also need the training set to be large to avoid overfitting. There’s no exact workaround to this trade-off. A rule of thumb, however, is to use an 80/2080 percent for training and 20 percent for testing split.

To improve the performance after validation, the hyperparameters can be tuned. The validation and tuning cycle continues until the desired performance is achieved.

Press + to interact
Validation cycle
Validation cycle

Validation set

After validation on the test set, if the hyperparameters are tuned and the training is carried out again, the test set is no longer unseen. It’s used in the training process but not directly as the training set.

If the test set must be unseen, then how to tune the hyperparameters?

A compromise in this situation is to make another split of the data called the validation set ...