Basic Cross-Validation

Learn the basic concepts of cross-validation.

We'll cover the following

Cross-validation is probably one of the most important concepts in ML. So far, we’ve been training and assessing different models on the same data for purely pedagogical reasons: we wanted to focus on learning the models, not correctly assessing their performance.

However, in practice, it’s crucial that we use different datasets for each task in order to get a good estimation of how the model would perform in the real world.

Cross-validation allows us to assess the performance and generalization ability of a predictive model by splitting the dataset into subsets for both training and testing. This process ensures that the model is evaluated on different, unseen data to prevent overfitting and provide a more accurate estimation of its real-world performance. This provides a more robust and reliable measure of a model’s effectiveness and aids in model selection and hyperparameter tuning. There are multiple techniques for cross-validation, such as splitting the data into three subsets (train, test, and validation), k-fold, and leave-one-out. Let’s start with the first one.

Train, test, and validation

There are multiple ways of doing this, but the most basic approach is to randomly split the original data into three different sets:

  • Train: This data will be used to train the models (around 70% of the observations).

  • Validation: This data will be used to fine-tune the trained models and help choose the best hyperparameters (around 15% of the observations).

  • Test: This data will be used in the end to assess the model’s performance (around 15% of the observations).

Get hands-on with 1200+ tech skills courses.