Overview

We have already learned about overfitting, underfitting, and the bias-variance trade-off. We are always looking for an optimal point between over- and underfitting. We have used a train test split in the model, where we divided our data into the train (X_train, y_train) and test (X_test, y_test) datasets with some percentage. We trained our regression model on the training part and tested/validated it on the test part. Both train test split and cross-validation help avoid overfitting more than underfitting.

However, the train test split does have its dangers:

  • What if the split we make is not random?

  • What if one subset (train/test) of our data has only one type of data point and is not a true representative of our complete dataset? In the simplest example, we can consider our data to be ordered by the number of rooms, and we get only the rooms with more numbers in the test data.

This will result in overfitting, and we don't want this. This is where cross-validation plays its role. Let's move on and learn about cross-validation now. It's a straightforward concept and somewhat similar to a train test split. The most commonly used cross-validation is k-fold cross-validation.

K-fold cross-validation

In this approach, we split our data into k subsets (also called folds). We use k-1 subsets to train our data and leave the one subset/fold as the test or validate data. We then average the model against each fold and finalize our model. After that, we test it against the test set.

Get hands-on with 1200+ tech skills courses.