Cross Validation

Cross Validation is a technique for making robust models. You'll discover how it works in this lesson.

Train, test and validation Datasets

We divide the dataset at hand into training and test dataset.

  • We train the model on the training dataset and evaluate its performance.

  • We evaluate the model’s performance on the test dataset (on which model is not trained) and report the performance of the model.

  • Scikit Learn provides train_test_split, which gives us the training and test dataset. These code snippets have been taken from the Scikit Learn documentation itself.

Press + to interact
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
print("Original Shape of input and output columns")
print(X.shape)
print(y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
print("Shape of the training dataset's input and output columns")
print(X_train.shape)
print(y_train.shape)
print("Shape of the test dataset's input and output columns")
print(X_test.shape)
print(y_test.shape)
  • Line 6 imports the Iris Dataset and saves the input columns in X and output column in y. Lines 8 and 9 print the shape of the dataset.

  • Line 11 splits the dataset into the training and the test datasets. test_size specifies the percentage of instances to be kept in the test dataset. In the current case, 40% of the rows are kept in the test dataset.

  • Then we print the shape of the newly formed datasets.

Validation dataset

When evaluating different settings, “hyperparameters” for models, such as the α\alpha (learning rate) setting that must be manually set for a Ridge Regression, there is still a risk of overfitting on the test set because the parameters can be tweaked until the model performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on ...