Cross Validation

Cross Validation is a technique for making robust models. You'll discover how it works in this lesson.

Train, test and validation Datasets

We divide the dataset at hand into training and test dataset.

  • We train the model on the training dataset and evaluate its performance.

  • We evaluate the model’s performance on the test dataset (on which model is not trained) and report the performance of the model.

  • Scikit Learn provides train_test_split, which gives us the training and test dataset. These code snippets have been taken from the Scikit Learn documentation itself.

Press + to interact
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
print("Original Shape of input and output columns")
print(X.shape)
print(y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
print("Shape of the training dataset's input and output columns")
print(X_train.shape)
print(y_train.shape)
print("Shape of the test dataset's input and output columns")
print(X_test.shape)
print(y_test.shape)
  • Line 6 imports the Iris Dataset and saves the input columns in X and output column in y. Lines 8 and 9 print the shape of the dataset.

  • Line 11 splits the dataset into the training and the test datasets. test_size specifies the percentage of instances to be kept in the test dataset. In the current case, 40% of the rows are kept in the test ...