Cross Validation
Cross Validation is a technique for making robust models. You'll discover how it works in this lesson.
Train, test and validation Datasets
We divide the dataset at hand into training and test dataset.
-
We train the model on the training dataset and evaluate its performance.
-
We evaluate the model’s performance on the test dataset (on which model is not trained) and report the performance of the model.
-
Scikit Learn provides
train_test_split
, which gives us the training and test dataset. These code snippets have been taken from the Scikit Learn documentation itself.
import numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn import datasetsfrom sklearn import svmX, y = datasets.load_iris(return_X_y=True)print("Original Shape of input and output columns")print(X.shape)print(y.shape)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)print("Shape of the training dataset's input and output columns")print(X_train.shape)print(y_train.shape)print("Shape of the test dataset's input and output columns")print(X_test.shape)print(y_test.shape)
-
Line
6
imports the Iris Dataset and saves the input columns inX
and output column iny
. Lines8
and9
print the shape of the dataset. -
Line
11
splits the dataset into the training and the test datasets.test_size
specifies the percentage of instances to be kept in the test dataset. In the current case, 40% of the rows are kept in the test dataset. -
Then we print the shape of the newly formed datasets.
Validation dataset
When evaluating different settings, “hyperparameters” for models, such as the (learning rate) setting that must be manually set for a Ridge Regression, there is still a risk of overfitting on the test set because the parameters can be tweaked until the model performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on ...