Machine learning is a branch of Artificial Intelligence (AI) that enables computers to learn patterns in data without being hard programmed to do so.
In machine learning, data is split into three sets, namely:
The training set is whereby the model learns patterns in the data.
The validation set evaluates the model’s performance on unseen data and is useful when tuning the model’s hyperparameters.
The testing set evaluates how well the tuned model can make predictions on unseen data.
Data scientists and Machine learning engineers do not always use validation sets in their modeling, but here is why you should consider doing so going forward:
from sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_split#loading the boston dataset from sklearnX, y = load_boston(return_X_y = True)print('shape of data: ', X.shape)# splitting the dataX_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.8)#splitting the second data set into validation and test sets equallyX_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)print('X_train',X_train.shape), print('y_train',y_train.shape)print('X_valid',X_valid.shape), print('y_valid',y_valid.shape)print('X_test',X_test.shape), print('y_test',y_test.shape)
Lines 1 and 2: We import the necessary modules for loading the Boston dataset
and the train_test_split
module for splitting our data.
Line 4: We load the features and the target data in the dataset.
Line 8: We split the data into 2, one set (80 data) is for training, and the other is left for splitting further.
Line 12: Then, we split the remaining data (20) equally into validation and test sets.
Note: If your model has no need of hyperparameter tuning or if the hyperparameters are hard to tune, then you might consider using the training and testing sets only.