The train_test_split
function of the sklearn.model_selection
package in Python splits arrays or matrices into random subsets for train and test data, respectively.
To use the train_test_split
function, we’ll import it into our program as shown below:
from sklearn.model_selection import train_test_split
The syntax of the train_test_split
function is as follows:
sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
The train_test_split
function accepts the following parameter values:
*arrays
: These are the arrays or matrices that need to be split.test_size
: This is the size of the test subset. If this parameter is an int
, then it represents the number of values that need to be added to the test subset. If this parameter is a float
, then it represents the proportion of the dataset that needs to be added to the test subset.train_size
: This is the size of the train subset. Similar to the test_size
parameter, the train_size
parameter can either be a float
or an int
.random_state
: This parameter value controls how the data is shuffled before being split.shuffle
: This parameter value determines whether or not the data needs to be shuffled before being split.stratify
: This parameter value class labels to allow data to be split in a stratified fashion.Note: A comprehensive description of the aforementioned parameters can be found here.
The train_test_split
function returns a list that contains the train-test splits of the inputs.
The code below shows us how to use the train_test_split
function in Python.
from sklearn.model_selection import train_test_split# declare an array of valuesdata = [20, 4, 12, 9, 0, 10]# declare labels associated with each valuelabels = ["A", "B", "B", "A", "C", "A"]# split the data into train-test subsets of equal sizestrain, test = train_test_split(data, test_size=0.5)print("Splitting into equal parts:")print("Train Split:", train)print("Test Split:", test)# split the dataset into train-test subsets of different sizestrain, test = train_test_split(data, test_size=0.2)print("\nSplitting into different parts:")print("Train Split:", train)print("Test Split:", test)# split multiple liststrain_data, test_data, train_labels, test_labels = train_test_split(data, labels)print("\nSplitting multiple lists:")print("Train Data:", train_data)print("Test Data:", test_data)print("Train Labels:", train_labels)print("Test Labels:", test_labels)
train_test_split
function from the sklearn.model_selection
library.data
array.data
array into equally-sized train and test subsets using the test_train_split
function. The lists returned by the function are output accordingly.data
array into differently-sized train and test subsets using the test_train_split
function with the test subset containing 20% of the values. The lists returned by the function are output accordingly.data
and labels
arrays to get all the train and test subsets. The lists returned by the function are output accordingly.