What is the train_test_split function in Sklearn?

Share

The train_test_split function of the sklearn.model_selection package in Python splits arrays or matrices into random subsets for train and test data, respectively.

To use the train_test_split function, we’ll import it into our program as shown below:

from sklearn.model_selection import train_test_split

Syntax

The syntax of the train_test_split function is as follows:

sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

Parameter values

The train_test_split function accepts the following parameter values:

  • *arrays: These are the arrays or matrices that need to be split.
  • test_size: This is the size of the test subset. If this parameter is an int, then it represents the number of values that need to be added to the test subset. If this parameter is a float, then it represents the proportion of the dataset that needs to be added to the test subset.
  • train_size: This is the size of the train subset. Similar to the test_size parameter, the train_size parameter can either be a float or an int.
  • random_state: This parameter value controls how the data is shuffled before being split.
  • shuffle: This parameter value determines whether or not the data needs to be shuffled before being split.
  • stratify: This parameter value class labels to allow data to be split in a stratified fashion.

Note: A comprehensive description of the aforementioned parameters can be found here.

Return value

The train_test_split function returns a list that contains the train-test splits of the inputs.

Example

The code below shows us how to use the train_test_split function in Python.

from sklearn.model_selection import train_test_split
# declare an array of values
data = [20, 4, 12, 9, 0, 10]
# declare labels associated with each value
labels = ["A", "B", "B", "A", "C", "A"]
# split the data into train-test subsets of equal sizes
train, test = train_test_split(data, test_size=0.5)
print("Splitting into equal parts:")
print("Train Split:", train)
print("Test Split:", test)
# split the dataset into train-test subsets of different sizes
train, test = train_test_split(data, test_size=0.2)
print("\nSplitting into different parts:")
print("Train Split:", train)
print("Test Split:", test)
# split multiple lists
train_data, test_data, train_labels, test_labels = train_test_split(data, labels)
print("\nSplitting multiple lists:")
print("Train Data:", train_data)
print("Test Data:", test_data)
print("Train Labels:", train_labels)
print("Test Labels:", test_labels)

Explanation

  • Line 1: We import the train_test_split function from the sklearn.model_selection library.
  • Line 4: We initialize an array of values to serve as the data.
  • Line 7: We initialize a list of labels that correspond to each value in the data array.
  • Line 9: We split the data array into equally-sized train and test subsets using the test_train_split function. The lists returned by the function are output accordingly.
  • Line 17: We split the data array into differently-sized train and test subsets using the test_train_split function with the test subset containing 20% of the values. The lists returned by the function are output accordingly.
  • Line 24: We split both the data and labels arrays to get all the train and test subsets. The lists returned by the function are output accordingly.