Most of the existing machine learning algorithms cannot be executed on categorical data. Instead, the categorical data needs to first be converted to numerical data. One-hot encoding is one of the techniques used to perform this conversion. This method is mostly used when deep learning techniques are to be applied to sequential classification problems.
One-hot encoding is essentially the representation of categorical variables as binary vectors. These categorical values are first mapped to integer values. Each integer value is then represented as a binary vector that is all 0s (except the index of the integer which is marked as 1).
Have a look at the example below which manually converts the categorical list of colors to a numerical list using one-hot encoding:
import numpy as np### Categorical data to be converted to numeric datacolors = ["red", "green", "yellow", "red", "blue"]### Universal list of colorstotal_colors = ["red", "green", "blue", "black", "yellow"]### map each color to an integermapping = {}for x in range(len(total_colors)):mapping[total_colors[x]] = xone_hot_encode = []for c in colors:arr = list(np.zeros(len(total_colors), dtype = int))arr[mapping[c]] = 1one_hot_encode.append(arr)print(one_hot_encode)
scikit-learn
Take a look at the example below. It uses the scikit-learn
library to perform one-hot encoding:
from sklearn.preprocessing import LabelEncoderfrom sklearn.preprocessing import OneHotEncoder### Categorical data to be converted to numeric datacolors = (["red", "green", "yellow", "red", "blue"])### integer mapping using LabelEncoderlabel_encoder = LabelEncoder()integer_encoded = label_encoder.fit_transform(colors)print(integer_encoded)integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)### One hot encodingonehot_encoder = OneHotEncoder(sparse=False)onehot_encoded = onehot_encoder.fit_transform(integer_encoded)print(onehot_encoded)