One-hot encoding of text data in natural language processing

In natural language processing (NLP), one of the fundamental tasks is to convert textual data, such as words or labels, into a format that machine learning algorithms can process effectively, i.e., vectors. One common technique for this purpose is one-hot encoding.

What is one-hot encoding?

One-hot encoding is a method used to represent categorical data, including textual data, in a numerical format. In the context of NLP, it is primarily used to convert textual information into a format that machine learning models can understand and process. Each category is transformed into a binary vector where all elements are 0, except for one element, which is set to 1. This single 1 in the vector corresponds to the category being encoded. The length of the vector is equal to the number of distinct categories in the data.

One-hot encoding
One-hot encoding

Working of one-hot encoding

The process of one-hot encoding involves converting each word in a given text into a unique numeric vector. Here’s how it works:

  • Vocabulary creation: The first step in one-hot encoding is to create a vocabulary consisting of all unique words in the entire text corpus. Each word is indexed with a unique integer value.

  • Vector representation: Once the vocabulary is created, each word is represented as a vector of 0s and 1s. The length of the vector is equal to the size of the vocabulary, and each position in the vector corresponds to a specific word in the vocabulary. If the word is present in a particular text sample, its corresponding position in the vector is marked as 1, and all other positions are 0. This implies that each word is uniquely represented by a binary vector, with only one element being 1 (indicating its presence) and all others being 0.

When to use one-hot encoding

One-hot encoding is used in various NLP and machine learning scenarios, including but not limited to:

  • Text classification: We use one-hot encoding when we want to classify text documents into categories (e.g., spam detection, sentiment analysis, topic classification).

  • Feature engineering: We use one-hot encoding when we have categorical features that must be included in machine learning models.

  • Embedding layers: We use one-hot encoding when preparing data for deep learning models like neural networks. Embedding layers can follow one-hot encoding to represent categorical data more efficiently.

  • Encoding labels: We use one-hot encoding when working with labels or target variables that are categorical and need to be transformed for model training, e.g., in classification tasks.

Implementation

Let’s see the implementation of one-hot encoding for text data using Python:

# Sample text
text = "Hello! Welcome to Educative. Happy learning."
# Tokenize the text into words
words = text.split()
# Create a set to get unique words (vocabulary)
vocabulary = set(words)
# Generate one-hot encoded vectors for each word in the vocabulary
one_hot_encoded = []
for word in vocabulary:
# Create a list of zeros with the length of the vocabulary
encoding = [0] * len(vocabulary)
# Get the index of the word in the vocabulary
index = list(vocabulary).index(word)
# Set the value at the index to 1 to indicate word presence
encoding[index] = 1
one_hot_encoded.append((word, encoding))
# Print the one-hot encoded vectors
for word, encoding in one_hot_encoded:
print(f"{word}: {encoding}")
print(vocabulary)

Code explanation

  • Line 5: We use the split() method to split the text into a list of words. By default, it splits the text on whitespace, separating the words and punctuation marks. The result is stored in the words list.

  • Line 8: We create a set called vocabulary to store unique words from the words list.

  • Lines 11–14: We create an empty list called one_hot_encoded. The code then enters a for loop to iterate through each unique word in the vocabulary. Inside the loop, we create a list called encoding with the same length as the vocabulary, filled with zeros. This list represents the one-hot encoded vector for the current word.

  • Line 17: We extract the index of the current word in the vocabulary set. The list(vocabulary) part converts the set into a list, allowing us to use the index() method to find the position of the current word.

  • Line 20: We set the value at the index corresponding to the current word to 1, indicating the presence of that word in the one-hot encoded vector.

  • Line 21: We create a tuple with the word and its corresponding one-hot encoded vector, and this tuple is appended to the one_hot_encoded list.

  • Lines 24–25: Finally, we print the one-hot encoded vectors for each word by iterating through the one_hot_encoded list and display the word along with its one-hot encoded vector.

Advantages and disadvantages

One limitation of one-hot encoding is its high dimensionality, especially for larger vocabularies, which can lead to sparse vectors and computational inefficiency. Additionally, it doesn’t capture any inherent relationships or semantics between words.

However, despite its limitations, one-hot encoding is a fundamental technique in NLP, particularly in scenarios where the model requires categorical data input. It is often a crucial step in text preprocessing before applying more advanced NLP techniques, such as neural networks or word embeddings.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved