Subword tokenization algorithms

Let's learn about several interesting subword tokenization algorithms that are used to create the vocabulary. After creating the vocabulary, we can use it for tokenization. We'll go over the following three popularly used subword tokenization algorithms:

Byte pair encoding
Byte-level byte pair encoding
WordPiece

Byte pair encoding

Let's understand how Byte Pair Encoding (BPE) works with the help of an example. Let's suppose we have a dataset. First, we extract all the words from the dataset along with their count. Suppose the words extracted from the dataset along with the count are (cost, 2), (best, 2), (menu, 1), (men, 1), and (camel, 1). $\text{(cost, 2)}$

Splitting the words into characters

Now, we split all the words into characters and create a character sequence. The following table shows the character sequence along with the wordcount:

Press + to interact

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Byte Pair Encoding

Subword tokenization algorithms

Byte pair encoding

Splitting the words into characters

Defining vocabulary size

Creating the vocabulary