Byte Pair Encoding
Learn about the byte pair encoding algorithm used for the subword tokenization.
Subword tokenization algorithms
Let's learn about several interesting subword tokenization algorithms that are used to create the vocabulary. After creating the vocabulary, we can use it for tokenization. We'll go over the following three popularly used subword tokenization algorithms:
Byte pair encoding
Byte-level byte pair encoding
WordPiece
Byte pair encoding
Let's understand how Byte Pair Encoding (BPE) works with the help of an example. Let's suppose we have a dataset. First, we extract all the words from the dataset along with their count. Suppose the words extracted from the dataset along with the count are (cost, 2), (best, 2), (menu, 1), (men, 1), and (camel, 1).
Splitting the words into characters
Now, we split all the words into characters and create a character sequence. The following table shows the character sequence along with the wordcount:
Get hands-on with 1400+ tech skills courses.