...

/

Byte Pair Encoding

Byte Pair Encoding

Learn about the byte pair encoding algorithm used for the subword tokenization.

Subword tokenization algorithms

Let's learn about several interesting subword tokenization algorithms that are used to create the vocabulary. After creating the vocabulary, we can use it for tokenization. We'll go over the following three popularly used subword tokenization algorithms:

  • Byte pair encoding

  • Byte-level byte pair encoding

  • WordPiece


Byte pair encoding


Let's understand how Byte Pair Encoding (BPE) works with the help of an example. Let's suppose we have a dataset. First, we extract all the words from the dataset along with their count. Suppose the words extracted from the dataset along with the count are (cost, 2), (best, 2), (menu, 1), (men, 1), and (camel, 1).(cost, 2)\text{(cost, 2)}

Splitting the words into characters

Now, we split all the words into characters and create a character sequence. The following table shows the character sequence along with the wordcount:

Press + to interact
Character sequence with count
Character sequence with count

Defining vocabulary size

Next, we define a vocabulary size. Let's suppose we build a vocabulary of size 14. This implies that we create a vocabulary with 14 tokens. Now, let's understand how to create vocabulary using BPE.

Creating the vocabulary

First, we add all the unique characters present in the character sequence to the ...