Subword tokenization algorithms

Let's learn about several interesting subword tokenization algorithms that are used to create the vocabulary. After creating the vocabulary, we can use it for tokenization. We'll go over the following three popularly used subword tokenization algorithms:

  • Byte pair encoding

  • Byte-level byte pair encoding

  • WordPiece


Byte pair encoding


Let's understand how Byte Pair Encoding (BPE) works with the help of an example. Let's suppose we have a dataset. First, we extract all the words from the dataset along with their count. Suppose the words extracted from the dataset along with the count are (cost, 2), (best, 2), (menu, 1), (men, 1), and (camel, 1).(cost, 2)\text{(cost, 2)}

Splitting the words into characters

Now, we split all the words into characters and create a character sequence. The following table shows the character sequence along with the wordcount:

Get hands-on with 1400+ tech skills courses.