Byte-Level Byte Pair Encoding and WordPiece Algorithms
Learn how to perform subword tokenization using the byte-level byte pair encoding and WordPiece algorithms.
Now, let's discuss two other subword tokenization algorithms—Byte-level byte pair encoding and WordPiece.
Byte-level byte pair encoding algorithm
Byte-level byte pair encoding (BBPE) is another popularly used algorithm. It works very similar to BPE, but instead of using a character-level sequence, it uses a byte-level sequence.
Tokenizing with byte-level byte pair encoding
Let's understand how BBPE works with the help of an example.
Example: best
Let's suppose our input text consists of just the word 'best'. We know that in BPE, we convert the word into a character sequence, so we will have the following:
Whereas in BBPE, instead of converting the word to a character sequence, we convert it to a byte-level sequence. Hence, we convert the word 'best' into a byte sequence:
In this way, we convert the given input into a byte-level sequence instead of a character-level sequence. Each Unicode character is converted into a byte. A single character can have 1 to 4 bytes.
Example: 你好
Let's consider one more example. Let's suppose that our input consists ...