...

/

Byte-Level Byte Pair Encoding and WordPiece Algorithms

Byte-Level Byte Pair Encoding and WordPiece Algorithms

Learn how to perform subword tokenization using the byte-level byte pair encoding and WordPiece algorithms.

Now, let's discuss two other subword tokenization algorithms—Byte-level byte pair encoding and WordPiece.


Byte-level byte pair encoding algorithm

Byte-level byte pair encoding (BBPE) is another popularly used algorithm. It works very similar to BPE, but instead of using a character-level sequence, it uses a byte-level sequence.

Tokenizing with byte-level byte pair encoding

Let's understand how BBPE works with the help of an example.

Example: best

Let's suppose our input text consists of just the word 'best'. We know that in BPE, we convert the word into a character sequence, so we will have the following:

Whereas in BBPE, instead of converting the word to a character sequence, we convert it to a byte-level sequence. Hence, we convert the word 'best' into a byte sequence:

In this way, we convert the given input into a byte-level sequence instead of a character-level sequence. Each Unicode character is converted into a byte. A single character can have 1 to 4 bytes.

Example: 你好

Let's consider one more example. Let's suppose that our input consists ...