Byte-Level Byte Pair Encoding and WordPiece Algorithms
Learn how to perform subword tokenization using the byte-level byte pair encoding and WordPiece algorithms.
Now, let's discuss two other subword tokenization algorithms—Byte-level byte pair encoding and WordPiece.
Byte-level byte pair encoding algorithm
Byte-level byte pair encoding (BBPE) is another popularly used algorithm. It works very similar to BPE, but instead of using a character-level sequence, it uses a byte-level sequence.
Tokenizing with byte-level byte pair encoding
Let's understand how BBPE works with the help of an example.
Example: best
Let's suppose our input text consists of just the word 'best'. We know that in BPE, we convert the word into a character sequence, so we will have the following:
Get hands-on with 1400+ tech skills courses.