Getting Started with Google BERT/

...

Byte-Level Byte Pair Encoding and WordPiece Algorithms

Learn how to perform subword tokenization using the byte-level byte pair encoding and WordPiece algorithms.

We'll cover the following...

Byte-level byte pair encoding algorithm
- Tokenizing with byte-level byte pair encoding
  - Example: best
  - Example: 你好
WordPiece algorithm

Now, let's discuss two other subword tokenization algorithms—Byte-level byte pair encoding and WordPiece.

Byte-level byte pair encoding algorithm

Byte-level byte pair encoding (BBPE) is another popularly used algorithm. It works very similar to BPE, but instead of using a character-level sequence, it uses a byte-level sequence.

Tokenizing with byte-level byte pair encoding

Let's understand how BBPE works with the help of an example.

Example: best

Let's suppose our input text consists of just the word 'best'. We know that in BPE, we convert the word into a character sequence, so we will have the following:

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Byte-Level Byte Pair Encoding and WordPiece Algorithms

Byte-level byte pair encoding algorithm

Tokenizing with byte-level byte pair encoding

Example: best

Example: 你好