The WordPiece Tokenizer
Learn about the WordPiece tokenizer and how it works.
We'll cover the following...
BERT uses a special type of tokenizer called a WordPiece tokenizer. The WordPiece tokenizer follows the subword tokenization scheme. Let's understand how the WordPiece tokenizer works with the help of an example. Consider the following sentence:
Tokenize the sentence
Now, if we tokenize the sentence using the WordPiece tokenizer, then we obtain the tokens as shown here:
We can observe that while tokenizing the sentence using the WordPiece tokenizer, the word 'pretraining' is split into the ...