Japanese BERT

Learn about the Japanese BERT model along with its different variants.

The Japanese BERT model is pre-trained using the Japanese Wikipedia text with WWM. We tokenize the Japanese texts using MeCab. MeCab is a morphological analyzer for Japanese text. After tokenizing with MeCab, we use the WordPiece tokenizer and obtain the subwords. Instead of using the WordPiece tokenizer and splitting the text into subwords, we can also split the text into characters.

Variants of Japanese BERT

So, Japanese BERT comes in two variants:

  • mecab-ipadic-bpe-32k: Tokenizes the text with the MeCab tokenizer and then splits it into subwords. The vocabulary size is 32K.

  • mecab-ipadic-char-4k: Tokenizes the text with the MeCab tokenizer and then splits it into characters. The vocabulary size is 4K.

The pre-trained Japanese BERT models can be downloaded from GitHub. We can also use the pre-trained Japanese BERT model with the transformers library, as shown here:

Get hands-on with 1400+ tech skills courses.