...

Exploring the Pre-Trained BERT Model

Learn about the different types of pre-trained BERT models, their use cases, and how to use them to extract embeddings.

We'll cover the following...

Different types of the pre-trained BERT
Extracting embeddings from pre-trained BERT

We pre-train the BERT model using masked language modeling and next-sentence prediction tasks, but pre-training BERT from scratch is computationally expensive, so we can download and use the pre-trained BERT model. Google has open-sourced the pre-trained BERT model, and we can download it from Google Research's GitHub repository. They have released the pre-trained BERT model with various configurations, shown in the following table. L denotes the number of encoder layers, and H denotes the size of the hidden unit (representation size):

Different types of the pre-trained BERT

The pre-trained model is also available in the BERT-uncased and BERT-cased formats. In BERT-uncased, all the tokens are lowercased, but in BERT-cased, the tokens are not lowercased and are used directly for training. Okay, which pre-trained BERT model should we use? BERT-cased or BERT-uncased? The BERT-uncased model is the one that is most commonly used, but if we are working on certain tasks such as Named Entity Recognition (NER) where we have to preserve the case, then we should use the BERT-cased model. Along with these, Google also released pre-trained BERT models trained using the whole word masking method. Okay, but how exactly can we use the pre-trained BERT model?

We can use the pre-trained model in the following two ways:

As a feature extractor by extracting embeddings.
By fine-tuning the pre-trained BERT model on downstream tasks such as text classification, question-answering, and more.

Extracting embeddings from pre-trained BERT

Let's learn how to extract embeddings from pre-trained BERT with an example. Consider the following sentence:

	H=128	H=256	H=512	H=768
L=2	2/128 (BERT-tiny)	2/256	2/512	2/768
L=4	4/128	4/256 (BERT-mini)	4/512 (BERT-small)	4/768
L=6	6/128	6/256	6/512	6/768
L=8	8/128	8/256	8/512 (BERT-medium)	8/768
L=10	10/128	10/256	10/512	10/768
L=12	12/128	12/256	12/512	12/768 (BERT-base)

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Exploring the Pre-Trained BERT Model

Different configurations of pre-trained BERT

Different types of the pre-trained BERT

Extracting embeddings from pre-trained BERT