Exploring the Pre-Trained BERT Model
Learn about the different types of pre-trained BERT models, their use cases, and how to use them to extract embeddings.
We'll cover the following...
We pre-train the BERT model using masked language modeling and next-sentence prediction tasks, but pre-training BERT from scratch is computationally expensive, so we can download and use the pre-trained BERT model. Google has open-sourced the pre-trained BERT model, and we can download it from Google Research's GitHub repository. They have released the pre-trained BERT model with various configurations, shown in the following table. L denotes the number of encoder layers, and H denotes the size of the hidden unit (representation size):
Different configurations of pre-trained BERT
H=128 | H=256 | H=512 | H=768 | |
L=2 | 2/128 (BERT-tiny) | 2/256 | 2/512 | 2/768 |
L=4 | 4/128 | 4/256 (BERT-mini) | 4/512 (BERT-small) | 4/768 |
L=6 | 6/128 | 6/256 | 6/512 | 6/768 |
L=8 | 8/128 | 8/256 | 8/512 (BERT-medium) | 8/768 |
L=10 | 10/128 | 10/256 | 10/512 | 10/768 |
L=12 | 12/128 | 12/256 | 12/512 | 12/768 (BERT-base) |
Different types of the pre-trained BERT
The pre-trained model is also available in the BERT-uncased and BERT-cased formats. In BERT-uncased, all the tokens are lowercased, but in BERT-cased, the tokens are not lowercased and are used directly for training. Okay, which pre-trained BERT model should we use? BERT-cased or BERT-uncased? The BERT-uncased model is the one that is most commonly used, but if we are working on certain tasks such as Named Entity Recognition (NER) where we have to preserve the case, then we should use the BERT-cased model. Along with these, Google also released pre-trained BERT models trained using the whole word masking method. Okay, but how exactly can we use the pre-trained BERT model?
We can use the pre-trained model in the following two ways:
As a feature extractor by extracting embeddings.
By fine-tuning the pre-trained BERT model on downstream tasks such as text classification, question-answering, and more.
Extracting embeddings from pre-trained BERT
Let's learn how to extract embeddings from pre-trained BERT with an example. Consider the following sentence:
Say we need to extract the contextual embedding of each word in the sentence. To do this, first, we tokenize the sentence and feed the tokens to the pre-trained BERT model, which will return the embeddings for each token. Apart from obtaining the token-level (word-level) representation, we can also obtain the sentence-level representation.
Let's learn how exactly we can extract the word-level and ...