...

Pre-Training the BERT Model

Learn how to apply different embeddings to the input sentence before feeding it as input to BERT.

We'll cover the following...

Input data representation

In this lesson, we will learn how to pre-train the BERT model. But what does pre-training mean? Say we have a model, m. First, we train the model m with a huge dataset for a particular task and save the trained model. Now, for a new task, instead of initializing a new model with random weights, we will initialize the model with the weights of our already trained model, m (pre-trained model). That is, since the model m is already trained on a huge dataset, instead of training a new model from scratch for a new task, we use the pre-trained model, m, and adjust (fine-tune) its weights according to the new task. This is a type of transfer learning.

Press + to interact

The BERT model is pre-trained on a huge corpus using two interesting tasks, called masked language modeling and next sentence prediction. Following pre-training, we save the pre-trained BERT model. For a task, say question-answering, instead of training BERT from scratch, we will use the pre-trained BERT model. That is, we will use the pre-trained BERT model and adjust (fine-tune) its weights for the new task.

In this lesson, we will learn how the BERT model is pre-trained in detail. Before diving into pre-training, first, let's take a look into how to structure the input data in a way that BERT accepts.

Input data representation

Before feeding the input to BERT, we convert the input into embeddings using the three embedding layers indicated here:

Token embedding
Segment embedding
Position embedding

Let's understand how each of these embedding layers works one by one.

Token embedding

First, we have a token embedding layer. Let's understand this with an example. Consider the following two sentences:

$\text{Sentence A: Paris is a beautiful city}$ .

$\text{Sentence B: I love Paris}$ .

First, we tokenize both two sentences and obtain the tokens, as shown here. In our example, we have not lowercased the tokens:

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Pre-Training the BERT Model

Input data representation

Token embedding