Pre-Training the BERT Model
Learn how to apply different embeddings to the input sentence before feeding it as input to BERT.
We'll cover the following...
In this lesson, we will learn how to pre-train the BERT model. But what does pre-training mean? Say we have a model, m
. First, we train the model m
with a huge dataset for a particular task and save the trained model. Now, for a new task, instead of initializing a new model with random weights, we will initialize the model with the weights of our already trained model, m
(pre-trained model). That is, since the model m
is already trained on a huge dataset, instead of training a new model from scratch for a new task, we use the pre-trained model, m
, and adjust (fine-tune) its weights according to the new task. This is a type of transfer learning.
The BERT model is pre-trained on a huge corpus using two interesting tasks, called masked language modeling and next sentence prediction. Following pre-training, we save the pre-trained BERT model. For a task, say question-answering, instead of training BERT from scratch, we will use the pre-trained BERT model. That is, we will use the pre-trained BERT model and adjust (fine-tune) its weights for the new task.
In this lesson, we will learn how the BERT model is pre-trained in detail. Before diving into pre-training, first, let's take a look into how to structure the input data in a way that BERT accepts.
Input data representation
Before feeding the input to BERT, we convert the input into embeddings using the three embedding layers indicated here:
Token embedding
Segment embedding
Position embedding
Let's understand how each of these embedding layers works one by one.
Token embedding
First, we have a token embedding layer. Let's understand this with an example. Consider the following two sentences:
First, we tokenize both two sentences and obtain the tokens, as shown here. In our example, we have not lowercased the tokens:
Next, we add a new token, called the [CLS] token, only at the beginning of the first sentence:
And then we add a new token called [SEP] at the end of every sentence:
Note that the [CLS] token is added only at the beginning of the first sentence, while the [SEP] token is added at the end of every sentence. The [CLS] token is used for classification tasks, and the [SEP] token is used ...