...

/

Pre-Training the BERT Model

Pre-Training the BERT Model

Learn how to apply different embeddings to the input sentence before feeding it as input to BERT.

In this lesson, we will learn how to pre-train the BERT model. But what does pre-training mean? Say we have a model, m. First, we train the model m with a huge dataset for a particular task and save the trained model. Now, for a new task, instead of initializing a new model with random weights, we will initialize the model with the weights of our already trained model, m (pre-trained model). That is, since the model m is already trained on a huge dataset, instead of training a new model from scratch for a new task, we use the pre-trained model, m, and adjust (fine-tune) its weights according to the new task. This is a type of transfer learning.

Press + to interact

The BERT model is pre-trained on a huge corpus using two interesting tasks, called masked language modeling and next sentence prediction. Following pre-training, we save the pre-trained BERT model. For a task, say question-answering, instead of training BERT from scratch, we will use the pre-trained BERT model. That is, we will use the pre-trained BERT model and adjust (fine-tune) its weights for the new task.

In this lesson, we will learn how the BERT model is pre-trained in detail. Before diving into pre-training, first, let's take a look into how to structure the input data in a way that BERT accepts.

Input data representation

Before feeding the input to BERT, we convert the input into embeddings using the three embedding layers indicated here:

  • Token embedding

  • Segment embedding

  • Position embedding

Let's understand how each of these embedding layers works one by one.

Token embedding

First, we have a token embedding layer. Let's understand this with an example. Consider the following two sentences:

Sentence A: Paris is a beautiful city\text{Sentence A: Paris is a beautiful city}.

Sentence B: I love Paris\text{Sentence B: I love Paris}.

First, we tokenize both two sentences and obtain the tokens, as shown here. In our example, we have not lowercased the tokens:

Next, we add a new token, called the [CLS] token, only at the beginning of the first sentence:

And then we add a new token called [SEP] at the end of every sentence:

Note that the [CLS] token is added only at the beginning of the first sentence, while the [SEP] token is added at the end of every sentence. The [CLS] token is used for classification tasks, and the [SEP] token is used ...