Pre-Training Procedure
Learn about the pre-training procedure of the BERT.
BERT is pre-trained using Toronto BookCorpus and the Wikipedia dataset. We have also learned that BERT is pre-trained using masked language modeling (cloze task) and the NSP task. Now, how do we prepare the dataset to train BERT using these two tasks?
Preparing the dataset
First, we sample two sentences (two text spans) from the corpus. Let's say we sampled two sentences, A and B. The sum of the total number of tokens from the two sentences A and B should be less than or equal to 512. While sampling two sentences (two text spans), for 50% of the time, we sample sentence B as the follow-up sentence to sentence A, and for the other % of the time, we sample sentence B as not being the follow-up sentence to sentence A.
Suppose we sampled the following two sentences:
Tokenize the sentence
First, we tokenize the sentence using a WordPiece tokenizer, add the [CLS] token to the beginning of the first sentence, and then add the [SEP] token to the end of every sentence. So, our tokens become the following:
Masking the tokens
Next, we randomly mask 15% of tokens according to the 80-10-10% rule. Suppose we masked the token game, then we have the following:
Training the BERT
Now, we feed the tokens to BERT and train BERT in predicting the masked tokens and also to classify whether sentence B is the follow-up sentence to sentence A. That is, we train BERT with both the masked language modeling and NSP tasks simultaneously. ...