RoBERTa
Understand RoBERTa, an advanced BERT variant that enhances pre-training by replacing static with dynamic masking, removing the next sentence prediction task, and training with larger batches and more data. Explore how these changes improve natural language understanding and model efficiency.
Robustly Optimized BERT pre-training Approach (RoBERTa) is another interesting and popular variant of BERT. Researchers observed that BERT is severely undertrained and proposed several approaches to pre-train the BERT model. RoBERTa is essentially BERT with the following changes in pre-training:
Use dynamic masking instead of static masking in the MLM task.
Remove the NSP task and train using only the MLM task.
Train with a large batch size.
Use byte-level BPE (BBPE) as a tokenizer.
Now, let's look into the details and discuss each of the preceding points.
Using dynamic masking instead of static masking
We learned that we pre-train BERT using the MLM and NSP tasks. In the MLM task, we randomly mask 15% of the tokens and let the network predict the masked token.
For instance, say we have the sentence 'We arrived at the airport in time'. Now, after tokenizing and adding [CLS] and [SEP] tokens, we have the following:
Next, we randomly mask 15% of the tokens:
Now, we feed the tokens to BERT and train it to predict the masked tokens.
Note: The masking is done only once during the preprocessing step, and we train the model over several ...