RoBERTa
Learn about the RoBERTa model and how it is different from BERT model.
Robustly Optimized BERT pre-training Approach (RoBERTa) is another interesting and popular variant of BERT. Researchers observed that BERT is severely undertrained and proposed several approaches to pre-train the BERT model. RoBERTa is essentially BERT with the following changes in pre-training:
Use dynamic masking instead of static masking in the MLM task.
Remove the NSP task and train using only the MLM task.
Train with a large batch size.
Use byte-level BPE (BBPE) as a tokenizer.
Now, let's look into the details and discuss each of the preceding points.
Using dynamic masking instead of static masking
We learned that we pre-train BERT using the MLM and NSP tasks. In the MLM task, we randomly mask 15% of the tokens and let the network predict the masked token.
For instance, say we have the sentence 'We arrived at the airport in time'. Now, after tokenizing and adding [CLS] and [SEP] tokens, we have the following:
Get hands-on with 1400+ tech skills courses.