...

RoBERTa

Learn about the RoBERTa model and how it is different from BERT model.

We'll cover the following...

Using dynamic masking instead of static masking
- Dynamic masking
Removing the NSP task
- DOC-SENTENCES vs. FULL-SENTENCES
Training with more data points
Training with a large batch size

Robustly Optimized BERT pre-training Approach (RoBERTa) is another interesting and popular variant of BERT. Researchers observed that BERT is severely undertrained and proposed several approaches to pre-train the BERT model. RoBERTa is essentially BERT with the following changes in pre-training:

Use dynamic masking instead of static masking in the MLM task.
Remove the NSP task and train using only the MLM task.
Train with a large batch size.
Use byte-level BPE (BBPE) as a tokenizer.

Now, let's look into the details and discuss each of the preceding points.

Using dynamic masking instead of static masking

We learned that we pre-train BERT using the MLM and NSP tasks. In the MLM task, we randomly mask 15% of the tokens and let the network predict the masked token.

For instance, say we have the sentence 'We arrived at the airport in time'. Now, after tokenizing and adding [CLS] and [SEP] tokens, we have the following:

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

RoBERTa

Using dynamic masking instead of static masking