...

Data Augmentation

Learn about using the data augmentation method to obtain the augmented dataset.

We'll cover the following...

Steps for the data augmentation
Example: Paris is a beautiful city

To perform distillation at the fine-tuning step, we need more task-specific data points. That is, for task-specific distillation, we need more data points. So we use a data augmentation method to obtain the augmented dataset. We will fine-tune the general TinyBERT with this augmented dataset.

Steps for the data augmentation

First, we will explore the algorithm of the data augmentation method step by step, and then we will understand it more clearly with an example.

Suppose we have a sentence: 'Paris is a beautiful city'.

Step 1: Tokenizing the sentence

First, we tokenize the sentence using the BERT tokenizer and store the tokens in the list called $X$ as shown here:

Step 3: Data augmentation step

Now, for every element (word), $i$ , in the list, $X$ , we do the following:

We check whether $X[i]$ is a single-piece word. If it is a single-piece word, then we mask $X\_ \text{masked}[i]$ with the [MASK] token. Next, we use the BERT-base model to predict the masked word. We predict the first $K$ most likely words and store them in a list called candidates. Say $K = 5;$ then we predict the 5 most likely words and store them in the candidates list.
If $X[i]$ is not a single-piece word, then we will not mask it. Instead, we check for the $K$ ...

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Data Augmentation

Steps for the data augmentation

Step 1: Tokenizing the sentence

Step 2: Copy the tokens

Step 3: Data augmentation step