...

/

The Data Augmentation Methods

The Data Augmentation Methods

Learn different methods to perform task-agnostic data augmentation.

We use the following methods for performing task-agnostic data augmentation:

  • Masking

  • POS-guided word replacement

  • n-gram sampling

Let's take a look at each one of them.

Understanding the masking method

In the masking method, with probability pmaskp_{\text{mask}}, we randomly mask a word in the sentence with the [MASK] token and create a new sentence with the masked token. For instance, suppose we are performing a sentiment analysis task and, say in our dataset, we have the sentence 'I was listening to music'. Now, with probability pmaskp_{\text{mask}}, we randomly mask a word. Say we have masked the word 'music', then we have a new sentence: 'I was listening to [MASK]'.

But how is this useful? With the [MASK] token in the sentence, our model will not be able to produce the confidence logits since [MASK] is an unknown token. Our model produces less confident logits for the sentence 'I was listening to [MASK]' with a [MASK] token than for the sentence 'I was listening to music' with the unmasked token. This helps our model understand the contribution of each word to the label.

Understanding the POS-guided word replacement method

In the POS-guided (parts of speech guided) word replacement method, with probability pposp_{\text{pos}}, we replace a word in a sentence with another word but with the same parts of speech.

For example, consider the sentence 'Where did you go?' We know that in this sentence, the word 'did' is a verb. Now we can replace the word 'did' with another verb. So now our sentence becomes 'where do you go?' As you can see, we replaced the word 'did' with 'do' and obtained a new sentence. ...