We use the following methods for performing task-agnostic data augmentation:

  • Masking

  • POS-guided word replacement

  • n-gram sampling

Let's take a look at each one of them.

Understanding the masking method

In the masking method, with probability pmaskp_{\text{mask}}, we randomly mask a word in the sentence with the [MASK] token and create a new sentence with the masked token. For instance, suppose we are performing a sentiment analysis task and, say in our dataset, we have the sentence 'I was listening to music'. Now, with probability pmaskp_{\text{mask}}, we randomly mask a word. Say we have masked the word 'music', then we have a new sentence: 'I was listening to [MASK]'.

But how is this useful? With the [MASK] token in the sentence, our model will not be able to produce the confidence logits since [MASK] is an unknown token. Our model produces less confident logits for the sentence 'I was listening to [MASK]' with a [MASK] token than for the sentence 'I was listening to music' with the unmasked token. This helps our model understand the contribution of each word to the label.

Understanding the POS-guided word replacement method

In the POS-guided (parts of speech guided) word replacement method, with probability pposp_{\text{pos}}, we replace a word in a sentence with another word but with the same parts of speech.

For example, consider the sentence 'Where did you go?' We know that in this sentence, the word 'did' is a verb. Now we can replace the word 'did' with another verb. So now our sentence becomes 'where do you go?' As you can see, we replaced the word 'did' with 'do' and obtained a new sentence.

Understanding the n-gram sampling method

In the n-gram sampling method, with probability pngp_{\text{ng}}, we just randomly sample an n-gram from a sentence, and the value of n is chosen randomly from 1 to 5.

We've learned three different methods for data augmentation. Now let's explore how we exactly apply them.

The data augmentation procedure

Say we have a sentence — 'Paris is a beautiful city'. Let w1,w2,...,wi,...,wnw_1, w_2, ..., w_i, ..., w_n be the words in the sentence. Now, for each word, wiw_i, in our sentence, we create a variable called XiX_i where the value of XiX_i is randomly sampled from the uniform distribution XiUniform(0,1)X_i \sim \text{Uniform}(0, 1). Based on the value of XiX_i, we do the following:

  • If Xi<pmaskX_i < p_{\text{mask}}, then we mask the word wiw_i.

  • If pmaskXi<pmask+pposp_{\text{mask}} ≤ X_i < p_{\text{mask}} + p_{\text{pos}}, then we apply POS-guided word replacement.

Note that masking and POS-guided word replacement are mutually exclusive; if we apply one, then we can't apply the other.

After the preceding step, we will obtain a modified sentence (a synthetic sentence). Now, with probability pngp_{\text{ng}}, we apply n-gram sampling to our synthetic sentence and obtain a final synthetic sentence. Then we append the final synthetic sentence to a data_aug list.

For every sentence, we perform the preceding steps NN number of times and obtain NN new synthetic sentences. Okay, but if we have sentence pairs instead of sentences, then how can we obtain the synthetic sentence pairs?

Data augmentation for sentence pairs

For sentence pairs, we can create synthetic sentence pairs in a number of ways. Some of these are as follows:

  • We can create a synthetic sentence only from the first sentence and hold the second sentence.

  • We hold the first sentence and create a synthetic sentence only from the second sentence.

  • We can create synthetic sentences from both the first and second sentences.

In this way, we can apply the data augmentation method and obtain more data points. Then, we train our student network with augmented data points.

Get hands-on with 1400+ tech skills courses.