Noising Techniques

Learn about different noising techniques for text corruption and their comparison to find the best one.

We've learned that we corrupt the text and feed it to the encoder of BART. But how exactly do we corrupt the text? Does corrupting only include masking few tokens? Not necessarily.

The researchers have proposed several interesting noising techniques for corrupting the text:

  • Token masking

  • Token deletion

  • Token infilling

  • Sentence shuffling

  • Document rotation

Let's take a closer look at each of these methods.

Token masking

In token masking, as the name suggests, we randomly mask a few tokens. That is, we randomly replace a few tokens with [MASK], just as we did in the BERT model. A simple example is shown in the following table:

Get hands-on with 1400+ tech skills courses.