

Understanding BERT

Understanding BERT

Let's learn about BERT.

We'll now explore the most influential and commonly used Transformer model, BERT. BERT is introduced in Google's research paperDevlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv./abs/1810.04805.

What does BERT do exactly? To understand what BERT outputs, let's dissect the name:

  • Bidirectional: Training on the text data is bidirectional, which means each input sentence is processed from left to right as well as from right to left.

  • Encoder: An encoder encodes the input sentence.

  • Representations: A representation is a word vector.

  • Transformers: The architecture is transformer-based.

BERT is essentially a trained transformer encoder stack. Input into BERT is a sentence, and the output is a sequence of word vectors. The word vectors are contextual, which means that a word vector is assigned to a word based on the input sentence. In short, BERT outputs contextual word representations.

We have already seen a number of issues that transformers aim to solve. Another problem that transformers address concerns word vectors. Earlier, we saw that word vectors are context-free; the word vector for a word is always the same independent of the sentence it is used in. The following diagram explains this problem:

Press + to interact
Word vector for the word "bank"
Word vector for the word "bank"

Here, even though the word "bank" has two completely different meanings in these two sentences, the word vectors are the same because Glove and FastText are static. Each word has only one vector, and vectors are saved to a file following ...