Understanding BERT

Let's learn about BERT.

We'll now explore the most influential and commonly used Transformer model, BERT. BERT is introduced in Google's research paperDevlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv./abs/1810.04805.

What does BERT do exactly? To understand what BERT outputs, let's dissect the name:

  • Bidirectional: Training on the text data is bidirectional, which means each input sentence is processed from left to right as well as from right to left.

  • Encoder: An encoder encodes the input sentence.

  • Representations: A representation is a word vector.

  • Transformers: The architecture is transformer-based.

BERT is essentially a trained transformer encoder stack. Input into BERT is a sentence, and the output is a sequence of word vectors. The word vectors are contextual, which means that a word vector is assigned to a word based on the input sentence. In short, BERT outputs contextual word representations.

We have already seen a number of issues that transformers aim to solve. Another problem that transformers address concerns word vectors. Earlier, we saw that word vectors are context-free; the word vector for a word is always the same independent of the sentence it is used in. The following diagram explains this problem:

Get hands-on with 1300+ tech skills courses.