Understanding BERT
Let's learn about BERT.
We'll cover the following
We'll now explore the most influential and commonly used Transformer model, BERT. BERT is introduced in Google's research
What does BERT do exactly? To understand what BERT outputs, let's dissect the name:
Bidirectional: Training on the text data is bidirectional, which means each input sentence is processed from left to right as well as from right to left.
Encoder: An encoder encodes the input sentence.
Representations: A representation is a word vector.
Transformers: The architecture is transformer-based.
BERT is essentially a trained transformer encoder stack. Input into BERT is a sentence, and the output is a sequence of word vectors. The word vectors are contextual, which means that a word vector is assigned to a word based on the input sentence. In short, BERT outputs contextual word representations.
We have already seen a number of issues that transformers aim to solve. Another problem that transformers address concerns word vectors. Earlier, we saw that word vectors are context-free; the word vector for a word is always the same independent of the sentence it is used in. The following diagram explains this problem:
Get hands-on with 1300+ tech skills courses.