Understanding BERT

Learn about BERT and its input processing.

Bidirectional Encoder Representation from Transformers (BERT) is a transformer model among a plethora of transformer models that have come to light over the past few years.

BERT was introduced in the paper BERT—Pre-training of Deep Bidirectional Transformers for Language UnderstandingDelvin et al. (https://arxiv.org/pdf/1810.04805.pdf). The transformer models are divided into two main factions:

  • Encoder-based models

  • Decoder-based (autoregressive) models

In other words, either the encoder or the decoder part of the transformer provides the foundation for these models, compared to using both the encoder and the decoder. The main difference between the two is how attention is used. Encoder-based models use bidirectional attention, whereas decoder-based models use autoregressive (that is, left to right) attention.

BERT is an encoder-based transformer model. It takes an input sequence (a collection of tokens) and produces an encoded output sequence. The figure below depicts the high-level architecture of BERT :

Press + to interact
The high-level architecture of BERT
The high-level architecture of BERT

It takes a set of input tokens and produces a sequence of hidden representations generated using several hidden layers.

Now, let’s discuss a few details pertinent to BERT, such as inputs consumed by BERT and the tasks it’s designed to solve.

Input processing for BERT

When BERT takes an input, it inserts some special tokens into the input. First, at the beginning, it inserts a [CLS] (an abbreviated form of the term classification) token that is used to generate the final hidden representation for certain types of tasks (such as sequence classification). It represents the output after attending to all the tokens in the sequence. Next, it also inserts a [SEP] (meaning “separation”) token depending on the type of input. The [SEP] token marks the end and beginning of different sequences in the input. For example, in question answering, the model takes a question and a context (such as a paragraph) that may have the answer as an input, and [SEP] is used in between the question and the context. Additionally, we have the [PAD] token, which can be used to pad short sequences to a required length.

The [CLS] token is appended to any input sequence fed to BERT. This denotes the beginning of the ...