Natural Language Processing with TensorFlow/

...

Understanding BERT

Learn about BERT and its input processing.

We'll cover the following...

Input processing for BERT
Tasks solved by BERT
How BERT is pretrained
- Masked language modeling (MLM)
- Next sentence prediction (NSP)

Bidirectional Encoder Representation from Transformers (BERT) is a transformer model among a plethora of transformer models that have come to light over the past few years.

BERT was introduced in the paper BERT—Pre-training of Deep Bidirectional Transformers for Language UnderstandingDelvin et al. (https://arxiv.org/pdf/1810.04805.pdf). The transformer models are divided into two main factions:

Encoder-based models
Decoder-based (autoregressive) models

In other words, either the encoder or the decoder part of the transformer provides the foundation for these models, compared to using both the encoder and the decoder. The main difference between the two is how attention is used. Encoder-based models use bidirectional attention, whereas decoder-based models use autoregressive (that is, left to right) attention.

BERT is an encoder-based transformer model. It takes an input sequence (a collection of tokens) and produces an encoded output sequence. The figure below depicts the high-level architecture of BERT :

Press + to interact

It takes a set of input tokens and produces a sequence of hidden representations generated using several hidden layers.

Now, let’s discuss a few details pertinent to BERT, such as inputs consumed by BERT and the tasks it’s designed to solve.

Input processing for BERT

When BERT takes an input, it inserts some special tokens into the input. First, at the beginning, it inserts a [CLS] (an abbreviated form of the term classification) token that is used to generate the final hidden representation for certain types of tasks (such as sequence classification). It represents the output after attending to all the tokens in the sequence. Next, it also inserts a [SEP] (meaning “separation”) token depending on the type of input. The [SEP] token marks the end and beginning of different sequences in the input. For example, in question answering, the model takes a question and a context (such as a paragraph) that may have the answer as an input, and [SEP] is used in between the question and the context. Additionally, we have the [PAD] token, which can be used to pad short sequences to a required length.

The [CLS] token is appended to any input sequence fed to BERT. This denotes the beginning of the ...

Introduction to Natural Language Processing

Understanding TensorFlow 2

Word2vec: Learning Word Embeddings

Advanced Word Vector Algorithms

Sentence Classification with Convolutional Neural Networks

Recurrent Neural Networks

Understanding Long Short-Term Memory Networks

Applications of LSTM: Generating Text

Sequence-to-Sequence Learning: Neural Machine Translation

Transformers

Sarcasm Classification Using BERT

Image Captioning with Transformers

Caption Generation Using PyTorch

Final Remarks

Appendix: Mathematical Foundations and Advanced TensorFlow

Understanding BERT

Input processing for BERT