Architecture of BERT

Learn about the different components involved in the BERT transformer in this lesson.

We'll cover the following

BERT introduces bidirectional attention to transformer models. Bidirectional attention requires many other changes to the original transformer model. In this section, we will focus on the specific aspects of BERT models.

We will focus on the evolutions designed by Devlin et al. (2018), which describe the encoder stack, in the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"The paper can be accessed at: https://arxiv.org/abs/1810.04805.

First we’ll go through the encoder stack the preparation of the pretraining input environment. Then we will describe the two-step framework of BERT: pretraining and fine-tuning.

Let’s first explore the encoder stack.

Structure

The first building block we will take from the original transformer model is an encoder layer. The encoder layer is shown in the diagram below:

Get hands-on with 1400+ tech skills courses.