Architecture of BERT
Learn about the different components involved in the BERT transformer in this lesson.
We'll cover the following
BERT introduces bidirectional attention to transformer models. Bidirectional attention requires many other changes to the original transformer model. In this section, we will focus on the specific aspects of BERT models.
We will focus on the evolutions designed by Devlin et al. (2018), which describe the encoder stack, in the paper "
First we’ll go through the encoder stack the preparation of the pretraining input environment. Then we will describe the two-step framework of BERT: pretraining and fine-tuning.
Let’s first explore the encoder stack.
Structure
The first building block we will take from the original transformer model is an encoder layer. The encoder layer is shown in the diagram below:
Get hands-on with 1400+ tech skills courses.