Teacher-Student Architecture
Learn about the teacher-student architecture of TinyBERT and how distillation happens in the TinyBERT model.
We'll cover the following...
In TinyBERT, we use a two-stage learning framework where we apply distillation in both the pre-training and fine-tuning stage.
 
But to understand how exactly TinyBERT works, let's first go over the premise and notation used. The following figure shows the teacher and student BERT:
We'll look at the teacher BERT first, and then we'll look at the student BERT.
Understanding the teacher BERT
From the above diagram, we can understand that the teacher BERT consists of 
The prediction layer is basically the feedforward network. If we are performing a masked language modeling task, then the prediction layer will return the logits of all the words in our vocabulary being the masked word.
We use the pre-trained BERT-base model as the teacher BERT. The BERT-base model consists of 12 encoder layers and 12 attention heads, and the size of the representation (hidden state dimension 
Understanding the student BERT
From the above diagram, we can notice that the architecture of the student BERT is the same as the teacher BERT, but unlike the teacher BERT, the student BERT consists of 
We use the BERT model with 4 encoder layers as the student BERT, and we set the representation size (hidden state dimension