...

/

Teacher-Student Architecture

Teacher-Student Architecture

Learn about the teacher-student architecture of TinyBERT and how distillation happens in the TinyBERT model.

In TinyBERT, we use a two-stage learning framework where we apply distillation in both the pre-training and fine-tuning stage.


But to understand how exactly TinyBERT works, let's first go over the premise and notation used. The following figure shows the teacher and student BERT:

Press + to interact
Teacher-student architecture of TinyBERT
Teacher-student architecture of TinyBERT

We'll look at the teacher BERT first, and then we'll look at the student BERT.

Understanding the teacher BERT

From the above diagram, we can understand that the teacher BERT consists of NN encoder layers. We take the input sentence and feed it to an embedding layer and get the input embeddings. Next, we pass the input embedding to the encoder layers. The encoder layers learn the contextual relation of the input sentence using the self-attention mechanism and return the representation. Next, we send the representation to the prediction layer.

The prediction layer is basically the feedforward network. If we are performing a masked language modeling task, then the prediction layer will return the logits of all the words in our vocabulary being the masked word.

We use the pre-trained BERT-base model as the teacher BERT. The BERT-base model consists of 12 encoder layers and 12 attention heads, and the size of the representation (hidden state dimension dd) produced by it is 768. The teacher BERT contains 110 million parameters. Now that we understand the teacher BERT, let's have a look at the student BERT.

Understanding the student BERT

From the above diagram, we can notice that the architecture of the student BERT is the same as the teacher BERT, but unlike the teacher BERT, the student BERT consists of MM encoder layers. Note that NN is greater than MM ...