Teacher-Student Architecture

Learn about the teacher-student architecture of TinyBERT and how distillation happens in the TinyBERT model.

We'll cover the following...

We'll look at the teacher BERT first, and then we'll look at the student BERT.

Understanding the teacher BERT

From the above diagram, we can understand that the teacher BERT consists of $N$ encoder layers. We take the input sentence and feed it to an embedding layer and get the input embeddings. Next, we pass the input embedding to the encoder layers. The encoder layers learn the contextual relation of the input sentence using the self-attention mechanism and return the representation. Next, we send the representation to the prediction layer.

The prediction layer is basically the feedforward network. If we are performing a masked language modeling task, then the prediction layer will return the logits of all the words in our vocabulary being the masked word.

We use the pre-trained BERT-base model as the teacher BERT. The BERT-base model consists of 12 encoder layers and 12 attention heads, and the size of the representation (hidden state dimension $d$ ) produced by it is 768. The teacher BERT contains 110 million parameters. Now that we understand the teacher BERT, let's have a look at the student BERT.

Understanding the student BERT

From the above diagram, we can notice that the architecture of the student BERT is the same as the teacher BERT, but unlike the teacher BERT, the student BERT consists of $M$ encoder layers. Note that $N$ is greater than $M$ —that is, the number of encoder layers in the teacher BERT is greater than the number of encoder layers in the student BERT.

We use the BERT model with 4 encoder layers as the student BERT, and we set the representation size (hidden state dimension ...

Before We Start

Starting Off with BERT

A Primer on Transformers

Semantic Search with Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Teacher-Student Architecture

Understanding the teacher BERT

Understanding the student BERT