TinyBERT

Learn about the TinyBERT variant of BERT based on knowledge distillation.

TinyBERT is another interesting variant of BERT that also uses knowledge distillation. With DistilBERT, we learned how to transfer knowledge from the output layer of the teacher BERT to the student BERT. But apart from this, can we also transfer knowledge from other layers of the teacher BERT? Yes!

In TinyBERT, apart from transferring knowledge from the output layer (prediction layer) of the teacher to the student, we also transfer knowledge from embedding and encoder layers.

Let's understand this with an example. Suppose we have a teacher BERT with NN encoder layers. For simplicity, we have shown only one encoder layer in the following figure. The following figure depicts the pre-trained teacher BERT model, where we feed a masked sentence and it returns the logits of all the words in our vocabulary being the masked word.

TinyBERT transferring knowledge from other layers

In DistilBERT, we took the logits produced by the output layer of the teacher BERT and trained the student BERT to produce the same logits (this step is indicated with a '1' in the diagram below).

In TinyBERT, we also take the hidden state and attention matrix produced by the teacher BERT and train the student BERT to produce the same hidden state and attention matrix (2 in the diagram below). Then we take the output of the embedding layer from the teacher BERT and train the student BERT to produce the same embedding as the teacher BERT (3 in the diagram below).

So, in TinyBERT, apart from transferring knowledge from the output layer of the teacher BERT to the student BERT, we also transfer knowledge from the intermediate layers.

Get hands-on with 1400+ tech skills courses.