Training the Student Network

Learn how to transfer the knowledge from the teacher to the student network.

We'll cover the following...

The distillation loss
Difference between the soft target and hard target
Difference between soft prediction and hard prediction
The student loss
Computing student loss
Computing distillation loss
Final loss function

Okay, so how do we transfer the dark knowledge from the teacher to the student? How is the student network trained, and how does it acquire knowledge from the teacher?

Note: The student network is not pre-trained, only the teacher network is pre-trained. The teacher network is pre-trained with softmax temperature.

As shown in the following figure, we feed the input sentence to both teacher and student networks and get the probability distribution as output. The teacher network is a pre-trained network, so the probability distribution returned by the teacher network will be our target. The output of the teacher network is called a soft target, and the prediction made by the student network is called a soft prediction.

1.Before We Start

2.Starting Off with BERT

3.A Primer on Transformers

Project

4.Understanding the BERT Model

5.Getting Hands-On with BERT

6.Exploring BERT Variants

7.Different BERT Variants

8.BERT Variants—Based on Knowledge Distillation

9.Applications of BERT

10.Exploring BERTSUM for Text Summarization

11.Applying BERT to Other Languages

12.Exploring Sentence and Domain-Specific BERT

13.Working with VideoBERT, BART, and More

14.Conclusion

Project

Training the Student Network

The distillation loss