...

Training the Student Network

Learn how to transfer the knowledge from the teacher to the student network.

We'll cover the following...

The distillation loss
Difference between the soft target and hard target
Difference between soft prediction and hard prediction
The student loss
Computing student loss
Computing distillation loss
Final loss function

Okay, so how do we transfer the dark knowledge from the teacher to the student? How is the student network trained, and how does it acquire knowledge from the teacher?

Note: The student network is not pre-trained, only the teacher network is pre-trained. The teacher network is pre-trained with softmax temperature.

As shown in the following figure, we feed the input sentence to both teacher and student networks and get the probability distribution as output. The teacher network is a pre-trained network, so the probability distribution returned by the teacher network will be our target. The output of the teacher network is called a soft target, and the prediction made by the student network is called a soft prediction.

Access this course and 1400+ top-rated courses and projects.

Preview Free Lessons→

Preview Free Lessons

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Training the Student Network