...

/

Training the Student Network

Training the Student Network

Learn how to transfer the knowledge from the teacher to the student network.

Okay, so how do we transfer the dark knowledge from the teacher to the student? How is the student network trained, and how does it acquire knowledge from the teacher?

Note: The student network is not pre-trained, only the teacher network is pre-trained. The teacher network is pre-trained with softmax temperature.

As shown in the following figure, we feed the input sentence to both teacher and student networks and get the probability distribution as output. The teacher network is a pre-trained network, so the probability distribution returned by the teacher network will be our target. The output of the teacher network is called a soft target, and the prediction made by the student network is called a soft prediction.

Press + to interact
Teacher-student architecture
Teacher-student architecture

The distillation loss

Now, we compute the cross-entropy loss between the soft target and soft prediction and train the student network through backpropagation by minimizing the cross-entropy loss. The cross-entropy loss between the soft target and soft prediction is also known as the distillation loss. We can also observe from the following figure that we keep the softmax temperature TT ...