Distillation of Embedding and Prediction Layer

Learn about the distillation of the embedding and prediction layer of Tiny BERT.

Embedding layer distillation

In embedding layer distillation, we transfer knowledge from the embedding layer of the teacher to the embedding layer of the student. Let ESE^S denote the embedding of the student and ETE^T denote the embedding of the teacher, then we train the network to perform embedding layer distillation by minimizing the mean squared error between the embedding of student ESE^S and teacher ETE^T, as shown in the following:

Get hands-on with 1400+ tech skills courses.