Summary: BERT Variants—Based on Knowledge Distillation
Let’s summarize what we have learned so far.
We'll cover the following
Key highlights
Summarized below are the main highlights of what we've learned in this chapter.
We started off by learning what knowledge distillation is and how it works.
We learned that knowledge distillation is a model compression technique in which a small model is trained to reproduce the behavior of a large pre-trained model. It is also referred to as teacher-student learning, where the large pre-trained model is the teacher and the small model is the student.
Get hands-on with 1400+ tech skills courses.