Distillation Techniques for Pre-training and Fine-tuning
Learn about performing distillation in the pre-training and fine-tuning stages.
We'll cover the following
In TinyBERT, we will use a two-stage learning framework as follows:
General distillation
Task-specific distillation
This two-stage learning framework enables the distillation in both the pre-training and fine-tuning stages. Let's take a look at how each of the stages works in detail.
General distillation
General distillation is basically the pre-training step. Here, we use the large pre-trained BERT (BERT-base) as the teacher and transfer its knowledge to the small student BERT (TinyBERT) by performing distillation. Note that we apply distillation at all the layers.
We know that the teacher BERT-base model is pre-trained on the general dataset (Wikipedia and the Toronto BookCorpus dataset). So, while performing distillation, that is, while transferring knowledge from the teacher (BERT-base) to the student (TinyBERT), we use the same general dataset.
After distillation, our student BERT will consist of knowledge from the teacher, and we can call our pre-trained student BERT a general TinyBERT.
Get hands-on with 1400+ tech skills courses.