Training the Student BERT (DistilBERT)
Learn how to train the student BERT for DistilBERT and how DistillBERT differs from the BERT-base model.
We can train the student BERT with the same dataset we used for pre-training the teacher BERT (BERT-base). We know that the BERT-base model is pre-trained with English Wikipedia and the Toronto BookCorpus dataset, and we can use this same dataset to train the student BERT (small BERT).
Using training strategies from the RoBERTa model
We'll borrow a few training strategies from the RoBERTa model. With RoBERTa, we train the student BERT with only the masked language modeling task, and during masked language modeling, we use dynamic masking. We also use a large batch size on every iteration.
Computing distillation loss
As shown in the following figure, we take a ...