Training the Student BERT (DistilBERT)

Learn how to train the student BERT for DistilBERT and how DistillBERT differs from the BERT-base model.

We can train the student BERT with the same dataset we used for pre-training the teacher BERT (BERT-base). We know that the BERT-base model is pre-trained with English Wikipedia and the Toronto BookCorpus dataset, and we can use this same dataset to train the student BERT (small BERT).

Using training strategies from the RoBERTa model

We'll borrow a few training strategies from the RoBERTa model. With RoBERTa, we train the student BERT with only the masked language modeling task, and during masked language modeling, we use dynamic masking. We also use a large batch size on every iteration.

Computing distillation loss

As shown in the following figure, we take a masked sentence, feed it as input to the teacher BERT (pre-trained BERT-base) and the student BERT, and get the probability distribution over the vocabulary as an output. Next, we compute the distillation loss as a cross-entropy loss between the soft target and soft prediction:

Get hands-on with 1400+ tech skills courses.