Knowledge distillation refers to transferring knowledge from a large entity to a smaller one. In deep learning terms, this is the process of transferring the learning of a large model to a smaller model.
The state-of-the-art deep learning methods are continuously performing well on various tasks. However, computational resources are one major obstacle in deploying these methods on a large scale. We can solve this problem using knowledge distillation and building models considering the performance and real-world constraints. The way to do this is to train a large model on the dataset and transfer the learning to a small model. The large model can be seen as a teacher model, and the small model as a student model.
Knowledge distillation comprises three components: knowledge, distillation, and architecture. Each component is important in the reliable knowledge transfer from one model to the other.
In deep learning, the knowledge is usually the weight of the trained model. We can divide the knowledge into the following categories:
Response-based knowledge: This knowledge focuses on the output of the model. While transferring this, the algorithm tries to learn the weights of the final layer of the model.
Feature-based knowledge: This knowledge focuses on the feature representation of the model. While transferring this, the algorithm tries to learn the weights of the hidden layers of the model.
Relation-based knowledge: This knowledge focuses on the relationship between feature maps of the model. This can be captured by examining the model’s correlations, similarity, or probability distributions.
The distillation methods can be divided into three categories as follows.
Offline distillation: In offline distillation, the teacher model already contains knowledge. This is transferred to the student model by comparing the outputs of teacher and student models using standard training pipelines.
Online distillation: In online distillation, the knowledge in the teacher model is not sufficient or high-performing. Thus, the teacher and student model are trained simultaneously in this method.
Self-distillation: The student model is absent as a separate entity in self-distillation. Instead, the teacher model creates a smaller version of itself. For instance, the teacher model can transfer knowledge from deeper sections of the model to the shallow sections. The shallow sections can then be represented as a student model.
The student-teacher model is the standard architecture for the knowledge distillation pipelines. However, several ideas contribute toward making a better student-teacher architecture. These include:
Using an architecture as a teacher model and a shallower version of the same architecture as a student model
Using a
Using neural architecture search for selecting the student model
Typically, the deep learning models minimize the error on datasets. In student-teacher architectures, however, the models minimize the error on the models. This means that these models try to improve the resemblance of student and teacher models. Although there’s no fixed method, one approach is to share the data with student and teacher models and use two loss functions simultaneously. One of the losses minimizes the error on the dataset, whereas the other loss compares the performance of the student model with the teacher model. This type of knowledge distillation uses online distillation to learn the response-based knowledge.
To sum up, in deep learning, knowledge distillation is the process of transferring knowledge from large models to small models. This helps create models that are not resource-intensive and memory-hungry without compromising performance.
Free Resources