What is knowledge distillation in deep learning?

Knowledge distillation refers to transferring knowledge from a large entity to a smaller one. In deep learning terms, this is the process of transferring the learning of a large model to a smaller model.

Why do we need knowledge distillation in deep learning?

The state-of-the-art deep learning methods are continuously performing well on various tasks. However, computational resources are one major obstacle in deploying these methods on a large scale. We can solve this problem using knowledge distillation and building models considering the performance and real-world constraints. The way to do this is to train a large model on the dataset and transfer the learning to a small model. The large model can be seen as a teacher model, and the small model as a student model.

Components of knowledge distillation

Knowledge distillation comprises three components: knowledge, distillation, and architecture. Each component is important in the reliable knowledge transfer from one model to the other.

Knowledge types

In deep learning, the knowledge is usually the weight of the trained model. We can divide the knowledge into the following categories:

Response-based knowledge: This knowledge focuses on the output of the model. While transferring this, the algorithm tries to learn the weights of the final layer of the model.
Feature-based knowledge: This knowledge focuses on the feature representation of the model. While transferring this, the algorithm tries to learn the weights of the hidden layers of the model.
Relation-based knowledge: This knowledge focuses on the relationship between feature maps of the model. This can be captured by examining the model’s correlations, similarity, or probability distributions.

Distillation schemes

The distillation methods can be divided into three categories as follows.

Offline distillation: In offline distillation, the teacher model already contains knowledge. This is transferred to the student model by comparing the outputs of teacher and student models using standard training pipelines.
Online distillation: In online distillation, the knowledge in the teacher model is not sufficient or high-performing. Thus, the teacher and student model are trained simultaneously in this method.
Self-distillation: The student model is absent as a separate entity in self-distillation. Instead, the teacher model creates a smaller version of itself. For instance, the teacher model can transfer knowledge from deeper sections of the model to the shallow sections. The shallow sections can then be represented as a student model.

Architectures

The student-teacher model is the standard architecture for the knowledge distillation pipelines. However, several ideas contribute toward making a better student-teacher architecture. These include:

Using an architecture as a teacher model and a shallower version of the same architecture as a student model
Using a quantizedlow precision representation version of the model as a student model (i.e., using 16-bit floats instead of 32-bit).
Using neural architecture search for selecting the student model

How to make a knowledge distillation pipeline

Typically, the deep learning models minimize the error on datasets. In student-teacher architectures, however, the models minimize the error on the models. This means that these models try to improve the resemblance of student and teacher models. Although there’s no fixed method, one approach is to share the data with student and teacher models and use two loss functions simultaneously. One of the losses minimizes the error on the dataset, whereas the other loss compares the performance of the student model with the teacher model. This type of knowledge distillation uses online distillation to learn the response-based knowledge.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources