Distillation: The BYOL Algorithm

Learn about self-supervised learning via distillation and get an overview of the BYOL algorithm.

Distillation as similarity maximization

As shown in the figure below, distillation, in general, refers to transferring knowledge from a fixed (usually large) model known as teacher fteacher(.)f^{\text{teacher}}(.) to a smaller one known as student fstudent(.)f^{\text{student}}(.).

Distillation methods can also be seen as similarity maximization–based methods. Just like contrastive learning and clustering, distillation aims to prevent trivial solutions to f(X)=f(augment(X))f(X) = f(\text{augment}(X)). It does so by solving fstudent(X)=fteacher(augment(X))f^{\text{student}}(X) = f^{\text{teacher}}(\text{augment}(X)) ...