...
/Introduction to Masked Image Modeling
Introduction to Masked Image Modeling
Learn a new self-supervised learning paradigm, Masked Image Modeling.
What is masked image modeling?
Inspired by NLP, masked image modeling (MIM) is a new self-supervised learning paradigm where the masked portion of the input is used to learn and predict the masked signal. This approach provides competitive results to approaches like contrastive learning.
Applying masked image modeling can be challenging because of several reasons:
Pixels close to each other are highly correlated. As a result, sometimes the image can be reconstructed well enough, even by duplicating nearby pixels. This leads to trivial solutions and inefficient learning.
Signals at pixel levels are very raw and contain low-level information.
Signals in image data are also continuous, unlike text data, where they are discrete.
Thus, masked image modeling must be accomplished properly to avoid correlation/trivial solutions.
The framework of masked image modeling
Masked image modeling aims to predict the original signals from a masked input. As illustrated below, the framework involves the following components:
Masking strategy: A masking strategy is based on selecting the area to mask and performing masking on that area. Usually, masking is done at the image patch level rather than the pixel level (i.e., masking is applied to patches rather than to pixels). We can use various strategies for image masking, like square shape masking, random masking, etc. This masked image is used as an input to the neural network.
Encoder: This component is a neural network that should be able to take a masked image as input and extract useful latent representations to predict the original signals at the masked areas. Generally, transformer models like Vision Transformer and Swin Transformers (discussed subsequently) are used as encoder architectures.
Prediction head: This component should reconstruct the original signals at the masked region of the input when the encoder features are given as input.
Prediction target: This component calculates the loss function on prediction head output. The loss type can be a cross-entropy classification
or pixel regression loss. Pixel regression means we predict the values of masked regions of the input image.
Here,
Overview of vision transformers
Most approaches in mask image modeling use masking strategies that operate at the image patch level. Instead of masking an image at each pixel, they mask
Patch embeddings
The first step is to represent the input image
The next step is to project each of these patches (