Simple Masked Image Modeling

Learn how to implement the Simple Masked Image Modeling (SimMIM) algorithm.

We'll cover the following...

Simple Masked Image Modeling (SimMIM) is a simple masked modeling framework that predicts the raw pixel values of randomly masked input image patches using a lightweight linear layer and a L1L_1 loss. The figure below illustrates the idea.

Masking strategy

SimMIM uses a patch-aligned random masking strategy where masking is randomly applied at a patch level (i.e., a patch is either fully visible or fully masked). By default, the algorithm uses a 32×3232\times 32 (N×NN \times N) patch size. Thus, given an image, XiX_i, we generate a random mask Mi{0,1}H×WM_i \in \{0,1\}^{H\times W} (HH and WW are the height and width of the image XiX_i). This 00 represents that the pixel/patch is masked and 11 represents that it’s not. The masked image MiXi ...