Predicting Relative Position of Patches

Learn to implement self-supervised learning via predicting relative patch positions.

Relative positioning of patches

Given an image XiX_i, this pretext task involves sampling a random pair of patches (Xip1,Xip2)(X_i^{p_1}, X_i^{p_2}) in one of the eight spatial configurations (shown in figure below) and assigning a pseudo label PiP_i (shown in figure below) that denotes the position of patch Xip2X_i^{p_2} relative to patch Xip1X_i^{p_1}. The neural network, f(.)f(.) thus learns to predict PiP_i given (Xip1,Xip2)(X_i^{p_1}, X_i^{p_2}) as input (i.e, f((Xip1,Xip2))=Pif((X_i^{p_1}, X_i^{p_2})) = P_i). This ensures that the model extracts and understands the relative spatial arrangement of relevant objects in an image useful for recognition.

The figure below illustrates how a training example is generated. We first sample a center patch Xip1X_i^{p_1} (shown in blue) uniformly within the image bounds. After that, the second patch Xip2X_i^{p_2} is sampled randomly in one of the eight spatial configurations (shown in dotted red) relative to the center patch Xip1X_i^{p_1}.

Get hands-on with 1400+ tech skills courses.