Relative positioning of patches
Given an image Xi, this pretext task involves sampling a random pair of patches (Xip1,Xip2) in one of the eight spatial configurations (shown in figure below) and assigning a pseudo label Pi (shown in figure below) that denotes the position of patch Xip2 relative to patch Xip1. The neural network, f(.) thus learns to predict Pi given (Xip1,Xip2) as input (i.e, f((Xip1,Xip2))=Pi). This ensures that the model extracts and understands the relative spatial arrangement of relevant objects in an image useful for recognition.
The figure below illustrates how a training example is generated. We first sample a center patch Xip1 (shown in blue) uniformly within the image bounds. After that, the second patch Xip2 is sampled randomly in one of the eight spatial configurations (shown in dotted red) relative to the center patch Xip1.