The CLIP Model

Learn about the CLIP model, a task-agnostic vision transformer.

Now, let’s go through the different task-agnostic models, starting off with CLIP, another computer vision model.

Overview

Contrastive Language-Image Pre-Training (CLIP) follows the philosophy of transformers. It plugs sequences of data into its transformer-type layers. Instead of sending text pairs, this time the model sends text-image pairs. Once the data is tokenized, encoded, and embedded, CLIP, a task-agnostic model, learns text-image pairs like with any other sequence of data.

The method is contrastive because it looks for the contrasts in the features of the image. This is similar to the method we humans use when solving magazine games where we have to find the differences, or contrasts, between two images.

Let’s first see the architecture of CLIP before looking into the code.

The basic architecture of CLIP

Contrastive means that the images are trained to learn how they fit together through their differences and similarities. The image and captions find their way toward each other through (joint text, image) pretraining. After pretraining, CLIP learns new tasks.

CLIPs are transferable because they can learn new visual concepts, like GPT models, such as action recognition in video sequences. The captions lead to endless applications.

ViT splits images into word-like patches. CLIP jointly trains text and image encoders for (caption, image) pairs to maximize cosine similarity, as shown in the figure below:

Get hands-on with 1400+ tech skills courses.