Multimodal embeddings: What does it mean?

Multimodal embeddings are numerical representations of data that integrate information from multiple data types—text, images, video, and audio—into a common feature space. These embeddings capture the relationships and interactions across different data types. This unified representation of different data types facilitates modern search features such as searching relevant content across different data modalities by image, text, audio, video, etc. In this lesson, we'll work with images and text to keep things simple.

Multimodal embedding APIs

Several APIs provide pretrained models for generating multimodal embeddings, making integrating these capabilities into various applications easier. Some of the widely used multimodal embedding APIs are listed below.

  • OpenAI’s CLIP: OpenAI’s CLIP (Contrastive Language-Image Pretraining) is a model designed to understand and relate images and text. It typically generates 512-dimensional embeddings for both images and text. These embeddings allow the model to understand and align visual and textual information in a shared latent space, enabling zero-shot classification and image-to-text retrieval tasks.

  • Google’s multimodal embeddings: Google’s multimodal embedding model generates 1408-dimensional vector embeddings from the inputs we provide, which can include a combination of images, text, and video data. These embedding vectors can be used for image and video search, classification, and ad or product recommendations given an image or a video.

  • Microsoft Azure’s AI Vision Image Analysis service: Azure provides a multimodal embedding model that generates 1024-dimensional vector embeddings for images (or video frames) and text. These embeddings support applications like digital asset management, security, forensic image retrieval, e-commerce, and fashion by enabling searches based on visual features and descriptions. However, the model is not designed for medical image analysis and should not be used for medical purposes.

Note: We will use OpenAI's multimodal embedding API CLIP in our examples.

OpenAI’s multimodal embedding API: CLIP

CLIP is trained on a large image dataset with corresponding text descriptions. Each image and text pair is encoded into embeddings using two encoders: a text encoder and an image encoder, as shown in the illustration below.

  • Text encoder is a transformer model (e.g., GPT) that converts text descriptions into high-dimensional embeddings.

  • Image encoder is a convolutional neural network (e.g., ResNet) that converts images into high-dimensional embeddings.

Get hands-on with 1400+ tech skills courses.