Vector Databases: From Embeddings to Applications/

...

Multimodal Embeddings

Learn about multimodal embeddings and their applications with an example of generating embeddings for image and text datasets, enabling search by either modality.

We'll cover the following...

Multimodal embeddings: What does it mean?
Multimodal embedding APIs
OpenAI’s multimodal embedding API: CLIP
Coding example
Applications of multimodal embeddings

Multimodal embeddings: What does it mean?

Multimodal embeddings are numerical representations of data that integrate information from multiple data types—text, images, video, and audio—into a common feature space. These embeddings capture the relationships and interactions across different data types. This unified representation of different data types facilitates modern search features such as searching relevant content across different data modalities by image, text, audio, video, etc. In this lesson, we'll work with images and text to keep things simple.

Multimodal embedding APIs

Several APIs provide pretrained models for generating multimodal embeddings, making integrating these capabilities into various applications easier. Some of the widely used multimodal embedding APIs are listed below.

OpenAI’s CLIP: OpenAI’s CLIP (Contrastive Language-Image Pretraining) is a model designed to understand and relate images and text. It typically generates 512-dimensional embeddings for both images and text. These embeddings allow the model to understand and align visual and textual information in a shared latent space, enabling zero-shot classification and image-to-text retrieval tasks.
Google’s multimodal embeddings: Google’s multimodal embedding model generates 1408-dimensional vector embeddings from the inputs we provide, which can include a combination of images, text, and video data. These embedding vectors can be used for image and video search, classification, and ad or product recommendations given an image or a video.
Microsoft Azure’s AI Vision Image Analysis service: Azure provides a multimodal embedding model that generates 1024-dimensional vector embeddings for images (or video frames) and text. These embeddings support applications like digital asset management, security, forensic image retrieval, e-commerce, and fashion by enabling searches based on visual features and descriptions. However, the model is not designed for medical image analysis and should not be used for medical purposes.

Note: We will use OpenAI's multimodal ...

Before Getting Started

Getting Started with Vector Databases and Embeddings

Working with Vector Databases

Developing a Music Recommendation System

Wrapping Up

Multimodal Embeddings

Multimodal embeddings: What does it mean?

Multimodal embedding APIs