Need of Vector Databases for LLMs

Learn about types of vector embeddings and vector database with its working for LLMs.

Types of embeddings

Embeddings can be applied to different types of raw data, including text, images, audio, and video.

  • Text embeddings: Imagine you have a box for words. Similar words, like "happy" and "joyful," get placed close together in the box because they're similar. We use natural language processing (NLP) word embeddings such as word2vec, GloVe, and FastText to represent words as a vector.

  • Image embeddings: This model is for pictures. For example, an image embedding for a cat might capture features like whiskers, pointed ears, and fur texture, while a dog embedding might capture floppy ears, a wagging tail, and a furry coat. These embeddings allow algorithms to find similar images based on these shared features. We use convolutional neural networks (CNNs) to extract image features, which are then converted into vector embeddings.

  • Audio embeddings: This model is for sounds. Similar sounds, like different instruments or animal noises, might be placed close together because they have similar qualities. Audio data can be represented as spectrogram images or extracted features such as Mel-frequency cepstral coefficients (MFCCs).

  • Video embeddings: This model is for moving pictures. Videos of people walking or animals running might be close together because they show similar movements. Video embeddings are typically derived from frame-level features extracted using CNNs or recurrent neural networks (RNNs).

Unimodal vs. multimodal embeddings

Unimodal embeddings are generated from a single type of data such as text, images, audio, and video. For example, text embedding model would only be trained on text data, it wouldn't be trained on images, audio, or video data.

Multimodal embeddings are generated from multiple types of data like text, audio, video, and images. For example, a multimodal model can be trained on different breeds of cats via a dataset containing textual descriptions, pictures, sounds, and videos of cats’ movements. Multimodal embeddings are beneficial because they integrate data from various sources that leads to richer and more comprehensive representations. This improves performance in tasks such as image captioning and sentiment analysis, enables efficient cross-modal retrieval, and enhances robustness and generalization. Overall, they provide a more holistic understanding, leading to better user experiences in complex applications.

Get hands-on with 1200+ tech skills courses.