Introduction to Vector Databases and Embeddings
Learn about vector databases, explore what embeddings are, and understand their connection with vector databases.
What is a vector database?
A vector database is a type of database that stores different types of data, such as text, images, audio, and video, in a numerical form referred to as a vector. A vector contains information about a data object in different dimensions. Each dimension captures a specific aspect of the object, enabling complex data to be represented in a structured numerical format. For example, a vector representing an image might include dimensions for pixel intensity, color channels (e.g., red, green, and blue), texture features, and spatial location.
By storing data in vectorized form, vector databases facilitate storing and manipulating heterogeneous data types, enabling efficient processing and analysis using mathematical techniques.
What are (vector) embeddings?
A vector is generally an array of numbers; it doesn’t matter what those numbers represent. When we use vectors to capture meaningful information about the data, such as its structure, properties, semantic relationship, and context, we call them vector embeddings or simply embeddings.
Uses of embeddings in various domains
Vector embeddings have various applications in domains such as NLP, machine learning, computer vision, and data analysis. Some of them are listed below:
Natural language processing (NLP)
In NLP, embeddings are used to represent words, phrases, or sentences as dense numerical vectors. These embeddings capture semantic and syntactic relationships between words, allowing algorithms to better understand and process natural language data. Applications include sentiment analysis, language translation, named entity recognition, and text summarization.
Recommendation systems
In recommendation systems, embeddings represent users, items, or features. By encoding user preferences and item attributes into numerical vectors, recommendation algorithms can effectively suggest relevant items to users. Applications include product recommendations on e-commerce platforms, movie recommendations on streaming services, and friend or content recommendations on social media platforms.
Image processing
In computer vision, embeddings represent images or features as compact numerical vectors. These embeddings capture visual characteristics and semantic information of images, enabling algorithms to perform tasks such as image classification, object detection, image retrieval, and image similarity matching.
Graph analytics
In graph-based applications, embeddings represent nodes or edges in a graph structure as numerical vectors. These embeddings capture structural and relational information within the graph, facilitating tasks such as node classification, link prediction, and graph clustering. Applications include social network analysis, recommendation systems based on user interaction graphs, and knowledge graph embedding.
Anomaly detection
Embeddings can be used to represent normal and anomalous patterns in data. By encoding data points into numerical vectors, anomaly detection algorithms can identify deviations from normal behavior and detect unusual patterns or outliers in the data. Applications include fraud detection in financial transactions, intrusion detection in network security, and equipment failure prediction in industrial systems.
Generating embeddings from data
We have seen the benefits of embeddings in various fields above. You might wonder where these embeddings come from or how we generate embeddings for our data. Below is a compact answer. We’ll explore it in detail in an upcoming lesson.
There are different embedding models tailored to different data types and needs. Embedding models are machine learning models that take input data objects such as documents and output embeddings, which are numerical representations of the essential features and characteristics of that data.
Note: We'll discuss popular embedding models that are used to generate word, sentence, document, image, video, and audio embeddings as we progress in the course.
The relationship between vector databases and embeddings
The vector database serves as a repository for the embeddings, enabling efficient storage and retrieval. However, its significance extends beyond storage. Vector databases play a crucial role in supporting various data-driven applications by providing a platform for leveraging embeddings. Applications utilize these embeddings stored in the vector database for tasks such as similarity search, recommendation, clustering, etc.
Assume the application is “Find Similar Images.” So, how will it find similar images? The process generally follows the following steps:
Generate embeddings: Images are fed into an image embedding model, which converts them into high-dimensional numerical vectors, capturing various features and characteristics of the images. Each image is represented by a unique embedding vector.
Store embeddings: These embedding vectors are stored in a vector database, which efficiently indexes and organizes them for quick retrieval.
Query processing: When a user submits a query image, it undergoes the same embedding process to generate a query embedding vector.
Similarity search: The vector database performs a similarity search by comparing the query embedding vector with the stored embedding vectors of all images. Various similarity metrics like Euclidean distance or cosine similarity can be used for this comparison.
Return results: Images with embedding vectors most similar to the query embedding are retrieved from the database and returned to the user as similar images.
Components of a vector database and their working
Above, we saw the definition of a vector database, but we didn’t discuss its internal components and how they work. It was necessary to introduce embeddings, the type of data that vector databases store and process. Now, we’ll discuss the different components of a vector database and their working.
API and interface: Vector databases provide APIs and user interfaces for interacting with the database. The API allows users to perform operations such as storing, retrieving, updating, and deleting data and more specialized operations like querying similar vectors. The user interface provides graphical tools for managing database resources, configuring database settings, indexing parameters, and visualizing query results (not shown in the image).
Indexing layer: This layer organizes vector data for efficient search, especially for tasks like similarity searches. It creates specialized indexes or data structures designed to quickly retrieve vectors similar to a query vector. Various indexing techniques are employed, including:
Inverted file (IVF) indexes
Hierarchical Navigable Small Worlds (HNSW) indexes
Tree-based structures (e.g., k-d trees, ball trees)
Hashing techniques (e.g., locality-sensitive hashing (LSH) indexes)
Graph-based structures (e.g., approximate nearest neighbor graphs)
Storage layer: The storage layer in a vector database is responsible for efficiently storing vector data. It employs specialized data structures optimized for high-dimensional vector storage, supporting fast insertion, retrieval, and deletion operations. This layer ensures data integrity and durability through replication and partitioning. Additionally, it manages disk I/O operations to minimize latency and maximize throughput.
Query processing layer: The query processing layer handles incoming queries and orchestrates the search process to retrieve relevant vector data. It interacts with the indexing and storage layers to execute queries efficiently, filtering and ranking candidate vectors based on the query criteria before returning the results to the user or application.
Integrating these components allows a vector database to efficiently store, index, and retrieve high-dimensional vector data, enabling various data-driven applications such as similarity search, recommendation systems, and clustering.
Note: We'll explore various open-source vector databases, their designs, and prominent features in the next chapter.
What’s next?
After a basic introduction to vector databases and embeddings, this chapter will explore some mathematical methods and algorithms for finding similarities between embeddings. Then, we’ll explore embedding models to generate embeddings for different data types (text, image, audio, and video). We’ll generate embeddings using a selected embedding model for each type of data.