Introduction to Vector Databases
Get an introduction to vector databases, an essential component of LLMs that help them enhance their capabilities and performance.
Vector databases are an essential component for efficiently handling and querying large-scale vector data, which can enhance the performance of applications leveraging LLMs. Before discussing vector databases, it's important to understand the type of data they store: vector data. You might wonder why we need to learn about vector data when LLMs generate data in the form of text, images, video, audio, etc. The answer is that all these data types are converted into vectorized forms, which offer several advantages, such as efficient data processing. Numerical data can be processed faster than other types, and mathematical operations can be performed on vector data. For instance, we can determine the similarity between data points by applying dot products or other similarity measures to their vectorized forms. Converting different types of data into a unified vector form allows LLMs to handle multimodal data effectively, enhancing their learning and inference capabilities.
Without any further ado, let’s see what vector data is and how it captures the essence of the actual data in numerical form.
What are vectors?
The vectors are arrays of numbers representing data points in a multi-dimensional space, capturing various features of the data. These vectors provide information about the magnitude (length) and direction of the data points. Visually, vectors can be represented as arrows in two or three dimensions, helping us understand their magnitude and direction. In a two-dimensional space, vectors are represented as 2-D points like (x, y) values, and in three-dimensional space, as (x, y, z) values. Similarly, in k-dimensional space, vectors are represented with k-valued points. Vectors can perform operations such as addition, subtraction, and scalar multiplication.
For example, imagine you have a toy car: if you push it towards the door, the arrow shows the direction, and the length of the arrow shows how fast it goes. This helps us understand the movement by looking at where it's going and how quickly it gets there.
A vector in LLM represents the raw data (text, image, audio, video, etc.) as points in a d-dimensional space. These vectors capture essential features and semantic information of the data. For text, vectors encapsulate the meaning and context of words, sentences, or documents. For other data types like images, video, and audio, vectors similarly encode relevant features, allowing the model to process and understand the data efficiently. This numerical representation of different types of data facilitates tasks such as similarity searches, data clustering, and various other machine-learning operations.
Vector embeddings
In machine learning, we use a specific term for numerical representation of different types of data known as vector embeddings. You must be curious about how these vector embeddings are generated. To generate embeddings, we have embedding models which are machine learning models trained on large datasets. These models learn to capture and encode essential features and semantic relationships of the data, transforming it into a numerical form. The illustration below shows the transformation of raw data into vector embeddings using an embedding model.
Now that we have a basic understanding of vector embeddings, we can move on to discussing vector databases, which are specifically designed to efficiently store and manage these embeddings. To provide you with a clearer, hands-on understanding, we'll explore detailed examples and practical applications of embeddings in the next lesson.
Vector databases
Vector databases are designed to store multidimensional vectors efficiently. Vector databases excel at performing operations on vectors, such as vector search, vector similarity calculations, and nearest neighbor searches. These databases are efficient when dealing with multidimensional data. Vector databases handle unstructured data but store it in a structured manner optimized for vector operations. They use specialized indexing techniques and algorithms optimized for vector similarity searches. They store the data in vectors in the form of numerical values. Also, they utilizes similarity search for vectors to identify the similar text to the query.
Vector databases often employ specialized indexing techniques tailored to high-dimensional data, such as tree-based indexes or hash-based indexes. They are optimized for low-latency operations and high-throughput vector computations, enabling real-time analytics and interactive applications.
To appreciate the need for vector databases, it's important to understand why traditional databases are not suitable for handling vector data.
Traditional databases
Traditional databases are used to store structured data in tabular format. Traditional databases often have a predefined schema, however, they can also support schema modifications and flexible data types like JSON. Each column of the table usually contains a specific type of data. There exists a set of predefined datatypes for these databases. Each table row comprises a complete record of an entity (a person, place, event, object, etc.). These types of databases are not as successful as vector databases because while dealing with massive multidimensional data, the storage space needed to represent each data point increases exponentially. They often store the data in table like structure and sometimes in hierarchies too. The data is usually stored in the form of text and numbers. In these databases, searching is done based on keywords. If the query has different words than the database's data, the search results might not be valuable for the users.
Traditional databases typically use B-tree or hash indexes optimized for one-dimensional or low-dimensional data. They may prioritize consistency and durability over low-latency performance.
Vector databases vs. traditional databases
Here is a brief comparison of both of these types of databases based on various factors like performance, scalability, schema flexibility, support for complex datatypes, and security concerns:
Vector Databases | Traditional Databases | |
Performance Limitations | Specifically designed for vector applications like similarity search between vectors or finding nearest neighbors. Performs better even when the data is very large. | Traditional databases can handle large datasets but are not optimized for high-dimensional vector operations, which can impact performance in specific use cases like similarity search. |
Scalability | When requiring more computing resources due to the increase in either data or in user-queries, these databases can be scaled-up horizontally by installing more servers or virtual machines along with the already running ones. The database remains functional during this process. | When requiring more computing resources, these type of databases can be scaled up vertically e.g. by installing more computational resources (like increasing RAM, microprocessors or SSD hard-drives) inside the already running machine(s). |
Schema Flexibility | Some vector databases do not require schema definition at the time of database creation. These databases can be beneficial where data needs to be stored without pre-designed schema. The schema can be altered after the creation of database too. | These databases usually require the predefined schema in order to store the data. And changing in the schema after the data storage will be difficult and can cause the huge downtime for large datasets. |
Support for Complex Data Types | Generally, the vector databases are developed to process on numerical data only. Complex data like text, images, audios or videos need to be transformed first into numerical valued vectors in order to perform various operations like similarity search, clustering and classification. | Although the traditional databases also support the complex datatypes like arrays, json objects etc. But these databases lack in faster processing of high dimensional data. Lack in faster processing of high dimensional data means laking in performance. |
Security Concerns | This technology is a newer one. Its methods and tools are not yet evolved much. The data can be theft by using some similarity search queries. So these databases are not considered much secured yet. | There are a few commonly known security issues with the traditional databases like SQL injection, data breaches and insider threats. As these databases are few decades older and there exist better security tools and methods now to protect the data. So the security issues can be minimized and data theft can be controlled by using these tools and methods. |
Try it yourself
In the below widget, you can see how we convert the text into embeddings and then how to store them in chromaDB. We will go through the following code later in the course with proper explanation. This is all set for you to try!
pip
version 24.1.2
. In the older versions, the warning does not pop up. If we reduce the pip version here, it may affect other installations. Because of this, we can proceed despite this warning.Vector databases are optimized for efficiently handling large-scale, multi-dimensional data, such as vectors, text, and media, making them ideal for tasks like similarity search. They offer scalability and schema flexibility, allowing adjustments as data grows. However, since vector databases are still new, work is ongoing to improve their security and protect the data.