In today’s dynamic AI landscape, mastering advanced techniques like Retrieval-Augmented Generation (RAG) is crucial for Data Engineers, Data Scientists, and ML Engineers.
RAG combines information retrieval with natural language generation to enhance AI responses with accuracy and context. However, traditional RAG methods have limitations, which are now being addressed through the innovative technique: RAPTOR (Recursive Abstractive Processing for Tree Organized Retrieval).
Today, we’ll explore into the mechanics of RAG, its challenges, and its benefits. We’ll then discuss how RAPTOR overcomes traditional RAG challenges and utilizes LlamaIndex to advance practical implementation.
RAG is a hybrid approach that combines the strengths of information retrieval and generative models to enhance the quality and relevance of generated text. Unlike traditional models that only use their training data, RAG utilizes additional context to give better responses.
RAG tackles the limitations of large language models (LLMs) by incorporating external knowledge into their generation process.
Here’s a breakdown of how it works:
Retrieval: The first step involves gathering relevant information. RAG acts like a skilled researcher when a user presents a question or prompt. It consults a vast knowledge base, which could be the entire internet, a company’s internal documents, or any other source of textual data. This retrieval process ensures that the LLM can access the most up-to-date and potentially relevant information to address the user’s query.
Augmentation: Imagine feeding the retrieved information directly to the LLM. It might be overwhelming! RAG employs various augmentation techniques to make this knowledge more digestible. These techniques can involve summarizing the key points of the retrieved passages or encoding them in a way the LLM can understand efficiently. This augmentation step improves the raw information so the LLM can use it better.
Generation: The LLM generates the response with its inherent understanding of language and the augmented knowledge from the retrieval stage. This response can take various forms depending on the user’s intent. It could be a direct answer to a question, a creative text format inspired by a prompt, or any other kind of textual output. By combining its language skills with augmented knowledge, the LLM aims to deliver a response that is not just creative but also grounded in factual accuracy.
RAG offers several advantages over traditional generative models:
Enhanced accuracy and relevance: RAG incorporates external information during generation, leading to more accurate and contextually relevant responses. This reduces the risk of factual errors or irrelevant information commonly found in models trained on static data.
Improved knowledge coverage: Unlike models limited to their training data, RAG can access and leverage up-to-date information from external sources. This expands the model’s knowledge base and ensures responses reflect current information.
Reduced hallucinations: Generative models can sometimes generate believable but incorrect information (“hallucinations”). RAG, which uses real-world data, helps reduce this problem by promoting factual responses.
Increased adaptability: RAG models can be tailored to specific domains by incorporating relevant knowledge bases and retrieval techniques. This allows them to excel in areas like customer support (company policies) or legal research (case law).
Enhanced user trust: RAG’s ability to cite sources builds user trust by demonstrating transparency and accountability. Users can verify the information and dive deeper if desired.
Cost-effective development: RAG leverages pretrained generative models like GPT-3, GPT-4, Gemini, or Claude, reducing development costs compared to building a model from scratch. Additionally, the focus on retrieval allows for efficient updates by incorporating new information sources.
RAG provides a strong method, but it faces challenges in some situations:
Context deficiency with long documents: Dividing long documents into uniform chunks for retrieval (a common practice) disrupts information flow and makes it challenging for the LLM to grasp the overarching context. Essential relationships between concepts spread across chunks might be overlooked, leading to inaccurate or incomplete responses.
Flat retrieval structure: In standard RAG, all retrieved information is treated equally when generating responses. This method doesn’t recognize that important information could be buried deep within the documents. As a result, the LLM may struggle to prioritize and use that information effectively.
Limited reasoning and fact-checking: While RAG can access external information, its ability to reason over that information or perform robust fact-checking can be limited. This can lead to outputs that combine factual elements with inconsistencies or illogical connections.
Bias and fairness: The quality and bias inherent in the retrieved documents can be reflected in the RAG output.
Interpretability and explainability: Understanding how RAG generates its outputs can be difficult because of the complex interaction between retrieving information and generating responses. This lack of interpretability can make it harder to debug and build trust in the RAG system.
RAPTOR (
The following illustration elaborates on how RAPTOR works:
Here’s a breakdown of the RAPTOR algorithm step-by-step:
The document is segmented into smaller units like sentences or paragraphs. These units are then converted into dense vector embeddings, numerical representations capturing the document’s semantic meaning. This allows for efficient similarity comparisons during retrieval.
RAPTOR is specifically designed to work with textual data. This means it is best suited for processing and analyzing information presented in written format. Keep this in mind as you explore the capabilities of RAPTOR.
This iterative core process refines the document representation:
Clustering: A clustering algorithm, typically based on Gaussian Mixture Models (GMMs), groups similar text chunks together. This helps organize related information for better summarization.
Model-based summarization: Each cluster is sent to an LLM like GPT-3. The LLM generates a concise and informative summary of the text within the cluster.
Re-embedding: The summaries created by the LLM are then converted back into numerical representations suitable for further processing.
After multiple rounds of clustering and summarization (controlled recursion depth), a hierarchical tree is built:
Leaf nodes: Original text chunks form the base of the tree.
Summary nodes: As you move up the tree, each node represents a concise summary of its children, capturing the essence of the sub-document it represents.
Hierarchical embeddings: Each node in the tree can also be associated with its own vector embedding, capturing the summarized meaning at that level.
This multi-layered representation, with both textual summaries and vector embeddings, allows for efficient retrieval at various levels of detail.
Given a query, RAPTOR employs two primary retrieval mechanisms for navigating the tree and retrieving relevant information:
Tree traversal retrieval: This approach systematically explores the tree structure, starting from the root node and progressing down the branches.
Collapsed tree retrieval: This simplified approach views the tree as a single layer, directly comparing the query embedding to the vector embeddings of all leaf nodes (original text chunks) and summary nodes. This is suitable for factual, keyword-based queries where specific details are needed.
RAPTOR’s ability to choose the appropriate retrieval mechanism based on query complexity and utilize both textual summaries and vector embeddings empowers it to retrieve information at the optimal level of abstraction, satisfying diverse query needs.
LlamaIndex is a powerful toolkit designed to enhance the capabilities of LLMs by enabling efficient data retrieval and manipulation from various sources. It allows LLMs to perform tasks such as question answering, text summarization, and knowledge base construction more accurately and effectively.
Here are some key features of LlamaIndex:
Data integration: Connects LLMs with diverse data sources like databases, APIs, and files.
Efficient retrieval: Optimizes data retrieval processes to ensure quick access to relevant information.
Custom indexing: Supports the creation of custom indexes tailored to specific tasks and datasets.
Scalability: Handles large volumes of data, making it suitable for extensive applications.
Flexible querying: Allows for complex queries to enhance LLMs’ understanding and response generation.
Ease of use: Provides user-friendly interfaces and tools for seamless integration with existing systems.
This section explains the implementation of RAPTOR using LlamaIndex, and steps are as follows:
Setting up our environment and configuring the necessary parameters is crucial before diving into the implementation. This step ensures that we have all the required tools and settings for a smooth implementation process.
Let’s start by creating a configuration file to store sensitive information and adjustable parameters:
openai_api_key: "your_api_key_here"models:embedding: "text-embedding-3-small"llm: "gpt-3.5-turbo"chunk_size: 400chunk_overlap: 50similarity_top_k: 2mode: "tree_traversal"temperature: 0.1
Code explanation:
Line 1: Specifies the API key for accessing OpenAI’s services.
Lines 2–4: Define the models used: "text-embedding-3-small"
for embeddings and "gpt-3.5-turbo"
for language tasks.
Line 5: Sets the size of text chunks to 400 tokens for processing.
Line 6: Indicates an overlap of 50 tokens between chunks to maintain context.
Line 7: Specifies retrieving the top 2 most similar chunks during searches.
Line 8: Uses a tree traversal method for navigating and processing data.
Line 9: Sets the language model’s randomness level, with a low value for more deterministic responses.
Now, let’s implement the setup code:
import yamlimport osimport loggingfrom typing import Dict, Any# Setup logginglogging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')logger = logging.getLogger(__name__)def load_config(config_path: str = 'config.yaml') -> Dict[str, Any]:"""Load configuration from YAML file."""try:with open(config_path, 'r') as file:config = yaml.safe_load(file)return configexcept FileNotFoundError:logger.error(f"Configuration file not found: {config_path}")raiseexcept yaml.YAMLError as e:logger.error(f"Error parsing YAML configuration: {e}")raise# Load configurationconfig = load_config("config.yaml")# Set OpenAI API keyos.environ["OPENAI_API_KEY"] = config['openai_api_key']# Install required packages!pip install -q llama-index llama-index-packs-raptor llama-index-vector-stores-chroma# Download the RAPTOR paper!wget -q https://arxiv.org/pdf/2401.18059.pdf -O ./raptor_paper.pdflogger.info("Setup completed successfully.")
Code explanation:
Lines 1–4: Import necessary libraries for the program to function. yaml
is imported to read YAML configuration files, os
to interact with the operating system (setting environment variables), logging
for structured logging messages and Dict, Any
from typing
for type hints.
Lines 7–8: Set up basic logging configuration. logging.basicConfig()
configures logging to display messages with an INFO level or higher, formatted with a timestamp, logging level, and message content. logger = logging.getLogger(__name__)
creates a logger object specific to the current module.
Lines 10–21: Define the load_config
function, which loads configuration from a YAML file. The function attempts to open and parse the YAML file specified by config_path
. It uses a try...except
block to handle FileNotFoundError
if the file is not found or yaml.YAMLError
for errors during parsing. Successful loading returns the configuration as a dictionary (Dict[str, Any]
).
Line 24: Calls the load_config
function with a specific path to load the configuration into the config
variable.
Line 27: Sets the OpenAI API key by retrieving openai_api_key
from the config
dictionary. This sets an environment variable needed for interactions with OpenAI services.
Line 30: Silently installs required Python packages using pip install -q
within a Jupyter Notebook environment.
Line 33: Silently downloads a PDF file from a specified URL and saves it locally as raptor_paper.pdf
.
Line 35: Logs an informational message using the configured logger indicating that the setup process has been completed without errors. This helps track the procedure’s progress and status.
We’ll load the document and set up the vector store in this step. This is crucial in preparing our data for efficient retrieval and processing.
import nest_asynciofrom llama_index.core import SimpleDirectoryReaderfrom llama_index.vector_stores.chroma import ChromaVectorStoreimport chromadb# Apply nest_asyncio to allow asynchronous operations in Jupyter notebooksnest_asyncio.apply()def load_document(file_path: str):"""Load document from file."""documents = SimpleDirectoryReader(input_files=[file_path]).load_data()logger.info(f"Document loaded successfully: {file_path}")return documentsdef setup_vector_store(db_path: str, collection_name: str):"""Set up ChromaDB vector store."""client = chromadb.PersistentClient(path=db_path)collection = client.get_or_create_collection(collection_name)vector_store = ChromaVectorStore(chroma_collection=collection)logger.info(f"Vector store set up successfully: {collection_name}")return vector_store# Load documentdocuments = load_document("./raptor_paper.pdf")# Setup vector storevector_store = setup_vector_store("./raptor_paper_db", "raptor")
Code explanation:
Lines 1–4: Import libraries essential for document processing and vector storage. nest_asyncio
is imported to enable asynchronous operations in Jupyter Notebooks, SimpleDirectoryReader
from llama_index.core
is used to read documents from a directory, ChromaVectorStore
from llama_index.vector_stores.chroma
facilitates storing and retrieving document vectors via ChromaDB, and chromadb
provides functionality for interacting with ChromaDB.
Line 7: Applies the nest_asyncio
library to enable asynchronous operations in the current Jupyter Notebook environment. This setup allows concurrent task execution without blocking.
Lines 9–13: Define the load_document
function, which loads a document from a specified file path. It uses SimpleDirectoryReader
to read the document specified by file_path
, logs a success message using the logger (logger.info()
), and returns the loaded document (documents
).
Lines 15–21: Define the setup_vector_store
function, which sets up a ChromaDB vector store. It creates a PersistentClient
object from chromadb
to interact with the database located at db_path
. It then retrieves or creates a collection named collection_name
using client.get_or_create_collection()
. A ChromaVectorStore
object is instantiated using the obtained collection, logs a success message, and returns the created vector_store
.
Line 24: Invokes the load_document
function with the path ./raptor_paper.pdf
to load the RAPTOR paper document content into the documents
variable.
Line 27: Calls setup_vector_store
with database path ./raptor_paper_db
and collection name raptor
to initialize a ChromaDB vector store named "raptor"
. The resulting vector_store
object is set up for storing and retrieving document vectors in ChromaDB.
Now that we have loaded our document and vector store, we’ll configure the RAPTOR pack. This step involves setting up the core components of the RAPTOR system.
from llama_index.core.node_parser import SentenceSplitterfrom llama_index.llms.openai import OpenAIfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.packs.raptor import RaptorPackdef create_raptor_pack(documents, vector_store, config):"""Create and configure RAPTOR pack."""pack = RaptorPack(documents,embed_model=OpenAIEmbedding(model=config['models']['embedding']),llm=OpenAI(model=config['models']['llm'], temperature=config['temperature']),vector_store=vector_store,similarity_top_k=config['similarity_top_k'],mode=config['mode'],transformations=[SentenceSplitter(chunk_size=config['chunk_size'], chunk_overlap=config['chunk_overlap'])],)logger.info("RAPTOR pack created successfully.")return pack# Create RAPTOR packraptor_pack = create_raptor_pack(documents, vector_store, config)
Code explanation:
Lines 1–4: Import components necessary for building and configuring a RAPTOR pack. SentenceSplitter
from llama_index.core.node_parser
facilitates document segmentation into sentences, OpenAI
from llama_index.llms.openai
provides an interface for using OpenAI’s LLMs, OpenAIEmbedding
from llama_index.embeddings.openai
generates document embeddings using OpenAI’s models and RaptorPack
from llama_index.packs.raptor
represents a pre-configured workflow for RAPTOR.
Lines 6–18: Define the create_raptor_pack
function. It takes three arguments:
documents
: Represents the loaded documents intended for processing within the RAPTOR pack.
vector_store
: Refers to the ChromaDB vector store object previously created, used for storing and retrieving document vectors.
config
: A dictionary containing various configuration settings for customizing the RAPTOR pack.
Within this function:
RaptorPack(...)
initializes a RaptorPack object using the following parameters:
Parameter | Description |
| Loaded documents for processing. |
| Specifies the embedding model to use based on the configuration. |
| Sets up the OpenAI LLM model with specified parameters like model name and temperature. |
| Assigns the ChromaDB vector store object for managing document vectors. |
| Determines the number of most similar documents considered during retrieval based on the configuration. |
| Defines the operational mode of the RAPTOR pack, influencing its behavior. |
| Specifies transformations to apply to documents, here using |
logger.info("RAPTOR pack created successfully.")
: Logs an informational message confirming the successful creation of the RAPTOR pack.
return pack
: Returns the initialized RaptorPack object, ready for further use in specific tasks as defined by the configured mode.
Line 21: Calls the create_raptor_pack
function with documents
, vector_store
, and config
as arguments, resulting in instantiating a specific RAPTOR pack configured according to the provided documents, vector store, and configuration settings. This raptor_pack
instance is now ready for performing tasks such as document retrieval and processing within the specified environment.
In this final section, we’ll set up the retriever and query engine, which will allow us to perform queries on our processed document.
from llama_index.packs.raptor import RaptorRetrieverfrom llama_index.core.query_engine import RetrieverQueryEnginefrom typing import List, Tupledef create_raptor_retriever(vector_store, config):"""Create RAPTOR retriever."""retriever = RaptorRetriever([],embed_model=OpenAIEmbedding(model=config['models']['embedding']),llm=OpenAI(model=config['models']['llm'], temperature=config['temperature']),vector_store=vector_store,similarity_top_k=config['similarity_top_k'],mode=config['mode'],)logger.info("RAPTOR retriever created successfully.")return retrieverdef create_query_engine(retriever, config):"""Create query engine."""query_engine = RetrieverQueryEngine.from_args(retriever,llm=OpenAI(model=config['models']['llm'], temperature=config['temperature']))logger.info("Query engine created successfully.")return query_enginedef run_multiple_queries(query_engine, queries: List[str]) -> List[Tuple[str, str]]:"""Run multiple queries and return results."""results = []for query in queries:response = query_engine.query(query)results.append((query, str(response)))logger.info(f"Query processed successfully: {query}")return results# Create retriever and query engineretriever = create_raptor_retriever(vector_store, config)query_engine = create_query_engine(retriever, config)# Example usagequeries = ["What baselines was RAPTOR compared against and why?","What are the main advantages of RAPTOR over traditional retrieval methods?","How does RAPTOR handle long documents?"]results = run_multiple_queries(query_engine, queries)# Print resultsfor query, response in results:print(f"Query: {query}")print(f"Response: {response}")print("-" * 50)
Code explanation:
Lines 1–3: Import necessary components for setting up a retrieval system and processing queries. RaptorRetriever
from llama_index.packs.raptor
likely implements document retrieval functionality within the RAPTOR framework. RetrieverQueryEngine
from llama_index.core.query_engine
represents a query engine capable of interacting with retrievers and potentially language models for query processing.
Lines 5–16: Define the create_raptor_retriever
function:
Takes vector_store
, representing the ChromaDB vector store, and config
, a dictionary holding configuration settings.
Creates a RaptorRetriever
object (retriever
) with parameters:
Parameter | Description |
| An empty list ( |
| Specifies the embedding model for generating document embeddings based on the configuration. |
| Sets up the OpenAI language model using the model name and temperature from the configuration. |
| Assigns the ChromaDB vector store for document vector management. |
| Determines the number of similar documents considered during retrieval. |
| Defines the operational mode of the retriever based on the configuration. |
Logs successful creation of the retriever using logger.info
.
Returns the initialized retriever
object.
Lines 18–25: Define the create_query_engine
function:
Takes retriever
, the RaptorRetriever
object, and config
, the configuration dictionary.
Uses RetrieverQueryEngine.from_args()
to instantiate a query engine (query_engine
) with:
retriever
: The previously created retriever object.
llm=OpenAI(model=config['models']['llm'], temperature=config['temperature'])
: Configures an OpenAI LLM for query processing.
Logs successful creation of the query engine using logger.info
.
Returns the initialized query_engine
object.
Lines 27–34: Define the run_multiple_queries
function:
Takes query_engine
, the query engine object, and queries
, a list of strings representing user queries.
Initializes an empty list results
to store query-response pairs.
Iterates through each query in queries
:
Uses query_engine.query(query)
to process each query, likely retrieving relevant documents and generating a response using the configured LLM.
Appends a tuple (query, str(response))
to results
, where response
is the generated response converted to a string.
Logs successful query processing using logger.info
.
Returns results
, a list of tuples containing the original queries and corresponding responses.
Lines 37–38: Call create_raptor_retriever
with vector_store
and config
to instantiate a specific RaptorRetriever
for use in the subsequent steps.
Lines 41–45: Define example queries related to the RAPTOR research paper stored in queries
, setting up different inquiries about RAPTOR’s capabilities or comparisons.
Line 47: Calls run_multiple_queries
with query_engine
and queries
to process each query using the configured retrieval and query engine setup.
Lines 50–53: Iterate through results
, printing each query and its corresponding response:
Prints the original query with print(f"Query: {query}")
.
Prints the generated response from the query engine with print(f"Response: {response}")
.
Separates each query-response pair with a line of dashes print("-" * 50)
for clarity.
Here is the output generated by the RAPTOR code above:
Query: What baselines was RAPTOR compared against and why?Response: RAPTOR was compared against BM25 and DPR as baselines. This comparison was conducted to showcase RAPTOR's superior performance in information retrieval tasks, particularly on datasets like QASPER. The reason for comparing against these baselines was to highlight RAPTOR's ability to outperform methods that can only extract top similar raw text chunks, as RAPTOR's hierarchical summarization approach allows it to capture a broader range of information, from general themes to specific details, leading to better overall performance.--------------------------------------------------Query: What are the main advantages of RAPTOR over traditional retrieval methods?Response: RAPTOR's main advantages over traditional retrieval methods include its hierarchical tree structure that allows for synthesizing information across different sections of retrieval corpora, its ability to handle a wider range of questions by providing both original text and higher-level summaries for retrieval, and its effectiveness in leveraging the full tree structure for more efficient retrieval during the query phase. Additionally, RAPTOR outperforms traditional retrieval methods and sets new performance benchmarks on various question-answering tasks based on controlled experiments.--------------------------------------------------Query: How does RAPTOR handle long documents?Response: RAPTOR handles long documents by segmenting the retrieval corpus into short, contiguous texts of a specific length, typically 100 tokens. If a sentence exceeds this limit, the entire sentence is moved to the next chunk to maintain contextual and semantic coherence. These chunks are then embedded using SBERT, forming the leaf nodes of a tree structure. RAPTOR employs a clustering algorithm to group similar text chunks, followed by summarization using a Language Model. This cycle of embedding, clustering, and summarization continues until further clustering becomes infeasible, resulting in a structured, multi-layered tree representation of the original documents.
Check out the official RAPTOR GitHub repository for more information and resources: RAPTOR GitHub Repository.
Here’s a comparison table highlighting why RAPTOR is considered superior to traditional RAG methods:
Aspect | RAG with RAPTOR | Traditional RAG |
Retrieval structure | Hierarchical tree structure for synthesizing information across sections. | Linear retrieval of top similar chunks without hierarchical context. |
Information synthesis | Combines original text and high-level summaries, providing a deeper understanding. | Focuses primarily on extracting top similar raw text chunks. |
Handling long documents | Segments texts into manageable chunks, clusters, and summarizes, creating a layered tree. | Processes documents linearly, often struggling with lengthy texts. |
Performance on QA tasks | Consistently outperforms traditional methods, setting new benchmarks on various datasets. | Traditional methods often rely on linear retrieval of top similar text chunks, which can result in less comprehensive information retrieval. |
Scalability | Scales linearly with document size in terms of build time and token use. | May struggle with scalability due to lack of hierarchical organization. |
Flexibility | Adapts to different query complexities by selecting appropriate tree nodes. | Limited flexibility; retrieves based on direct text similarity. |
Integration with retrievers | Enhances performance when combined with models like SBERT, outperforming standalone retrievers. | Traditional methods do not inherently improve when combined with other retrievers. |
Now that you know a bit about RAPTOR, we hope you feel better equipped to master RAG and its various techniques.
Are you ready to gain more hands-on skills with RAG?
If so, here are a few courses you may find interesting:
You can also start building with RAG through Educative Projects, which guide you through creating tangible outcomes for your portfolio (without the setup):
Happy learning!
Free Resources