...
/Parent Document Retrieval (PDR): Structuring Hierarchical Data
Parent Document Retrieval (PDR): Structuring Hierarchical Data
Learn about the parent document retrieval (PDR) technique, how it works, and its step-by-step implementation.
In RAG, effectively retrieving relevant source documents is crucial for generating high-quality, informative responses. Standard RAG methods often operate on smaller text chunks, which might not provide sufficient context for complex queries. Parent document retrieval (PDR) addresses this limitation by retrieving the complete parent documents associated with the most relevant child passages. This approach enhances RAG’s ability to handle intricate questions requiring a broader understanding of the source material.
What is parent document retrieval (PDR)?
Parent document retrieval (PDR) is a technique used in advanced RAG models to retrieve the full parent documents from which relevant child passages (snippets) are derived. This retrieval process improves the context available to the RAG model, leading to more comprehensive and informative responses, especially for complex or nuanced queries.
Here are the core steps of parent document retrieval in RAG models:
Data preprocessing: Split large documents into smaller chunks.
Create embeddings: Convert each chunk into a numerical representation for efficient search.
User query: The user submits a question.
Chunk retrieval: Search for the most relevant chunks based on the query’s embedding.
Identify parent documents: Find the original documents (or larger segments) for the shortlisted chunks.
Retrieve parent documents: Get the full parent documents for better context.
Step-by-step implementation
The following are the steps to implement the
1. Prepare the data
We’ll begin by setting up the necessary environment and data for parent document retrieval (PDR) in our RAG system.
i) Import necessary modules
Next, we’ll import the required modules from the installed libraries to build our PDR system:
from langchain.schema import Documentfrom langchain.vectorstores import Chromafrom langchain.retrievers import ParentDocumentRetrieverfrom langchain.chains import RetrievalQAfrom langchain_openai import OpenAIfrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain.storage import InMemoryStorefrom langchain.document_loaders import TextLoaderfrom langchain.embeddings.openai import OpenAIEmbeddings
These libraries and modules are essential for the subsequent steps in the process.
ii) Set up the OpenAI API key
We use an OpenAI LLM for generating responses, so we’ll need an OpenAI API key. Set the OPENAI_API_KEY
environment variable with your key:
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] = "" # Add your OpenAI API keyif OPENAI_API_KEY == "":raise ValueError("Please set the OPENAI_API_KEY environment variable")
Code explanation
Line 1: Set the
OPENAI_API_KEY
variable to an empty string and assign it to the environment variableOPENAI_API_KEY
usingos.environ
. This is where you ...