...

/

Parent Document Retrieval (PDR): Structuring Hierarchical Data

Parent Document Retrieval (PDR): Structuring Hierarchical Data

Learn about the parent document retrieval (PDR) technique, how it works, and its step-by-step implementation.

In RAG, effectively retrieving relevant source documents is crucial for generating high-quality, informative responses. Standard RAG methods often operate on smaller text chunks, which might not provide sufficient context for complex queries. Parent document retrieval (PDR) addresses this limitation by retrieving the complete parent documents associated with the most relevant child passages. This approach enhances RAG’s ability to handle intricate questions requiring a broader understanding of the source material.

What is parent document retrieval (PDR)?

Parent document retrieval (PDR) is a technique used in advanced RAG models to retrieve the full parent documents from which relevant child passages (snippets) are derived. This retrieval process improves the context available to the RAG model, leading to more comprehensive and informative responses, especially for complex or nuanced queries.

Here are the core steps of parent document retrieval in RAG models:

  • Data preprocessing: Split large documents into smaller chunks.

  • Create embeddings: Convert each chunk into a numerical representation for efficient search.

  • User query: The user submits a question.

  • Chunk retrieval: Search for the most relevant chunks based on the query’s embedding.

  • Identify parent documents: Find the original documents (or larger segments) for the shortlisted chunks.

  • Retrieve parent documents: Get the full parent documents for better context.

Press + to interact
High-level overview of parent document retrieval (PDR)
High-level overview of parent document retrieval (PDR)

Step-by-step implementation

The following are the steps to implement the LangChain is an open-source framework designed to simplify the development of applications that utilize large language models (LLMs).parent document retrieval (PDR):

Press + to interact
Steps for implementing PDR
Steps for implementing PDR

1. Prepare the data

We’ll begin by setting up the necessary environment and data for parent document retrieval (PDR) in our RAG system.

i) Import necessary modules

Next, we’ll import the required modules from the installed libraries to build our PDR system:

Press + to interact
from langchain.schema import Document
from langchain.vectorstores import Chroma
from langchain.retrievers import ParentDocumentRetriever
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings

These libraries and modules are essential for the subsequent steps in the process.

ii) Set up the OpenAI API key

We use an OpenAI LLM for generating responses, so we’ll need an OpenAI API key. Set the OPENAI_API_KEY environment variable with your key:

Press + to interact
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] = "" # Add your OpenAI API key
if OPENAI_API_KEY == "":
raise ValueError("Please set the OPENAI_API_KEY environment variable")

Code explanation

  • Line 1: Set the OPENAI_API_KEY variable to an empty string and assign it to the environment variable OPENAI_API_KEY using os.environ. This is where you ...