Build AI Chatbots with Open-Source LLMs, LangChain, and Streamlit/

...

/

Retrieval Strategies: Data Chunking

This demands a series of transformation steps aimed at preparing documents for efficient retrieval. Among these steps, a fundamental task is dividing a large document into smaller sections, or chunks, because of the finite context windows of large language models. These windows define the maximum stretch of text any LLM can interpret in a single operation. Despite the increasing context window sizes of modern LLMs, they perform better on specific tasks when provided with smaller yet more relevant segments of information. The challenge then becomes selecting the most relevant data subset for retrieval.

Ideally, a chunking strategy is needed whereby chunking should preserve semantic cohesion within the text segments. The definition of semantic cohesion varies with the type of text. For external documents, the initial breakdown into smaller chunks is essential for extracting nuanced features that are later encoded to capture their semantic essence. Both too-large and too-small chunks can lead to less than optimal retrieval performance, making it essential to determine the ideal chunk size for the document text to enhance the accuracy and relevance of retrieval outcomes.

Selecting an effective chunking strategy involves evaluating several key aspects, such as the nature of the content being indexed, the capabilities and ideal operational scale of the embedding model, the anticipated complexity of user queries, and the specific demands of the application utilizing the retrieval results. The choice of chunking model may vary based on the document’s length. Moreover, different embedding models behave differently across various chunk sizes. For example, sentence-transformer models are optimized for single sentences, whereas models such as OpenAI text-embedding-ada-002 are more efficient with 256 or 512 tokens.

We can observe here that as we increase the chunk size, more a more of the sentence is included in the chunks to be sent to the LLM model.

The complexity of user queries and the application’s unique needs, such as semantic search or question answering, influence the chunking strategy. This selection is often guided by the token limits of the utilized LLMs, demanding adaptations to chunk sizes. Achieving good retrieval results entails the flexible application of various chunking strategies, as no single approach is universally superior.

At a high level, text splitters work as follows:

We split the text into small, semantically meaningful chunks.
We start combining these small chunks into a larger chunk until we reach ...

Introduction to Building Chatbots

Understanding Transformers

Automating Contract Review with Transformer Models

Understanding Large Language Models (LLMs)

Data Collection and Preparation

Optimizing RAG Workflows with LangChain

Prompt Engineering and Retrieval Chains

Chatbot User Interface Development with Streamlit

Chatbot Integration and Evaluation

Capstone Project

Conclusion and Future Developments

Retrieval Strategies: Data Chunking

Understanding text splitters

Examples of text splitting

Example 1

Example 2