Retrieval Strategies: Data Chunking

Learn how to split or chunk text for large documents.

We'll cover the following...

Understanding text splitters

An essential element in the RAG retrieval process is the selective extraction of document segments.

RAG workflow: Data chunking
RAG workflow: Data chunking

This demands a series of transformation steps aimed at preparing documents for efficient retrieval. Among these steps, a fundamental task is dividing a large document into smaller sections, or chunks, because of the finite context windows of large language models. These windows define the maximum stretch of text any LLM can interpret in a single operation. Despite the increasing context window sizes of modern LLMs, they perform better on specific tasks when provided with smaller yet more relevant segments of information. The challenge then becomes selecting the most relevant data subset for retrieval.

Ideally, a chunking strategy is needed whereby chunking should preserve semantic cohesion within the text segments. The definition of semantic cohesion varies with the type of text. For external documents, the initial breakdown into smaller chunks is essential for extracting nuanced features that are later encoded to capture their semantic essence. Both too-large and too-small chunks can lead to less than optimal retrieval performance, making it essential to determine the ideal chunk size for the document text to enhance the accuracy and relevance of retrieval outcomes.

Selecting an effective chunking strategy involves evaluating several key aspects, such as the nature of the content being indexed, the capabilities and ideal operational scale of the embedding model, the anticipated complexity of user queries, and the specific demands of the application utilizing the retrieval results. The choice of chunking model may vary based on the document’s length. Moreover, different embedding models behave differently across various chunk sizes. For example, sentence-transformer models are optimized for single sentences, whereas models such as OpenAI text-embedding-ada-002 are more efficient with 256 or 512 tokens.

Examples of text splitting

Let’s consider the two examples below to illustrate the concept of text chunking. We can see that different colors represent different chunks, with the exception of the dark green color, which represents the overlapping text between the chunks.

Example 1

Chunk size of 128 tokens with an overlapping text of 15 tokens. The total number of characters is 1004, and the number of chunks is 9.

Chunk size 128
Chunk size 128

We can observe here that the smaller the chunks are, the less text we have as part of the sentences and the less context we will send to the LLM model.

Example 2

Chunk size of 256 tokens with an overlapping text of 15 tokens. The total number of characters is 1004, and the number of chunks is 4.

Chunk size 256
Chunk size 256

We can observe here that as we increase the chunk size, more a more of the sentence is included in the chunks to be sent to the LLM model.

The complexity of user queries and the application’s unique needs, such as semantic search or question answering, influence the chunking strategy. This selection is often guided by the token limits of the utilized LLMs, demanding adaptations to chunk sizes. Achieving good retrieval results entails the flexible application of various chunking strategies, as no single approach is universally superior.

At a high level, text splitters work as follows:

  • We split the text into small, semantically meaningful chunks.

  • We start combining these small chunks into a larger chunk until we reach ...