Document chunking strategies: beyond naive splitting

Document chunking—the process of splitting documents into smaller pieces for embedding and retrieval—is deceptively simple in concept but consequential in practice. Poor chunking decisions create cascading problems: relevant information split across chunks, context lost at boundaries, and retrieval failures that no amount of prompt engineering can fix.

Why chunking matters

Embedding models have context windows, typically 512 tokens for older models and up to 8,192 tokens for newer ones. Documents exceeding these limits must be split. But even within these limits, chunking decisions affect retrieval quality in non-obvious ways.

Consider a query about a specific contract clause. If the relevant clause spans two chunks, neither chunk alone may score highly enough to be retrieved. The embedding of each partial chunk captures incomplete semantics, reducing similarity to the query. This is the fundamental chunking problem: preserving semantic coherence while respecting technical constraints.

Common chunking strategies

Fixed-size chunking

The simplest approach: split text every N characters or tokens, optionally with overlap. This is what most tutorials demonstrate and what many production systems use by default.

Example

chunk_size = 1000  # characters
overlap = 200

chunks = []
for i in range(0, len(text), chunk_size - overlap):
    chunks.append(text[i:i + chunk_size])

Advantages: Simple to implement, predictable chunk sizes, works with any text.

Problems: Ignores document structure, splits sentences mid-word, breaks semantic units arbitrarily. The overlap parameter is a band-aid that increases storage and processing costs without solving the fundamental problem.

Sentence-based chunking

Split on sentence boundaries, grouping sentences until reaching a size threshold. This respects linguistic units and avoids mid-sentence splits.

Characteristics

-Requires sentence boundary detection (spaCy, NLTK, or regex)
-Variable chunk sizes based on sentence lengths
-Better preservation of meaning within chunks
-Still ignores paragraph and section structure

Paragraph-based chunking

Use paragraph breaks as primary split points. Paragraphs typically represent coherent ideas, making them natural semantic units.

This works well for well-structured documents but fails when paragraphs are very long (exceeding context limits) or very short (losing context). Legal documents, academic papers, and technical manuals often have paragraph structures that do not align with retrieval needs.

Recursive chunking

A hierarchical approach: try to split on larger units first (sections), then fall back to smaller units (paragraphs, sentences) if chunks exceed size limits. LangChain's RecursiveCharacterTextSplitter popularized this approach.

Typical separator hierarchy

1. Double newlines (paragraph breaks)
2. Single newlines
3. Sentence boundaries (periods, question marks)
4. Word boundaries (spaces)
5. Character-level (last resort)

Structure-aware chunking

The strategies above treat documents as flat text, ignoring their inherent structure. For many enterprise documents, this is a significant limitation. Better approaches leverage document structure explicitly.

Markdown and HTML chunking

For structured formats, use heading hierarchy to define chunk boundaries. Each section becomes a chunk, with heading context preserved.

Strategy

-Parse document structure (headings, lists, tables)
-Create chunks at section boundaries
-Include parent heading context in each chunk
-Handle nested sections recursively

PDF and document layout chunking

PDFs and scanned documents require layout analysis to identify structural elements. Tools like PyMuPDF, pdfplumber, or dedicated document AI services can extract:

Headers and footers (often noise to be excluded)
Section headings and their hierarchy
Tables (requiring special handling)
Multi-column layouts
Footnotes and references

Semantic chunking

An emerging approach uses embeddings themselves to identify semantic boundaries. The idea: compute embeddings for sliding windows and split where semantic similarity drops significantly.

Algorithm outline

1. Embed sentences or small text windows
2. Compute similarity between adjacent windows
3. Identify "valleys" where similarity drops below threshold
4. Split at these semantic boundaries

This approach is computationally expensive (requiring embeddings for boundary detection) but can identify topic shifts that structural approaches miss. It works well for documents without clear formatting but with distinct topical sections.

Chunk size considerations

Beyond strategy, chunk size itself affects retrieval quality. There is no universal optimal size—it depends on your documents, queries, and embedding model.

Smaller chunks (100-300 tokens)

More precise retrieval for specific queries
Less noise in retrieved context
Risk of missing broader context
More chunks to store and search

Larger chunks (500-1000 tokens)

More context preserved in each chunk
Better for questions requiring broader understanding
Risk of diluted embeddings (too many topics in one chunk)
Consumes more of the LLM context window

Advanced techniques

Hierarchical indexing

Instead of choosing one chunk size, create multiple representations at different granularities. Index documents at section, paragraph, and sentence levels. At query time, retrieve at the appropriate level based on query characteristics.

Parent-child relationships

Store small chunks for precise retrieval but return larger parent chunks to the LLM. This combines retrieval precision with context completeness. When a small chunk matches, expand to include surrounding context before generation.

Metadata enrichment

Augment chunks with metadata that aids retrieval: document title, section heading, page number, document type. This metadata can be used for filtering, boosting, or included in the chunk text itself to improve embedding quality.

Practical recommendations

1. Start with structure-aware chunking if your documents have consistent formatting. Leverage headings, sections, and document hierarchy.
2. Use sentence-based chunking as a fallback for unstructured text. Group sentences to target sizes around 300-500 tokens.
3. Preserve context through metadata Include document title and section headings in chunk text or metadata fields.
4. Handle tables separately Tables often require special chunking (row-by-row, or as complete units) and may benefit from structured extraction.
5. Evaluate on your data Create test queries and measure retrieval quality with different chunking strategies. What works for one corpus may fail for another.
6. Consider hybrid approaches Different document types in your corpus may warrant different chunking strategies. A contracts database may need different treatment than a technical documentation corpus.

Conclusion

Chunking is not a solved problem with a single best solution. The right approach depends on your document characteristics, query patterns, and retrieval requirements. Naive fixed-size splitting is rarely optimal, but more sophisticated approaches require understanding your specific use case.

Invest time in chunking strategy early in your RAG development. It is far easier to adjust chunking during development than to re-process and re-index a large document corpus after deployment. And remember: the best chunking strategy is the one that works for your users and your data, not the one that scores highest on academic benchmarks.