4.6Intermediate7 min

Chunking Strategies for RAG: Fixed, Semantic, Hierarchical and Late Chunking Compared

Blck Alpaca·9 June 2026

Definition

Chunking strategies for RAG determine how a document is split into searchable text segments (chunks) before embedding. The choice of strategy and chunk size largely determines retrieval quality: it decides whether a language model finds the right passage and whether that passage contains enough context for a correct, source-grounded answer.

Key Takeaways

✓Chunking is not a detail but the lever with the greatest impact on retrieval quality: naive fixed-size chunking that ignores sentence, table and list boundaries is one of the most common causes of hallucinations and missing content.
✓There is no universal optimum. The default starting point is 200-800 tokens with 10-20% overlap; the right strategy depends on document type, embedding model and query patterns.
✓Hierarchical parent-child chunking decouples retrieval (small, precise child chunks) from generation (large parent chunks with context) and is often the best choice for long reports and manuals.
✓Anthropic's Contextual Retrieval (as of 2024) reduces the retrieval failure rate by 49%, and by 67% when combined with re-ranking - by prepending an LLM-generated context header to each chunk.
✓Depending on the tokeniser, German compound words are roughly 1.3-1.7x more token-intensive than English - this reduces the effective chunk capacity and must be factored in when choosing the size.
✓The right strategy is not guessed but measured: with a gold set and RAGAS metrics (Context Precision, Context Recall, Faithfulness), variants can be compared objectively in an A/B test.

Chunking strategies for RAG determine how a document is split into searchable text segments - so-called chunks - before embedding. The choice of strategy and chunk size largely determines retrieval quality: it decides whether a language model finds the right passage and whether that passage contains enough context for a correct, source-grounded answer. Chunking is therefore the most inconspicuous yet most powerful lever in the entire RAG stack.

Default recommendation: 200-800 tokens per chunk, 10-20% overlap, always preceded by layout-aware parsing.
Most common mistake: Naive fixed-size chunking cuts through sentences, tables and lists - one of the main causes of hallucinations.
Best all-rounder in 2026: Hierarchical parent-child chunking, optionally combined with Contextual Retrieval (Anthropic: up to -67% retrieval errors).

Why chunking decides between success and failure

In a RAG pipeline, every source document is split, each chunk is translated into a vector (embedding) and stored in a vector database. At runtime, the system searches for the chunks most similar to the user's question and places them in the language model's prompt. This means: anything that was not cleanly chunked cannot be found in the first place - and anything that is found but cut without context leads the model astray.

Two classes of error dominate in practice. First, naive fixed-size chunking ignores sentence, table and list boundaries; the result is hallucinations and missing table content. Second, the "lost-in-the-chunks" problem: a chunk contains the sentence "It increased by 12%" - without the surrounding context it remains unclear what increased. Both problems arise before embedding and cannot be repaired later by even the best re-ranking.

The six chunking strategies at a glance

Fixed-size chunking cuts text into fixed windows, for example 512 tokens with an overlap of 50 tokens. It is fast, deterministic and cost-effective - but content-blind. Suitable for uniform texts such as logs or transcripts.

Recursive character / sentence splitting splits hierarchically along delimiters (paragraphs, then sentences, then words) while respecting natural boundaries. This is the generic default of many frameworks (LangChain) and a good starting point for mixed corpora.

Sentence-window is a variant of this: embedding and searching happen at sentence level, but a window of the surrounding sentences is passed to the LLM - a lightweight precursor to the parent-child principle.

Semantic chunking measures the cosine distance between consecutive sentences and places the cut where the topical coherence breaks (cosine drift). This produces coherent chunks for content-heterogeneous documents but is more compute-intensive during indexing, because each sentence must be embedded beforehand.

Hierarchical / parent-child chunking decouples retrieval from generation: small child chunks serve as a precise search target, while the associated large parent chunk is supplied to the generator as context. This strategy is often the best choice for long reports and technical manuals.

Late chunking is a more recent approach that reverses the order: instead of cutting the text before embedding, the entire (long) document is first processed by a long-context embedding model, and only the resulting token embeddings are subsequently pooled into chunk vectors. As a result, every chunk vector carries the context of the whole document within it. The approach pursues the same goal as Contextual Retrieval - solving the lost-in-the-chunks problem - but requires no additional LLM call per chunk. It is, however, comparatively young and less widely tested, and it presupposes a long-context-capable embedding model; you should reconcile the respective supported context length and pooling behaviour with the current provider documentation of the chosen model before productive use.

Two further methods are also worth considering: propositional chunking breaks text down into atomic individual statements and is suitable for very precise fact extraction, and contextual chunking (see below), which prepends an explanatory context header to each chunk.

Comparison table: which strategy when?

Strategy	When suitable	Advantages	Disadvantages
Fixed-size (e.g. 512 tokens, overlap 50)	Logs, transcripts, uniform texts	Fast, deterministic, cheap	Content-blind, cuts through sentences/tables
Recursive / sentence splitter	Generic default, mixed corpora	Respects natural boundaries, robust	Misses semantic topic shifts
Sentence-window	Q&A over running text, short fact questions	Precise retrieval, some context via window	Limited context for long coherent passages
Semantic chunking	Content-heterogeneous documents	Coherent, topic-pure chunks	Compute-intensive during indexing
Hierarchical / parent-child	Long reports, manuals, contracts	Precise search + full context for the LLM	Higher pipeline complexity, double storage
Late chunking	Long documents, long-context embeddings (as of 2026, provider-dependent)	Global context per chunk without LLM header	Requires long-context embedding model; less tested
Contextual chunking (Anthropic, as of 2024)	Broad applicability	-49% retrieval errors, solves lost-in-the-chunks	One LLM call per chunk during indexing

Tradeoffs: chunk size and overlap

Chunk size is a classic conflict of objectives. Small chunks (e.g. 256 tokens) produce sharper embeddings and more precise hits for specific fact questions but risk loss of context. Large chunks (e.g. 800 tokens and more) carry more coherence but dilute the embedding and increase the risk that the relevant information gets lost in the LLM's "lost-in-the-middle" effect. Oversized top-k sets passed to the model exacerbate this further - the recommendation is therefore to condense to top_k = 5-10 via a re-ranker after retrieval.

The overlap (overlapping tokens between adjacent chunks) prevents a thought from being torn apart exactly at the cut edge. 10-20% of the chunk size is a proven rule of thumb. More overlap increases redundancy, storage requirements and indexing costs without improving quality proportionally.

A factor often underestimated for DACH corpora: German compound words such as "Datenschutz-Folgenabschätzung" generate considerably more tokens than their English equivalent, depending on the tokeniser (BPE/SentencePiece). As a rule of thumb, German is roughly 1.3-1.7x more token-intensive depending on the model. This reduces the effective content capacity of a 512-token chunk - a size defined in tokens therefore carries less substantive content in German than in English.

A concrete example with numbers

Suppose you are indexing 10,000 technical PDF pages (contracts and product manuals).

Parsing: Layout-aware with a parser such as Docling or Unstructured, so that tables, lists and headings are preserved.
Chunking: Parent-child. Parent chunks of around 1,500 tokens (entire sections), child chunks of around 300 tokens with 15% overlap.
Embedding target: the child chunks; the respective parent is passed to the LLM.
Optional contextual header: A short LLM-generated context sentence per chunk before embedding. According to Anthropic (as of 2024), this reduces the top-20 retrieval failure rate from 5.7% to 2.9% (-49%), and to 1.9% in combination with re-ranking (-67%); contextual embeddings alone deliver -35%. The indexing costs are around 1.02 US dollars per 1 million document tokens with prompt caching (Anthropic pricing, as of September 2024 - update before use).

Pseudocode for the hierarchical split:

```
parents = split(document, size=1500, separators=["\n## ", "\n\n"])
for p in parents:
children = split(p, size=300, overlap=45) # 15% of 300
for c in children:
c.context = llm("What is this section about?", p) # optional
index.upsert(embed(c.context + c.text), parent_ref=p.id)
```

How to test the right strategy

Chunking decisions are not guessed but measured. The procedure:

Build a gold set: 50-200 real user questions, each annotated with the correct source passage. Without a gold set, LLM-as-judge (RAGAS reference-free) or a synthetically generated test set helps.
Index variants: Chunk the same corpus multiple times - for example Fixed-512, Recursive-300, Semantic, Parent-Child - with an otherwise identical pipeline.
Measure with RAGAS: The central metrics are Context Precision and Context Recall (does it retrieve the right passages?) as well as Faithfulness and Answer Relevancy (does the answer genuinely rely on the context?).
A/B in production: Roll out the variant leading in the offline test against the existing one and observe real signals (clicks, user feedback).

Important: if the embedding model changes, a full re-indexing is required - embeddings from different models are not comparable. Therefore, tag your indices with a version string.

For agencies and B2B decision-makers

Chunking is not a setup detail that you configure once and forget, but the adjusting screw with the highest leverage on the answer quality of your RAG system - and thereby on the trust that customers and employees place in it. Anyone who starts here with naive fixed-size chunking will pay later with hallucinations, support escalations and expensive rework. As the agency Blck Alpaca (Vienna), we design and evaluate RAG pipelines for DACH companies measurably along RAGAS metrics - including German-language tokenisation, EU-compliant hosting and a chunking strategy that fits your documents and questions. Talk to us before you go into production, rather than repairing afterwards.

FAQ

What chunk size should I use for RAG?

As a default starting point, 200-800 tokens with 10-20% overlap has proven effective. Smaller chunks (e.g. 256 tokens) deliver more precise hits for fact-oriented questions, while larger chunks (e.g. 800 tokens) preserve more context for coherent explanations. Because German texts are, depending on the tokeniser, around 1.3-1.7x more token-intensive than English, you should test the effective size for your corpus rather than adopting a fixed value.

What is the difference between fixed-size and semantic chunking?

Fixed-size chunking cuts text mechanically after a fixed number of tokens - regardless of content, often mid-sentence or inside a table. Semantic chunking measures the cosine distance between consecutive sentences and places the cut where the topical coherence breaks. Fixed-size is fast and predictable, whereas semantic chunking produces more coherent chunks for content-heterogeneous documents but is more compute-intensive during indexing.

What is parent-child or hierarchical chunking?

Hierarchical chunking decouples retrieval from generation. Small, precise child chunks are embedded and searched, but the associated large parent chunk is passed to the language model. This way the system finds the exact location while the LLM receives enough surrounding context for a complete answer. It is particularly suitable for long reports, technical manuals and contracts.

What is Contextual Retrieval and how much does it help?

With Contextual Retrieval (Anthropic, as of 2024), a short, LLM-generated context header is prepended to each chunk before embedding, explaining where the segment originates. This solves the lost-in-the-chunks problem - for example when a chunk only contains 'It increased by 12%'. According to Anthropic, this reduces the retrieval failure rate by 49%, and by 67% when combined with re-ranking.

How do I find the right chunking strategy for my project?

Not by guessing, but by measuring. Build a gold set from real user questions together with the respective correct source passages, index the same corpus with several chunking variants and compare them using RAGAS metrics such as Context Precision, Context Recall and Faithfulness. Only this controlled A/B comparison reveals which strategy actually works best for your documents and query patterns.

Want to go deeper?

Get new analyses straight to your inbox, or see how we put this knowledge to work for companies.

Subscribe to newsletter →Our services

Previous← Pinecone vs. Weaviate vs. Qdrant: Vector DB Comparison from a DACH/EU Hosting Perspective NextHybrid Search in RAG: Combining BM25 and Vector Similarity Correctly →