Skip to content
4.2Intermediate7 min

RAG Architecture: Ingestion, Retrieval, Generation, Reranking

Blck Alpaca·
Definition

RAG architecture is the two-phase structure of a retrieval-augmented generation system: in the ingestion path, documents are loaded, chunked, embedded into vectors and indexed; in the query path, passages relevant to the request are retrieved, re-sorted (reranking) and passed to an LLM as context for answer generation.

Key Takeaways

  • A RAG pipeline consists of two separate paths: the offline ingestion/indexing path (loading, chunking, embedding, upsert into the vector DB) and the online query path (query embedding, retrieval, filtering, reranking, context assembly, generation).
  • Reranking with a cross-encoder is not an optional extra: Anthropic's Contextual Retrieval reduces the retrieval failure rate by 49 per cent, and by 67 per cent when combined with reranking (as of September 2024).
  • Hybrid search combines dense vector search (HNSW index) with sparse BM25 and captures exact codes, IDs and proper names that pure embeddings miss.
  • Metadata filters (tenant_id, ACL) belong in every chunk: they are a technical prerequisite for multi-tenancy and for the tenant separation and erasability required by the DSK.
  • The most common architecture mistakes are naive fixed-size chunking, missing reranking, missing evaluation and missing source citations.
  • Sovereign DACH building blocks for EU hosting are Qdrant (Berlin), Weaviate (Amsterdam) and Haystack/deepset (Berlin), along with embeddings from Jina, Mistral or Aleph Alpha.

RAG architecture describes the technical structure of a retrieval-augmented generation system in two clearly separated phases. The first phase, ingestion (also indexing), runs offline and transforms source documents into a searchable index. The second phase, query time, runs online per request and retrieves the matching content, re-sorts it and passes it to a language model for generation. Anyone wishing to run RAG in production must understand and operate both paths separately.

  • Two paths, not one: Ingestion (loading, chunking, embedding, upsert) runs in batch; the query path (embedding, retrieval, filtering, reranking, generation) runs per request.
  • Reranking is mandatory, not a luxury: A cross-encoder after the coarse search reduces the retrieval failure rate measurably (by up to 67 per cent when combined with Contextual Retrieval, as of September 2024).
  • Metadata determines compliance: tenant_id, ACL and source reference per chunk are the technical basis for multi-tenancy, tenant separation and erasability.

Phase 1: Ingestion / Indexing (offline)

The ingestion path is the groundwork that pre-determines the quality and latency of later operation. It consists of four to five steps:

1. Loading / connectors. Sources such as SharePoint, S3, Confluence or a database are connected via connectors. For each document, stable IDs and metadata are already captured here (source, timestamp, tenant, access rights).

2. Parsing. Raw documents, especially PDFs and contracts, are transformed into structured text. Layout-aware parsers preserve table, list and header structure, which later prevents hallucinations. Tools for this include Docling (IBM, open source), Unstructured.io, LlamaParse and Azure Document Intelligence.

3. Chunking. The text is broken into retrievable units. Naive fixed-size chunking (e.g. 512 tokens, overlap 50) is simple but cuts through sentences and tables. Better options are recursive/semantic chunking (cutting at cosine drift between sentences) or hierarchical parent-child chunking (small chunks for retrieval, large parent chunks as context). Rule of thumb: 200 to 800 tokens, 10 to 20 per cent overlap.

4. Embedding. Each chunk is converted into a vector by an embedding model. For German-language corpora, multilingual models such as Cohere Embed v4, Voyage-3, Gemini Embedding 002 or BGE-M3 dominate; sovereign options are jina-embeddings-v3 (Berlin), Mistral Embed (EU) and Aleph Alpha Pharia-1-Embed (on-prem). Important: depending on the tokeniser, German is roughly 1.3 to 1.7 times more token-intensive than English.

5. Indexing / upsert. The vectors are written to a vector database together with their metadata. The standard index algorithm is almost everywhere HNSW (Malkov & Yashunin 2016) with the parameters M, ef_construction and ef_search, which trade recall against speed. Optionally, a BM25 index is built in parallel for hybrid search, and an LLM-generated context header is prepended to each chunk (Contextual Retrieval).

Phase 2: Query time (online)

The query path processes every individual user request in milliseconds to a few hundred milliseconds:

1. Query preprocessing (optional). Query rewriters, HyDE or decomposers reformulate complex questions or break them down. In multi-tenant systems, AuthN/Z and tenant resolution additionally take place here.

2. Query embedding. The question is converted into a vector using the same embedding model as the chunks. In parallel, a BM25 query is created.

3. Retrieval / vector search. The vector DB returns the most similar chunks (typically top_k = 50 to 100). In hybrid search, dense and sparse searches run in parallel and are combined via rank fusion (Reciprocal Rank Fusion).

4. Filtering. Metadata filters (tenant_id, ACL, date) narrow down the candidates. This is not only about performance but about compliance: the DSK RAG guidance requires tenant separation and a rights and roles concept.

5. Reranking. A cross-encoder (Cohere Rerank v3, BGE Reranker, Jina Reranker v2) scores query and document together and reduces the 50 to 100 candidates to the most relevant top_k = 5 to 10. Cross-encoders are more accurate but slower than the bi-encoders of the coarse search, which is why they only run after the cheap recall.

6. Context assembly & prompting. The final chunks are inserted into a prompt template with source references. Important passages belong at the front to avoid the lost-in-the-middle problem.

7. LLM generation. The language model generates the answer, ideally with citation forcing and a downstream faithfulness guardrail that checks whether the answer is supported by the sources.

Table: RAG pipeline by phase, task and tools

Phase

Task

Typical tools (as of 2026)

Ingestion

Loading / connectors

SharePoint, S3, Confluence connectors

Ingestion

Parsing

Docling, Unstructured.io, LlamaParse, Azure Document Intelligence

Ingestion

Chunking

LangChain splitters, semantic/hierarchical chunking, contextual chunking

Ingestion

Embedding

Cohere Embed v4, Voyage-3, Gemini Embedding 002, jina-v3, BGE-M3, Aleph Alpha

Ingestion

Indexing / upsert

Qdrant, Weaviate, Pinecone, Milvus, pgvector (HNSW + optional BM25)

Query

Query embedding

same embedding model as ingestion

Query

Retrieval / vector search

HNSW ANN search, hybrid (dense + BM25), RRF fusion

Query

Filtering

metadata filters (tenant_id, ACL) in Qdrant/Weaviate/pgvector

Query

Reranking

Cohere Rerank v3, BGE Reranker, Jina Reranker v2, LLM-as-Judge

Query

Generation

LLM (Claude, GPT, Gemini, Mistral, Aleph Alpha Pharia) + faithfulness guardrail

Cross-cutting

Evaluation / observability

RAGAS, TruLens, DeepEval, Arize Phoenix, LangSmith

Data flow in words

A document travels offline through loader, parser and chunker, is vectorised by the embedding model and lands together with its metadata as a record in the vector DB. When a question arrives online, it is embedded with the same model, compared against all chunk vectors via the HNSW index and yields a coarse selection. Metadata filters remove what the user is not allowed to see. The reranker condenses the coarse selection down to the best hits. These are assembled into the prompt with source citations, the LLM generates the answer from them, and a guardrail checks fidelity to the source. Crucially: ingestion and query embedding must use the same model, otherwise the vectors are not comparable.

Concrete example with numbers

A support bot for a SaaS product indexes 50,000 documents. With semantic chunking at around 500 tokens per chunk, roughly 300,000 vectors are created. Indexing with Contextual Retrieval costs, according to Anthropic, about 1.02 US dollars per 1 million document tokens with prompt caching (as of September 2024). In operation, each query retrieves top_k = 50 candidates; the Cohere reranker reduces this to top_k = 5.

The Anthropic eval shows the effect of the building blocks on the top-20 retrieval failure rate: without optimisation, it stood at 5.7 per cent. With contextual embeddings alone, it fell to 3.7 per cent (minus 35 per cent), with full Contextual Retrieval (embeddings plus contextual BM25) to 2.9 per cent (minus 49 per cent), and with additional reranking to 1.9 per cent (minus 67 per cent). A hybrid retrieval query including rerank is, in terms of latency, in the region of about 100 to 800 milliseconds.

Pseudocode of the query path:

```
q_vec = embed(query) # same model as ingestion
dense = vdb.search(q_vec, top_k=50, filter={tenant_id})
sparse = bm25.search(query, top_k=50)
cands = rrf_fuse(dense, sparse) # hybrid search
top5 = rerank(query, cands)[:5] # cross-encoder
prompt = template(query, top5, cite=True)
answer = llm.generate(prompt)
assert faithfulness(answer, top5) > 0.8 # guardrail
```

Common architecture mistakes

  • Naive fixed-size chunking ignores sentence and table boundaries and leads to missing content and hallucinations.
  • Wrong embedding model for the language (EN embeddings for a DE corpus) destroys recall on German queries.
  • No reranking: hits are semantically similar but not relevant.
  • Lost-in-the-chunks: without a context header, a chunk such as "It increased by 12 per cent" does not know what it is about.
  • Pure semantics without BM25: exact codes and IDs are not found.
  • No metadata filters: no multi-tenant, no ACL protection; a chunk from tenant A can be retrieved for tenant B.
  • Oversized top_k to the LLM: lost-in-the-middle and higher costs.
  • No evaluation and no source citations: silent quality regression and compliance risk.
  • Embedding drift on model change: without a version string and full re-indexing, old and new vectors are incompatible.

For agencies and B2B

A clean RAG architecture is not a weekend prototype but a pipeline with ingestion, hybrid retrieval, reranking, evaluation and an erasure concept. For agencies, this means: the value lies not in the LLM but in the data strategy, in chunking and in tenant separation. For DACH B2B decision-makers, sovereignty additionally counts: Qdrant, Weaviate, Haystack and EU hosting make RAG GDPR-compliant. Blck Alpaca designs and operates such pipelines end-to-end, including RAGAS evaluation and an erasure pipeline in line with the DSK guidance. Get in touch if your AI pilot is to become a production-grade, verifiable knowledge system.

FAQ

What is the difference between the ingestion phase and the query phase in a RAG architecture?
The ingestion phase runs offline and in batch: sources are connected, parsed, broken into chunks, converted into vectors with an embedding model and written to the vector database (upsert). The query phase runs online per request: the user question is embedded, matching chunks are retrieved, filtered, re-sorted by a reranker, inserted into the prompt and processed by the LLM into an answer.
Is reranking in a RAG pipeline really necessary?
In production systems, almost always. Vector search returns semantically similar but not necessarily relevant hits. A cross-encoder reranker scores query and document together and re-sorts the top 50 down to the best top 5 to top 10. According to Anthropic, this reduces the retrieval failure rate by up to 67 per cent when combined with Contextual Retrieval (as of September 2024). Without reranking, too much noise ends up in the prompt.
What is the right chunk size for a RAG system?
There is no universal value. As a rule of thumb, 200 to 800 tokens with 10 to 20 per cent overlap apply. More important than the exact number is layout-aware parsing that preserves tables, lists and headings, as well as semantic or hierarchical chunking instead of rigid fixed-size cutting. For German-language corpora, note that German is roughly 1.3 to 1.7 times more token-intensive than English depending on the model, which lowers the effective chunk capacity.
What is hybrid search and why is pure vector search not enough?
Hybrid search combines dense retrieval (semantic vector search over an HNSW index) with sparse retrieval (classic BM25 keyword matching) and fuses both hit lists, typically via Reciprocal Rank Fusion. Pure vector search often misses exact codes, article numbers, proper names or IDs such as TS-999, because these carry little semantic signal. BM25 captures precisely these cases.
Which tools do I need for an EU-sovereign RAG architecture?
For the vector database, Qdrant (Berlin) or Weaviate (Amsterdam/EU); as a framework, Haystack from deepset (Berlin); for embeddings, jina-embeddings-v3 (Berlin), Mistral Embed (EU) or Aleph Alpha Pharia-1-Embed (Heidelberg, on-prem capable); and for hosting, STACKIT, IONOS, OVHcloud or Open Telekom Cloud. This keeps data within the EU and makes tenant separation and erasability in line with the DSK guidance implementable.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.