RAG Systems Explained
How RAG systems supply LLMs with external knowledge — retrieval, embeddings, vector databases and accurate answers.
Retrieval-Augmented Generation (RAG) is a method in which a language model deliberately retrieves relevant content from an external knowledge source before generating an answer (retrieval) and inserts it into the prompt (augmentation). A RAG system consists of three mandatory elements: an external knowledge store with an index, a retriever, and a generator LLM whose prompt is augmented with the retrieved passages. This keeps answers source-grounded, verifiable, and up to date without re-training.
Key Takeaways
- ✓RAG combines a generative language model with a retriever that deliberately pulls knowledge from an external source before every answer; the original definition comes from Lewis et al. (Facebook AI, NeurIPS 2020) as a combination of parametric and non-parametric memory.
- ✓Anthropic Contextual Retrieval reduces the retrieval failure rate by 49 percent, and by 67 percent in combination with reranking (Anthropic, 09/2024) - the reranker is thus the single highest-ROI improvement in the pipeline.
- ✓Hybrid Search combines dense retrieval (vectors) with sparse retrieval (BM25) and captures exact codes, IDs, and names that pure embeddings miss; for German, hybrid typically delivers 5-15 nDCG@10 points more than dense alone.
- ✓HNSW (Malkov and Yashunin, 2016) is the standard index algorithm in nearly all production vector databases and delivers 95-99 percent recall at good latency up to about 100 million vectors.
- ✓Long-context models do not replace RAG: Gemini 1.5 Pro reaches up to 99.7 percent recall on the single-needle test at 1 million tokens, but drops to around 60 percent on realistic multi-needle - at 30-60x higher latency and roughly 1250x higher cost per query (Tian Pan 2026; arXiv:2407.01370).
- ✓For German-language corpora, German embedding quality is decisive, not the English MTEB rank; multilingual models such as Cohere Embed v4, BGE-M3 (MIT, MIRACL-SOTA), and Jina v4 (Berlin, Apache 2.0) lead in 2026.
- ✓GDPR relevance (informational, not legal advice): embeddings of personal data are most likely considered personal under EDPB Opinion 28/2024; inversion attacks reconstruct up to 92 percent of 32-token inputs (Morris et al., EMNLP 2023), which is why Art. 17 must also be applied to embeddings and chunks.
- ✓Sovereign DACH/EU options allow on-prem operation: Qdrant (Berlin), Weaviate (Amsterdam), Haystack/deepset (Berlin), SAP HANA Cloud Vector Engine, pgvector on STACKIT/IONOS/OTC, as well as Aleph Alpha and Jina as DACH-native providers.
What is RAG? A clear definition
Retrieval-Augmented Generation (RAG) denotes a method in which a Large Language Model (LLM) deliberately retrieves external knowledge content before generating an answer (retrieval) and embeds it into the prompt (augmentation), in order to ground the generation in verifiable sources. The original definition comes from Lewis et al. (Facebook AI Research, NeurIPS 2020) and describes RAG as a combination of parametric memory (the trained language model) and non-parametric memory (an external, searchable index).
Across the various canonical definitions, a consensus minimum can be established: a RAG system consists of three mandatory elements — (i) an external knowledge store with an index, (ii) a retriever that finds relevant passages, and (iii) a generator LLM whose prompt is augmented with the retrieved content. The central benefit: RAG delivers source-grounded, traceable answers, measurably reduces hallucinations, and keeps knowledge up to date without the model having to be re-trained.
Why RAG? The problem it solves
LLMs have three structural weaknesses: they hallucinate, their knowledge is frozen at the training cutoff, and their answers are not traceable. RAG addresses all three. Instead of training the model on new data with expensive fine-tuning, the relevant knowledge is loaded at runtime per query. Updates are thus as simple as a re-indexing; citations are natively possible via chunk IDs; and the answer remains bound to verifiable material.
It is precisely these properties that make RAG the de facto standard pattern for enterprise AI in the DACH region, where traceability, currency, and data control are not optional but required by regulation and by trust.
The RAG pipeline: indexing and query path
A production RAG system has two paths. The indexing path (offline/batch) processes the knowledge sources: connectors load documents from SharePoint, S3, Confluence, or databases; a parser (e.g. Docling) extracts text, tables, and structures in a layout-faithful way; a chunker breaks the content apart; an embedding model generates vectors; and an upsert writes these along with metadata (tenant ID, ACL, source, timestamp) into the vector database — optionally accompanied by a parallel BM25 index.
The query path (online) begins with the user query, optionally rewritten (query rewriting, HyDE). Then a hybrid retrieval runs (typically top_k = 50–100), followed by a re-ranker that condenses to the most relevant 5–10 passages. These are inserted into a prompt template with source citation and passed to the LLM; a faithfulness check can verify the answer against the sources.
Embeddings and vector databases
Embeddings are numerical vector representations of text, in which semantic proximity is mapped as geometric proximity (cosine similarity as the default for normalized vectors). For German-language corpora, an important rule applies: the English MTEB rank is not the German rank. Models that lead in English often lose 5–15 nDCG@10 points on German compounds, technical jargon, and long words. Decisive are German or multilingual benchmarks (MMTEB, MIRACL-de, MTEB-DE). German compounds also lead to more tokens — rule of thumb: depending on the model, German is roughly 1.3–1.7× more token-intensive than English.
Vector databases store embeddings and answer similarity queries via approximate-nearest-neighbour indexes. The standard algorithm is HNSW (Hierarchical Navigable Small World, Malkov & Yashunin 2016), which runs in nearly all production systems — from Qdrant through Weaviate, Milvus, and pgvector to SAP HANA. Practical benchmark: HNSW delivers 95–99 % recall and is comfortable up to ~100 million vectors; beyond that, quantization (halfvec, SQ8) or disk-based methods such as DiskANN help.
Vector DB | Origin | License | Hybrid Search | DACH/EU hosting |
|---|---|---|---|---|
Qdrant | Berlin (DE) | Apache 2.0 | yes (BM25/SPLADE) | yes (on-prem, STACKIT, air-gapped) |
Weaviate | Amsterdam (NL/EU) | BSD-3 | yes | yes (EU region) |
pgvector | OSS (Postgres) | PostgreSQL license | via tsvector/ParadeDB | anywhere Postgres runs |
SAP HANA Vector | DE (SAP) | commercial | with full-text | yes (BTP, Sovereign Cloud, Delos) |
Pinecone | New York (US) | proprietary SaaS | yes | EU region, but no on-prem |
For the DACH mid-market under ~10–50 million vectors, pgvector on a managed Postgres (IONOS, STACKIT, OTC, Hetzner) is the pragmatic sovereign default: one database, one backup story, one DPA chain. Corporations typically run two or three stores in parallel — SAP HANA Vector for SAP-resident data plus a dedicated vector DB (Qdrant, Weaviate) for unstructured documents.
Chunking: how knowledge is broken apart
Chunking decisively determines retrieval quality. Naive fixed-size chunking (e.g. 512 tokens, 50-token overlap) is robust but cuts through sentences, tables, and lists — a common anti-pattern. Better strategies:
- Recursive/semantic chunking respects document structure or cuts at content-based jump points.
- Hierarchical (parent-child) uses small chunks for retrieval and large parent chunks as generator context.
- Contextual Retrieval (Anthropic, 09/2024) prepends a short, LLM-generated context header to each chunk before embedding. Result: −49 % retrieval errors, −67 % with an additional reranker. The price is one LLM call per chunk at ingest time.
- Late Chunking (Jina, 2024) reverses the order: first the entire document is embedded with a long-context embedder, then the token embeddings are averaged across the chunk boundaries — the chunk vectors thus retain the document context. Late chunking is practically free at ingest time (no additional LLM call) and, in the Jina study, ~24 % better than naive chunking. For cost-disciplined DACH projects, late chunking with Jina v3/v4 or BGE-M3 is often the more rewarding default.
For layout-heavy PDFs (contracts, government correspondence, IFRS reports), layout-aware parsers (Docling, Marker) win — and, prospectively, multimodal approaches such as ColPali/ColQwen or Jina v4, which render each page as an image and thus circumvent OCR and layout errors.
Hybrid Search and reranking
Hybrid Search combines dense retrieval (embeddings, semantic proximity) with sparse retrieval (BM25 or learned sparse models such as SPLADE/ELSER) and fuses the results, usually via Reciprocal Rank Fusion (RRF). The reason: pure embeddings miss exact codes, IDs, file reference numbers, SAP material numbers, or IBANs — precisely the tokens that matter in DACH B2B practice. BM25 captures these. For German with compounds and technical jargon, hybrid consistently delivers 5–15 nDCG@10 points more than dense-only.
Reranking is the second, more precise sorting stage: a cross-encoder computes query and document jointly and re-scores the top candidates. This is the single highest-ROI improvement in the entire pipeline — typically +5–15 percentage points recall@5. The Anthropic study quantified the combined effect: embeddings + BM25 yield −49 % retrieval errors compared to embeddings-only; with Contextual Retrieval and reranker together, −67 %.
Latency budget for a DACH standard pipeline (10 million vectors): BM25 and dense first stage 10–50 ms each, RRF under 1 ms, cross-encoder reranker (e.g. BGE Reranker M3 on a GPU) 100–300 ms — a total of 150–500 ms before LLM generation. With hard sub-100-ms SLAs, the reranker is the first candidate to drop, against a recall loss of 5–15 points.
RAG vs. alternatives: when to use what?
RAG is not the only strategy. The choice over fine-tuning, long-context, and prompt engineering depends on the objective:
Dimension | RAG | Fine-Tuning | Long-Context |
|---|---|---|---|
Knowledge update | very easy (re-index) | laborious (re-train) | expensive per query |
Source citation | native (chunk IDs) | not possible | possible, but unreliable |
Hallucination risk | low (with rerank + faithfulness) | medium (frozen knowledge) | medium-high (lost-in-the-middle) |
GDPR controllability | good (ACL, deletion pipeline) | problematic (knowledge in the model) | problematic with closed API |
An open but increasingly settled debate is long-context vs. RAG. Modern models offer huge context windows (Gemini 2.5: 1 million tokens, Claude: 200k). On the classic single-needle test, Gemini 1.5 Pro reaches up to 99.7 % recall at 1 million tokens — but on realistic multi-needle retrieval the value drops to around 60 % (arXiv:2407.01370). Add to that ~30–60× higher latency and ~1250× higher cost per query compared to a RAG pipeline (qualitative comparison, Tian Pan 2026). The 2026 consensus: long-context complements RAG for narrowly scoped workloads, but rarely replaces it for multi-needle, multi-tenant, and cost-sensitive scenarios.
GDPR and sovereign operation in the DACH region
Note: the following statements are informational and do not constitute legal advice.
One of the materially most important GDPR questions in 2025/2026 reads: Are embeddings personal data? The honest answer is "most likely yes, insofar as derived from personal data." Inversion attacks reconstruct up to 92 % of 32-token inputs exactly (Morris et al., EMNLP 2023) — an embedding is therefore not a safe pseudonymization. EDPB Opinion 28/2024 calls for a case-by-case re-identification risk assessment; the CJEU ruling C-413/23 P (September 2025) clarifies that pseudonymized data are not automatically personal for every recipient, but narrows the obligations only rather than abolishing them.
Practical consequences for RAG architectures, oriented on the DSK guidance document on RAG and EDPB requirements:
- Right to erasure (Art. 17): embeddings and chunks must also be deletable. HNSW graphs support point deletion to varying degrees — deletion semantics are a hard procurement criterion (pgvector and Qdrant delete efficiently).
- Tenant separation: tenant ID and ACL in the metadata, filter on every query, defense-in-depth (auth + filter + re-check + audit log). A shared index without a tenant filter is a GDPR accident waiting to happen.
- Data residency: prefer EU-region hosting; with US cloud providers, assess CLOUD Act / FISA 702 residual risk (SCC + TIA). Transferring an embedding to a US-hosted vector DB is a third-country transfer.
- Minimization before embedding: where possible, remove names, emails, and IDs before embedding, or replace them with stable pseudonyms; encrypt vectors with customer-managed keys (CMK/BYOK).
For regulated workloads (BFSI under MaRisk/DORA, health under MDR/IVDR, KRITIS under NIS2, public sector under OZG), the rule is: every layer sovereignly deployable. The sovereign DACH/EU landscape is robust in 2026 — Qdrant (Berlin), Weaviate (Amsterdam), Haystack/deepset (Berlin), SAP HANA Cloud Vector Engine, pgvector on STACKIT/IONOS/OTC/Hetzner, as well as Aleph Alpha (Heidelberg) and Jina (Berlin) as DACH-native model providers. Note on the EU AI Act (as of 05/2026): the political agreement of the Digital Omnibus of 7 May 2026 proposes to postpone the high-risk rules to 2 December 2027 — not yet formally adopted (provisional); the Art. 50 transparency obligations remain unchanged at 2 August 2026.
Quality assurance with RAGAS
A RAG system without evaluation regresses silently. The de facto standard is RAGAS with the core metrics faithfulness (fidelity to the source), answer relevancy, context precision, and context recall — complemented by TruLens ("RAG Triad") or DeepEval. The sensible approach is a gold set plus LLM-as-judge plus A/B tests in production, firmly anchored in the CI pipeline. Citation forcing, faithfulness guardrails, and answer refusal at low scores are the standard means against hallucinations despite RAG.
Outlook and practical note
Vector databases and embedding models have largely commoditized at the API level — HNSW is everywhere, hybrid search is standard, and multimodality (ColPali, Jina v4) is the new frontier. In parallel, RAG continues to evolve along the stages Naive → Advanced → Modular → Agentic RAG, with retrieval increasingly understood as a dynamic tool of an agent (Singh et al. 2025).
The pragmatic entry point for a DACH project: start with BM25 + dense (BGE-M3 or Jina v4) + cross-encoder reranker (BGE Reranker M3) on a sovereign Postgres/pgvector base, classify personal data before embedding, and measure quality from day one with RAGAS. What dominates architecturally in 2026 is no longer raw performance, but the question of where the embeddings sit, who can reach them, and whether the stack can be pulled on-prem if in doubt — a RAG stack planned to be sovereign, German-language, and open-source-oriented is the robust answer.
All Articles in this Topic
14 ArticlesRAG Architecture: Ingestion, Retrieval, Generation, Reranking
RAG architecture is the two-phase structure of a retrieval-augmented generation system: in the ingestion path, documents are loaded, chunked, embedded into vectors and indexed; in the query path, passages relevant to the request are retrieved, re-sorted (reranking) and passed to an LLM as context for answer generation.
Embedding Models 2026 Compared: text-embedding-3, Cohere, BGE-M3, Voyage & Jina
An embedding model comparison evaluates models such as OpenAI text-embedding-3, Cohere Embed v4, BGE-M3, Voyage and Jina by dimensions, context length, MTEB/MMTEB benchmarks, multilinguality, cost, self-hosting and licence. For German-language RAG systems, what counts is not the English MTEB rank but the demonstrated quality on MMTEB, MIRACL and MTEB-DE.
Vector Database Comparison: Pinecone, Weaviate, Qdrant, Milvus, pgvector & Co. in the Enterprise Check
A vector database comparison evaluates vector databases based on hosting, scaling, metadata filtering, hybrid search, consistency, cost and maturity. In the DACH enterprise environment, the 2026 choice is primarily a sovereignty and GDPR decision: pgvector covers most cases below around 50 million vectors, while Qdrant is regarded as the DACH-proximate champion.
Pinecone vs. Weaviate vs. Qdrant: Vector DB Comparison from a DACH/EU Hosting Perspective
Pinecone, Weaviate and Qdrant are the three most widely used vector databases for RAG systems. From a DACH perspective, the deciding factor is less performance than hosting sovereignty: Qdrant (Berlin, Apache 2.0) and Weaviate (Amsterdam, BSD-3) are self-hostable and EU-native, while Pinecone is a US managed SaaS with no on-prem option.
Chunking Strategies for RAG: Fixed, Semantic, Hierarchical and Late Chunking Compared
Chunking strategies for RAG determine how a document is split into searchable text segments (chunks) before embedding. The choice of strategy and chunk size largely determines retrieval quality: it decides whether a language model finds the right passage and whether that passage contains enough context for a correct, source-grounded answer.
Hybrid Search in RAG: Combining BM25 and Vector Similarity Correctly
Hybrid search in RAG combines lexical search (BM25/keyword matching) with dense vector similarity. Both retrievers run in parallel, and their result lists are merged into one outcome via rank fusion (usually Reciprocal Rank Fusion). This lets the system find both semantically similar passages and exact terms, proper nouns and codes that pure embeddings miss.
Reranking in RAG: Cross-Encoder vs. Bi-Encoder
Reranking is the second retrieval stage in a RAG pipeline: a cross-encoder re-scores the top candidates found by the fast bi-encoder and sorts them by genuine query relevance. According to Anthropic, reranking significantly lowers the retrieval failure rate compared with pure vector retrieval, at the cost of higher latency.
Graph RAG: When Relationships Matter More Than Similarity
Graph RAG is a retrieval-augmented generation approach that stores knowledge not (only) as vectors, but as a knowledge graph of entities and their relationships. Instead of purely semantic similarity, the system uses the graph structure to answer multi-hop questions and connect information across many documents.
Agentic RAG vs. classic RAG: what is the difference?
Agentic RAG is a RAG variant in which an AI agent dynamically decides whether, what and how often knowledge is retrieved. Retrieval becomes a tool that the agent calls reflectively, in multiple steps and from several sources. Classic RAG, by contrast, follows a fixed, one-off pipeline without any decision logic.
Corrective RAG and Self-RAG: Self-Correcting Retrieval Patterns for Fewer Hallucinations
Corrective RAG (CRAG) and Self-RAG are self-correcting retrieval patterns. CRAG assesses the relevance of retrieved results and switches to a web-search fallback when quality is poor. Self-RAG lets the model decide for itself, via reflection tokens, whether to retrieve at all and whether its own answer is supported by the sources.
Multimodal RAG: Retrieving Images, PDFs and Tables
Multimodal RAG extends classic Retrieval-Augmented Generation to non-textual content: images, scanned PDFs, tables, charts and diagrams are indexed and made retrievable. Instead of searching plain text only, the system retrieves visual and structured information via multimodal embeddings, vision-LLM descriptions or layout-aware parsing and feeds it source-grounded into the answer prompt.
RAG Evaluation: RAGAS, TruLens and DeepEval Compared
RAG evaluation is the systematic, measurable quality assessment of a retrieval-augmented generation system. It separately assesses whether retrieval finds the right documents and whether the generated answer is faithfully grounded in those sources. The core metrics are Faithfulness, Answer Relevance, Context Precision and Context Recall, measured with frameworks such as RAGAS, TruLens, DeepEval or LangSmith.
Building GDPR-Compliant RAG Systems: A Practical Guide
A GDPR-compliant RAG system processes personal data in source documents, the vector index and embeddings only on a secure legal basis, with data minimisation, EU hosting, tenant isolation and a deletion pipeline that removes chunks and embeddings together. According to supervisory practice, embeddings count as pseudonymous personal data, not as anonymous.
RAG on-premise vs. EU cloud: A decision matrix for hosting options
RAG on-premise vs. cloud refers to the hosting decision for a retrieval-augmented generation system: on-premise (self-hosted) runs on your own hardware with maximum data control and CapEx, while EU cloud uses managed services in EU data centres with OpEx and faster scaling. The choice depends on data sensitivity, compliance, cost and operational know-how.