Hybrid Search in RAG: Combining BM25 and Vector Similarity Correctly
Hybrid search in RAG combines lexical search (BM25/keyword matching) with dense vector similarity. Both retrievers run in parallel, and their result lists are merged into one outcome via rank fusion (usually Reciprocal Rank Fusion). This lets the system find both semantically similar passages and exact terms, proper nouns and codes that pure embeddings miss.
Key Takeaways
- ✓Hybrid search unites two complementary retrievers: BM25 captures exact terms, proper nouns, product IDs and codes, while dense vector search captures synonyms and meaning. Pure vector RAG demonstrably misses exact codes such as an article number TS-999.
- ✓Reciprocal Rank Fusion (RRF) is the most robust fusion method because it uses only rank positions rather than incomparable raw scores, and therefore requires no score normalisation.
- ✓Anthropic's Contextual Retrieval combines Contextual Embeddings with Contextual BM25 and reduces the top-20 retrieval failure rate by 49 percent (5.7 to 2.9 percent), with additional reranking by 67 percent (as of 09/2024).
- ✓Hybrid retrieval plus Gemini 2.5 Flash achieves over 85 percent accuracy in the Agri-Query benchmark and clearly beats naive long-context prompts (arXiv:2508.18093, 2025).
- ✓Hybrid search is supported by all production-ready vector databases, including the DACH/EU-sovereign options Qdrant (Berlin) and Weaviate (Amsterdam).
Hybrid search in RAG combines lexical search (BM25/keyword matching) with dense vector similarity. Both retrievers run in parallel, and their result lists are merged into one outcome via rank fusion. This lets the system find both semantically similar passages and exact terms, proper nouns and codes that pure embeddings miss. Hybrid retrieval has been a fixed component of advanced RAG since 2023 and belongs in any production pipeline that searches more than generic running text.
- What is combined: sparse retrieval (BM25, exact tokens) plus dense retrieval (vector embeddings, meaning) into a single result set.
- How it is combined: rank fusion, usually Reciprocal Rank Fusion (RRF), more rarely weighted score fusion with prior normalisation.
- Why it pays off: hybrid beats pure vector RAG wherever exact terms matter, and measurably improves recall.
The two pillars: lexical versus semantic
BM25 (sparse / lexical). BM25 is the proven standard of full-text search. The algorithm scores documents by exact term match, weights rare terms more heavily (inverse document frequency) and normalises across document length. BM25 is deterministic, fast, language-agnostic in its mechanism and needs no model inferencing. Its strength is at the same time its limit: it only finds what occurs literally (or after tokenisation and stemming). Synonyms, rephrasings and paraphrases escape it.
Dense vector similarity (semantic). Dense retrieval embeds query and document in the same vector space and measures proximity, usually via cosine similarity. An approximate nearest-neighbour index (HNSW after Malkov and Yashunin in almost all vector databases) makes the search scalable. Dense retrieval captures meaning: "notice period" and "contract-end notification" land close to one another, even without shared words. The downside: embeddings smooth out exact character strings. An article number, an error code or a rarely seen proper noun becomes blurred in semantic space.
This is precisely where the need for combination arises. Pure semantics without BM25 is a documented anti-pattern: a query for an exact code such as "TS-999" is regularly not found by the vector retriever, whereas BM25 delivers it immediately. The two methods fail on different queries — and therefore cover one another.
When hybrid beats classic vector RAG
Hybrid search is not mandatory for every corpus, but is clearly superior in the following cases:
- Exact codes and IDs: product, article and order numbers, SKUs, error and status codes, ticket references.
- Proper nouns and technical terms: people, companies, product names, legal paragraphs, rare technical terms that are underrepresented in the embedding training corpus.
- Code and log search: function names, variables, configuration keys — here exact matching is often more important than meaning.
- Mixed queries: natural language plus embedded exact tokens ("What warranty applies to model TS-999?").
For homogeneous, narrative texts without hard identifiers, pure dense retrieval may suffice. But as soon as structured terms, numbers or regulated terminology are involved — and in everyday B2B that is the rule — hybrid wins.
Fusion: RRF, weighted scores and the normalisation problem
The tricky part is not the parallel searching, but the merging. BM25 and vector scores live in incomparable value ranges: BM25 scores are unbounded and corpus-dependent, while cosine similarity lies between minus one and one. Simply adding them together lets the more dominant scale drown out everything else.
Reciprocal Rank Fusion (RRF) solves this elegantly by ignoring raw scores and using only rank positions. Each document receives a score per result list according to the formula:
```
RRF_score(d) = sum over all lists of 1 / (k + rank(d))
```
Here rank(d) is the position of the document in the respective list and k is a small constant (often 60) that dampens the influence of very high ranks. A document that ranks high in both lists accumulates the highest overall score. Because RRF is scale-free, all normalisation is eliminated — this makes it robust and almost parameter-free.
Weighted score fusion is the alternative when you deliberately want to give one modality more weight (for example 0.7 * dense + 0.3 * sparse). However, it strictly requires normalisation: both score lists must first be brought to a common range (e.g. min-max to zero to one) before they are linearly combined. This is more powerful, but more sensitive to tuning and corpus drift.
Aspect | Reciprocal Rank Fusion (RRF) | Weighted score fusion |
|---|---|---|
Input quantity | Rank positions only | Raw scores of both retrievers |
Normalisation required | No | Yes (e.g. min-max) |
Parameters | One constant k (often 60) | Weights per retriever + normalisation |
Robustness | High, corpus-independent | Medium, tuning- and drift-prone |
Controllability | Low (equal-weight fusion) | High (modality can be weighted deliberately) |
Recommendation | Default for most setups | When one modality should demonstrably dominate |
Pipeline architecture
A hybrid retrieval pipeline runs schematically as follows:
```
Query ──┬─► Dense Encoder ──► ANN index (HNSW) ──► top_n_dense
│
└─► Tokenizer ──────► BM25 index ────────► top_n_sparse
│
Rank fusion (RRF / weighted) ◄────┘
→ unified top_k
```
In practice, re-ranking follows: hybrid retrieval typically delivers 50 to 100 candidates with high recall (top_k = 50–100), and a cross-encoder reranker (such as Cohere Rerank, BGE-Reranker or the DACH model Jina Reranker v2) scores query and document jointly and reduces this to the most precise five to ten hits that go to the LLM. Cross-encoders are more accurate but slower — which is why they run only after the cheap hybrid recall.
Practical example with figures
The most relevant solid data point comes from Anthropic's Contextual Retrieval (as of 09/2024). The approach extends exactly the hybrid idea: each chunk is prefixed with a short, LLM-generated context header before embedding and before BM25 indexing (Contextual Embeddings plus Contextual BM25).
- Contextual Embeddings alone: top-20 retrieval failure rate from 5.7 to 3.7 percent — minus 35 percent.
- Contextual Embeddings plus Contextual BM25 (i.e. hybrid): from 5.7 to 2.9 percent — minus 49 percent.
- Additionally reranking: to 1.9 percent — minus 67 percent.
Read differently: simply adding the BM25 component to contextualised dense retrieval pushes the failure rate from 3.7 to 2.9 percent — a noticeable recall gain that would not be achievable without lexical search. Indexing cost around 1.02 US dollars per one million document tokens with prompt caching (as of 09/2024). These figures are a vendor evaluation and should be read accordingly.
A second, independent piece of evidence: in the Agri-Query benchmark (arXiv:2508.18093, 2025), hybrid retrieval in combination with Gemini 2.5 Flash achieves over 85 percent accuracy across several languages and clearly beats naive long-context prompts. Hybrid search is therefore not just a recall trick, but part of the answer to the ongoing "RAG versus long-context" debate: for realistic, multi-part queries, hybrid RAG remains the more cost-rational and more accurate architecture.
Tooling: hybrid search in the vector DB
As of 2026, all production-ready vector databases support hybrid search natively or via extension. Qdrant (Berlin) relies on sparse vectors and BM42, Weaviate (Amsterdam) combines BM25 with dense, Pinecone offers sparse-dense, and Vespa as well as the established search stacks Elastic and OpenSearch combine classic BM25 with kNN. pgvector achieves hybrid via additional BM25 extensions. For DACH projects with a sovereignty requirement, Qdrant and Weaviate are the obvious, EU-hosted options with granular metadata filtering — relevant for tenant separation in line with GDPR.
For agencies and B2B
Hybrid search is the lever that turns a RAG prototype into a robust production solution — especially in industries with product catalogues, standards, file references, tariffs or technical manuals, where exact terms are business-critical. Any agency building RAG-based assistants or knowledge portals for DACH clients should plan hybrid retrieval with RRF plus re-ranking as a default, not as an afterthought. Blck Alpaca designs and implements such retrieval pipelines in an EU-sovereign way, from tool selection through fusion strategy to RAGAS-based evaluation. Talk to us if you want your RAG system to find the right documents reliably.
FAQ
When does hybrid search beat pure vector RAG?
What is Reciprocal Rank Fusion (RRF)?
Why can't you simply add BM25 and vector scores together?
Do I still need re-ranking after hybrid search?
Which vector databases support hybrid search natively?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.