4.7Advanced7 min

Hybrid Search in RAG: Combining BM25 and Vector Similarity Correctly

Blck Alpaca·9 June 2026

Definition

Hybrid search in RAG combines lexical search (BM25/keyword matching) with dense vector similarity. Both retrievers run in parallel, and their result lists are merged into one outcome via rank fusion (usually Reciprocal Rank Fusion). This lets the system find both semantically similar passages and exact terms, proper nouns and codes that pure embeddings miss.

Key Takeaways

✓Hybrid search unites two complementary retrievers: BM25 captures exact terms, proper nouns, product IDs and codes, while dense vector search captures synonyms and meaning. Pure vector RAG demonstrably misses exact codes such as an article number TS-999.
✓Reciprocal Rank Fusion (RRF) is the most robust fusion method because it uses only rank positions rather than incomparable raw scores, and therefore requires no score normalisation.
✓Anthropic's Contextual Retrieval combines Contextual Embeddings with Contextual BM25 and reduces the top-20 retrieval failure rate by 49 percent (5.7 to 2.9 percent), with additional reranking by 67 percent (as of 09/2024).
✓Hybrid retrieval plus Gemini 2.5 Flash achieves over 85 percent accuracy in the Agri-Query benchmark and clearly beats naive long-context prompts (arXiv:2508.18093, 2025).
✓Hybrid search is supported by all production-ready vector databases, including the DACH/EU-sovereign options Qdrant (Berlin) and Weaviate (Amsterdam).

Hybrid search in RAG combines lexical search (BM25/keyword matching) with dense vector similarity. Both retrievers run in parallel, and their result lists are merged into one outcome via rank fusion. This lets the system find both semantically similar passages and exact terms, proper nouns and codes that pure embeddings miss. Hybrid retrieval has been a fixed component of advanced RAG since 2023 and belongs in any production pipeline that searches more than generic running text.

What is combined: sparse retrieval (BM25, exact tokens) plus dense retrieval (vector embeddings, meaning) into a single result set.
How it is combined: rank fusion, usually Reciprocal Rank Fusion (RRF), more rarely weighted score fusion with prior normalisation.
Why it pays off: hybrid beats pure vector RAG wherever exact terms matter, and measurably improves recall.

The two pillars: lexical versus semantic

BM25 (sparse / lexical). BM25 is the proven standard of full-text search. The algorithm scores documents by exact term match, weights rare terms more heavily (inverse document frequency) and normalises across document length. BM25 is deterministic, fast, language-agnostic in its mechanism and needs no model inferencing. Its strength is at the same time its limit: it only finds what occurs literally (or after tokenisation and stemming). Synonyms, rephrasings and paraphrases escape it.

Dense vector similarity (semantic). Dense retrieval embeds query and document in the same vector space and measures proximity, usually via cosine similarity. An approximate nearest-neighbour index (HNSW after Malkov and Yashunin in almost all vector databases) makes the search scalable. Dense retrieval captures meaning: "notice period" and "contract-end notification" land close to one another, even without shared words. The downside: embeddings smooth out exact character strings. An article number, an error code or a rarely seen proper noun becomes blurred in semantic space.

This is precisely where the need for combination arises. Pure semantics without BM25 is a documented anti-pattern: a query for an exact code such as "TS-999" is regularly not found by the vector retriever, whereas BM25 delivers it immediately. The two methods fail on different queries, and therefore cover one another.

When hybrid beats classic vector RAG

Hybrid search is not mandatory for every corpus, but is clearly superior in the following cases:

Exact codes and IDs: product, article and order numbers, SKUs, error and status codes, ticket references.
Proper nouns and technical terms: people, companies, product names, legal paragraphs, rare technical terms that are underrepresented in the embedding training corpus.
Code and log search: function names, variables, configuration keys: here exact matching is often more important than meaning.
Mixed queries: natural language plus embedded exact tokens ("What warranty applies to model TS-999?").

For homogeneous, narrative texts without hard identifiers, pure dense retrieval may suffice. But as soon as structured terms, numbers or regulated terminology are involved, and in everyday B2B that is the rule, hybrid wins.

Fusion: RRF, weighted scores and the normalisation problem

The tricky part is not the parallel searching, but the merging. BM25 and vector scores live in incomparable value ranges: BM25 scores are unbounded and corpus-dependent, while cosine similarity lies between minus one and one. Simply adding them together lets the more dominant scale drown out everything else.

Reciprocal Rank Fusion (RRF) solves this elegantly by ignoring raw scores and using only rank positions. Each document receives a score per result list according to the formula:

```
RRF_score(d) = sum over all lists of 1 / (k + rank(d))
```

Here rank(d) is the position of the document in the respective list and k is a small constant (often 60) that dampens the influence of very high ranks. A document that ranks high in both lists accumulates the highest overall score. Because RRF is scale-free, all normalisation is eliminated, which makes it robust and almost parameter-free.

Weighted score fusion is the alternative when you deliberately want to give one modality more weight (for example 0.7 * dense + 0.3 * sparse). However, it strictly requires normalisation: both score lists must first be brought to a common range (e.g. min-max to zero to one) before they are linearly combined. This is more powerful, but more sensitive to tuning and corpus drift.

Aspect	Reciprocal Rank Fusion (RRF)	Weighted score fusion
Input quantity	Rank positions only	Raw scores of both retrievers
Normalisation required	No	Yes (e.g. min-max)
Parameters	One constant k (often 60)	Weights per retriever + normalisation
Robustness	High, corpus-independent	Medium, tuning- and drift-prone
Controllability	Low (equal-weight fusion)	High (modality can be weighted deliberately)
Recommendation	Default for most setups	When one modality should demonstrably dominate

Pipeline architecture

A hybrid retrieval pipeline runs schematically as follows:

```
Query ──┬─► Dense Encoder ──► ANN index (HNSW) ──► top_n_dense
│
└─► Tokenizer ──────► BM25 index ────────► top_n_sparse
│
Rank fusion (RRF / weighted) ◄────┘
→ unified top_k
```

In practice, re-ranking follows: hybrid retrieval typically delivers 50 to 100 candidates with high recall (top_k = 50–100), and a cross-encoder reranker (such as Cohere Rerank, BGE-Reranker or the DACH model Jina Reranker v2) scores query and document jointly and reduces this to the most precise five to ten hits that go to the LLM. Cross-encoders are more accurate but slower, which is why they run only after the cheap hybrid recall.

Practical example with figures

The most relevant solid data point comes from Anthropic's Contextual Retrieval (as of 09/2024). The approach extends exactly the hybrid idea: each chunk is prefixed with a short, LLM-generated context header before embedding and before BM25 indexing (Contextual Embeddings plus Contextual BM25).

Contextual Embeddings alone: top-20 retrieval failure rate from 5.7 to 3.7 percent, minus 35 percent.
Contextual Embeddings plus Contextual BM25 (i.e. hybrid): from 5.7 to 2.9 percent, namely minus 49 percent.
Additionally reranking: to 1.9 percent, minus 67 percent.

Read differently: simply adding the BM25 component to contextualised dense retrieval pushes the failure rate from 3.7 to 2.9 percent, a noticeable recall gain that would not be achievable without lexical search. Indexing cost around 1.02 US dollars per one million document tokens with prompt caching (as of 09/2024). These figures are a vendor evaluation and should be read accordingly.

A second, independent piece of evidence: in the Agri-Query benchmark (arXiv:2508.18093, 2025), hybrid retrieval in combination with Gemini 2.5 Flash achieves over 85 percent accuracy across several languages and clearly beats naive long-context prompts. Hybrid search is therefore not just a recall trick, but part of the answer to the ongoing "RAG versus long-context" debate: for realistic, multi-part queries, hybrid RAG remains the more cost-rational and more accurate architecture.

Tooling: hybrid search in the vector DB

As of 2026, all production-ready vector databases support hybrid search natively or via extension. Qdrant (Berlin) relies on sparse vectors and BM42, Weaviate (Amsterdam) combines BM25 with dense, Pinecone offers sparse-dense, and Vespa as well as the established search stacks Elastic and OpenSearch combine classic BM25 with kNN. pgvector achieves hybrid via additional BM25 extensions. For DACH projects with a sovereignty requirement, Qdrant and Weaviate are the obvious, EU-hosted options with granular metadata filtering, relevant for tenant separation in line with GDPR.

For agencies and B2B

Hybrid search is the lever that turns a RAG prototype into a robust production solution, especially in industries with product catalogues, standards, file references, tariffs or technical manuals, where exact terms are business-critical. Any agency building RAG-based assistants or knowledge portals for DACH clients should plan hybrid retrieval with RRF plus re-ranking as a default, not as an afterthought. Blck Alpaca designs and implements such retrieval pipelines in an EU-sovereign way, from tool selection through fusion strategy to RAGAS-based evaluation. Talk to us if you want your RAG system to find the right documents reliably.

FAQ

When does hybrid search beat pure vector RAG?

Whenever exact character strings matter: product and article numbers, error codes, legal paragraphs, proper nouns, tickets or function names in code. Dense embeddings smooth such tokens semantically and regularly miss them. A query for an article number like TS-999 often returns nothing relevant with pure semantics, whereas BM25 finds the exact hit immediately. The lexical component also helps with rare technical terms that are underrepresented in the embedding training corpus.

What is Reciprocal Rank Fusion (RRF)?

RRF is a fusion method that merges multiple result lists into one combined ranking. Each document receives, per list, a score of 1 divided by (k plus rank position), and the scores are summed. Crucially, RRF uses only the rank position, not the raw scores of the retrievers. This eliminates the difficult normalisation between incomparable BM25 and cosine values, and the method is considered robust and low on parameters.

Why can't you simply add BM25 and vector scores together?

Because they lie in completely different value ranges. BM25 scores are unbounded and corpus-dependent, while cosine similarity lies between minus one and one. A direct addition would let one scale dominate. Anyone wanting weighted score fusion must first normalise both lists, for example via min-max to zero to one. RRF avoids this problem by working exclusively with rank positions.

Do I still need re-ranking after hybrid search?

For high-quality answers, yes. Hybrid search typically delivers 50 to 100 candidates with good recall. A cross-encoder reranker, which scores query and document jointly, accurately sorts the best five to ten from these and forwards only those to the LLM. According to Anthropic, the combination of Contextual Retrieval and reranking reduces retrieval errors by up to 67 percent (as of 09/2024).

Which vector databases support hybrid search natively?

As of 2026, practically all production-ready systems support hybrid search: Qdrant (BM42/sparse vectors, Berlin), Weaviate (BM25 plus dense, Amsterdam), Pinecone (sparse-dense), Vespa, Elastic and OpenSearch (classic BM25 plus kNN), MongoDB Atlas, Milvus and Redis. pgvector can implement hybrid via additional BM25 extensions. For DACH/EU-sovereign setups, Qdrant and Weaviate are the obvious options.

Want to go deeper?

Get new analyses straight to your inbox, or see how we put this knowledge to work for companies.

Subscribe to newsletter →Our services

Previous← Chunking Strategies for RAG: Fixed, Semantic, Hierarchical and Late Chunking Compared NextReranking in RAG: Cross-Encoder vs. Bi-Encoder →