Skip to content
4.8Advanced7 min

Reranking in RAG: Cross-Encoder vs. Bi-Encoder

Blck Alpaca·
Definition

Reranking is the second retrieval stage in a RAG pipeline: a cross-encoder re-scores the top candidates found by the fast bi-encoder and sorts them by genuine query relevance. According to Anthropic, reranking significantly lowers the retrieval failure rate compared with pure vector retrieval, at the cost of higher latency.

Key Takeaways

  • Bi-encoders encode query and document separately and are fast enough for the initial retrieval across millions of chunks; cross-encoders process query and document jointly and are more accurate, but too expensive for the first pass.
  • Reranking is a post-retrieval stage: hybrid retrieval delivers the top 50 to top 100 candidates, and the reranker reduces them to the most relevant top 5 to top 10 for the LLM prompt.
  • Anthropic Contextual Retrieval reduces the top-20 retrieval failure rate by 49 per cent; combined with reranking by 67 per cent (from 5.7 to 1.9 per cent).
  • Practically relevant models (as of 2026): Cohere Rerank v3/v3.5 (API, multilingual including German), BGE-Reranker and Jina Reranker v2 (open source), Voyage rerank-2, and LLM-as-judge rerankers.
  • The trade-off is latency versus precision: hybrid retrieval plus reranking typically sits at around 100 to 800 milliseconds of total latency.
  • Passing too large a top-k to the LLM causes 'lost-in-the-middle' and higher costs; reranking plus a top-k of 5 to 10 is the standard countermeasure.

Reranking is the second retrieval stage in a RAG pipeline: a cross-encoder re-scores the top candidates found by the fast bi-encoder and sorts them by genuine query relevance. According to Anthropic, reranking significantly lowers the retrieval failure rate compared with pure vector retrieval – at the cost of higher latency. Reranking is therefore not an optional nice-to-have, but in production systems the most reliable lever between "semantically similar" and "actually relevant".

  • What it does: Reranking re-orders the candidates found by the first retrieval before they reach the LLM prompt.
  • Why two stages: The fast bi-encoder provides recall across millions of chunks, while the precise cross-encoder provides precision on the final top-k.
  • Concrete effect: Anthropic reports a reduction of the top-20 retrieval failure rate by up to 67 per cent when Contextual Retrieval is combined with reranking.

Why single-stage retrieval is not enough

Classic vector retrieval works with a bi-encoder. It encodes each document chunk into a vector in advance and stores it in an ANN index (usually HNSW). At query time, the query is projected into the same vector space, and similarity is computed via cosine distance. This is extremely fast and scales to millions of entries – but it is an approximation. Query and document never "see" each other jointly; the model only compares two independently produced condensations.

The consequence is a typical anti-pattern: the top-k hits are semantically similar, but not necessarily relevant to the specific question. This is exactly where reranking comes in. It is a post-retrieval stage that applies a more precise – but slower – mechanism to a small set of candidates, rather than to the entire corpus.

Bi-encoder vs. cross-encoder: the central difference

The architectural core of reranking is the switch from the bi-encoder to the cross-encoder.

A bi-encoder processes query and document separately. Both are each encoded into a vector, and relevance only emerges downstream from the distance between these vectors. Because document embeddings can be computed and indexed in advance, the bi-encoder is the only practical choice for the initial retrieval across large corpora.

A cross-encoder processes query and document jointly in a single forward pass. The model sees both texts together and can account for word-by-word interactions, negations, and subtle differences in meaning. The result is a much more precise relevance assessment – however, a computation has to be run for every query-document pair. Across millions of chunks this would be prohibitively expensive; across 50 to 100 pre-filtered candidates it is acceptable.

Property

Bi-encoder

Cross-encoder

Input

Query and document separately

Query and document jointly

Pre-indexing

Yes (document vectors storable)

No (pairwise at runtime)

Speed

Very fast

Slow (forward pass per pair)

Precision

Approximation

High

Role in the pipeline

Initial retrieval (recall)

Reranking (precision)

Scales to

Millions of chunks

Dozens to a few hundred candidates

In between sits the late-interaction approach (ColBERT): it stores token-level embeddings per document and computes relevance via a MaxSim operation between query and document tokens. This offers more accuracy than a classic bi-encoder with better scalability than a full cross-encoder, but costs considerably more memory for the token vectors.

Position in the pipeline and top-k strategy

Reranking is a firmly defined stage in the query path. The canonical arrangement looks like this:

```
[Hybrid Retrieval, top_k = 50-100]
-> [Cross-Encoder Reranker (Cohere Rerank / BGE / Jina)]
-> [top_k = 5-10]
-> [Prompt + source citation]
-> [LLM]
```

The decisive point: the reranker comes only after the cost-effective bi-encoder recall. The initial hybrid retrieval (dense plus BM25, fused via Reciprocal Rank Fusion) deliberately delivers a generous candidate pool of 50 to 100 entries in order to maximise recall. The cross-encoder then reduces this pool to the 5 to 10 truly relevant chunks that reach the prompt.

This narrowing is not just a matter of cost. Too large a top-k passed to the LLM leads to the "lost-in-the-middle" effect: relevant information in the middle of a long context is processed less well, and token costs rise. Reranking plus a tight top-k of 5 to 10 is the standard countermeasure – with the additional recommendation to place the most important chunks at the front of the prompt.

Reranking models at a glance (as of 2026)

Model

Type

Hosting / Licence

Languages

Note

Cohere Rerank v3 / v3.5

Cross-encoder

API

multilingual incl. DE

API standard

BGE-Reranker (large / v2-m3)

Cross-encoder

Open source, Apache 2.0

multilingual

On-prem capable

Jina Reranker v2

Cross-encoder

Open source

multilingual

DACH provider (Berlin)

Voyage rerank-2

Cross-encoder

API

multilingual

API alternative

LLM-as-judge (Haystack LLMRanker)

LLM prompt

own GPT/Claude prompt

any

flexible, but more expensive

For DACH scenarios with sovereignty requirements, the open-source options BGE-Reranker and the Berlin-developed Jina Reranker v2 are relevant, as they can be run on-prem or in EU regions. Anyone who accepts an API and needs multilingual quality including German turns in practice to Cohere Rerank. The LLM-as-judge approach (such as the LLMRanker in Haystack) is the most flexible, but the most expensive per request, and should only be used where reranking quality is critical and the latency budget is generous.

Latency/quality trade-off

Reranking is the most expensive single stage in the online path, because the cross-encoder computes per candidate. In production systems, the total latency from hybrid retrieval plus reranking typically lies at around 100 to 800 milliseconds. This range is directly controllable:

  • Size of the input pool: Reranking 100 candidates costs more than 30. Depending on the corpus, a smaller pool is often sufficient.
  • Model choice: API rerankers such as Cohere are throughput-optimised; an LLM-as-judge is considerably slower.
  • Hardware: Open-source rerankers benefit strongly from GPU inference; on CPU the latency rises considerably.

The trade-off is therefore not "reranking yes or no", but "how many candidates at which latency budget". For most enterprise use cases, the precision gain is worth the additional latency.

Numerical example: precision increase

In the publication "Introducing Contextual Retrieval" (19/09/2024), Anthropic measured concrete figures from its own benchmarks (Code, Fiction, ArXiv papers, Science). To be read as a vendor eval with self-interest, but methodically documented:

  • Baseline (pure retrieval): top-20 retrieval failure rate of 5.7 per cent.
  • Contextual Embeddings alone: reduction by 35 per cent to 3.7 per cent.
  • Contextual Retrieval (embeddings plus BM25): reduction by 49 per cent to 2.9 per cent.
  • Contextual Retrieval plus reranking: reduction by 67 per cent to 1.9 per cent.

In concrete terms this means: of the original 57 errors per 1,000 queries, only 19 remain with the full pipeline. Reranking alone – as the step from 2.9 to 1.9 per cent – eliminates around another third of the remaining errors in this setup. For a production RAG system, every avoided retrieval error means one less chunk that tempts the LLM into a wrong or unsubstantiated answer. This is exactly why RAG practice lists "no reranking" as a clear anti-pattern: the top-k are then semantically similar, but not relevant.

For agencies and B2B decision-makers

Anyone setting up a production RAG system in the DACH region should plan reranking as a fixed pipeline stage from the outset – not as downstream tuning. The leverage is large, the implementation effort manageable: an additional API call (Cohere) or a self-hosted open-source reranker (BGE, Jina) between retrieval and LLM. For agencies building knowledge assistants or internal search tools for clients, reranking offers the best ratio of quality gain to effort – and a measurable argument in the pitch, because the precision increase can be demonstrated in evaluation frameworks such as RAGAS (Context Precision). Blck Alpaca designs such two-stage retrieval architectures including EU-compliant model choice and latency budgeting. Have your existing RAG pipeline checked to see whether the right chunks really do arrive at the front.

FAQ

What is the difference between a bi-encoder and a cross-encoder?
A bi-encoder encodes query and document independently into separate vectors; the similarity is only computed afterwards via cosine distance. This is fast and can be indexed in advance, making it ideal for the initial retrieval. A cross-encoder processes query and document jointly in a single forward pass and thereby delivers a much more precise relevance assessment, but it is too compute-intensive to search through millions of chunks. That is why the cross-encoder is only deployed during reranking on a small number of candidates.
Do I need reranking if I already use hybrid search?
In most production cases, yes. Hybrid search improves recall, meaning more relevant documents end up in the candidate pool. Reranking improves precision at the top: it ensures that the truly most relevant chunks are placed right at the front and make it into the limited LLM context. According to Anthropic, reranking adds measurable quality beyond Contextual Retrieval (reducing retrieval errors from 49 to 67 per cent).
Which reranking models are relevant in 2026?
Cohere Rerank v3 and v3.5 are considered the API standard, multilingual including German. Open-source options are the BGE-Reranker (large or v2-m3, Apache 2.0) and the Jina Reranker v2 from Berlin. Voyage rerank-2 is another API variant. More flexible but more expensive is an LLM-as-judge reranker, such as the LLMRanker in Haystack. All details as of 2026.
How much latency does reranking cost?
Reranking is the most expensive single stage in the query path, because the cross-encoder runs a full forward pass per candidate. In practice, the total latency from hybrid retrieval plus reranking typically lies at around 100 to 800 milliseconds. This is controllable via the number of reranked candidates: the smaller the input pool, the lower the latency and cost.
What is ColBERT or late interaction?
ColBERT is a late-interaction approach that sits between bi- and cross-encoder. Instead of a single vector per document, it stores token-level embeddings and computes relevance via a MaxSim operation between query and document tokens. This delivers higher accuracy than pure bi-encoders with better scalability than cross-encoders, but requires more memory for the token vectors.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.