Skip to content
4.13Advanced7 min

RAG Evaluation: RAGAS, TruLens and DeepEval Compared

Blck Alpaca·
Definition

RAG evaluation is the systematic, measurable quality assessment of a retrieval-augmented generation system. It separately assesses whether retrieval finds the right documents and whether the generated answer is faithfully grounded in those sources. The core metrics are Faithfulness, Answer Relevance, Context Precision and Context Recall, measured with frameworks such as RAGAS, TruLens, DeepEval or LangSmith.

Key Takeaways

  • RAG evaluation separates two failure sources: poor retrieval (wrong contexts) and unfaithful generation (the model deviates despite correct sources). Both require dedicated metrics.
  • The four core metrics are Faithfulness/Groundedness, Answer Relevance, Context Precision and Context Recall. RAGAS, TruLens and DeepEval implement them as LLM-as-judge.
  • RAGAS is regarded as the de facto standard and can measure reference-free; TruLens focuses on the RAG triad; DeepEval brings Pytest-style tests; LangSmith delivers end-to-end tracing plus dataset eval.
  • An eval dataset (gold set) made up of question, reference context and ideal answer is the foundation. Without it, quality remains unmeasurable and regressions only surface in production.
  • Evaluation belongs in the CI/CD pipeline: every pipeline change is tested against the gold set before it goes live. Otherwise silent quality regression looms (anti-pattern AP5).

RAG evaluation is the systematic, measurable quality assessment of a retrieval-augmented generation system. It answers two separate questions: does retrieval find the right documents, and is the generated answer faithfully grounded in precisely those documents? Without this separation, it remains unclear whether a poor answer stems from retrieval or from the language model. Frameworks such as RAGAS, TruLens, DeepEval and LangSmith automate this measurement with standardised metrics.

  • What is measured? Four core metrics: Faithfulness/Groundedness, Answer Relevance, Context Precision and Context Recall.
  • With what? RAGAS (de facto standard), TruLens (RAG triad), DeepEval (Pytest style), LangSmith (tracing + dataset eval) as well as Arize Phoenix (OSS tracing).
  • Why? Without evaluation, silent quality regression sets in: a pipeline change degrades the answers without anyone noticing, until users complain.

Why RAG has two failure sources

A RAG system can fail in two mutually independent ways. First, retrieval can deliver the wrong or incomplete contexts, in which case the model has no chance of a correct answer at all. Second, the model can hallucinate despite correct context, i.e. make statements that are not present in the retrieved material. This second phenomenon is called context-unfaithful generation: the model cites sources but deviates from them in substance.

This is precisely why a single overall score is not enough. RAG evaluation breaks quality down into a retrieval axis and a generation axis. Only this separation makes debugging possible: if Context Recall falls, you optimise chunking, embeddings or hybrid search. If Faithfulness falls despite good context, you optimise the prompt, citation forcing or guardrails.

The four core RAG metrics

RAGAS measures RAG quality along Faithfulness, Answer Relevancy, Context Precision and Context Recall. These four terms are the shared vocabulary of the entire industry, even if individual tools name them differently.

Faithfulness / Groundedness (generation axis)

Measures whether every single statement in the answer is supported by the retrieved context. An LLM-as-judge breaks the answer down into atomic claims and checks each one against the context. The value is the proportion of substantiated claims. Low Faithfulness means hallucination despite RAG, the most common trust trap in production systems.

Answer Relevance (generation axis)

Measures whether the answer actually addresses the question asked and does not digress or remain incomplete. An answer can be faithful but irrelevant, for example if it correctly references a secondary aspect but fails to address the core question. Both metrics must therefore always be considered together.

Context Precision (retrieval axis)

Measures whether the relevant chunks appear in the top ranks of the retrieval results. High precision means little noise, with the important passages coming first. This is decisive, because too much irrelevant context leads to the lost-in-the-middle effect and lowers generation quality.

Context Recall (retrieval axis)

Measures whether all the information needed for the ideal answer was retrieved at all. Recall requires a reference answer (ground truth) against which it is checked whether every necessary piece of information was present in the retrieved contexts. Low recall is a classic symptom of poor chunking or an embedding model unsuited to the language.

In addition, the frameworks offer further metrics such as Answer Correctness, Noise Sensitivity (RAGAS) or Hallucination (DeepEval).

RAG evaluation tools compared

The following frameworks are the established tools in 2026. All work predominantly on the LLM-as-judge principle, i.e. a second model assesses the output.

Tool

Core metrics

Distinctive feature

RAGAS

Faithfulness, Answer Relevancy, Context Precision, Context Recall, Noise Sensitivity, Answer Correctness

De facto standard; reference-free measurement possible (source: docs.ragas.io v0.1.21).

TruLens

Groundedness, Answer Relevance, Context Relevance (the "RAG triad")

Compact triad scheme that covers both retrieval and generation faithfulness at once (TruLens-Eval).

DeepEval

G-Eval, Faithfulness, Hallucination, Contextual Precision/Recall

Evals in Pytest style, therefore directly integrable as unit tests in CI.

Arize Phoenix

LLM tracing plus eval

Open source, OpenTelemetry-compatible; strong for observability and trace inspection.

LangSmith

End-to-end tracing, dataset eval

Commercial (LangChain Inc.); first choice in LangChain/LangGraph stacks.

Note as of 2026: version and metric designations evolve rapidly. The RAGAS metrics named here refer to the documentation for version v0.1.21; before production use, check the current documentation of the respective tool.

Building an eval dataset (gold set)

Every robust evaluation needs a gold set, i.e. a curated dataset of representative questions. An entry typically consists of four fields:

  • question: a real or realistic user question
  • ground_truth: the ideal, factually correct answer
  • reference_contexts (optional): the chunks that substantiate the answer
  • metadata (optional): tenant, source, difficulty level

Practical approach:

  1. Curate manually: collect 30 to 100 real questions from support logs, sales enquiries or specialist departments and annotate them with expert answers. Quality beats quantity.
  2. Supplement synthetically: RAGAS and DeepEval offer test-set generators that automatically produce question-answer pairs from your own document corpus. Be sure to check these manually on a sample basis.
  3. Plan for edge cases: questions whose answer is not in the corpus at all (the system should then answer "I don't know"), as well as questions about exact codes or IDs, which pure embeddings often miss.
  4. Version it: the gold set belongs in the repository. Every change becomes traceable, and eval results remain comparable over time.

Where a complete gold set is missing, reference-free metrics (Faithfulness, Answer Relevancy) plus implicit signals (clicks, thumbs up/down) can be used as an interim solution.

Concrete example: eval in the development loop

A DACH mid-sized company runs an internal RAG assistant on its technical documentation. The gold set comprises 60 questions. The evaluation runs automatically on every pull request (pseudocode):

```
goldset = load("eval/goldset_v3.json") # 60 entries
results = ragas.evaluate(
dataset = goldset,
metrics = [faithfulness, answer_relevancy,
context_precision, context_recall]
)

assert results["faithfulness"] >= 0.90
assert results["context_recall"] >= 0.80
assert results["answer_relevancy"] >= 0.85

Build breaks if a threshold is undershot

```

In the initial state, the pipeline delivers Faithfulness 0.88 and Context Recall 0.71. A developer adds a cross-encoder reranker and a contextual header per chunk. Anthropic puts the effect of Contextual Retrieval at a reduction of the retrieval failure rate by 49 per cent, and in combination with reranking by 67 per cent (Anthropic, as of 09/2024). In the gold set, Context Recall subsequently rises to 0.86 and Faithfulness to 0.93, all thresholds are met, and the build passes.

Three weeks later, someone experimentally lowers top_k from 8 to 3. Faithfulness remains stable, but Context Recall falls to 0.74. The automated eval run blocks the merge before the degradation ever reaches a user. This is exactly the point of evaluation in CI: making silent quality regression visible (anti-pattern AP5: deployment without faithfulness measurement).

Faithfulness as a runtime guardrail

Evaluation does not end at deployment. The same faithfulness measurement can be used as a runtime guardrail: if the faithfulness score of a specific answer falls below a threshold, the system refuses the answer or escalates to a human, rather than delivering a possibly hallucinated response. Combined with citation forcing (every statement must name a chunk source), this creates a double safeguard against hallucinations, which is mandatory in regulated industries.

For agencies and B2B

For marketing agencies and B2B decision-makers, RAG evaluation is the difference between a demo-ready gimmick and a production-ready knowledge system. Anyone selling a RAG assistant to a client should be able to demonstrate quality in numbers, not in anecdotes. A documented gold set plus a CI gate of Faithfulness and Context Recall is a concrete, verifiable quality promise and at the same time a differentiator in the pitch. Blck Alpaca from Vienna supports DACH companies in building evaluable RAG pipelines, from gold-set curation through tool selection to integrating evaluation into the development loop. Get in touch if you want to make your RAG system measurable.

FAQ

What is the difference between Faithfulness and Answer Relevance?
Faithfulness (also called Groundedness) measures whether every statement in the answer is supported by the retrieved context, i.e. whether the model is not hallucinating. Answer Relevance, by contrast, measures whether the answer actually addresses the question asked. An answer can be faithful but irrelevant, and vice versa. That is why you need both metrics in parallel.
Do I always need a reference dataset for RAG evaluation?
No. RAGAS and comparable frameworks can compute many metrics reference-free via LLM-as-judge, such as Faithfulness and Answer Relevancy, because they only compare question, context and answer. Context Recall and Answer Correctness, however, require a reference (ground truth). For robust regression testing, a curated gold set is nevertheless recommended.
Which RAG evaluation framework should I choose?
RAGAS is the de facto standard with broad metric coverage and a reference-free option. TruLens fits when the RAG triad of Groundedness, Answer Relevance and Context Relevance is the priority. DeepEval is suitable if you want to write evals like unit tests in Pytest style within CI. LangSmith is the first choice for LangChain and LangGraph stacks that need end-to-end tracing.
How do I integrate RAG evaluation into the development loop?
Create a versioned gold set, define thresholds per metric (e.g. Faithfulness >= 0.9) and run the evaluation against the gold set on every pull request. If a metric falls below the threshold, the build is blocked. This way regressions become visible before deployment instead of only in production (anti-pattern AP5).
What is the RAG triad in TruLens?
The RAG triad of TruLens comprises three assessment axes: Context Relevance (are the retrieved contexts relevant to the question), Groundedness (is the answer supported by the contexts) and Answer Relevance (does the answer address the question). Together they cover retrieval quality and generation faithfulness in one compact scheme.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.