10.6Intermediate7 min

AI Agent Evaluation: Which Metrics Matter

Blck Alpaca·9 June 2026

Definition

AI agent evaluation measures whether an AI agent reliably accomplishes its intended task. The core metrics are task success rate, trajectory and tool-call correctness, groundedness or hallucination rate, latency, cost and HITL escalation rate. Measurement happens offline against an eval dataset and online in production.

Key Takeaways

✓A single number such as task success rate is not enough: quality, tool correctness, groundedness, latency, cost and escalation rate must be considered together.
✓Reliability (pass^k, consistency across multiple runs) is often more important than single-run accuracy. An agent with 90% success that fails unpredictably is worse than one with 80% and predictable failure behaviour.
✓Offline evals against a locked golden set provide upfront evidence; online monitoring catches drift and real-world failures. Both are mandatory, not alternatives.
✓LLM-as-judge scales the assessment but is no substitute for error analysis. Binary pass/fail with justification beats 1-5 scales; judges must be calibrated against human annotators.
✓Agent benchmarks such as SWE-bench or GAIA demonstrate capability, not fitness for a specific use case. The proof for production deployment is your own German-language eval dataset.
✓A golden set of 200-500 examples per use case is the highest-leverage component of the entire eval programme and the foundation for EU AI Act Art. 72 monitoring.

AI agent evaluation measures whether an AI agent reliably accomplishes its intended task. The core metrics are task success rate, trajectory and tool-call correctness, groundedness or hallucination rate, latency, cost and HITL escalation rate. Measurement happens offline against an eval dataset and online in production. Evaluation is the discipline that decides whether an agent is allowed to leave the lab, and since the EU AI Act, ISO/IEC 42001 and the model-risk expectations of BaFin and FINMA, it has shifted from a research practice to a compliance artefact.

The most important insight up front: a single figure on a leaderboard is no longer proof that an agent works. The Princeton team behind the Holistic Agent Leaderboard (HAL) shows that 18 months of capability progress brought only small improvements in reliability, while pure accuracy rose steadily. Anyone wanting to run agents in production and in a compliant manner needs a multi-dimensional set of metrics.

Quick answers:

The task success rate is the single most important figure, but never sufficient on its own: quality, tool correctness, cost and reliability belong with it.
Offline evals (fixed dataset, LLM-as-judge) provide upfront evidence; online monitoring catches drift and real-world failures. Both are mandatory.
Agent benchmarks demonstrate capability, not fitness for your use case; the proof is your own German-language eval dataset.

The six metric families for AI agents

A complete eval programme produces at least one figure from each relevant category per release and plots the tension triangle of cost, quality and safety on a Pareto front, rather than collapsing it into a single score.

Metric	What it shows	Method
Task success rate (binary)	Proportion of tasks solved correctly against the actual outcome, the central headline figure	Outcome scorer: exact match, execution check or LLM judge against a reference
Trajectory/step correctness	Whether each step in the multi-step process was correct, not just the final result	Per-step scoring (PRM style), trajectory matching against the canonical solution path
Tool-call correctness	Whether the right tool was called with correct arguments in the right order	AST/JSON schema match (BFCL approach) or execution comparison; plus recovery rate on tool errors
Groundedness / hallucination rate	Proportion of statements supported by retrieved context or world knowledge	Claim extraction and per-claim verification (RAGAS faithfulness, MiniCheck, Vectara HHEM)
Latency (P50/P95/P99)	Response time per task, incl. tail latency for user-facing agents	Wall-clock measurement per task; tail percentiles instead of the mean
Cost	Input/output/reasoning tokens and €-per-task or €-per-1,000-tasks	Token counting per span; reasoning tokens separately (can dominate with reasoning models)
HITL/escalation rate	Proportion of tasks where the agent hands off to a human	Counting human-in-the-loop handovers per task
Consistency / pass^k	Whether N repeated runs on the same input yield the same result	pass^k (all k runs successful), standard deviation across 3-5 runs

Two points deserve emphasis. First, the escalation rate is not a purely negative signal: a well-tuned escalation to a human is a feature, not a defect, and what matters is that the agent hands off the right cases. Second, reliability is often more important than accuracy. An agent that succeeds 90% of the time but fails unpredictably in the remaining 10% is frequently worse in DACH production than one with 80% success and predictable, recoverable failure behaviour. Sierra's τ-bench shows that pass@k drops markedly up to k=8 for many models: the same agent, the same task, materially different results across runs. Anyone reporting only a single-seed run hides this variance. Minimum standard: mean plus standard error across 3-5 runs.

Offline evals vs. online monitoring

The fundamental dichotomy of the discipline is offline versus online, and mature programmes use both sides.

Offline evaluation runs batched against a fixed, versioned dataset. It is deterministic, repeatable and provides the upfront evidence that EU AI Act Art. 15 requires for high-risk systems: accuracy must be measured and declared, not asserted aspirationally. Offline evals are the place for LLM-as-judge against reference answers and for systematically working through known edge cases.

Online evaluation assesses real production traffic. It is the bridge between "we have deployed" and "we can demonstrate under Art. 72 that the system continues to operate within tolerances". Proven deployment patterns:

Shadow modethe new system runs in parallel with the same inputs, but its outputs are not served; instead they are logged and compared. A de facto standard before any high-risk rollout.
Canary deployment1-5% of traffic to the new system, with further rollout dependent on real-time metrics. Important: the gate metrics must include eval scores, not just latency and error rate.
Eval gatesautomatic thresholds that stop the rollout (e.g. "halt promotion if faithfulness drops below 0.85 within a 4-hour window").

Since running LLM judges on every trace is prohibitively expensive, sampling is used. A sensible starting configuration for the Mittelstand: 1-5% stratified random sampling plus 100% anomaly-driven sampling (long sessions, many retries, explicit negative feedback). Cheap heuristics (PII check, format check, refusal detection) run synchronously inline; expensive judge evals run asynchronously on the sample. This sync/async split is the most important cost lever: without it, the eval overhead can exceed inference costs by a factor of 2 to 5.

Online should additionally be anchored in real user behaviour, not just judge scores: copy-paste rate, edit-after-paste, retry rate and abandonment rate are slow but ground-truth signals. In DACH, the click rate on thumbs-up/down is culturally lower than in the US, which dampens the informativeness of explicit feedback signals.

LLM-as-judge: useful, but not a self-runner

The canonical reference (Zheng et al., 2023) shows that strong judge models achieve over 80% agreement with controlled human evaluations, the same level as human-to-human agreement. This applies to opinion-based tasks, however; for fact-critical German-language content (law, medicine, finance), human gold remains the standard.

Practical guidelines from the field:

Binary instead of Likert. Pass/fail with an articulated critique beats 1-5 scales, because the 3-versus-4 boundary is unstable across judges and runs.
Mitigate known biases. Position bias by swapping and averaging both orderings; verbosity bias through length control; self-enhancement bias through judges from a different model family or a judge ensemble.
Calibrate. Report judge-versus-human agreement as Cohen's kappa or Krippendorff's alpha before a judge is trusted in production.
Cap costs. Rule of thumb: budget 10-30% of inference costs for evaluation; with reasoning judges this can rise above 50%. Pattern: reasoning judges for the validation set, cheap specialist judges (such as MiniCheck-style models) for production sampling.

Crucially: LLM-as-judge is no substitute for error analysis. The practitioner consensus is that 60-80% of the eval effort should continue to flow into looking at data, finding failure modes and translating them into pass/fail asserts.

How to build the eval dataset and eval loop

A robust golden set has documented provenance, a labelling rubric, inter-annotator agreement and explicit coverage of the real use cases. Practical recipe:

Sample 200-500 real user traces; redact PII in line with GDPR Art. 32.
Have two domain experts label each trace with outcome plus critique; compute Cohen's kappa, resolve conflicts.
Add 50-100 expert-written adversarial edge cases at policy boundaries.
Add 100-200 synthetic cases across documented dimensions (features × scenarios × personas).
Split off 30% as a locked test set that is never used for tuning; 70% for iteration.
Version it, document a dataset card, refresh quarterly.

For a DACH Mittelstand pilot, a 500-item golden set initially requires roughly 3-6 weeks of expert time and thereafter about one week per quarter. This is the highest-leverage component of the entire eval programme.

The continuous eval loop connects both worlds: user request → agent run → trace-span emission (OpenTelemetry) → sampling layer → eval layer (LLM judges plus heuristics plus small-model judges) → eval store → dashboard → alert on threshold breach → feedback into the golden set. Mature programmes turn this into a CI/CD gate: every pull request that changes a prompt, model choice, retrieval configuration or tool definition triggers the eval suite; the merge fails if the pass rate drops below the threshold. A provider's model version change should also trigger an automatic re-eval before the new model is allowed into production routing. A locked golden set should be re-evaluated against the live system at least weekly, and daily for production-critical agents, in order to separate model/prompt regression from pure environment drift.

For GDPR-compliant data residency, self-hostable tools are suitable, such as the German-founded Langfuse (self-host via ClickHouse and Postgres) or the OTel-native Arize Phoenix; RAGAS is the de facto library for RAG metrics (faithfulness, context precision, answer relevancy), while DeepEval and Inspect AI cover CI/CD-adjacent and safety-relevant evals. (As of 2026; the eval tooling market is currently consolidating via M&A such as Cisco/Galileo.)

Distinction from model benchmarks

A common and costly fallacy is to confuse benchmark scores with proof of suitability. A model-card eval is the provider's marketing or research artefact: a snapshot of capability across public benchmarks. A compliance eval is the operator's regulatory artefact: upfront plus ongoing measurement against the actual use case, dataset and risk profile, with documented methodology, versioned scorers and an audit trail. These are different artefacts with different audiences, and most regulators do not accept a model card as a substitute.

Three reasons why benchmark scores only provide context:

Contamination. Any benchmark whose dataset existed publicly before 2024 is presumably in the training data. Example: Claude Opus 4.5 achieves 80.9% on SWE-bench Verified, but only 45.9% on the contamination-resistant SWE-bench Pro: the same model, the same task family, 35 points of difference (as of 2026).
Scaffold dependence. "GAIA leaderboard leader" is meaningless without naming the scaffold: between scaffolded (Princeton HAL with Claude Sonnet 4.5: 74.6%) and bare scores there is a spread of roughly 30 percentage points.
Reward hacking. A UC Berkeley RDI paper (April 2026) showed that an automated scanning agent was able to hack all eight major agent benchmarks to near-perfect scores without solving a single task.

The defensible position for a DACH compliance dossier: cite several benchmarks across families, report cost and variance, supplement with customer-private golden sets, and acknowledge the limits in writing. English-language agent benchmarks (SWE-bench, GAIA, τ-bench, BFCL, OSWorld) are useful for capability baselining; the actual proof remains the German-language eval dataset, ideally mapped to the BSI criteria catalogues.

For agencies and B2B decision-makers

Anyone planning or operating AI agents for DACH clients should understand evaluation not as a downstream test, but as an end-to-end discipline: a German-language golden set as the foundation, offline gates before go-live, online monitoring with drift detection afterwards, and a metric set that takes reliability and cost as seriously as the bare success rate. This very eval foundation is at the same time the prerequisite for robust EU AI Act Art. 72 monitoring. Blck Alpaca supports agencies and B2B companies from Vienna in setting up such an eval programme, from the golden set through tool selection to an audit-proof compliance dossier. Get in touch if you want your agents to go into production in a measurable and compliant way.

FAQ

What is the most important metric for AI agent evaluation?

The task success rate, i.e. the proportion of tasks solved correctly measured against the actual outcome, is the single most important figure. But it is never sufficient on its own: without tool-call correctness, groundedness, latency, cost and reliability (consistency across multiple runs), a high success rate often conceals unpredictable failure behaviour.

What is the difference between offline and online evaluation?

Offline evaluation runs batched against a fixed, locked eval dataset and provides the upfront evidence before go-live (EU AI Act Art. 15). Online evaluation assesses real production traffic on a sample basis and detects drift as well as real-world failures in operation (Art. 72). Mature programmes use both.

Are benchmark scores such as SWE-bench or GAIA sufficient as proof of quality?

No. Agent benchmarks demonstrate capability under laboratory conditions, not fitness for a specific use case. They are prone to contamination and scaffold-dependent: GAIA shows roughly 30 percentage points of spread between scaffolded and bare scores. The robust proof for production deployment is your own German-language golden set.

How reliable is LLM-as-judge?

According to Zheng et al. (2023), strong judge models achieve over 80 percent agreement with human evaluations on opinion-based tasks, i.e. on a par with human-to-human agreement. However, they have known biases (position, length, self-enhancement) and must be calibrated against human annotators. For fact-critical content, human evaluation remains the gold standard.

How do you build an eval dataset for AI agents?

Practical recipe: sample 200-500 real user traces and redact PII, have two domain experts label each with outcome and critique, check Cohen's kappa, add 50-100 adversarial edge cases and 100-200 synthetic cases, then split off 30 percent as a locked test set, version it and refresh quarterly.

Want to go deeper?

Get new analyses straight to your inbox, or see how we put this knowledge to work for companies.

Subscribe to newsletter →Our services

Previous← Token Economics: How AI Agent Costs Really Arise