Skip to content
10.14Intermediate8 min

Measuring Agency KPIs Better with AI Agents

Blck Alpaca·
Definition

Agency KPIs for AI Agents are the metrics with which an agency demonstrates the value and economic viability of agent-based services: client-side output, quality, conversion and time-to-value; internally utilisation, margin, token costs, plus error and HITL rates. The key is separating adoption vanity metrics from auditable value metrics.

Key Takeaways

  • Adoption is necessary but not sufficient: the number of pilot projects, deployed tools and active users does not correlate with value creation. Steer towards outcome metrics (cost-out, cycle time, conversion, margin).
  • Report both leading and lagging indicators, but compensate the team on lagging KPIs. Firms that reward adoption get high adoption and no value.
  • Calculate ROI conservatively: the credible reference point is the Brynjolfsson study with 14 percent productivity (34 percent for newcomers), not 10x. GitHub Copilot's 55 percent applies only to isolated coding tasks; end-to-end, Bain shows 10 to 15 percent.
  • Self-reported time savings are unreliable: in the METR field trial, experienced developers were in fact 19 percent slower yet believed in plus 20 percent. Telemetry and outcome metrics beat self-reports.
  • The largest hidden cost block is human-in-the-loop review: often 30 to 60 percent of the gross saving. Anyone who fails to measure HITL costs overestimates the margin.
  • Concentration beats proliferation: according to BCG, AI leaders focus on 3.5 use cases versus 6.1 among laggards and expect 2.1x the ROI. Hard kill gates at 6 and 12 months are more valuable than soft governance.

Agency KPIs for AI Agents are the metrics with which an agency demonstrates the value and economic viability of its agent-based services: client-side output, quality, conversion and time-to-value; internally utilisation, margin, token costs, plus error and HITL rates. The key is separating fast-growing adoption figures from the few auditable value metrics that a CFO can follow.

  • Measure the client side, not just usage: success rate, quality, conversion/cost-out and time-to-value demonstrate value to the client – licence utilisation alone does not.
  • Look at the margin internally: token cost per task, utilisation and above all the HITL escalation rate determine whether an agent project is profitable.
  • Calculate conservatively: the credible reference point is a productivity increase of around 14 percent, not the tenfold figure often promised.

Why most agency dashboards celebrate the wrong number

The most common KPI mistake in DACH projects in 2026 is to pin success on adoption alone. A programme with full licence utilisation that has never moved a P&L line is a programme without impact. The empirical picture is unambiguous: the number of pilot projects launched, the number of use cases identified and the number of AI tools deployed do not correlate with value creation. BCG provides the clearest counterexample in its AI Radar – AI leaders concentrate on an average of 3.5 use cases, laggards spread across 6.1, and the leaders nonetheless expect 2.1x the ROI. Concentration beats proliferation. On top of this: around 60 percent of the companies surveyed do not define or monitor any financial KPI for their AI value at all.

For an agency this means: the reporting must cleanly separate two layers – adoption metrics (necessary, not sufficient) and outcome metrics (the ones that count).

Client-side KPIs: output, quality, conversion, time-to-value

For the client, what matters is what the agent achieves in the process. Four dimensions are leading:

  • Output / success rate: the share of tasks the agent correctly brings to an end state – the single most important headline figure. Complemented by task completion (does the agent even reach a terminal state?) to distinguish "gave up" from "answered incorrectly".
  • Quality: faithfulness or hallucination rate (the share of statements covered by retrieved context or world knowledge) and – especially for law, medicine, finance – citation accuracy at the statement level. Consistency across repeated runs belongs here too: an agent that solves 90 percent of cases but unpredictably tips over on 10 percent is often worse than one with 80 percent that fails in a predictable and remediable way.
  • Conversion / impact: depending on the process, lead-to-quote rate, deflection rate in service, CSAT/NPS, cycle-time reduction (case-to-close), defect or error rate.
  • Time-to-value: how quickly does the first measurable ROI emerge? Realistic DACH expectations are 3 to 6 months for service tier-1 augmentation, 6 to 9 for CRM-embedded sales/marketing copilots, 6 to 12 for internal knowledge/search agents, and 9 to 15 for document-heavy back-office processes. Three-month promises are not credible.

Internal KPIs: utilisation, margin, token costs, error/HITL rate

The second KPI family protects the economic viability of the agency itself:

  • Utilisation of the scarce roles, above all the AI product managers who are accountable for use cases and outcomes.
  • Project margin as a lagging metric, validated by Finance.
  • Token/inference cost per task: input tokens, output tokens and – decisive since 2025 – reasoning tokens, which with reasoning models can dominate the cost. Reporting units: euros per task and euros per 1,000 tasks, plus latency P50/P95/P99.
  • Error and HITL rate: the human-in-the-loop escalation rate is the direct lever on the margin. Every escalation ties up review time, which often eats up 30 to 60 percent of the gross deflection saving again – the largest hidden cost block in service and document agents. Important for interpretation: a high escalation rate is not automatically bad; well-calibrated escalation in high-risk processes is a feature. What matters is the downward trend at stable quality.

Leading vs. lagging – and the compensation rule

Early indicators move quickly and are controllable; late indicators show the business outcome with a delay, but credibly.

KPI

Definition

Source

Direction

Adoption rate (WAU/MAU)

Active users per feature / licences

Product telemetry

Leading – high, but only a precondition

Tasks per user/day

Agent interactions used

Product telemetry

Leading – high

Eval pass rate

Asserts/judge checks passed per release

Eval pipeline (CI/CD)

Leading – keep high

HITL escalation rate

Share of tasks with human review

Agent logs/tracing

Leading – controlled decline

Success rate

Share of correctly completed tasks

Outcome scorer

Outcome – high

Hallucination rate

Share of unsupported statements

LLM judge / MiniCheck

Outcome – low

Cycle-time reduction

Median lead time vs. baseline

Process data

Lagging – falling

Cost-out / margin

Saving or contribution, confirmed by Finance

Finance

Lagging – rising

NPS/CSAT

Satisfaction in client-facing processes

Survey

Lagging – stable/rising

Token cost/task

€-inference cost per case

Observability/gateway

Efficiency – low

The discipline must be made explicit: report both, but compensate on lagging. Every study cited shows the same pattern – teams rewarded for adoption deliver high adoption and no value; teams rewarded for outcomes deliver measurable value.

Calculate ROI conservatively: 14 percent, not 10x

The most credible documented productivity figure for AI in the enterprise comes from the study by Brynjolfsson, Li and Raymond (NBER WP 31161, 2023; QJE 2025): 14 percent more cases resolved per hour in customer service on average, 34 percent for newcomers and low-skilled workers, with barely any effect on experienced professionals. That is the upper end of what is credible for service – and a strategic signal: AI distributes value down the skill curve.

Two corrections belong in every agency calculation:

  • The GitHub Copilot classic "55 percent faster" applies only to tightly specified, isolated coding tasks – not to end-to-end delivery. There, Bain (Technology Report 2025) finds typical 10 to 15 percent, often not recovered into higher-value work because review, test and deployment bottlenecks remain downstream.
  • Self-reports are unreliable. In the METR field trial (arXiv 2507.09089, 2025), 16 experienced open-source developers were 19 percent slower with AI, yet forecast plus 24 percent beforehand and afterwards believed in plus 20 percent; ML and economics experts even predicted 38 to 39 percent speed-up. Boardroom translation: steer on telemetry and outcome metrics, not on self-reports.

From this follows the clean bottom-up formula that a CFO can check:

```
Gross saving = time saving% x annual volume x fully-loaded cost per case
Net ROI = gross saving

        • licence/platform
        • deployment/integration
        • observability/eval
        • HITL review (30-60% of the gross saving)
          ```

One must stay honest about the "ROI-is-not-measurable" problem: when LLM costs are small relative to total OpEx, the gain shows up as faster work, not as a measurable cost reduction at line-item level. With broad horizontal copilots the ROI is often not detectable as a line item – that is not a failure if the bet was deliberately declared as a capability investment.

Example dashboard: service tier-1 agent of a mid-sized company

Assumption: 120,000 service cases/year, fully-loaded cost €6 per case, deflection 40 percent, per-conversation inference cost €0.30 (corridor per research: €0.10–1.00).

Metric

Value

Category/Direction

Deflection rate

40 %

Outcome – rising

Success rate (resolved)

88 %

Outcome – high

Hallucination rate

1.8 %

Quality – low

HITL escalation rate

17 % → trend ↓

Leading – falling

CSAT vs. baseline

+3 points

Lagging – stable/rising

Token cost/case

€0.30

Efficiency – low

Latency P95

4.1 s

UX – low

Gross saving/year

120,000 × 40 % × €6 = €288,000

Calculation

– HITL recapture (~45 %)

−€130,000

Margin-relevant

– Licence/operations/eval

−€90,000

Margin-relevant

Net value year 1

≈ €68,000

Lagging, Finance-validated

The dashboard makes two things visible: the seemingly expensive item (token costs) is not the cost driver – HITL and operations are. And the HITL rate is the lever on which the net value is decided over the following quarters.

OKR form that the CFO can check

Objective: build credible agent capability in the core revenue function.

  • KR1: 70 %+ active weekly adoption of the agent in the function within 9 months.
  • KR2: 25 %+ reduction in the median cycle time of the target process against baseline within 12 months.
  • KR3: NPS/CSAT with no deterioration (or +5 %) over the period.
  • KR4: HITL escalation rate <20 % within 12 months with a measurable downward trend.
  • KR5: net P&L contribution validated by Finance within 18 months.

This form forces outcomes that are auditable – and kill discipline. Hard gates are more valuable than soft governance: at 6 months without a clear ROI path (adoption flat below 30 percent, no measurable improvement) and at 12 months without a quantitative ROI signal, end the project, reclaim the budget, no zombie. Every agent programme needs an explicit kill criterion in its founding charter.

For agencies and B2B decision-makers

Those who sell agent services will in future sell outcomes, not tool access. An agency that provides its clients with a two-layer KPI model – leading early indicators for steering, lagging value metrics for compensation – and calculates ROI conservatively with 14-percent logic instead of 10x promises wins trust in the boardroom. Blck Alpaca builds exactly such measurement and eval setups: from the client-side outcome dashboard (success rate, quality, conversion, time-to-value) to the internal margin and HITL cost accounting. If you want to set up an auditable KPI framework for AI agent projects for your agency or your company, talk to us – we define the metrics before deployment, with a baseline and Finance sign-off.

FAQ

Which KPIs should an agency measure at minimum for AI agent services?
Client-side: completion or success rate, quality (faithfulness/hallucination rate, citation accuracy), conversion or cost-out or cycle time, and time-to-value. Internally: utilisation of the AI product managers, project margin, token/inference cost per task, latency (P50/P95/P99), plus error and HITL escalation rate. At minimum one metric from each of the categories task, quality, cost and reliability per release.
What is the difference between leading and lagging KPIs for AI agents?
Leading indicators are early indicators that move quickly and are controllable: adoption rate, tasks per user, eval pass rate, AI literacy ratio, HITL escalation rate. Lagging indicators measure the business outcome with a delay: revenue lift, cost-out, NPS/CSAT, retention, margin. Rule of thumb: report both, but couple objectives and bonuses to lagging KPIs so that adoption without value is not rewarded.
How does an agency calculate the ROI of AI agents credibly?
Bottom-up per use case: time saving in percent times volume times fully-loaded unit cost yields the gross saving; from this subtract licence, deployment, observability and above all HITL costs (often 30 to 60 percent of the saving). As the productivity assumption, use the evidenced corridor of around 14 percent (Brynjolfsson, Li, Raymond), not 10x promises. With broad horizontal copilots, communicate openly that the ROI at line-item level is often not measurable.
What are typical vanity metrics that should be avoided?
Number of pilot projects launched, number of use cases identified and number of AI tools deployed. According to BCG, none of these figures correlates with value creation; leaders focus on 3.5 use cases instead of 6.1 and expect 2.1x the ROI. Pure self-reported time savings are also risky, because users systematically overestimate their gain (METR: believed plus 20 percent versus actual minus 19 percent).
Why is the HITL rate such an important metric?
The human-in-the-loop escalation rate directly governs the margin. Every escalation ties up human review time, which often eats up 30 to 60 percent of the gross saving again. A falling HITL rate at stable quality is therefore a central value signal. Important: a high escalation rate is not bad per se; well-calibrated escalation in high-risk or client-facing processes is a feature, not a defect.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.