Pricing Models for Agent Infrastructure: Retainer, Project, Outcome
Pricing for AI agents in an agency setting comprises four models: retainer (monthly flat fee), project/fixed (fixed price per deliverable), outcome-based (payment per result) and hybrid. The decisive factor is value-based rather than hourly pricing, since AI efficiency decouples working time from the result, alongside margin protection against volatile token costs.
Key Takeaways
- ✓Four base models: retainer, project/fixed, outcome-/performance-based and hybrid - each with its own margin risk and appropriate use case.
- ✓Hourly rates collapse under AI efficiency: value-based pricing decouples the fee from the reduced time spent and protects the margin.
- ✓Token and infrastructure costs are only 30-50 percent of true TCO - pass-through with a transparent margin (typically 30-50 percent) beats the hidden flat fee.
- ✓Outcome pricing (e.g. per resolved ticket) aligns incentives but shifts the volume and cost risk entirely onto the agency.
- ✓Margin protection needs hard token caps per workflow, caching (50-90 percent savings on input) and eval-driven model selection - not the supplier discount.
- ✓Token economics are volatile in 2026: every flat fee needs a price-adjustment clause and a quarterly cost review.
Pricing for AI agents in an agency setting comprises four models: retainer (monthly flat fee), project/fixed (fixed price per deliverable), outcome-based (payment per result) and hybrid. The decisive factor is value-based rather than hourly pricing, since AI efficiency decouples working time from the result, alongside consistent margin protection against volatile token costs. Anyone billing agent infrastructure like a classic service by the hour systematically gives away value and underestimates the cost structure.
- Retainer for ongoing operation, monitoring and continued development; project/fixed for scoped implementations; outcome-based only where results are measurable and stable.
- Token and infrastructure costs are only 30-50 percent of true TCO - the rest is engineering, eval, compliance and human-in-the-loop.
- Token economics are volatile in 2026: every flat fee needs caps, caching and a price-adjustment clause, otherwise a model switch eats the margin.
Why hourly pricing fails for AI agents
The classic agency model sells time. AI agents break this logic because they decouple the result from the time spent. A research or classification task that used to take days as manual work runs as an agent workflow in minutes. Continuing to bill by the hour penalises your own efficiency: the better the agency automates, the less it earns for the same output. Value-based pricing turns this around - what gets measured is the business value for the client, not the time consumed internally.
At the same time, the cost side has become trickier. In 2026, a single user request no longer corresponds to a single model call but typically to 5-20 LLM calls (planner, tool selection, tool result, critique, revision, verification). Agentic workflows have increased token consumption per request by 5 to 50 times compared with the simple chatbot pattern. Sub-agent cascades multiply this again by 3 to 10 times. A flat fee calculated on the old "one prompt, one answer" picture quickly loses its margin here.
The four pricing models at a glance
Retainer (monthly flat fee). The client pays a fixed monthly fee for operation, monitoring, eval iteration and continuous improvement of the agents. Ideal for ongoing agent infrastructure that needs to be maintained and adapted to model updates. The retainer secures predictable revenue and amortises the compliance set-up over the term - a weighty argument in the DACH region, because multi-year engagements justify the DPA and sub-processor effort.
Project / fixed (fixed price). A scoped deliverable - such as the implementation of a voice agent or a service workflow - at a fixed price. Clearly communicable and popular for initial projects. The margin risk lies in the scope: underestimated token cascades, retry loops or integration effort into SAP-heavy DACH stacks eat the calculation. Fixed prices belong with a buffer and a clean change-request process.
Outcome- / performance-based. Payment is made per result - per resolved ticket, qualified lead, completed transaction. The structural advantage: provider and client interests are aligned with success. The structural risk: the agency carries the cost risk of every transaction. If the success rate falls below the assumption, a loss is incurred per result. In the market, this model is becoming established above all in customer service - Intercom Fin sits at USD 0.99 per resolved conversation, HubSpot lowered to USD 0.50 in April 2026, Zendesk charges USD 1.50 (committed) to USD 2.00 (pay-as-you-go) per resolution, Salesforce Agentforce USD 0.10 per action or USD 2.00 per conversation (all as of 2026). Sierra does not publish prices; third-party estimates cite year-1 total costs of USD 200,000 to 350,000 and more. The prerequisite for outcome pricing is a robust, measured baseline of the success rate - without it, it is flying blind.
Hybrid. The de facto norm in 2026: a fixed base (retainer or setup) plus a usage- or outcome-dependent component plus passed-through token costs. Hybrid combines a predictable contribution margin with a fair distribution of load and is the most robust structure for most agency engagements.
Model comparison: when sensible, pros and cons, margin risk
Model | When sensible | Pro / con | Margin risk |
|---|---|---|---|
Retainer | Ongoing operation, monitoring, continuous development; multi-year engagements | Predictable revenue, amortises compliance setup / token consumption not covered, scope creep | Medium - rises with volatile token costs without a cap; price-adjustment clause needed |
Project / fixed | Scoped implementation with a clear deliverable; initial projects | Clearly communicable, clear expectation / rigid limits, add-ons difficult | High - underestimated cascades, retry loops (+20-50 %) and integration eat the fixed price |
Outcome / performance | Measurable, stable result (resolved ticket, lead); known success rate | Incentives aligned, high willingness to pay / agency carries cost and volume risk | Very high - loss per result with too low a success rate or usage spikes |
Hybrid | Standard case: base + usage/outcome + token pass-through | Robust contribution margin, fair distribution of load / more complex billing | Low to medium - risk shared proportionally; best margin profile |
Token costs: pass-through or flat fee
The item that looks cheapest - pure API compute - is not the one where the costs sit. Direct model costs account for only 30-50 percent of total TCO in a typical agentic workload. The rest is distributed across tool cascades, sub-agents, sandbox compute, vector DB, observability, compliance ops and operations labour. An agency that only calculates the token list overlooks half the bill.
There are two clean approaches to handling token costs:
- Pass-through: Direct API and platform costs are tracked per client (e.g. via Helicone or Portkey by key) and re-billed with a transparent mark-up - the industry standard being a 30-50 percent margin on the passed-through costs. The volatility risk lies with the client, and the billing is traceable.
- Flat fee: Token costs are priced into the fixed price or retainer. Client-friendly and predictable, but only viable with hard
max_tokens,max_iterationsandmax_tool_callscaps per workflow, aggressive caching from day one and a safety buffer.
Caching is the biggest margin lever here: Anthropic grants a 90 percent discount on cache reads (as of 2026). For Claude Sonnet 4.6, the input drops from USD 3.00 to USD 0.30 per million tokens; an 80 percent hit rate lowers the effective input costs by 70-80 percent. Eval-driven model selection - the cheapest model that passes the eval - saves a further 30-60 percent. Stacked, a well-instrumented FinOps approach delivers a 60-80 percent cost reduction compared with the unoptimised baseline. That is the headroom from which the agency margin arises - not from the supplier discount.
Example calculation: customer-service agent with outcome pricing
An agency operates a service agent for a DACH client that handles 5,000 tickets per month. Architecture: Claude Sonnet 4.6 as the executor with active prompt caching on the system prompt and tool definitions.
Assumptions per resolved ticket (as of 2026, illustrative):
- Direct LLM consumption per conversation: around 15,000 input tokens (predominantly cached) and 1,500 output tokens across several agent steps.
- Cached input at USD 0.30/m → approx. USD 0.0045; output at USD 15/m → approx. USD 0.0225.
- Direct model costs: around USD 0.03 per ticket.
- Plus tool cascade, retry risk and indirect costs (embeddings, tool definitions): realistically USD 0.08-0.12 per ticket all-in on the infrastructure side.
Pseudocode for the margin calculation:
```
cost_direct = 0.03 # USD per ticket, cached
cost_allin = 0.10 # + cascade, retry, indirect
eu_uplift = 1.10 # 10 % EU-region surcharge
cost_dach = cost_allin * eu_uplift # = 0.11
outcome_price = 0.50 # billed per resolved ticket
margin_per_ticket = outcome_price - cost_dach # = 0.39
contribution_pm = margin_per_ticket * 5000 # = 1'950 USD/month
```
At 5,000 tickets, this yields a monthly contribution margin of around USD 1,950 - provided the success rate is stable. If the resolution rate tips or retry loops rise by 20-50 percent, the margin per ticket melts away quickly. This is precisely why outcome pricing needs a measured baseline and a floor retainer that covers the fixed costs (eval iteration, compliance, monitoring) independently of volume. Note: token economics are volatile in 2026 - cheap open-weight models such as DeepSeek V4 Flash sit at USD 0.14/m input and are thus around 36 times below GPT-5.5; a model switch can shift the calculation in either direction.
Margin protection and DACH reality
Three levers secure the margin regardless of the chosen model:
- Price-adjustment clause and caps. Every retainer and fixed price contains an adjustment clause for token cost changes and hard usage limits per workflow. Token economics in 2026 are a moving target.
- Engineering-first instead of contract-first. Caching, routing, batch (50 percent discount with a 24-hour SLA) and open-weight fallback for long-tail workloads determine 50-80 percent of the bill - more than any supplier discount.
- Price in DACH overhead explicitly. The EU region costs a 10 percent surcharge at OpenAI and Anthropic, sovereign hosting 1.5 to 3 times as much, the DPA chain EUR 5-20k per year and provider plus EUR 10-50k onboarding. In total, DACH factors raise TCO by 15-35 percent compared with a US workload. Multi-year engagements amortise this compliance effort - an argument for retainers and against one-off projects.
For agencies and B2B decision-makers
For agencies, this means: leave hourly logic behind. Build a hybrid model with a floor retainer, transparent token pass-through (30-50 percent margin, tracked per client) and an optional outcome component where you know the success rate from measurement. Anchor caps, caching and a price-adjustment clause contractually - this protects the margin when token prices shift.
For B2B decision-makers buying agent services: ask about the cost structure behind the price, not just the hourly rate. A serious quote sets out token pass-through, caching strategy and DACH compliance effort transparently. If you are planning an agent infrastructure and want to develop a robust, margin-secure pricing model for your use case, talk to us - we calculate token economics, TCO and model selection along your real workload.
FAQ
Which pricing model is best suited to AI agents?
Why does hourly billing no longer work for AI agents?
Should an agency pass through token costs or charge a flat fee?
What are typical margin risks with outcome-based pricing?
How large is the DACH mark-up on the agency calculation?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.