Pillar 23

Building AI Agent Infrastructure

How to build production-ready AI agent infrastructure: frameworks, RAG, MCP, orchestration, monitoring and security.

For: DevOps engineers, platform teams, CTOs

Definition

AI agent infrastructure encompasses the entire technical and organizational foundation on which production AI agents run: the deployment topology (cloud, on-prem, hybrid, or EU-sovereign), the inference and orchestration stack, the network and identity layer, monitoring/observability, as well as cost and security governance. For DACH organizations, it is the point at which decisions are made about data sovereignty, latency, compliance (GDPR, BSI C5, EU AI Act), and the actual operating costs. Unlike a pure chatbot, the infrastructure determines whether an agent is regulation-proof, latency-capable, and economically viable.

Key Takeaways

✓"Frankfurt region" does not equal "sovereign": An EU region of a US hyperscaler delivers data residency, not data sovereignty - the parent company remains subject to the US CLOUD Act (2018). True sovereignty requires dedicated sovereign clouds, partner stacks (e.g., T-Systems x Google), or non-US providers.
✓Hybrid is the dominant DACH pattern: Sensitive documents, embeddings, and the vector store remain on-prem or in the sovereign cloud, and only the generation step calls a hyperscaler API via an egress-controlled proxy.
✓The EU-sovereign market matured significantly in 2025/2026: Microsoft completed the EU Data Boundary on 26 February 2025, AWS is launching the European Sovereign Cloud in Brandenburg (EUR 7.8 billion investment, German legal entity), and DACH-native providers such as STACKIT, Open Telekom Cloud/T Cloud, IONOS, Swisscom, and Infomaniak offer concrete alternatives.
✓On the inference side, the stack has shifted: Hugging Face placed TGI into maintenance mode on 11 December 2025 and recommends vLLM or SGLang; vLLM is the de facto standard for self-hosted production in 2026, with NIM being the most pragmatic on-prem path for mid-market companies.
✓Cost factor token economics: Agentic workflows multiply token consumption per request by 5x to 50x (planner, tool calls, critique, verification); at scale, the API tokens are typically less than half of the total TCO.
✓Caching is the biggest FinOps lever in 2026: Anthropic grants a 90% discount on cache reads, OpenAI bills cached input at 10% of the base price; a well-instrumented FinOps program (caching, routing, batch, open-weight fallback, eval-driven model selection) cuts costs by 60-80% compared to an unoptimized baseline.
✓Security baseline for agents: mTLS between components, OIDC/SAML federation, workload identity (no static credentials), HYOK against customer-owned HSMs (Utimaco, Thales), and deny-by-default egress with an allowlist - the elevated blast radius of an agent demands one service account per (agent x tool) pair.
✓DACH-specific premium: Sovereign hosting costs roughly 1.5-3x the US cloud price, EU regions add a 10% uplift at OpenAI and Anthropic, and compliance ops plus co-determination drive total costs 15-35% higher than a comparable US workload; since 1 July 2025, BSI C5 Type 2 attestation has been mandatory for cloud processing of patient data.

What AI agent infrastructure is - and why it determines success or failure

AI agent infrastructure is the sum of all technical and organizational building blocks on which production AI agents run: the deployment topology (where and under which legal jurisdiction the agent is operated), the inference and orchestration stack, the network and identity layer, monitoring/observability, as well as cost and security governance. Unlike a simple chatbot, an agent is a multi-stage, tool-using system that generates east-west traffic between the orchestrator, tool servers, memory, and vector store, thereby placing entirely different demands on network, identity, and observability.

For DACH decision-makers, the infrastructure is the point at which decisions are made about data sovereignty, latency, and compliance. A conceptual clarification upfront, because it is the most common confusion in DACH projects: data residency refers to the physical storage/processing location, while data sovereignty refers to the legal jurisdiction including extraterritorial reach (such as the US CLOUD Act of 2018). A "Frankfurt region" of a US hyperscaler delivers residency, not sovereignty. In DACH mid-market parlance, "on-prem" usually does not even mean an in-house server room, but a dedicated environment in a German/Austrian/Swiss carrier-neutral colocation data center.

Cloud vs. on-prem vs. hybrid: the EU-sovereign topology question

The topology rarely results from a single option - most production stacks span at least two. Five drivers determine the choice, roughly in this order of weighting: data sensitivity/regulatory class, latency SLO, sovereignty requirement, cost predictability, and existing in-house platform know-how.

Topology	Sovereignty position	Typical DACH use case
Public Cloud (hyperscaler EU region)	Residency yes, sovereignty no (CLOUD Act remains)	Greenfield, low data sensitivity
Sovereign Cloud (hyperscaler-sovereign + DACH-native)	CLOUD Act-resistant depending on the model	BFSI, public sector, regulated industries
Private Cloud (managed/self-managed)	"Azure-like without Azure jurisdiction"	Mid-market with managed-services partner
On-prem / colocation	Full audit authority	Industry, defense-adjacent, BFSI with regulator requirement
Hybrid	Data gravity separately controllable	The dominant DACH pattern

The EU-sovereign market matured considerably in 2025/2026. Microsoft completed the EU Data Boundary on 26 February 2025 and committed to keeping end-to-end AI data processing for EU customers within this boundary, unless the customer determines otherwise. AWS is launching its European Sovereign Cloud with the first region in Brandenburg (announced for the end of 2025, EUR 7.8 billion investment, operated by a German legal entity with an EU citizen as managing director; around 90 of more than 240 services at launch - AWS whitepaper, September 2025).

Alongside this stands a distinct DACH-native category that is usually absent from generic English-language enterprise AI literature: STACKIT (Schwarz Digits, with a data center in Austria as well; EUR 11 billion announced for an AI DC expansion, target up to 100,000 GPUs), Open Telekom Cloud / T Cloud Public (Deutsche Telekom/T-Systems, "Sovereignty by Design," together with NVIDIA the Munich Industrial AI Cloud with up to 10,000 Blackwell GPUs from Q1 2026), IONOS (AI Model Hub with Teuken-7B and Llama 3.3, first Legal AI Factory with Noxtua), Swisscom (Swiss AI Platform, deployment partner for the open Swiss LLM Apertus), and Infomaniak (fully Swiss-controlled, FADP- and GDPR-compliant). T-Systems has publicly pledged to close the feature gap to the hyperscalers by the end of 2026 - to be read as a roadmap commitment, not as today's actual state.

The dominant DACH pattern remains hybrid: Sensitive documents, embeddings, and the vector store remain on-prem or in the sovereign cloud, and only the generation step calls a hyperscaler API - often via an egress-controlled proxy. In addition, confidential-computing patterns (model in the EU region, customer holds the keys via HYOK) and cloud bursting to GPU specialists for peak loads are becoming established.

Orchestration and inference stack

The inference stack is the most volatile layer. A clear industry signal: Hugging Face placed TGI into maintenance mode on 11 December 2025 and directs new deployments to vLLM or SGLang. For self-hosted production, vLLM (PagedAttention, broadest hardware support, OpenAI-compatible endpoints) is the de facto standard in 2026; SGLang scores with multi-turn chat and structured output (according to the report, around 29% higher throughput on 7B-8B models on H100). NVIDIA NIM - pre-built, optimized microservices, portable across cloud, data center, and RTX workstations - is considered the most pragmatic on-prem path in the DACH mid-market.

Above the inference engine, the AI gateway has established itself as a distinct architectural component. It handles multi-provider failover, virtual keys, team budgets, observability, guardrails, and PII redaction. Practical shortlist: LiteLLM (open source, self-hosted, OpenAI-compatible for 100+ providers - ideal when audit authority matters), Portkey (managed and on-prem, strong observability and governance), and Kong AI Gateway (when Kong is already the standard anyway). At the orchestration level, the spectrum ranges from frameworks such as LangGraph/CrewAI/AutoGen to vendor stacks such as Microsoft Foundry Agents or the sovereign Pharia platform (Aleph Alpha, part of a combined entity with a roughly USD 20 billion valuation since the Cohere connection reported in April 2026; verify product names at the time of publication).

Architecturally central is latency: Co-located inference achieves single-digit milliseconds, while a transatlantic call (Frankfurt agent to a US East API) adds, according to the report, around 80-130 ms one-way. With multiple tool-call rounds, this multiplies - for sub-second agent UX, transatlantic API calls are not practical.

Monitoring and observability

Agentic workloads cannot be steered productively without observability. What is required: trace standards (OpenTelemetry for LLMs, OpenInference), token-accurate cost attribution, and eval harnesses. DACH-residency-compliant backends are available - Langfuse self-hosted in the EU (according to the FinOps report, already on a ~EUR 50/month VPS), Datadog EU, or Honeycomb EU. On the cost side, observability typically accounts for 2-8% of total TCO, ranging from practically zero (Helicone Free, self-hosted Langfuse) up to EUR 5,000-50,000/month for Datadog LLM Observability at enterprise scale.

Two points are regulatorily relevant (informational, not legal advice): First, for systems classified as high-risk, the EU AI Act under Art. 12 requires event logging of inputs, outputs, and decisions with auditable granularity - the report puts the infrastructure cost for this at EUR 100,000-500,000 for enterprise implementation plus ongoing storage costs. Second, model versions should be pinned and accompanied by a documented rollback plan, since managed APIs change their versions on the provider's schedule.

Costs, FinOps, and token economics

2026 is the first year in which AI agent workloads demand genuine FinOps discipline. Two structural breaks coincide: Agentic workflows multiply token consumption per request by 5x to 50x (planner, tool call, critique, revision, verification), and the pricing ladder has split - the entry class (Haiku/Mini/Flash) has fallen 10x to 100x since 2023, while the frontier class lingers at around USD 5/25-30 per million tokens. The consequence: The list price no longer correlates with the monthly bill - the gap between vendor list price and production TCO typically amounts to 2x to 10x.

The decisive point: At scale, the API tokens are usually less than half of the total TCO. Direct model costs account for 30-50%, plus tool-use cascades (+50% to +200% on the direct API line), sub-agent fan-out (3x to 10x multiplier), compute/sandbox (10-25%), vector DB/embedding (5-15%), observability (2-8%), compliance/governance (5-20%), and operations labor (10-30%).

The most effective levers lie below the API line:

Caching is the single biggest lever. Anthropic grants a 90% discount on cache reads (cached input on Sonnet 4.6: 0.30 instead of 3.00 USD/M), OpenAI bills cached input at 10% of the base price. At an 80% cache hit rate, input costs drop by 70-80%.
Model routing: cheap model for simple tasks, expensive only for complex ones. Anthropic's advisor-tool benchmark (Sonnet + Opus advisor) reached 74.8% on SWE-bench Multilingual at 11.9% lower cost than Opus alone.
Batch API: a flat 50% discount with a 24-hour SLA, stackable with caching.
Open-weight fallback for long-tail workloads (DeepSeek V4 Flash, Mistral Ministral, Qwen 3) - GDPR-compliant only via EU-hosted routes (Together AI EU, DeepInfra Frankfurt, STACKIT/OVHcloud), not via China-hosted direct APIs.

Stacked, a well-instrumented FinOps program delivers a 60-80% cost reduction compared to the unoptimized baseline. The DACH reality adds further cost: EU regions add a 10% uplift at OpenAI and Anthropic, sovereign hosting costs roughly 1.5x to 3x the US cloud price (SAP Joule AI Units approx. 1.5-2x), and compliance ops plus co-determination drive total TCO 15-35% higher than a comparable US workload. Per vendor, EUR 5,000-20,000/year in ongoing DPA/sub-processor costs apply; Bitkom figures for 2026 underscore the sovereignty pressure: 68% of Germans consider Germany too dependent on the US and China for AI, and 60% want less dependence on US AI providers.

Security and identity

Identity and key management is the lever that makes a non-sovereign hyperscaler region into something defensible under DACH compliance (informational, not legal advice; the detailed GDPR/DPA treatment belongs in the sister topics). The architectural baseline:

mTLS between all agent components - also a typical piece of evidence in BSI C5 and ISO 27001 audits.
OIDC/SAML federation for enterprise SSO (Entra ID, Okta, KeyCloak); the agent exchanges the user token for short-lived tokens for tool calls.
Workload identity (Azure Managed Identity, AWS IRSA, GCP Workload Identity Federation, in sovereign clouds OpenStack Keystone / K8s service accounts) - no static credentials in the code.
KMS/HSM with BYOK/HYOK: With BYOK, the provider continues to operate the key; with HYOK, the cloud calls the customer-owned HSM (Utimaco/Aachen, Thales) for every crypto operation - the strongest sovereignty statement, which according to the report withstands both legal scrutiny and a BSI C5/TISAX audit.

An agent has an unusually high blast radius because it can call many tools. Best practice: one service account per (agent x tool) pair (not a shared account), just-in-time elevation, all credentials from Vault or KMS rather than from environment variables, and an audit trail that binds back to the user identity via a token-exchange chain. On the network side, deny-by-default egress with an explicit allowlist of the model API FQDNs has become established - it prevents unwanted data outflows, provides audit evidence, and forces all model traffic through the gateway, where rate limits, PII filters, and budgets reside.

DACH compliance notes and outlook

Several DACH-specific rules drive real architecture decisions (informational, not legal advice): Since 1 July 2025, BSI C5 Type 2 attestation has been mandatory for cloud processing of patient data (DigiG / § 393 SGB V). Switzerland does not follow the GDPR but FADP/revDSG (in force since 1 September 2023); the "privatim" tightening reported in November 2025 recommends international SaaS for sensitive data only with end-to-end encryption and customer-owned keys. The EU AI Act phases in on a staggered basis (prohibitions from February 2025, GPAI rules from August 2025, high-risk from August 2026); the specific deadlines must be checked depending on the provider and classification and are in some cases still in flux.

Practical note: Do not start with the procurement question "cloud or on-prem?", but with data classification and the latency SLO - these determine the topology. Build the AI gateway, deny-by-default egress, and eval-driven model selection in from day one, because it is precisely these "pilot gaps" that typically break at production launch. For the mid-market, an M365-anchored hybrid with an EU data zone and a small on-prem RAG layer is the pragmatic default; for regulated industries, the path leads via STACKIT/Open Telekom Cloud with HYOK and sovereign inference. Since sovereign-cloud roadmaps shift on a quarterly basis, every architecture decision should be accompanied by a date stamp ("as of: ...") and a documented migration trigger.

All Articles in this Topic

5 Articles

10.2

On-Premise vs. EU Cloud for AI Agents: The Decision Matrix for the DACH Region

On-premise vs. EU cloud for AI agents describes the choice of operating model for production AI agents: dedicated in-house hardware in a German, Austrian or Swiss data centre (on-premise), sovereign EU cloud providers or a hybrid combination. The decisive factors are data sensitivity, GDPR sovereignty, cost, latency, scaling and existing operational expertise.

Intermediate·7 min

10.3

Deploying AI Agents on Kubernetes: Architecture, Scaling and When K8s Pays Off

Deploying AI agents on Kubernetes means running the components of an agent system - agent service, tool or MCP server, vector store, inference engine and message queue - as containerised workloads on a K8s cluster. Kubernetes provides scaling, GPU scheduling, state handling, secrets management and observability for production, EU-sovereign agent operations.

Advanced·10 min

10.4

Observability for AI Agents: Tracing, Metrics, Logs and Evals

AI agent observability makes the inner workings of an autonomous agent visible: through tracing (spans across reasoning and tool calls), metrics (latency, tokens, cost, success rate), structured logs and continuous evals. It answers why an agent made a particular decision, and it is the prerequisite for being able to debug, secure and audit multi-step agents in production at all.

Intermediate·8 min

10.5

Token Economics: How AI Agent Costs Really Arise

Token economics for AI agents describes the cost mechanics whereby every agent run is billed by the tokens consumed: input, output, cached and reasoning tokens. Unlike a chatbot, agents multiply consumption through multi-step loops, tool calls and sub-agents - the list price deviates from real production costs by a factor of 2 to 10.

Intermediate·7 min

10.6

AI Agent Evaluation: Which Metrics Matter

AI agent evaluation measures whether an AI agent reliably accomplishes its intended task. The core metrics are task success rate, trajectory and tool-call correctness, groundedness or hallucination rate, latency, cost and HITL escalation rate. Measurement happens offline against an eval dataset and online in production.

Intermediate·7 min