Deploying AI Agents on Kubernetes: Architecture, Scaling and When K8s Pays Off
Deploying AI agents on Kubernetes means running the components of an agent system - agent service, tool or MCP server, vector store, inference engine and message queue - as containerised workloads on a K8s cluster. Kubernetes provides scaling, GPU scheduling, state handling, secrets management and observability for production, EU-sovereign agent operations.
Key Takeaways
- ✓A production agent system on Kubernetes consists of several decoupled services: agent orchestrator, tool/MCP server, inference engine (vLLM, SGLang or NVIDIA NIM), vector store and a message queue for asynchronous jobs.
- ✓Scaling runs along two axes: stateless agent pods scale elastically via HPA, while GPU inference pods are scheduled onto dedicated GPU nodes via node selectors, taints/tolerations and the NVIDIA device plugin and stay warm rather than elastic for cost reasons.
- ✓Agents are stateful: conversation memory and task plans belong in an external store (Redis, Postgres), not in the pod. The vector store runs as a StatefulSet with PersistentVolumes.
- ✓Kubernetes only pays off at high, steady load (rule of thumb: from around 8-12 H100-equivalents of continuous inference) or under strict sovereignty/latency requirements - below that, managed APIs and serverless are usually cheaper and faster to go live.
- ✓Sovereignty does not arise from the region alone: a Frankfurt region of a US provider delivers data residency, not CLOUD Act-resistant sovereignty. Strict cases require sovereign clouds (STACKIT, Open Telekom Cloud, Infomaniak, Exoscale) with managed Kubernetes.
- ✓Plan realistically: a self-operated GPU cluster needs 6-9 months of lead time and a platform team that masters vLLM, GPU scheduling and NCCL failure patterns in 24x7 operations.
Deploying AI agents on Kubernetes means running the components of an agent system - agent service, tool or MCP server, vector store, inference engine and message queue - as containerised workloads on a K8s cluster. Kubernetes provides scaling, GPU scheduling, state handling, secrets management and observability for production, EU-sovereign agent operations. It is powerful, but not an end in itself - this article shows the architecture, the scaling mechanics and, above all, when the effort actually pays off.
- Architecture: Break the agent system down into decoupled services - orchestrator, tool/MCP server, inference engine, vector store, queue - and run each as its own workload.
- Scaling: Stateless agent pods scale elastically via the Horizontal Pod Autoscaler; GPU inference pods are scheduled onto dedicated GPU nodes and stay warm for cost reasons.
- Decision: Kubernetes pays off at high, continuous load or under strict sovereignty/latency requirements - below that, managed APIs and serverless are faster to go live and cheaper.
The container architecture of an agent system
A production AI agent is not a monolith but a federation of microservices. This division is not an end in itself but follows the differing scaling, state and security profiles of the individual components.
Agent service (orchestrator)
The agent service is the brain: it executes the orchestration logic, decides on tool calls and plans steps. It is typically built on a framework such as LangGraph, CrewAI or AutoGen, or on a vendor platform such as PhariaAI (Heidelberg) or Azure AI Foundry Agents. Architecturally decisive: this service must remain stateless so that Kubernetes can freely scale, restart and distribute it across multiple replicas. All state migrates to external stores.
Tool and MCP server
The Model Context Protocol (MCP) has established itself as the standard interface through which agents address external tools and data sources. In a DACH enterprise environment, MCP servers usually run as in-VPC services, co-located with the agent runtime and connected via mTLS - this is the default pattern for integrations with SAP, Salesforce, ServiceNow, M365 and internal databases. For factory or branch scenarios, MCP servers are deployed at the edge, close to the data, and the central orchestration calls them via dedicated links (ExpressRoute, Direct Connect). For regulated workloads the rule is: dedicated MCP servers per business unit with separate audit trails rather than a multi-tenant server.
Inference engine
This is where the language model runs. As of 2026, the selection is clearly defined:
- vLLM (originating from UC Berkeley) is the de facto standard for production self-hosting - PagedAttention, the broadest hardware support, OpenAI-compatible endpoints.
- SGLang (LMSYS) delivers, according to LMSYS benchmarks, around 29 per cent higher throughput on 7B-8B models on H100 and better tail latency; ideal for multi-turn chat, RAG-heavy and structured workloads.
- NVIDIA NIM packages vLLM/TensorRT-LLM/SGLang into pre-built containers and is the most pragmatic on-prem route in the DACH mid-market - tied, however, to the NVIDIA AI Enterprise licence.
- TensorRT-LLM delivers peak throughput on NVIDIA hardware but is NVIDIA-only and operationally demanding.
Important note for existing systems: Hugging Face moved TGI (Text Generation Inference) into maintenance mode on 11 December 2025 and points new deployments to vLLM or SGLang. Anyone running on TGI today need not migrate in a panic, but should no longer build new developments on it.
Vector store and message queue
The vector store (Qdrant, Weaviate or pgvector) holds embeddings for RAG. It is a classic StatefulSet candidate with PersistentVolumes, because embeddings plus original chunks reach terabyte scale and need stable pod identity as well as persistent storage. A message queue (such as RabbitMQ, NATS or Redis Streams) decouples long-running, asynchronous agent jobs - document processing, batch analyses - from the synchronous request path and enables worker pools that scale independently.
Component mapping: what belongs on which K8s resource
The following table is the core reference for an agent deployment. It maps each component to the appropriate Kubernetes resource and names the decisive architectural note.
Component | K8s resource | Note |
|---|---|---|
Agent service / orchestrator | Deployment + HPA + Service | Keep stateless; scales elastically via CPU/custom metrics |
Tool/MCP server | Deployment (or DaemonSet at the edge) + Service | mTLS via service mesh; one service account per agent-tool pair |
Inference engine (vLLM/NIM) | Deployment with GPU resource limit + Service | Node selector/tolerations on GPU nodes; keep pods warm, not elastic |
Vector store (Qdrant/Weaviate) | StatefulSet + PersistentVolumeClaim | Persistent storage; geo-replication for failover if needed |
Message queue + worker | StatefulSet (broker) + Deployment (worker) | Decouples async jobs; scale workers via KEDA on queue depth |
Session/conversation state | External Redis / Cosmos (often managed) | Not in the pod; geo-replication depending on RPO |
Secrets / tool credentials | External Secrets + Vault / Cloud KMS | No static keys in the pod; HSM seal for BFSI |
AI gateway (LiteLLM/Portkey) | Deployment + Service + Ingress | Multi-provider failover, budgets, PII redaction, central egress control |
Identity (pod → tool/model) | Workload Identity (IRSA / Managed Identity) | Token exchange to short-lived tokens; no credentials in code |
Observability | Sidecar/agent + OpenTelemetry collector | Traces, token costs, health; EU-resident backend (e.g. Langfuse self-hosted) |
Scaling: HPA, GPU scheduling and the two axes
Agent stacks scale along two completely different axes, and this is the most common stumbling block.
Stateless axis (CPU): Agent service, MCP server and AI gateway are inexpensive CPU workloads. They scale elastically via the Horizontal Pod Autoscaler (HPA) - replicas are added during load spikes and removed when idle. For queue-driven workers, KEDA is suitable, scaling on queue depth rather than just CPU.
GPU axis (inference): A different logic applies here. GPU nodes are advertised to the cluster via the NVIDIA device plugin; pods request GPUs via the resource limit nvidia.com/gpu. With node selectors, taints and tolerations, inference pods land specifically on the expensive GPU nodes, while stateless workloads remain on CPU nodes. The central difference: GPU capacity usually stays warm rather than elastic, because idle GPUs are the most expensive capex of all. True elasticity via the cluster autoscaler only works where the cloud provider supplies GPU nodes quickly enough - with dedicated, allocation-driven Blackwell capacity (B200/GB200) this is not the case.
The GPU memory maths determines how many models fit per node. Rule of thumb for the weights: parameters times bytes per parameter. At BF16 (2 bytes) a 70B model needs around 140 GB, at FP8 around 70 GB, at AWQ-INT4 around 35 GB - plus the KV cache, which grows with batch size and sequence length (in the order of 10-40 GB in production). In practice, 70B at BF16 fits on a single H200 (141 GB) for low concurrency, but for production batch sizes it typically needs 2x H100 or tensor parallelism across multiple GPUs of a node via NVLink.
State, memory and EU region
Unlike classic web apps, agents are stateful - conversation memory, task plans and tool-call history must survive even when a pod restarts or a region failover occurs. Architecturally this means: session state in a regional Redis or Cosmos DB with geo-replication, long-term memory and vector store replicated synchronously or asynchronously, depending on the required RPO. The vector store is the bulk-data challenge here, because embeddings and chunks make up the largest data volumes.
On the subject of EU region and sovereignty, the most important trap lurks: a Frankfurt region of a US hyperscaler delivers data residency, not CLOUD Act-resistant sovereignty - the operator remains a US legal entity. For managed Kubernetes with genuine sovereignty, sovereign clouds come into play: Infomaniak (Geneva/Zurich, full Swiss control, FADP plus GDPR) and Exoscale (Switzerland, OpenStack, managed K8s) offer managed Kubernetes without hyperscaler exposure; Swisscom has a Kubermatic-based sovereign K8s service; STACKIT (Schwarz Digits, with a data centre in Austria) and Open Telekom Cloud rely on OpenStack-based platforms with GPU instances. For workloads without strict sovereignty requirements, managed Kubernetes on a hyperscaler (AKS/EKS/GKE) in an EU region remains the pragmatic default - which is also how the reference architecture for lean scale-ups solves it.
Secrets, tool access and observability
Agents have an unusually large blast radius: a compromised agent can call many tools. Best practice is therefore one service account per agent-tool pair rather than a shared account, just-in-time elevation for sensitive operations, and audit trails that reach back via a token-exchange chain to the user identity.
Concretely on Kubernetes:
- Workload Identity instead of static credentials: Azure Managed Identity, AWS IRSA (IAM Roles for Service Accounts in EKS), GCP Workload Identity Federation - or Keystone/K8s service accounts for sovereign clouds. The goal is identical: no key in the code, no human in the credential path.
- Secrets backbone: HashiCorp Vault is the most widespread secrets and PKI layer in DACH platform stacks; in BFSI and the public sector with an HSM seal against a Utimaco (Aachen) or Thales HSM. The External Secrets pattern synchronises these into K8s Secrets without storing them in the repo.
- Egress control: The pattern established in DACH BFSI is deny-by-default with an explicit allowlist of the model API FQDNs, logged at the gateway. This prevents accidental data exfiltration and forces all model traffic through the AI gateway (LiteLLM, Portkey, Kong), where rate limits, PII filters and budgets reside.
For health and observability, Kubernetes provides the foundation: liveness and readiness probes per pod ensure that only healthy inference pods receive traffic - which is critical given long model load times (the readiness probe must only turn green once the model is loaded). On top of this sits a tracing layer via OpenTelemetry, with token-accurate cost attribution and an EU-resident backend such as Langfuse (self-hosted).
When Kubernetes - and when not
Here is the honest complexity warning. Kubernetes with self-hosted GPU inference is not a weekend project. It pays off when at least one of these drivers applies:
- High, steady load: A rule of thumb circulating in DACH platform teams holds that self-hosted inference becomes cheaper per token than managed APIs from around 8-12 H100-equivalents of continuous load - but with 6-9 months of engineering lead time. Below that, managed APIs dominate the TCO.
- Strict latency: Latency budgets below 200-500 ms with several tool-call rounds demand co-located inference; transatlantic API calls (Frankfurt → US-East: ~80-120 ms one-way) are then disqualified.
- Contractual sovereignty: When the legal department does not accept CLOUD Act exposure or BSI C5 Type 2 is binding.
Arguing against this, managed APIs and serverless win when the load is spiky (idle GPUs are the worst capex), when high model variety is needed (Azure Foundry alone added DeepSeek R1, GPT-4.1, Mistral Large 3, Claude Opus 4.5 and Llama 4 in 2025) or simply when no platform team is available. Few DACH mid-market companies master the operation of vLLM, GPU scheduling and NCCL failure patterns in 24x7 mode in-house - and it is precisely this shortfall, not the technology itself, that is the most common reason self-hosting projects fail.
Concrete example: lean cluster for a B2B agent
A typical scale-up stack (modelled on the "Lean Cloud" reference architecture) looks like this:
```text
Managed Kubernetes (AKS/EKS in EU region or Exoscale/Infomaniak sovereign)
├── Deployment: agent-orchestrator (LangGraph, 3 replicas, HPA 2-10, CPU nodes)
├── Deployment: mcp-tools-sap (mTLS, 1 service account per tool)
├── Deployment: ai-gateway-litellm (failover: Azure OpenAI EU -> Mistral La Plateforme)
├── StatefulSet: qdrant (3 replicas, PVC, vector store)
├── StatefulSet: redis (session state, geo-replication)
└── Worker-Deployment: doc-processor (KEDA, scales on RabbitMQ queue depth)
Model inference: managed API (no own GPU node) -> via gateway
```
Here no own GPU runs - inference sits with a managed API in an EU geo, abstracted via the gateway. This is deliberate: at modest load, pay-per-token is cheaper and live in weeks rather than months. Only once the monthly API spend exceeds the run rate of around 10 H100-equivalents in a sovereign cloud - or a new regulatory requirement demands a control that the managed API cannot demonstrate - does migration to your own GPU nodes with vLLM or NIM on the same cluster pay off. This is exactly what Kubernetes is ideal for: the architecture stays the same, only the inference component moves from "managed API behind the gateway" to "GPU deployment in the cluster".
For agencies and B2B decision-makers
Kubernetes is the right foundation for agent systems that will grow over the long term, remain EU-sovereign or meet strict latency requirements - but the entry point should almost always be via managed APIs and managed Kubernetes, with a clearly documented migration trigger for the switch to your own GPU inference. Anyone who does not define this threshold cleanly will either build an expensive GPU cluster too early or too late, once the API bill is already spiralling out of control. Blck Alpaca supports DACH companies and marketing agencies with exactly this architectural decision - from the make-buy-rent assessment per component to the sovereign, observability-ready cluster design. Talk to us before the first GPU is ordered.
FAQ
When does Kubernetes make sense for AI agents - and when does it not?
Which inference engine should you deploy on Kubernetes in 2026?
How are agent state and memory handled on Kubernetes?
How do agents on Kubernetes get secure access to secrets and tools?
How does GPU scheduling for inference pods work?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.