Skip to content
10.3Advanced10 min

Deploying AI Agents on Kubernetes: Architecture, Scaling and When K8s Pays Off

Blck Alpaca·
Definition

Deploying AI agents on Kubernetes means running the components of an agent system - agent service, tool or MCP server, vector store, inference engine and message queue - as containerised workloads on a K8s cluster. Kubernetes provides scaling, GPU scheduling, state handling, secrets management and observability for production, EU-sovereign agent operations.

Key Takeaways

  • A production agent system on Kubernetes consists of several decoupled services: agent orchestrator, tool/MCP server, inference engine (vLLM, SGLang or NVIDIA NIM), vector store and a message queue for asynchronous jobs.
  • Scaling runs along two axes: stateless agent pods scale elastically via HPA, while GPU inference pods are scheduled onto dedicated GPU nodes via node selectors, taints/tolerations and the NVIDIA device plugin and stay warm rather than elastic for cost reasons.
  • Agents are stateful: conversation memory and task plans belong in an external store (Redis, Postgres), not in the pod. The vector store runs as a StatefulSet with PersistentVolumes.
  • Kubernetes only pays off at high, steady load (rule of thumb: from around 8-12 H100-equivalents of continuous inference) or under strict sovereignty/latency requirements - below that, managed APIs and serverless are usually cheaper and faster to go live.
  • Sovereignty does not arise from the region alone: a Frankfurt region of a US provider delivers data residency, not CLOUD Act-resistant sovereignty. Strict cases require sovereign clouds (STACKIT, Open Telekom Cloud, Infomaniak, Exoscale) with managed Kubernetes.
  • Plan realistically: a self-operated GPU cluster needs 6-9 months of lead time and a platform team that masters vLLM, GPU scheduling and NCCL failure patterns in 24x7 operations.

Deploying AI agents on Kubernetes means running the components of an agent system - agent service, tool or MCP server, vector store, inference engine and message queue - as containerised workloads on a K8s cluster. Kubernetes provides scaling, GPU scheduling, state handling, secrets management and observability for production, EU-sovereign agent operations. It is powerful, but not an end in itself - this article shows the architecture, the scaling mechanics and, above all, when the effort actually pays off.

  • Architecture: Break the agent system down into decoupled services - orchestrator, tool/MCP server, inference engine, vector store, queue - and run each as its own workload.
  • Scaling: Stateless agent pods scale elastically via the Horizontal Pod Autoscaler; GPU inference pods are scheduled onto dedicated GPU nodes and stay warm for cost reasons.
  • Decision: Kubernetes pays off at high, continuous load or under strict sovereignty/latency requirements - below that, managed APIs and serverless are faster to go live and cheaper.

The container architecture of an agent system

A production AI agent is not a monolith but a federation of microservices. This division is not an end in itself but follows the differing scaling, state and security profiles of the individual components.

Agent service (orchestrator)

The agent service is the brain: it executes the orchestration logic, decides on tool calls and plans steps. It is typically built on a framework such as LangGraph, CrewAI or AutoGen, or on a vendor platform such as PhariaAI (Heidelberg) or Azure AI Foundry Agents. Architecturally decisive: this service must remain stateless so that Kubernetes can freely scale, restart and distribute it across multiple replicas. All state migrates to external stores.

Tool and MCP server

The Model Context Protocol (MCP) has established itself as the standard interface through which agents address external tools and data sources. In a DACH enterprise environment, MCP servers usually run as in-VPC services, co-located with the agent runtime and connected via mTLS - this is the default pattern for integrations with SAP, Salesforce, ServiceNow, M365 and internal databases. For factory or branch scenarios, MCP servers are deployed at the edge, close to the data, and the central orchestration calls them via dedicated links (ExpressRoute, Direct Connect). For regulated workloads the rule is: dedicated MCP servers per business unit with separate audit trails rather than a multi-tenant server.

Inference engine

This is where the language model runs. As of 2026, the selection is clearly defined:

  • vLLM (originating from UC Berkeley) is the de facto standard for production self-hosting - PagedAttention, the broadest hardware support, OpenAI-compatible endpoints.
  • SGLang (LMSYS) delivers, according to LMSYS benchmarks, around 29 per cent higher throughput on 7B-8B models on H100 and better tail latency; ideal for multi-turn chat, RAG-heavy and structured workloads.
  • NVIDIA NIM packages vLLM/TensorRT-LLM/SGLang into pre-built containers and is the most pragmatic on-prem route in the DACH mid-market - tied, however, to the NVIDIA AI Enterprise licence.
  • TensorRT-LLM delivers peak throughput on NVIDIA hardware but is NVIDIA-only and operationally demanding.

Important note for existing systems: Hugging Face moved TGI (Text Generation Inference) into maintenance mode on 11 December 2025 and points new deployments to vLLM or SGLang. Anyone running on TGI today need not migrate in a panic, but should no longer build new developments on it.

Vector store and message queue

The vector store (Qdrant, Weaviate or pgvector) holds embeddings for RAG. It is a classic StatefulSet candidate with PersistentVolumes, because embeddings plus original chunks reach terabyte scale and need stable pod identity as well as persistent storage. A message queue (such as RabbitMQ, NATS or Redis Streams) decouples long-running, asynchronous agent jobs - document processing, batch analyses - from the synchronous request path and enables worker pools that scale independently.

Component mapping: what belongs on which K8s resource

The following table is the core reference for an agent deployment. It maps each component to the appropriate Kubernetes resource and names the decisive architectural note.

Component

K8s resource

Note

Agent service / orchestrator

Deployment + HPA + Service

Keep stateless; scales elastically via CPU/custom metrics

Tool/MCP server

Deployment (or DaemonSet at the edge) + Service

mTLS via service mesh; one service account per agent-tool pair

Inference engine (vLLM/NIM)

Deployment with GPU resource limit + Service

Node selector/tolerations on GPU nodes; keep pods warm, not elastic

Vector store (Qdrant/Weaviate)

StatefulSet + PersistentVolumeClaim

Persistent storage; geo-replication for failover if needed

Message queue + worker

StatefulSet (broker) + Deployment (worker)

Decouples async jobs; scale workers via KEDA on queue depth

Session/conversation state

External Redis / Cosmos (often managed)

Not in the pod; geo-replication depending on RPO

Secrets / tool credentials

External Secrets + Vault / Cloud KMS

No static keys in the pod; HSM seal for BFSI

AI gateway (LiteLLM/Portkey)

Deployment + Service + Ingress

Multi-provider failover, budgets, PII redaction, central egress control

Identity (pod → tool/model)

Workload Identity (IRSA / Managed Identity)

Token exchange to short-lived tokens; no credentials in code

Observability

Sidecar/agent + OpenTelemetry collector

Traces, token costs, health; EU-resident backend (e.g. Langfuse self-hosted)

Scaling: HPA, GPU scheduling and the two axes

Agent stacks scale along two completely different axes, and this is the most common stumbling block.

Stateless axis (CPU): Agent service, MCP server and AI gateway are inexpensive CPU workloads. They scale elastically via the Horizontal Pod Autoscaler (HPA) - replicas are added during load spikes and removed when idle. For queue-driven workers, KEDA is suitable, scaling on queue depth rather than just CPU.

GPU axis (inference): A different logic applies here. GPU nodes are advertised to the cluster via the NVIDIA device plugin; pods request GPUs via the resource limit nvidia.com/gpu. With node selectors, taints and tolerations, inference pods land specifically on the expensive GPU nodes, while stateless workloads remain on CPU nodes. The central difference: GPU capacity usually stays warm rather than elastic, because idle GPUs are the most expensive capex of all. True elasticity via the cluster autoscaler only works where the cloud provider supplies GPU nodes quickly enough - with dedicated, allocation-driven Blackwell capacity (B200/GB200) this is not the case.

The GPU memory maths determines how many models fit per node. Rule of thumb for the weights: parameters times bytes per parameter. At BF16 (2 bytes) a 70B model needs around 140 GB, at FP8 around 70 GB, at AWQ-INT4 around 35 GB - plus the KV cache, which grows with batch size and sequence length (in the order of 10-40 GB in production). In practice, 70B at BF16 fits on a single H200 (141 GB) for low concurrency, but for production batch sizes it typically needs 2x H100 or tensor parallelism across multiple GPUs of a node via NVLink.

State, memory and EU region

Unlike classic web apps, agents are stateful - conversation memory, task plans and tool-call history must survive even when a pod restarts or a region failover occurs. Architecturally this means: session state in a regional Redis or Cosmos DB with geo-replication, long-term memory and vector store replicated synchronously or asynchronously, depending on the required RPO. The vector store is the bulk-data challenge here, because embeddings and chunks make up the largest data volumes.

On the subject of EU region and sovereignty, the most important trap lurks: a Frankfurt region of a US hyperscaler delivers data residency, not CLOUD Act-resistant sovereignty - the operator remains a US legal entity. For managed Kubernetes with genuine sovereignty, sovereign clouds come into play: Infomaniak (Geneva/Zurich, full Swiss control, FADP plus GDPR) and Exoscale (Switzerland, OpenStack, managed K8s) offer managed Kubernetes without hyperscaler exposure; Swisscom has a Kubermatic-based sovereign K8s service; STACKIT (Schwarz Digits, with a data centre in Austria) and Open Telekom Cloud rely on OpenStack-based platforms with GPU instances. For workloads without strict sovereignty requirements, managed Kubernetes on a hyperscaler (AKS/EKS/GKE) in an EU region remains the pragmatic default - which is also how the reference architecture for lean scale-ups solves it.

Secrets, tool access and observability

Agents have an unusually large blast radius: a compromised agent can call many tools. Best practice is therefore one service account per agent-tool pair rather than a shared account, just-in-time elevation for sensitive operations, and audit trails that reach back via a token-exchange chain to the user identity.

Concretely on Kubernetes:

  • Workload Identity instead of static credentials: Azure Managed Identity, AWS IRSA (IAM Roles for Service Accounts in EKS), GCP Workload Identity Federation - or Keystone/K8s service accounts for sovereign clouds. The goal is identical: no key in the code, no human in the credential path.
  • Secrets backbone: HashiCorp Vault is the most widespread secrets and PKI layer in DACH platform stacks; in BFSI and the public sector with an HSM seal against a Utimaco (Aachen) or Thales HSM. The External Secrets pattern synchronises these into K8s Secrets without storing them in the repo.
  • Egress control: The pattern established in DACH BFSI is deny-by-default with an explicit allowlist of the model API FQDNs, logged at the gateway. This prevents accidental data exfiltration and forces all model traffic through the AI gateway (LiteLLM, Portkey, Kong), where rate limits, PII filters and budgets reside.

For health and observability, Kubernetes provides the foundation: liveness and readiness probes per pod ensure that only healthy inference pods receive traffic - which is critical given long model load times (the readiness probe must only turn green once the model is loaded). On top of this sits a tracing layer via OpenTelemetry, with token-accurate cost attribution and an EU-resident backend such as Langfuse (self-hosted).

When Kubernetes - and when not

Here is the honest complexity warning. Kubernetes with self-hosted GPU inference is not a weekend project. It pays off when at least one of these drivers applies:

  • High, steady load: A rule of thumb circulating in DACH platform teams holds that self-hosted inference becomes cheaper per token than managed APIs from around 8-12 H100-equivalents of continuous load - but with 6-9 months of engineering lead time. Below that, managed APIs dominate the TCO.
  • Strict latency: Latency budgets below 200-500 ms with several tool-call rounds demand co-located inference; transatlantic API calls (Frankfurt → US-East: ~80-120 ms one-way) are then disqualified.
  • Contractual sovereignty: When the legal department does not accept CLOUD Act exposure or BSI C5 Type 2 is binding.

Arguing against this, managed APIs and serverless win when the load is spiky (idle GPUs are the worst capex), when high model variety is needed (Azure Foundry alone added DeepSeek R1, GPT-4.1, Mistral Large 3, Claude Opus 4.5 and Llama 4 in 2025) or simply when no platform team is available. Few DACH mid-market companies master the operation of vLLM, GPU scheduling and NCCL failure patterns in 24x7 mode in-house - and it is precisely this shortfall, not the technology itself, that is the most common reason self-hosting projects fail.

Concrete example: lean cluster for a B2B agent

A typical scale-up stack (modelled on the "Lean Cloud" reference architecture) looks like this:

```text
Managed Kubernetes (AKS/EKS in EU region or Exoscale/Infomaniak sovereign)
├── Deployment: agent-orchestrator (LangGraph, 3 replicas, HPA 2-10, CPU nodes)
├── Deployment: mcp-tools-sap (mTLS, 1 service account per tool)
├── Deployment: ai-gateway-litellm (failover: Azure OpenAI EU -> Mistral La Plateforme)
├── StatefulSet: qdrant (3 replicas, PVC, vector store)
├── StatefulSet: redis (session state, geo-replication)
└── Worker-Deployment: doc-processor (KEDA, scales on RabbitMQ queue depth)
Model inference: managed API (no own GPU node) -> via gateway
```

Here no own GPU runs - inference sits with a managed API in an EU geo, abstracted via the gateway. This is deliberate: at modest load, pay-per-token is cheaper and live in weeks rather than months. Only once the monthly API spend exceeds the run rate of around 10 H100-equivalents in a sovereign cloud - or a new regulatory requirement demands a control that the managed API cannot demonstrate - does migration to your own GPU nodes with vLLM or NIM on the same cluster pay off. This is exactly what Kubernetes is ideal for: the architecture stays the same, only the inference component moves from "managed API behind the gateway" to "GPU deployment in the cluster".

For agencies and B2B decision-makers

Kubernetes is the right foundation for agent systems that will grow over the long term, remain EU-sovereign or meet strict latency requirements - but the entry point should almost always be via managed APIs and managed Kubernetes, with a clearly documented migration trigger for the switch to your own GPU inference. Anyone who does not define this threshold cleanly will either build an expensive GPU cluster too early or too late, once the API bill is already spiralling out of control. Blck Alpaca supports DACH companies and marketing agencies with exactly this architectural decision - from the make-buy-rent assessment per component to the sovereign, observability-ready cluster design. Talk to us before the first GPU is ordered.

FAQ

When does Kubernetes make sense for AI agents - and when does it not?
Kubernetes pays off when you run self-hosted inference at high, steady load (rule of thumb: from around 8-12 H100-equivalents of continuous load, self-hosting becomes cheaper per token than managed APIs), when latency budgets below 200-500 ms with several tool-call rounds are required, or when strict sovereignty (CLOUD Act-resistant, BSI C5 Type 2) is contractually necessary. For spiky, low-sensitivity load or in the absence of a platform team, managed APIs (Azure OpenAI Data Zone EUR, Bedrock EU, Mistral La Plateforme) and serverless are faster to go live and cheaper - budget 6-9 months of lead time for your own GPU cluster.
Which inference engine should you deploy on Kubernetes in 2026?
As of 2026, vLLM is the de facto standard for production self-hosting deployments: PagedAttention, broad hardware support, OpenAI-compatible endpoints. According to LMSYS benchmarks, SGLang delivers around 29 per cent higher throughput on 7B-8B models on H100 and is suited to multi-turn chat and structured outputs. NVIDIA NIM is the most pragmatic on-prem route in the DACH mid-market (pre-built containers, fast deployment) but ties you to the NVIDIA AI Enterprise licence. Hugging Face's TGI has been in maintenance mode since 11 December 2025 - new deployments are pointed to vLLM or SGLang.
How are agent state and memory handled on Kubernetes?
Agent pods must remain stateless so the HPA can freely scale and restart them. Conversation memory, task plans and tool-call history belong in an external store - typically Redis (session state) and Postgres (persistent conversations), with geo-replication for cross-region failover. The vector store (Qdrant, Weaviate, pgvector) runs as a StatefulSet with PersistentVolumeClaims, because embeddings and original chunks can reach terabyte scale and need stable identity.
How do agents on Kubernetes get secure access to secrets and tools?
Static credentials in the pod or in environment variables are off-limits. Workload Identity (Azure Managed Identity, AWS IRSA, GCP Workload Identity Federation, or Keystone/K8s service accounts for sovereign clouds) binds an identity to the pod without distributing keys. Secrets are provided by HashiCorp Vault, in regulated cases with an HSM seal against a Utimaco or Thales HSM. Best practice is one service account per agent-tool pair rather than a shared account, because a compromised agent has a large blast radius.
How does GPU scheduling for inference pods work?
GPU nodes are advertised to the cluster via the NVIDIA device plugin; pods request GPUs via a resource limit (nvidia.com/gpu). With node selectors, taints and tolerations, inference pods land specifically on the expensive GPU nodes, while stateless agent pods run on cheap CPU nodes. Important: a GPU is not arbitrarily divisible - depending on quantisation, a 70B model needs around 140 GB (BF16), 70 GB (FP8) or 35 GB (INT4) for the weights alone, plus KV cache. GPU pools usually stay warm rather than elastic, because idle GPUs are the most expensive capex.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.