2.14Intermediate8 min

LLM router: when to use a large frontier model, when small, when open source?

Blck Alpaca·9 June 2026

Definition

An LLM router is a routing logic that automatically assigns each agent step to the appropriate model: large frontier models for complex reasoning, small low-cost models for simple steps, open-source or EU-hosted models for sovereignty and cost control. The choice follows four criteria: quality, cost, latency and compliance.

Key Takeaways

✓Model choice in 2026 is no longer a one-off decision but a per-step routing logic: depending on the sub-task, an agent can switch between a frontier, a workhorse and a small speed model.
✓According to the research, the capability gap between the best open-weight (Kimi K2.6, DeepSeek V4 Pro, Mistral Large 3) and frontier closed (Claude Opus 4.7, GPT-5.5 Pro, Gemini 3.1 Pro) has shrunk from 12-18 months (2024) to 3-6 months (2026).
✓The cost lever is real: according to the research, closed frontier remains 8-100x more expensive on output than low-cost open-weight models (as of 2026) - routing simple steps to small models massively reduces the token bill.
✓According to the research, open vs. proprietary is not a binary question but a portfolio question along four dimensions: weights control, hosting sovereignty, customization path, cost profile.
✓For DACH B2B, hosting sovereignty is a hard routing criterion: workforce-defining and customer-data-processing agents belong on a sovereign EU substrate, while capability-bound analysis workloads can run on US closed frontier.
✓Chinese open-weight (DeepSeek, Kimi, Qwen) is permissive in licensing terms but, according to the research, carries a geopolitical tail risk and requires a deliberate workload limitation.

An LLM router is a routing logic that automatically assigns each step of an AI agent to the appropriate model. Large frontier models handle complex reasoning and tool orchestration, small low-cost models take care of simple steps such as classification or extraction, and open-source or EU-hosted models cover sovereignty and cost requirements. Instead of a one-off decision of "which model do we use?", model choice thus becomes a continuous mapping of task to model along four criteria: quality, cost, latency and compliance.

The central insight for 2026: the question "large or small model, proprietary or open source?" is, according to the underlying research, no longer a binary decision but a portfolio allocation. A production agent rarely runs on a single model.

Three quick answers

When large (frontier closed)? For complex reasoning, agentic coding at the top tier, very-long-context tasks, premium multimodal and rare high-value analysis steps. In DACH corporations this category typically covers 15-35% of token volume, but it carries 60-80% of the perceived strategic value.
When small (workhorse/speed)? For simple, frequent, well-defined steps: classification, extraction, summarisation, the routing decisions themselves, standard formatting. Here the frontier premium is, according to the research, no longer economically compelling.
When open source / EU hosting? When sovereignty, data residency, cost control at high volume or co-determination requirements dominate - particularly for workforce-defining and customer-data-processing agents.

Why the binary model question no longer holds in 2026

According to the research, the capability gap between the best open-weight and frontier closed has narrowed from 12-18 months (2024) to 3-6 months (2026); on individual workloads it is zero or negative. Concretely: Kimi K2.6 (1T parameters, Modified MIT, open-weight) ranks 4th overall on the Artificial Analysis Intelligence Index - behind only Anthropic, Google and OpenAI - and achieves parity with GPT-5.5 on SWE-Bench Pro at 58.6%. DeepSeek V4 Pro achieves 80.6% on SWE-Bench Verified and a Codeforces rating of 3,206, the highest competition coding rating ever published.

At the same time, the frontier premium for certain steps remains real and non-trivial: on FrontierMath, GPT-5.5 Pro achieves the best public math result; on GPQA Diamond, Gemini 3.1 Pro leads with 94.3%; Claude Opus 4.7 stands at 87.6% on SWE-Bench Verified. For rare, high-value reasoning workloads - legal research, scientific hypothesis formation, complex financial analysis - frontier closed remains materially superior.

For model choice this means: the honest question is not "open or closed?", but "which step belongs on which tier?".

The cost lever: why routing pays off

The economic driver behind LLM routing is the price spread. According to the research, closed frontier remains 8-100x more expensive on output than low-cost open-weight models. Anyone who sends every agent step - even the simple classification of an email - to the frontier model pays premium prices for quality the step does not even require.

The following overview shows representative list prices (as of April-May 2026, USD per million tokens input/output). Prices and model versions change quickly and must be verified before any multi-year commitment.

Model	Tier	Price in/out (USD/1M tok., as of 2026)	Sovereignty profile
Claude Opus 4.7	Frontier closed	5 / 25	US jurisdictional (EU region available)
GPT-5.5 Pro	Frontier closed	30 / 180	US jurisdictional (Azure EU)
Gemini 3.1 Pro	Frontier closed	2 / 12 (>200K: 4 / 18)	US jurisdictional (Vertex EU)
Claude Sonnet 4.6	Workhorse closed	approx. 3 / 15	US jurisdictional
Claude Haiku 4.5	Speed closed	approx. 1 / 5	US jurisdictional
Mistral Large 3 (675B/41B active)	Frontier-near, Apache 2.0	0.50 / 1.50	EU sovereign (FR)
Ministral 3	Speed, open-weight	0.15 / 0.40	EU sovereign
Kimi K2.6 (1T/32B active)	Frontier-near, Modified MIT	0.60 / 2.50 (Moonshot)	CN origin, geopolitical tail risk
DeepSeek V4 Flash	Workhorse, MIT-derived	0.14 / 0.28	CN origin, geopolitical tail risk

The spread is drastic: in the typical comparison, the output price differs by a factor of 8 to 100; if you set the most expensive frontier model against the cheapest workhorse, GPT-5.5 Pro on output even costs around 640 times that of DeepSeek V4 Flash. It is precisely this gap that makes routing the economic standard - not for ideological but for purely business reasons.

The router pattern: task to model

An LLM router assigns incoming steps to their complexity and compliance class and selects the cheapest model that meets the requirement. In practice, a small, fast model often handles the classification of the task itself before the actual work goes to the appropriate target model. The routing table is the heart of it:

Scenario	Model type	Rationale
Complex multi-step reasoning, agentic coding plan	Frontier closed (Claude Opus 4.7, GPT-5.5 Pro)	Capability premium real; the best result on the hardest tasks justifies the high output price
Frontier math, scientific hypothesis formation	Frontier closed (GPT-5.5 Pro)	Best public math result according to the research; rare, high-value workload
Tool orchestration, terminal/shell tasks	Frontier closed (GPT-5.5 leads Terminal-Bench, Claude Opus strong)	Agentic capability determines the success rate
Standard coding, code refactoring at scale	Frontier-near open-weight (DeepSeek V4 Pro, Kimi K2.6)	Math/coding parity with frontier closed; significantly cheaper
Classification, extraction, summarisation (batch)	Workhorse/speed open-weight (DeepSeek V4 Flash, Ministral)	50-80% of frontier capability sufficient; cost advantage dominates
German-language workhorse workflows	EU open-weight (Mistral, Aleph Alpha Pharia, Cohere Aya, Teuken-7B)	German performance a structural differentiator; US open-weight (Llama, approx. 8% non-English) weaker
Workforce-defining agent (HR bot, internal knowledge)	Sovereign EU (Mistral/Aleph Alpha on STACKIT, OVHcloud, T-Systems)	Co-determination, GDPR and reputation compound; compliance simplifier
Customer-data-processing agent (EU data)	Sovereign EU or at least US hyperscaler EU region with DPA	EU region reduces GDPR friction but does not eliminate US jurisdiction
Capability-bound analysis without personal data	US closed frontier acceptable	Capability premium economically substantial, no sovereignty constraint
Latency-critical interactive response	Speed tier (Claude Haiku, Groq/Cerebras-hosted)	Sub-second TTFT; quality secondary to response time

The four trade-off dimensions

The research structures model choice along four dimensions that a good router maps together:

Quality (weights control & capability): How much reasoning depth does the step really need? And who controls the model - does it sit as a closed API solely within the vendor roadmap, or is it portable as open-weight to another inference provider or to self-hosting?
Cost (cost profile): Per-token closed (premium, fully variable), per-token open-weight via an inference provider (mid tier) or self-hosting (fixed costs, marginal token costs close to zero at high utilisation). According to the research, self-hosting only becomes truly attractive from a constant 5-50 million tokens/day onwards and only with existing MLOps capacity.
Latency: Speed specialists such as Groq (LPU) and Cerebras (wafer-scale) deliver sub-second time-to-first-token for selected models - relevant for interactive agents, irrelevant for nightly batch processing.
Compliance (hosting sovereignty): An EU region on a US hyperscaler is, according to the research, not the same as sovereign EU. It reduces latency and GDPR friction but does not eliminate US jurisdiction (CLOUD Act, OFAC). Structurally outside US reach are only sovereign EU stacks (STACKIT, OVHcloud, T-Systems Open Telekom Cloud, IONOS, Hetzner) and on-prem.

Open source vs. proprietary: the sober assessment

This is where a closer look pays off, because "open" is multidimensional. Open-weight means that the weights are downloadable - not automatically that the licence is unrestricted. Mistral Large 3 (Apache 2.0, EU sovereign) and Phi-4 (MIT) are permissive. Llama 4, by contrast, is subject to the Llama Community License with a 700-million-MAU threshold and an EU multimodal restriction - a direct regulatory response to the EU AI Act; the OSI explicitly classifies Llama as not open source.

Chinese open-weight (DeepSeek V4 under MIT-derived, Kimi K2.6 under Modified MIT, Qwen 3.6-27B under Apache 2.0) is genuinely permissive in licensing terms - more permissive than Llama. But, according to the research, the origin jurisdiction cannot be neutralised by licence permissiveness. For DACH corporations with a US subsidiary, US contractual partner obligations or critical infrastructure, this option carries an undetermined geopolitical tail risk (export controls, reputational risk) that must be explicitly limited per workload - acceptable, for example, for coding agents on public code or batch classification on publicly available texts, problematic with sensitive customer data.

Sovereignty is, moreover, a measurable procurement criterion in 2026: according to Bitkom data, 89% of German digital importers consider themselves dependent, and 72% of the population regard Germany as too dependent on the USA for AI. In works council negotiations, a sovereign EU substrate creates structurally less friction because audit rights, data residency and vendor jurisdiction are easier to agree on.

A discreet but important note: the compliance and licensing statements here are not legal advice. Relevant for routing is a threshold from the research: anyone who substantially fine-tunes an open-weight model (indicative EU AI Act threshold: more than a third of the base pretraining compute, default 3.33 x 10^22 FLOPs) can themselves become a GPAI provider. LoRA/QLoRA typically lies far below this; continued pretraining almost always exceeds the threshold. This is an argument for relying on RAG plus prompt engineering in the customization pipeline rather than on heavy fine-tuning - and a reason to have the specific legal assessment reviewed by a qualified party.

Practical example: a support agent with mixed routing

A customer service agent handles 100,000 requests per day. Without a router, everything runs on Claude Opus 4.7 (5/25 USD per million tokens). With a router, the mapping looks like this:

Step 1 - intent classification (70% of the load): A speed open-weight model such as Ministral 3 (0.15/0.40 USD) on EU-region inference (e.g. DeepInfra Frankfurt) classifies the request. Sovereignty-compliant, as EU-hosted, and around 60 times cheaper on output than Opus.
Step 2 - standard answer from the knowledge base (20% of the load): An EU workhorse such as Mistral Large 3 (0.50/1.50 USD, Apache 2.0, EU sovereign) generates the answer via RAG - strong German performance, data-resident.
Step 3 - escalation with complex contract reasoning (10% of the load): Here the frontier model (Claude Opus 4.7) takes over, because answer quality is business-critical and the premium is justified.

The result: the expensive frontier path now carries only a tenth of the load, while the predominant share runs on low-cost, EU-sovereign models. Pseudocode of the routing decision:

```
intent = classify(request) # Ministral 3, EU region
if intent in SIMPLE:
return mistral_large_3(rag(request)) # EU sovereign workhorse
elif intent == CONTRACT_COMPLEX:
return claude_opus_47(request) # Frontier only where needed
```

So that routing does not become a vendor trap, the research recommends a portability layer (such as LiteLLM or OpenRouter for multi-provider routing) and at least a thin open-weight migration path for the most important workloads - plus a continuous eval pipeline against a held-out test set, because closed-API updates happen automatically and silent capability regressions have been documented several times.

For agencies and B2B decision-makers

A well-thought-out LLM router is not a nice-to-have in 2026 but the lever on which an agent's cost, answer quality and GDPR compliance all hang simultaneously. The right architecture is almost always hybrid - the actual work lies in deciding cleanly, per workload, which step needs frontier quality, which can run on a low-cost EU model and where sovereignty is binding. This is exactly the routing logic, vendor portability and the appropriate sovereign-EU or closed-frontier mix that Blck Alpaca, as a Vienna-based agency for AI agents, designs together with DACH companies. If you want to know which model mix fits your workloads, your compliance profile and your budget, talk to us.

FAQ

What is an LLM router and why does an agent need one?

An LLM router decides, for each agent step, which model handles the request. Instead of sending every step to the most expensive frontier model, it directs simple tasks (classification, extraction, formatting) to small, low-cost models and reserves large models for complex reasoning or tool orchestration. This optimises cost, latency and compliance simultaneously, without sacrificing answer quality on the demanding steps.

When is a large frontier model worth it compared to a small one?

According to the research, the frontier premium remains real for a minority of workloads: agentic coding at the Sonnet-4.6+ tier, very-long-context reasoning, premium multimodal and frontier math (here GPT-5.5 Pro achieves the best public result). Empirically, in DACH corporations typically 15-35% of token volume falls into this category, but it carries 60-80% of the perceived strategic value. For classification, extraction, summarisation and standard coding, the premium is no longer economically compelling.

Is open-source LLM good enough for production agents in 2026?

For pure text and coding workloads, the open-weight gap is, according to the research, closed or minimal: Kimi K2.6 ranks 4th overall on the Artificial Analysis Intelligence Index and achieves parity with GPT-5.5 on SWE-Bench Pro at 58.6%. For premium vision/audio/video and frontier math, closed frontier remains ahead. On the German workhorse tier, EU models (Mistral, Aleph Alpha, Cohere Aya, Teuken) are structurally stronger than US open-weight such as Llama, which was trained on only about 8% non-English data.

What role do EU hosting and sovereignty play in model choice?

Hosting sovereignty is a routing dimension in its own right. An EU region on a US hyperscaler reduces, according to the research, latency and GDPR friction but does not eliminate US jurisdiction (CLOUD Act, OFAC). True sovereignty is offered only by sovereign EU stacks such as STACKIT, OVHcloud, T-Systems or on-prem. For regulated agents that process customer or employee data, a sovereign EU substrate is a structural compliance simplifier and creates less friction in works council negotiations.

What does wrong routing actually cost?

According to the research, closed frontier models remain 8-100x more expensive on output than low-cost open-weight models. Example as of 2026: GPT-5.5 Pro costs USD 30 input / USD 180 output per million tokens, DeepSeek V4 Flash 0.14 / 0.28 USD, Mistral Large 3 0.50 / 1.50 USD. Anyone who sends every classification or extraction step to the frontier model pays a multiple for quality the step does not need. Prices change quickly and must be verified before any commitment.

Want to go deeper?

Get new analyses straight to your inbox, or see how we put this knowledge to work for companies.

Subscribe to newsletter →Our services

Previous← Structured Outputs with JSON Schema: Enforcing Reliable Agent Responses