LLM router: when to use a large frontier model, when small, when open source?
An LLM router is a routing logic that automatically assigns each agent step to the appropriate model: large frontier models for complex reasoning, small low-cost models for simple steps, open-source or EU-hosted models for sovereignty and cost control. The choice follows four criteria: quality, cost, latency and compliance.
Key Takeaways
- ✓Model choice in 2026 is no longer a one-off decision but a per-step routing logic: depending on the sub-task, an agent can switch between a frontier, a workhorse and a small speed model.
- ✓According to the research, the capability gap between the best open-weight (Kimi K2.6, DeepSeek V4 Pro, Mistral Large 3) and frontier closed (Claude Opus 4.7, GPT-5.5 Pro, Gemini 3.1 Pro) has shrunk from 12-18 months (2024) to 3-6 months (2026).
- ✓The cost lever is real: according to the research, closed frontier remains 8-100x more expensive on output than low-cost open-weight models (as of 2026) - routing simple steps to small models massively reduces the token bill.
- ✓According to the research, open vs. proprietary is not a binary question but a portfolio question along four dimensions: weights control, hosting sovereignty, customization path, cost profile.
- ✓For DACH B2B, hosting sovereignty is a hard routing criterion: workforce-defining and customer-data-processing agents belong on a sovereign EU substrate, while capability-bound analysis workloads can run on US closed frontier.
- ✓Chinese open-weight (DeepSeek, Kimi, Qwen) is permissive in licensing terms but, according to the research, carries a geopolitical tail risk and requires a deliberate workload limitation.
An LLM router is a routing logic that automatically assigns each step of an AI agent to the appropriate model. Large frontier models handle complex reasoning and tool orchestration, small low-cost models take care of simple steps such as classification or extraction, and open-source or EU-hosted models cover sovereignty and cost requirements. Instead of a one-off decision of "which model do we use?", model choice thus becomes a continuous mapping of task to model along four criteria: quality, cost, latency and compliance.
The central insight for 2026: the question "large or small model, proprietary or open source?" is, according to the underlying research, no longer a binary decision but a portfolio allocation. A production agent rarely runs on a single model.
Three quick answers
- When large (frontier closed)? For complex reasoning, agentic coding at the top tier, very-long-context tasks, premium multimodal and rare high-value analysis steps. In DACH corporations this category typically covers 15-35% of token volume, but it carries 60-80% of the perceived strategic value.
- When small (workhorse/speed)? For simple, frequent, well-defined steps: classification, extraction, summarisation, the routing decisions themselves, standard formatting. Here the frontier premium is, according to the research, no longer economically compelling.
- When open source / EU hosting? When sovereignty, data residency, cost control at high volume or co-determination requirements dominate - particularly for workforce-defining and customer-data-processing agents.
Why the binary model question no longer holds in 2026
According to the research, the capability gap between the best open-weight and frontier closed has narrowed from 12-18 months (2024) to 3-6 months (2026); on individual workloads it is zero or negative. Concretely: Kimi K2.6 (1T parameters, Modified MIT, open-weight) ranks 4th overall on the Artificial Analysis Intelligence Index - behind only Anthropic, Google and OpenAI - and achieves parity with GPT-5.5 on SWE-Bench Pro at 58.6%. DeepSeek V4 Pro achieves 80.6% on SWE-Bench Verified and a Codeforces rating of 3,206, the highest competition coding rating ever published.
At the same time, the frontier premium for certain steps remains real and non-trivial: on FrontierMath, GPT-5.5 Pro achieves the best public math result; on GPQA Diamond, Gemini 3.1 Pro leads with 94.3%; Claude Opus 4.7 stands at 87.6% on SWE-Bench Verified. For rare, high-value reasoning workloads - legal research, scientific hypothesis formation, complex financial analysis - frontier closed remains materially superior.
For model choice this means: the honest question is not "open or closed?", but "which step belongs on which tier?".
The cost lever: why routing pays off
The economic driver behind LLM routing is the price spread. According to the research, closed frontier remains 8-100x more expensive on output than low-cost open-weight models. Anyone who sends every agent step - even the simple classification of an email - to the frontier model pays premium prices for quality the step does not even require.
The following overview shows representative list prices (as of April-May 2026, USD per million tokens input/output). Prices and model versions change quickly and must be verified before any multi-year commitment.
Model | Tier | Price in/out (USD/1M tok., as of 2026) | Sovereignty profile |
|---|---|---|---|
Claude Opus 4.7 | Frontier closed | 5 / 25 | US jurisdictional (EU region available) |
GPT-5.5 Pro | Frontier closed | 30 / 180 | US jurisdictional (Azure EU) |
Gemini 3.1 Pro | Frontier closed | 2 / 12 (>200K: 4 / 18) | US jurisdictional (Vertex EU) |
Claude Sonnet 4.6 | Workhorse closed | approx. 3 / 15 | US jurisdictional |
Claude Haiku 4.5 | Speed closed | approx. 1 / 5 | US jurisdictional |
Mistral Large 3 (675B/41B active) | Frontier-near, Apache 2.0 | 0.50 / 1.50 | EU sovereign (FR) |
Ministral 3 | Speed, open-weight | 0.15 / 0.40 | EU sovereign |
Kimi K2.6 (1T/32B active) | Frontier-near, Modified MIT | 0.60 / 2.50 (Moonshot) | CN origin, geopolitical tail risk |
DeepSeek V4 Flash | Workhorse, MIT-derived | 0.14 / 0.28 | CN origin, geopolitical tail risk |
The spread is drastic: in the typical comparison, the output price differs by a factor of 8 to 100; if you set the most expensive frontier model against the cheapest workhorse, GPT-5.5 Pro on output even costs around 640 times that of DeepSeek V4 Flash. It is precisely this gap that makes routing the economic standard - not for ideological but for purely business reasons.
The router pattern: task to model
An LLM router assigns incoming steps to their complexity and compliance class and selects the cheapest model that meets the requirement. In practice, a small, fast model often handles the classification of the task itself before the actual work goes to the appropriate target model. The routing table is the heart of it:
Scenario | Model type | Rationale |
|---|---|---|
Complex multi-step reasoning, agentic coding plan | Frontier closed (Claude Opus 4.7, GPT-5.5 Pro) | Capability premium real; the best result on the hardest tasks justifies the high output price |
Frontier math, scientific hypothesis formation | Frontier closed (GPT-5.5 Pro) | Best public math result according to the research; rare, high-value workload |
Tool orchestration, terminal/shell tasks | Frontier closed (GPT-5.5 leads Terminal-Bench, Claude Opus strong) | Agentic capability determines the success rate |
Standard coding, code refactoring at scale | Frontier-near open-weight (DeepSeek V4 Pro, Kimi K2.6) | Math/coding parity with frontier closed; significantly cheaper |
Classification, extraction, summarisation (batch) | Workhorse/speed open-weight (DeepSeek V4 Flash, Ministral) | 50-80% of frontier capability sufficient; cost advantage dominates |
German-language workhorse workflows | EU open-weight (Mistral, Aleph Alpha Pharia, Cohere Aya, Teuken-7B) | German performance a structural differentiator; US open-weight (Llama, approx. 8% non-English) weaker |
Workforce-defining agent (HR bot, internal knowledge) | Sovereign EU (Mistral/Aleph Alpha on STACKIT, OVHcloud, T-Systems) | Co-determination, GDPR and reputation compound; compliance simplifier |
Customer-data-processing agent (EU data) | Sovereign EU or at least US hyperscaler EU region with DPA | EU region reduces GDPR friction but does not eliminate US jurisdiction |
Capability-bound analysis without personal data | US closed frontier acceptable | Capability premium economically substantial, no sovereignty constraint |
Latency-critical interactive response | Speed tier (Claude Haiku, Groq/Cerebras-hosted) | Sub-second TTFT; quality secondary to response time |
The four trade-off dimensions
The research structures model choice along four dimensions that a good router maps together:
- Quality (weights control & capability): How much reasoning depth does the step really need? And who controls the model - does it sit as a closed API solely within the vendor roadmap, or is it portable as open-weight to another inference provider or to self-hosting?
- Cost (cost profile): Per-token closed (premium, fully variable), per-token open-weight via an inference provider (mid tier) or self-hosting (fixed costs, marginal token costs close to zero at high utilisation). According to the research, self-hosting only becomes truly attractive from a constant 5-50 million tokens/day onwards and only with existing MLOps capacity.
- Latency: Speed specialists such as Groq (LPU) and Cerebras (wafer-scale) deliver sub-second time-to-first-token for selected models - relevant for interactive agents, irrelevant for nightly batch processing.
- Compliance (hosting sovereignty): An EU region on a US hyperscaler is, according to the research, not the same as sovereign EU. It reduces latency and GDPR friction but does not eliminate US jurisdiction (CLOUD Act, OFAC). Structurally outside US reach are only sovereign EU stacks (STACKIT, OVHcloud, T-Systems Open Telekom Cloud, IONOS, Hetzner) and on-prem.
Open source vs. proprietary: the sober assessment
This is where a closer look pays off, because "open" is multidimensional. Open-weight means that the weights are downloadable - not automatically that the licence is unrestricted. Mistral Large 3 (Apache 2.0, EU sovereign) and Phi-4 (MIT) are permissive. Llama 4, by contrast, is subject to the Llama Community License with a 700-million-MAU threshold and an EU multimodal restriction - a direct regulatory response to the EU AI Act; the OSI explicitly classifies Llama as not open source.
Chinese open-weight (DeepSeek V4 under MIT-derived, Kimi K2.6 under Modified MIT, Qwen 3.6-27B under Apache 2.0) is genuinely permissive in licensing terms - more permissive than Llama. But, according to the research, the origin jurisdiction cannot be neutralised by licence permissiveness. For DACH corporations with a US subsidiary, US contractual partner obligations or critical infrastructure, this option carries an undetermined geopolitical tail risk (export controls, reputational risk) that must be explicitly limited per workload - acceptable, for example, for coding agents on public code or batch classification on publicly available texts, problematic with sensitive customer data.
Sovereignty is, moreover, a measurable procurement criterion in 2026: according to Bitkom data, 89% of German digital importers consider themselves dependent, and 72% of the population regard Germany as too dependent on the USA for AI. In works council negotiations, a sovereign EU substrate creates structurally less friction because audit rights, data residency and vendor jurisdiction are easier to agree on.
A discreet but important note: the compliance and licensing statements here are not legal advice. Relevant for routing is a threshold from the research: anyone who substantially fine-tunes an open-weight model (indicative EU AI Act threshold: more than a third of the base pretraining compute, default 3.33 x 10^22 FLOPs) can themselves become a GPAI provider. LoRA/QLoRA typically lies far below this; continued pretraining almost always exceeds the threshold. This is an argument for relying on RAG plus prompt engineering in the customization pipeline rather than on heavy fine-tuning - and a reason to have the specific legal assessment reviewed by a qualified party.
Practical example: a support agent with mixed routing
A customer service agent handles 100,000 requests per day. Without a router, everything runs on Claude Opus 4.7 (5/25 USD per million tokens). With a router, the mapping looks like this:
- Step 1 - intent classification (70% of the load): A speed open-weight model such as Ministral 3 (0.15/0.40 USD) on EU-region inference (e.g. DeepInfra Frankfurt) classifies the request. Sovereignty-compliant, as EU-hosted, and around 60 times cheaper on output than Opus.
- Step 2 - standard answer from the knowledge base (20% of the load): An EU workhorse such as Mistral Large 3 (0.50/1.50 USD, Apache 2.0, EU sovereign) generates the answer via RAG - strong German performance, data-resident.
- Step 3 - escalation with complex contract reasoning (10% of the load): Here the frontier model (Claude Opus 4.7) takes over, because answer quality is business-critical and the premium is justified.
The result: the expensive frontier path now carries only a tenth of the load, while the predominant share runs on low-cost, EU-sovereign models. Pseudocode of the routing decision:
```
intent = classify(request) # Ministral 3, EU region
if intent in SIMPLE:
return mistral_large_3(rag(request)) # EU sovereign workhorse
elif intent == CONTRACT_COMPLEX:
return claude_opus_47(request) # Frontier only where needed
```
So that routing does not become a vendor trap, the research recommends a portability layer (such as LiteLLM or OpenRouter for multi-provider routing) and at least a thin open-weight migration path for the most important workloads - plus a continuous eval pipeline against a held-out test set, because closed-API updates happen automatically and silent capability regressions have been documented several times.
For agencies and B2B decision-makers
A well-thought-out LLM router is not a nice-to-have in 2026 but the lever on which an agent's cost, answer quality and GDPR compliance all hang simultaneously. The right architecture is almost always hybrid - the actual work lies in deciding cleanly, per workload, which step needs frontier quality, which can run on a low-cost EU model and where sovereignty is binding. This is exactly the routing logic, vendor portability and the appropriate sovereign-EU or closed-frontier mix that Blck Alpaca, as a Vienna-based agency for AI agents, designs together with DACH companies. If you want to know which model mix fits your workloads, your compliance profile and your budget, talk to us.
FAQ
What is an LLM router and why does an agent need one?
When is a large frontier model worth it compared to a small one?
Is open-source LLM good enough for production agents in 2026?
What role do EU hosting and sovereignty play in model choice?
What does wrong routing actually cost?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.