2.11Intermediate7 min

Temperature, Top-p and Sampling: Settings for Deterministic Agents

Blck Alpaca·9 June 2026

Definition

Temperature, Top-p and Top-k are sampling parameters that control how randomly an LLM selects the next token. Low values (temperature 0 to 0.2) make outputs reproducible and are mandatory for tool calls and structured outputs; higher values increase variance and are suited to creative content.

Key Takeaways

✓Temperature scales the probability distribution before sampling: values close to 0 sharpen it (near-deterministic), values above 1 flatten it (more variance, higher hallucination risk).
✓Top-p (nucleus) and Top-k prune the candidate pool: they limit which tokens can be sampled from at all, and are the sharper tool against outliers than temperature alone.
✓For reliable agents the rule is: deterministic tool calls and JSON/structured outputs with temperature 0 to 0.2; reasoning and analysis 0.2 to 0.5; creative content 0.7 to 1.0.
✓Full determinism is rarely guaranteed in practice: GPU floating-point, batching and MoE routing produce residual variance even at temperature 0. Reproducibility only comes with a seed plus a pinned model version.
✓Stability beats cleverness: in production agents a reproducible, evaluable output is worth more than an occasionally brilliant but unpredictable one.

Temperature, Top-p and Top-k are sampling parameters that control how randomly a Large Language Model (LLM) selects the next token. Low values (temperature 0 to 0.2) make outputs reproducible and are mandatory for deterministic tool calls and structured outputs; higher values increase variance and are suited to creative content. For building reliable agents, these settings are not a side issue but a central reliability lever.

Deterministic agents need low temperature. Tool calls, classification and JSON outputs run most stably at temperature 0 to 0.2.
Top-p and Top-k prune the candidate pool. They are the sharper tool against improbable outlier tokens than temperature alone.
True determinism cannot be taken for granted. Even at temperature 0, residual variance remains due to GPU effects and batching; reproducibility requires a seed plus a pinned model version.

How sampling works in an LLM

An LLM generates text token by token. At each step the model computes a probability distribution over the entire vocabulary (the so-called logits are converted into probabilities via softmax). Which token is actually emitted is decided by the sampling strategy. This is precisely where temperature, Top-p and Top-k intervene. They do not change what the model has learned, only how the concrete token is drawn from the learned distribution.

This distinction is crucial for agents: the same model, the same weights and the same prompt can, depending on the sampling settings, produce a cleanly parsed tool call one time and rambling prose the next. Whoever ignores sampling leaves the reliability of their agent to chance.

Temperature

Temperature scales the distribution before sampling. Mathematically, the logits are divided by the temperature value:

Temperature towards 0: The distribution becomes maximally sharp. The most probable token dominates; the behaviour approaches greedy decoding, i.e. the pure selection of the top token. Outputs become highly repeatable.
Temperature around 1.0: The distribution remains almost unchanged. The model samples with the learned probabilities.
Temperature above 1.0: The distribution flattens. Improbable tokens gain more weight. This increases diversity and creativity, but also the risk of incoherent outputs and hallucinations.

Top-p (nucleus sampling)

Top-p, also called nucleus sampling, works via pruning rather than scaling. At Top-p = 0.9, the model considers only the smallest set of tokens whose cumulative probability reaches at least 90 percent, and samples exclusively from this core (the nucleus). The long tail of improbable tokens is cut off entirely. Top-p is dynamic: in contexts with a clear continuation the pool stays small, with open-ended phrasings it grows.

Top-k

Top-k is the simplest form of pruning: it keeps only the k most probable tokens and discards the rest. Top-k = 1 corresponds to greedy decoding. Top-k is static (always the same number of candidates) and is now regarded as a coarser variant compared with adaptive Top-p. Some vendors and inference stacks expose Top-k, others rely primarily on temperature and Top-p.

Why these settings determine an agent's reliability

An agent is not a chatbot that answers once. It executes multi-step workflows: it calls tools, parses their returns, plans next steps and hands structured data to downstream systems. In this chain, predictability matters more than brilliance. Three concrete failure modes show why:

Breaking structure: At high temperature the model may invent an additional field, omit a quotation mark or put prose before the JSON. The downstream parser breaks, the agent stops or runs into an error loop.
Unstable tool selection: An agent that chooses tool A one time and tool B the next on identical input is not testable. Low temperature makes the tool-routing decision reproducible.
Non-reproducible bugs: Bugs that only occur on certain sampling paths are hard to debug without determinism and cannot be measured stably in evaluations.

At the same time, there are legitimate cases for higher variance: generating content variants, brainstorming, creative text snippets or producing diverse synthetic test data. The art lies in choosing the appropriate profile per workload step instead of applying one global value across the entire agent.

Parameter, effect and recommendation for agents

Parameter	Effect	Recommendation for agents
Temperature 0 to 0.2	Near-deterministic, most probable token dominates	Tool calls, function calling, JSON/structured outputs, classification, extraction, routing decisions
Temperature 0.3 to 0.5	Slight variance, coherent	Reasoning and analysis steps, RAG answers with source attribution, summaries
Temperature 0.7 to 1.0	High variance, creative	Creative content, headline/variant generation, brainstorming, synthetic training data
Temperature above 1.0	Very high spread, incoherence risk	Experimental only; avoid in production agents
Top-p (nucleus)	Prunes to the cumulative probability core	Leave at default (often 0.9 to 1.0); lower it for controlled creativity instead of cranking up temperature
Top-k	Keeps only the k most probable tokens	Optional; where available as an additional outlier brake, otherwise default
Seed (where supported)	Pins the random stream	Set when reproducibility across runs is required (tests, evals, audits)

Important rule of thumb: Actively control only one of the two parameters, temperature or Top-p, and leave the other at the provider default. Aggressively changing both at the same time creates hard-to-understand interactions and makes results harder to compare.

The limit of determinism: why temperature 0 is not everything

A common misconception is that temperature 0 guarantees bit-identical outputs. In practice this is often not the case. Even in greedy mode, residual variance remains from several sources:

GPU floating-point: Parallel computations on GPUs are not bit-identical in every order. Minimal numerical differences can flip the token selection at close decision points.
Dynamic batching: If a request is batched together with other requests, the numerical result can shift slightly depending on the batch composition.
Mixture-of-experts routing: With MoE architectures (widespread as of 2026, for example with Mistral Large 3 at 675 billion parameters and 41 billion active, or with DeepSeek V4), a router decides which experts process a token. Routing effects can introduce additional variance.

From this follows the practical hierarchy of reproducibility: temperature 0 reduces the sampling variance, a pinned seed (provided the vendor offers it) makes the random stream repeatable, and only a frozen model version closes the gap against silent model updates. Closed-API models are updated by their vendors; version pinning in the configuration is therefore just as important for auditable agents as the temperature value. Whoever needs maximum reproducibility has the greatest leverage with self-hosted open-weight models on a pinned inference stack (such as vLLM, SGLang or TensorRT-LLM), because there both the weights and the runtime can be frozen.

Practical example: a lead-routing agent for an agency

Suppose a marketing agency builds an agent that classifies incoming contact requests and routes them to the right team. The workflow has three steps with three different profiles:

```text
Step 1: Classification (tool call):
temperature = 0
top_p = 1.0 (default)
Task: request -> {"category": "SEO|web design|consulting", "priority": "high|medium|low"}
Goal: identical input -> identical category, cleanly parsable JSON

Step 2: Fact-based summary (RAG):
temperature = 0.3
Task: condense relevant CRM/knowledge data into 3 sentences, without fabrication

Step 3, first draft of the reply email (creative):
temperature = 0.8
top_p = 0.9
Task: 3 stylistically different reply variants to choose from
```

The result is measurable: in an internal evaluation with 200 repeated test runs, Step 1 at temperature 0 delivers a stable, reproducible classification, so that the parsing error rate approaches zero and the tests are deterministic. Were the same classification run at temperature 0.8, the category assignment fluctuates on ambiguous requests, individual outputs contain explanatory prose before the JSON, and the parser breaks in a portion of cases. Step 3, by contrast, benefits from high temperature, because three identical email drafts would be worthless. This very separation per step is the core of solid agent design.

For agencies and B2B teams

Whoever deploys agents in production should not treat sampling settings as a technical detail but as part of quality assurance. In practice that means: deterministic profiles for everything structured (tool calls, data extraction, routing), moderate temperature for analysis and RAG, high temperature only for deliberately creative steps. Document the values per workflow step, pin the model version and run an eval pipeline against a fixed test dataset so that regressions become visible before they surface in client use. As a Vienna-based agency for AI agents, Blck Alpaca supports DACH companies in configuring exactly these settings cleanly and turning them into reliable, auditable agent workflows.

FAQ

What does temperature 0 mean for an LLM?

Temperature 0 means greedy decoding: at each step the model selects the most probable token instead of sampling. This is the most deterministic mode and the standard for tool calls, classification and structured outputs. Note: because of GPU floating-point and batching effects, even temperature 0 is not always bit-identically reproducible in practice.

What is the difference between temperature and Top-p?

Temperature scales the entire probability distribution (how sharp or flat it is) before sampling. Top-p (nucleus sampling) prunes the candidate pool to the smallest set of tokens whose cumulative probability reaches p. Temperature controls the spread, Top-p caps the improbable outliers. Both are often combined, but should be set deliberately.

Should you change temperature and Top-p at the same time?

As a rule of thumb: actively control only one parameter and leave the other at the provider default. Aggressively lowering both at the same time creates hard-to-predict interactions. For deterministic agents, temperature 0 to 0.2 with Top-p at default is usually sufficient. For controlled creativity, steering via Top-p at a moderate temperature is often more precise.

Does temperature 0 make an agent fully deterministic?

No, not guaranteed. Temperature 0 removes the sampling randomness, but residual variance arises from non-deterministic GPU operations, dynamic batching and, with mixture-of-experts models, from routing. True reproducibility additionally requires a pinned seed (provided the vendor supports it) and a frozen model version.

Which temperature is right for structured JSON outputs?

For JSON, function calling and schema-bound outputs, temperature 0 to 0.2 is the standard. This keeps the structure stable and minimises parsing errors. Even more reliable are vendor-side structured-output or constrained-decoding modes (available as of 2026 from vendors such as Anthropic, OpenAI and Google), which enforce the schema rather than merely hoping for a low temperature.

Want to go deeper?

Get new analyses straight to your inbox, or see how we put this knowledge to work for companies.

Subscribe to newsletter →Our services

Previous← Tokenisation and Context Window: What Drives Agent Latency and Cost NextFunction Calling vs. Tool Use: Terminology and Implementations →