TheHowPage

What Is Temperature in AI Text Generation?

Temperature controls how creative or predictable AI text is. Learn the math behind it, how top-k and top-p sampling work, and when to use which setting.

What Is Temperature in AI Text Generation?

When you use ChatGPT, Claude, or any language model, there's a hidden knob controlling how creative the output is. It's called temperature, and understanding it turns you from a casual user into someone who can precisely control AI output.

What the model actually outputs

After processing your prompt through dozens of transformer layers, an LLM doesn't output words. It outputs a vector of raw scores — one number for every token in its vocabulary (~100,000 numbers). These raw scores are called logits.

For the prompt "The capital of France is," the logits might look like:

| Token | Raw Logit | |-------|-----------| | Paris | 8.2 | | the | 3.1 | | a | 2.8 | | located | 2.3 | | Lyon | 1.9 | | known | 1.5 |

These logits need to be converted into probabilities. That's where softmax comes in — and temperature modifies the logits before softmax is applied.

The math: A concrete example

The temperature formula:

adjusted_logits = logits / temperature
probabilities = softmax(adjusted_logits)

Let's run the numbers for three temperatures using the logits above:

Temperature = 0.2 (very focused): Divide all logits by 0.2 → Paris becomes 41.0, the rest become 15.5, 14.0, 11.5... After softmax: Paris = 99.97%, everything else ≈ 0%

Temperature = 1.0 (neutral): Logits unchanged → After softmax: Paris = 78%, the = 5%, a = 3.5%, located = 2.1%, Lyon = 1.4%...

Temperature = 1.5 (creative): Divide by 1.5 → Paris becomes 5.47, the becomes 2.07... After softmax: Paris = 48%, the = 10%, a = 8%, located = 6%, Lyon = 5%...

See the pattern? Low temperature amplifies the gap between the top choice and everything else. High temperature compresses it, giving lower-ranked tokens a real shot.

At temperature = 0, the model always picks the single highest-logit token. This is called greedy decoding — perfectly deterministic, zero randomness.

Beyond temperature: Top-k and top-p sampling

Temperature isn't the only way to control output randomness. Modern LLMs typically combine it with two other techniques:

Top-k sampling

After applying temperature, only keep the k highest-probability tokens and redistribute the probability among them. Everything else gets zero probability.

  • k = 1: Same as greedy decoding (only the top token)
  • k = 50: The model samples from the 50 most likely tokens
  • k = 500: Very diverse output, rare tokens can appear

The problem with top-k: for some predictions, the model is very confident and only 3-4 tokens make sense. For others, 200 tokens are all reasonable. A fixed k doesn't adapt.

Top-p (nucleus) sampling

Instead of a fixed count, keep the smallest set of tokens whose cumulative probability exceeds p. This adapts to the model's confidence:

  • If the model is 95% sure of one token and p = 0.9, only that token is selected
  • If probability is spread across 50 tokens, all 50 might be included

p = 0.9 (common default) means: take the top tokens until you've covered 90% of the probability mass, then sample from that set.

Top-p is generally preferred over top-k because it naturally adapts. When the model is confident, the nucleus is small. When it's uncertain, the nucleus expands.

How they work together

In practice, APIs let you set all three: temperature, top-k, and top-p. They're applied in sequence:

  1. Divide logits by temperature
  2. Apply softmax to get probabilities
  3. Filter to top-k tokens (if set)
  4. Filter to top-p nucleus (if set)
  5. Sample from the remaining distribution

Most practitioners set temperature + top-p and leave top-k unset. A common high-quality configuration: temperature = 0.7, top-p = 0.9.

Decoding strategies compared

Temperature and sampling aren't the only ways to generate text. Here's how the major strategies compare:

| Strategy | How It Works | Pros | Cons | |----------|-------------|------|------| | Greedy (T=0) | Always pick the top token | Fast, deterministic | Repetitive, can miss better sequences | | Beam search | Track top-N candidate sequences | Finds high-probability sequences | Tends toward generic, short outputs | | Sampling (T>0) | Randomly sample from distribution | Natural, diverse text | Can go off-rails at high T | | Top-p sampling | Sample from probability nucleus | Adaptive randomness | Slightly slower than greedy | | Speculative decoding | Draft with small model, verify with large | 2-3x faster inference | Complex to implement |

Beam search was dominant in early NLP (machine translation, summarization) but has fallen out of favor for general text generation. Modern chatbots almost universally use sampling with temperature + top-p.

When to use which temperature

| Task | Recommended T | Why | |------|--------------|-----| | Factual Q&A | 0.0 – 0.2 | One right answer, consistency matters | | Code generation | 0.0 – 0.3 | Syntax errors from randomness are costly | | Summarization | 0.3 – 0.5 | Faithful to source, slight variation | | Conversational AI | 0.6 – 0.8 | Natural-sounding, avoids robotic repetition | | Creative writing | 0.8 – 1.2 | Surprise and variety | | Brainstorming | 1.0 – 1.5 | Maximize diversity of ideas | | Deliberate weirdness | 1.5 – 2.0 | Experimental, expect incoherence |

The key insight

Temperature doesn't make the model smarter or dumber. It doesn't change what the model knows. It changes its risk tolerance when choosing among the options it sees.

At low temperature, the model plays it safe — always picking the most likely continuation. At high temperature, it takes creative risks, sometimes brilliantly, sometimes incoherently. The model's knowledge is identical; only its selection strategy changes.

This is why the same prompt with temperature 0 always gives the same output, while temperature 1.0 gives different outputs each time. The model "sees" the same probabilities — it just picks differently.

Try it yourself

Experiment with our interactive temperature slider — adjust the temperature in real time and watch the probability distribution shift from a sharp spike to a flat spread. See exactly which tokens gain or lose probability as you drag the slider.