TheHowPage
Interactive Explainer

How LLMs Work

From raw text to intelligent-sounding predictions — scroll through each stage of a large language model, and interact with every step along the way.

The cat sat on the
mat
0%
roof
0%
warm
0%
floor
0%
bed
0%
Scroll to begin ↓

Step 1: Tokenization

Before a language model can understand your text, it needs to break it into small pieces called tokens. But here's the thing — the model doesn't split on words the way you'd expect. It uses an algorithm called byte-pair encoding (BPE) that finds the most efficient way to represent text using a fixed vocabulary of about 50,000 subword pieces.

Common words like "the" and "is" get their own token. But rarer words get split into pieces: "unforgettable" might become ["un", "forget", "table"]. This is why LLMs can handle any word — even made-up ones — without needing an infinitely large vocabulary. Every token maps to a unique integer ID (like "the" → 464 in GPT-2).

BPE works by starting with individual characters and repeatedly merging the most common adjacent pairs. After thousands of merges, you get a vocabulary that efficiently balances between character-level flexibility and word-level efficiency. English text averages about 4 characters per token.

This has real consequences. Tokenization is why models struggle to count letters in words (they see "strawberry" as ["straw", "berry"], not individual characters), why some languages cost more tokens than others, and why code sometimes tokenizes strangely. Try it below — switch between Simple Split and BPE to see the difference:

10 tokens~4.4 chars/tokenVocab: 50,257 tokens
Full word
Subword
Punctuation
Number
The[1858]quick[2068]brown[7586]fox[21831]jumps[23660]over[625]the[464]lazy[16931]dog[9703].[13]

Step 2: Embeddings

Each token gets converted into an embedding — a dense vector of hundreds or thousands of numbers that encodes the token's meaning. GPT-3's embeddings are 12,288 dimensions long. These aren't hand-crafted; the model learns them during training by adjusting the numbers to minimize prediction errors.

Here's the fascinating part: words with similar meanings end up near each other in this high-dimensional space. "Cat" and "dog" are close together. "King" and "queen" are close. But "cat" and "democracy" are far apart. The space organizes itself into semantic neighborhoods — without anyone telling it to.

Even more remarkably, the directions in embedding space encode relationships. The vector from "man" to "king" points in roughly the same direction as the vector from "woman" to "queen". This means you can do vector arithmetic: king − man + woman ≈ queen. The model has discovered gender, royalty, tense, and hundreds of other concepts — just from predicting text.

Below is a 2D projection of a real embedding space. Explore the clusters, try vector arithmetic, and check how similar different word pairs are:

catdogfishbirdhorsemousewhaleeaglekingqueenprinceprincessthronecrownkingdomapplebananabreadpizzaricesteaksushicomputerphonelaptoptabletserverrobotinternethappysadangrylovefearjoyhoperunwalkjumpswimflyclimbdanceDimension 1Dimension 2

Hover to see nearest neighbors. Words with similar meanings cluster together.

Step 3: Attention

This is where the magic happens. The attention mechanism is the core innovation of the transformer architecture — and arguably the most important idea in modern AI. It lets each token look at every other token in the sequence and decide which ones are relevant for understanding it.

Here's how it works mechanically: for each token, the model computes three vectors called Query, Key, and Value. The query asks "what am I looking for?", the key says "here's what I contain", and the value says "here's the information to pass along." Attention weights are computed by comparing each query against all keys, then used to create a weighted combination of values.

But there's a twist: modern models don't use just one attention pattern. They use multi-head attention — typically 32 to 128 heads running in parallel, each learning to detect different types of relationships. One head might track syntax (linking subjects to verbs). Another resolves pronouns ("she" → "Alice"). Another handles positional patterns. Together, they build a rich understanding of the input.

This is what allows the model to understand that "bank" means different things in "river bank" vs "bank approved the loan." Try the different sentences below — click a word to see its attention pattern, and toggle between heads to see different perspectives:

A simple sentence. Watch how 'sat' attends to 'cat' (who sat?) and 'mat' attends to 'on' (where?).

Click a word to see what it pays attention to

Select a word above to see attention weights

Step 4: The Transformer Stack

A single round of attention isn't enough. Modern LLMs stack dozens to over a hundred transformer blocks, each one refining the model's understanding. GPT-3 has 96 layers. GPT-4 is estimated to have even more. Each layer consists of the same four operations: multi-head attention, add & normalize, feed-forward network, add & normalize.

What makes this work is the residual connection — a seemingly simple idea with profound consequences. Instead of replacing the input with the attention output, the model adds the attention output to the original input. This creates a "skip highway" that lets information flow through all 96 layers without degrading. Without residuals, training deep networks is nearly impossible.

The feed-forward network in each block is where most of the model's "knowledge" lives. It's a simple two-layer neural network, but when you multiply it by 96 layers with billions of parameters, it stores an enormous amount of factual associations, linguistic patterns, and reasoning heuristics. Researchers have found that specific neurons activate for specific concepts.

As information flows through the layers, it transforms from shallow surface patterns (punctuation, word shapes) to deep semantic understanding (reasoning, world knowledge). Explore the pipeline below, toggle residual connections on and off, and use the depth slider to see what the model has learned at each level:

Inside one transformer block:

Input from previous layer
Output to next layer

What the model learns at each depth:

Layer depth24 / 96
Layer 1Layer 96
Semantic meaning~Layer 24

Word meaning in context, basic world knowledge, entity types.

bank (river) vs bank (finance)Paris → FranceEinstein → physics

Step 5: Next-Token Prediction

After passing through all those transformer layers, the model's final job is deceptively simple: predict the next token. The last layer's output — a vector of 12,288 numbers — gets projected to vocabulary size (50,257 for GPT-2) through a linear transformation, producing a raw score (logit) for every possible next token.

These logits are then passed through a softmax function that converts them into a probability distribution — every token gets a probability between 0 and 1, and they all sum to 1. For factual prompts like "The capital of France is", the distribution is extremely sharp — "Paris" gets 92% of the probability mass. For open-ended prompts like "The meaning of life is", it's much flatter — many tokens are plausible.

This shape — sharp vs flat — is quantified by a measure called entropy. Low entropy means the model is confident. High entropy means many tokens are equally plausible and the model is "unsure." Notice how different prompt types produce radically different distributions:

Try the different prompts below. Pay attention to the long tail — even the top 10 predictions rarely account for 100% of the probability. The other 50,000+ tokens each get a tiny sliver:

Choose a prompt:
The capital of France is
1.Paris
92.0%
2.a
2.2%
3.the
1.5%
4.located
0.8%
5.known
0.6%
6.one
0.4%
7.not
0.3%
8.also
0.2%
9.home
0.2%
10.called
0.1%
Distribution:
Sharp (confident)H=0.6 bits

The model assigns a probability to every possible next token. Higher bars = more likely.

Step 6: Temperature & Sampling

You've seen how the model produces a probability distribution over next tokens. But how does it choose which one to use? This is where temperature comes in — the one parameter that you, the user, actually get to control.

Temperature works by scaling the logits (raw scores) before the softmax function. Mathematically, it's simple: divide every logit by the temperature value T. Low T (like 0.2) makes the differences between logits larger, sharpening the distribution so the top token dominates. High T (like 1.5) compresses the differences, flattening the distribution so more tokens have a chance of being selected.

At temperature 0 (or very close to it), the model always picks the single most probable token — this is called greedy decoding. The output is completely deterministic: same input, same output, every time. At temperature 1.0, the probabilities are used as-is. Above 1.0, you start getting increasingly surprising and creative — but potentially incoherent — output.

In practice, most APIs also offer top-k and top-p (nucleus) sampling to further shape the distribution. Top-k only considers the k most likely tokens. Top-p keeps the smallest set of tokens whose cumulative probability exceeds a threshold. Both prevent the model from picking extremely unlikely tokens. Drag the slider below to see how temperature reshapes the distribution in real time:

0.70
00.30.711.52.0
BalancedDefault setting for most tasks. Natural-sounding text.
78.1%
the
16.2%
a
2.5%
an
1.2%
this
0.8%
that
0.4%
one
0.3%
my
0.1%
his
0.1%
her
0.1%
their
0.0%
our
0.0%
some
0.0%
every
0.0%
no
0.0%
another

Same prompt, different temperatures:

T=0.2 (Focused)

The cat sat on the warm windowsill, watching the rain fall gently against the glass. It had been a long day, and the soft patter of raindrops was the only sound in the quiet house.

T=0.8 (Balanced)

The cat perched on the dusty windowsill, eyes tracking a pigeon on the fire escape. Thunder rumbled somewhere over Brooklyn, and she flicked her tail twice — a private verdict on the weather.

T=1.5 (Creative)

The cat dissolved into the windowsill like butter on warm asphalt. Rain? No — the sky was weeping crystallized jazz notes. She blinked sideways at Tuesday and decided gravity was optional.

Step 7: The Generation Loop

Here's how it all comes together. To generate text, the model takes your prompt, runs the entire pipeline (tokenize → embed → attend → transform → predict → sample), selects one token, appends it to the sequence, and repeats the entire process from scratch. Each new token is fed back in as input. This is called autoregressive generation — the output feeds back into the input.

This means that for a 100-token response, the model runs the full inference pipeline 100 times. Each pass is independent — the model has no "memory" of previous computations beyond what's in the token sequence itself. Every new token requires re-processing the entire sequence from the beginning (though in practice, KV caching avoids redundant computation for earlier tokens).

At each step, there's not just one possible next token — there are thousands of plausible continuations. The model picks one based on the probability distribution and temperature settings. A different random seed or slightly different temperature would produce a completely different text. The "ghost" alternatives you'll see below represent paths not taken — parallel universes of text that could have been.

Generation continues until the model produces a special end-of-sequence (EOS) token, or until it hits a maximum length. Watch the process step by step below — and try clicking alternative tokens to explore how a single different choice cascades into a completely different continuation:

The cat sat
Step 1 of 15 · sat (35%)
Alternatives:
End-of-sequence probability:
0.1%

Each step: the model reads everything so far, predicts the next token, appends it, and repeats. Click an alternative to explore a different path.

How It Learned: Training

Everything you've seen so far is inference — what happens when you use a trained model. But how did the model learn all this? Training is where billions of parameters get their values, and it happens in two major phases.

Phase 1: Pre-training. The model is shown hundreds of billions of tokens of text — web pages, books, code, Wikipedia, academic papers. For each token in the training data, the model tries to predict what comes next. When it's wrong, the error signal flows backward through the network via backpropagation, nudging all 175 billion parameters slightly in the direction that would have produced a better prediction. Repeat this trillions of times.

Phase 2: Fine-tuning (RLHF). The pre-trained model can continue text, but it doesn't follow instructions well. Fine-tuning with Reinforcement Learning from Human Feedback transforms it into a helpful assistant. Human evaluators rank model outputs, a reward model learns their preferences, and the LLM is trained to maximize that reward. This is what makes it answer questions instead of just continuing text.

The loss curve below shows how prediction error decreases over training. Early on, the model is nearly random. As training progresses, it first learns common words, then grammar, then facts, then reasoning. The comparison panel shows the dramatic difference between a pre-trained base model and one that's been fine-tuned:

🧠
175B
175 billion learned weights
📚
300B
300 billion words of text
💰
$4.6M
In GPU compute time
⏱️
~34 days
On 1,024 A100 GPUs
2468100100K200K300K400K500KLossTraining steps
Common wordsStep 1K

Learns 'the', 'a', 'is', 'of' — the most frequent tokens.

Grammar basicsStep 10K

Subject-verb agreement, basic sentence structure.

World knowledgeStep 50K

'Paris is the capital of France', common facts emerge.

ReasoningStep 150K

Can follow simple instructions, basic cause-and-effect.

Near human-levelStep 300K

Nuanced language, style adaptation, complex reasoning.

Pre-trained vs Fine-tuned

Prompt: What is the capital of France?
Pre-trained (base)

What is the capital of Germany? What is the capital of Spain? What is the capital of Italy?

Fine-tuned (RLHF)

The capital of France is Paris. Paris is the largest city in France, located in the north-central part of the country along the Seine River.

What It Can't Do

Understanding how LLMs work also means understanding their fundamental limitations. These aren't bugs to be fixed — they're inherent to how the technology works. A next-token predictor trained on text patterns will always have certain blind spots.

Hallucinations. LLMs generate text that is statistically plausible, not factually verified. If a confident-sounding answer fits the pattern, the model will produce it — even when it's completely fabricated. It might invent people, cite papers that don't exist, or state false facts with absolute confidence. The model has no internal mechanism to distinguish what it "knows" from what it's making up.

Limited context. Every model has a finite context window — a maximum number of tokens it can process at once. Anything beyond that window is completely invisible. In a long conversation, the model literally cannot see what was said earlier. Newer models have expanded this dramatically (up to 200K tokens), but it's still a hard limit.

Reasoning failures. LLMs are surprisingly bad at certain tasks that seem easy to humans. Counting letters in words, basic arithmetic, logic puzzles — these often trip them up because the model is pattern-matching, not actually computing. The classic bat-and-ball problem fools most LLMs because the wrong answer appears far more often in training data than the right one. Explore these limitations interactively below:

Prompt:

Tell me about the 2019 Nobel Prize in Mathematics.

LLM Response:

The 2019 was awarded to for her groundbreaking work on topological quantum field theory and its applications to number theory. Her was described by the committee as 'a once-in-a-century achievement.'

Click the highlighted phrases to reveal what's wrong:

Why does this happen? The model generates text that is statistically plausible — it sounds right because it matches patterns from training data. But it has no way to verify facts against reality. It's predicting likely next tokens, not looking things up.

The Big Picture

Let's zoom out. Every time ChatGPT writes a sentence, Claude answers a question, or Copilot suggests code, this entire pipeline runs — for every single token. Text comes in, gets tokenized, embedded, processed through attention layers, and a single next token comes out. Then it loops. A 500-word response requires roughly 375 complete passes through the entire network.

The "intelligence" isn't in any one step — it's in the billions of trained parameters that shape how attention flows, how representations transform layer by layer, and which tokens get predicted. An LLM is, at its core, the world's most sophisticated autocomplete. But "just autocomplete" undersells it — the depth of pattern recognition across 96 layers, trained on most of the internet, produces emergent capabilities that continue to surprise researchers.

What's remarkable is that the entire architecture is built from simple, differentiable operations — matrix multiplications, additions, softmax. There's no special "understanding" module, no knowledge database, no explicit reasoning engine. Just prediction pressure, scale, and data. Whether that constitutes genuine understanding or merely the most convincing imitation of it — that's a question for philosophers.

Click any stage in the pipeline below to jump back to that section. And remember: every token you've read in this explainer could have been generated by the very process it describes.

Inference vs Training:

Inference is what happens when you chat with an LLM. The weights are frozen — the model reads your prompt, runs the pipeline once per token, and generates a response.

~360 TFLOP
FLOPs per token
~30ms
Time per token
~$0.002
Cost per 1K tokens

Frequently Asked Questions

What is a large language model (LLM)?

A large language model is a type of artificial intelligence trained on massive amounts of text data. It learns patterns in language and can generate human-like text by predicting the most likely next word (token) given a sequence of input words. Examples include GPT-4, Claude, Llama, and Gemini.

What is tokenization in NLP?

Tokenization is the process of breaking text into smaller units called tokens. Modern LLMs use byte-pair encoding (BPE), which splits text into subword pieces — common words stay whole ('the' → [the]) while rare words split into fragments ('unforgettable' → [un, forget, table]). This lets the model handle any text with a fixed vocabulary of ~50,000 tokens.

What are embeddings in AI?

Embeddings are numerical representations of tokens as high-dimensional vectors — lists of hundreds or thousands of numbers. Words with similar meanings end up close together in this vector space. Famously, vector arithmetic works: king - man + woman ≈ queen. The model learns these representations during training.

How does the attention mechanism work?

The attention mechanism lets each token in a sequence look at every other token to understand context. It computes query, key, and value vectors for each token, then calculates attention weights that determine how much each token should 'pay attention to' others. Multiple attention heads run in parallel, each learning to detect different patterns — syntax, coreference, semantic similarity, and more.

What is next-token prediction?

Next-token prediction is the core task that LLMs are trained on. Given a sequence of tokens, the model outputs a probability distribution over all possible next tokens (often 50,000+). During generation, one token is selected from this distribution, appended to the sequence, and the process repeats. This is called autoregressive generation.

What is temperature in AI text generation?

Temperature is a parameter that controls randomness in text generation. It scales the logits (raw scores) before the softmax function. Low temperature (e.g., 0.2) sharpens the distribution, making the model pick the most probable tokens for predictable, factual output. High temperature (e.g., 1.5) flattens the distribution, producing more creative and varied — but potentially less coherent — output.

How are LLMs trained?

LLMs are trained in two main phases. Pre-training exposes the model to billions of tokens of text from the internet, books, and code — the model learns to predict the next token and in doing so absorbs grammar, facts, and reasoning patterns. Fine-tuning (often using RLHF — reinforcement learning from human feedback) then aligns the model to follow instructions and be helpful, harmless, and honest.

What is RLHF (Reinforcement Learning from Human Feedback)?

RLHF is a technique used to fine-tune LLMs after pre-training. Human evaluators rank model outputs by quality, and a reward model is trained on these preferences. The LLM is then fine-tuned using reinforcement learning to maximize the reward model's score. This is what transforms a base model (which just continues text) into a helpful assistant that follows instructions.

Why do LLMs hallucinate?

LLMs hallucinate because they generate text by predicting statistically likely next tokens, not by looking up facts. If a plausible-sounding answer exists in the pattern space, the model will produce it confidently — even if it's wrong. The model has no mechanism to verify claims against reality. It's extrapolating patterns, not reasoning from a knowledge base.

What is a context window in LLMs?

The context window is the maximum number of tokens an LLM can process at once. Early models had 1K-4K token windows; modern models like GPT-4 Turbo (128K) and Claude 3.5 (200K) can handle much longer inputs. Anything beyond the context window is invisible to the model — it simply can't reference earlier text.

What is top-p (nucleus) sampling?

Top-p sampling (also called nucleus sampling) is an alternative to temperature for controlling generation randomness. Instead of considering all tokens, it only considers the smallest set of tokens whose cumulative probability exceeds a threshold p (e.g., 0.9). When the model is confident, fewer tokens pass the filter; when uncertain, more do. This adapts automatically to the situation.

Keep exploring