TheHowPage

How LLM Training Works: From Raw Text to ChatGPT

Learn how large language models are trained — from next-token prediction on billions of words to RLHF alignment. The full pipeline explained.

How LLM Training Works: From Raw Text to ChatGPT

Training a large language model costs millions of dollars, takes months on thousands of GPUs, and consumes enough electricity to power a small town. But the core idea is deceptively simple: predict the next word. Over and over. Trillions of times. Until patterns emerge that look like intelligence.

Here's how the full pipeline works.

Phase 1: Pre-training (the expensive part)

Pre-training is where the model learns language, facts, reasoning patterns, and world knowledge. The setup:

  1. Gather a massive corpus — web pages, books, code, scientific papers, Wikipedia, forums. GPT-3 was trained on ~300 billion tokens. GPT-4 likely used several trillion. Llama 2 used 2 trillion tokens.

  2. Feed the model text sequences and ask it to predict the next token at every position. For the sentence "The cat sat on the mat," the model gets:

    • Input: The → Predict: cat
    • Input: The cat → Predict: sat
    • Input: The cat sat → Predict: on
    • ...and so on for every position in every document.
  3. Compute the loss — how wrong was the prediction? The standard loss function is cross-entropy loss: the negative log probability the model assigned to the correct next token. If the model gave the right token a 90% probability, the loss is low. If it gave it 1%, the loss is high.

  4. Update the weights via backpropagation and an optimizer (typically AdamW). Each update nudges millions of parameters slightly to make better predictions next time.

  5. Repeat for hundreds of thousands of gradient steps.

The scale is staggering

Training GPT-4 reportedly involved:

  • ~1.8 trillion parameters across a mixture-of-experts architecture
  • ~13 trillion tokens of training data
  • ~25,000 NVIDIA A100 GPUs running for ~90-100 days
  • Estimated cost: $63-100 million in compute alone

Llama 2 70B (a smaller model) still required 1,720,320 GPU-hours — equivalent to running a single GPU for 196 years.

Learning rate schedules

You can't just set a single learning rate and forget it. Modern LLM training uses carefully designed schedules:

  • Warmup: Start with a tiny learning rate and linearly increase it over the first few thousand steps. This prevents the randomly initialized model from making destructive early updates.
  • Cosine decay: After warmup, gradually decrease the learning rate following a cosine curve. This lets the model make large adjustments early and fine-grained adjustments later.
  • Typical peak learning rate: 3e-4 for large models, with warmup over 2,000 steps.

The loss curve

If you plot the training loss over time, you see a characteristic curve: it drops steeply at first (the model learns basic grammar and common words), then gradually flattens (diminishing returns as it learns subtler patterns). But the curve never truly plateaus for large models — there's always more to learn with more data and more compute.

This is the empirical basis of scaling laws: researchers at OpenAI and DeepMind have shown that loss decreases predictably as a power law function of model size, dataset size, and compute. Double the compute, and you get a predictable improvement. This relationship held across six orders of magnitude and is why the industry keeps building bigger models.

Phase 2: Supervised Fine-Tuning (SFT)

A pre-trained model is impressive but not useful as a chatbot. It's been trained to predict text, not to follow instructions. Ask it "What's the capital of France?" and it might continue with "What's the capital of Germany? What's the capital of Italy?" — because that's a likely continuation of such text on the internet.

Supervised fine-tuning fixes this with a curated dataset of (prompt, ideal_response) pairs. Human contractors write thousands of high-quality responses demonstrating the desired behavior: answering questions, following instructions, refusing harmful requests, thinking step by step.

The model is trained on these examples using the same next-token prediction objective, but now the "text" is formatted conversations. After SFT, the model understands that when given a question, it should answer — not just continue the pattern.

SFT typically uses 10,000–100,000 examples, a much smaller dataset than pre-training, and runs for only a few epochs. It's fine-tuning, not retraining.

Phase 3: RLHF (Reinforcement Learning from Human Feedback)

SFT teaches format. RLHF teaches quality and safety. The process:

Step 1: Collect comparison data. Show human raters two model outputs for the same prompt and ask which is better. Repeat thousands of times across diverse prompts.

Step 2: Train a reward model. Using the comparison data, train a separate neural network that takes a (prompt, response) pair and outputs a scalar score predicting how much humans would prefer it.

Step 3: Optimize the LLM against the reward model. Using Proximal Policy Optimization (PPO) or similar algorithms, update the LLM to generate responses that score higher on the reward model. A KL divergence penalty keeps the model from straying too far from the SFT baseline (preventing reward hacking).

RLHF is what makes models helpful, harmless, and honest. It's also what teaches them to say "I don't know" instead of confidently hallucinating, to refuse harmful requests, and to be concise rather than rambling.

Some newer approaches skip the reward model entirely. Direct Preference Optimization (DPO) directly optimizes the language model from preference pairs, avoiding the instability of RL training. Constitutional AI (CAI) uses the model itself to generate and evaluate its own training data using a set of principles.

Emergent capabilities

One of the most surprising findings in LLM research: abilities appear suddenly at scale.

Small models can't do arithmetic, analogical reasoning, or chain-of-thought problem solving. As you scale up — more parameters, more data, more compute — these abilities don't improve gradually. They're absent, absent, absent... then suddenly present.

This phenomenon, called emergence, is still debated. Some researchers argue it's an artifact of how we measure (threshold effects on discrete tasks). Others believe it reflects genuine phase transitions in what the model's representations can encode.

Either way, the practical implication is real: you often can't predict what a larger model will be able to do based on what a smaller model can do. This is both exciting and concerning — exciting because it means breakthroughs might be one scaling step away, concerning because it means we can't fully predict what we're building.

The full pipeline, summarized

Raw text (trillions of tokens)
  → Pre-training (next-token prediction, months of compute)
    → Base model (powerful but not helpful)
      → SFT (instruction following, thousands of examples)
        → Fine-tuned model (helpful but rough)
          → RLHF/DPO (human preferences, safety alignment)
            → Final model (helpful, harmless, honest)

Each phase is cheaper and faster than the last, but all are necessary. Pre-training provides raw capability. SFT provides format. RLHF provides quality and alignment.

See the training process visualized

Explore our interactive training section — watch the loss curve drop in real time, see how gradient updates flow through the network, and understand how each training phase builds on the last.