🌰 seedling
RL Scaling Follows Pre-Training - The Generalization Inflection Ahead

RL Scaling Follows Pre-Training — The Generalization Inflection Ahead


The analogy

Pre-training’s trajectory:

  1. GPT-1 (2018) — narrow corpus, poor generalization
  2. GPT-2 (2019) — broad internet data, sudden generalization across tasks
  3. GPT-3 onward — scaling laws hold, capability improves log-linearly with compute

RL’s current trajectory:

  1. Narrow RL (2024-2025) — trained on math competitions, coding challenges, specific verifiable tasks
  2. Broad RL (expected) — training on diverse task distributions, sudden generalization across domains
  3. RL scaling laws — already visible on AIME benchmarks, log-linear improvement with training duration

The inflection (step 1→2) happened for pre-training when the data distribution broadened. Amodei expects the same transition for RL as task distributions broaden beyond math and coding.

Why this matters

Pre-training scaling gave us general language capability. RL scaling gives us general action capability — the ability to pursue goals, make multi-step decisions, and recover from errors across diverse domains. The combination produces agents that both understand context (pre-training) and execute toward objectives (RL).

For verifiable tasks (math, coding, formal proofs): RL scaling already produces state-of-the-art results. Amodei predicts full end-to-end software engineering capability within 1-2 years.

For non-verifiable tasks (writing, strategy, planning): more uncertainty, but the generalization pattern from pre-training suggests that once the RL task distribution broadens enough, capability extends to these domains too.

The sample efficiency question

Rich Sutton’s objection: if models had a “true core of human learning,” they wouldn’t need billions of dollars of compute to learn simple tasks. Amodei’s response: models start from random weights and must do the equivalent of both evolutionary learning and individual learning during training. Humans start with brains shaped by millions of years of evolution. Once trained, models with million-token context windows show genuine in-context adaptation comparable to weeks of human reading.

The implication: high upfront training cost, low marginal deployment cost. Each new model amortizes its training across all users. This favors a few large labs running expensive training with cheap inference, consistent with the oligopoly structure described in Two Exponentials - AI Capability vs Economic Diffusion.


Connected Notes