RL Scaling Follows Pre-Training — The Generalization Inflection Ahead

The analogy

Pre-training’s trajectory:

GPT-1 (2018) — narrow corpus, poor generalization
GPT-2 (2019) — broad internet data, sudden generalization across tasks
GPT-3 onward — scaling laws hold, capability improves log-linearly with compute

RL’s current trajectory:

Narrow RL (2024-2025) — trained on math competitions, coding challenges, specific verifiable tasks
Broad RL (expected) — training on diverse task distributions, sudden generalization across domains
RL scaling laws — already visible on AIME benchmarks, log-linear improvement with training duration

The inflection (step 1→2) happened for pre-training when the data distribution broadened. Amodei expects the same transition for RL as task distributions broaden beyond math and coding.

Why this matters

Pre-training scaling gave us general language capability. RL scaling gives us general action capability — the ability to pursue goals, make multi-step decisions, and recover from errors across diverse domains. The combination produces agents that both understand context (pre-training) and execute toward objectives (RL).

For verifiable tasks (math, coding, formal proofs): RL scaling already produces state-of-the-art results. Amodei predicts full end-to-end software engineering capability within 1-2 years.

For non-verifiable tasks (writing, strategy, planning): more uncertainty, but the generalization pattern from pre-training suggests that once the RL task distribution broadens enough, capability extends to these domains too.

The sample efficiency question

Rich Sutton’s objection: if models had a “true core of human learning,” they wouldn’t need billions of dollars of compute to learn simple tasks. Amodei’s response: models start from random weights and must do the equivalent of both evolutionary learning and individual learning during training. Humans start with brains shaped by millions of years of evolution. Once trained, models with million-token context windows show genuine in-context adaptation comparable to weeks of human reading.

The implication: high upfront training cost, low marginal deployment cost. Each new model amortizes its training across all users. This favors a few large labs running expensive training with cheap inference, consistent with the oligopoly structure described in Two Exponentials - AI Capability vs Economic Diffusion.

Two Exponentials - AI Capability vs Economic Diffusion — the capability exponential that RL scaling feeds
Three Waves of AI Opportunity - Unhobbling, Physical Interface, Robotics — Wave 1 tasks are where RL scaling hits first
Token Throughput as the New Coding Bottleneck — coding as the first domain where RL scaling produces full automation
Inference Cost Collapse and Frontier Model Margin Expansion — the economics of high training cost + cheap inference
Dario Amodei on Dwarkesh Patel — source

RL Scaling Follows Pre-Training — The Generalization Inflection Ahead

The analogy

Why this matters

The sample efficiency question

Related Notes