Auto Research — Agents as Overnight Experimentation Engines
The setup
The pattern is deceptively simple:
- Objective: Define a measurable goal. Minimize validation loss on this benchmark. Maximize code pass rate on this test suite. Shrink the model without losing accuracy below threshold X.
- Boundaries: Define what the agent can and cannot change. You may modify hyperparameters, loss weights, and optimizer config. You may not change the architecture or training data.
- Budget: Define compute, time, and cost limits.
- Leave it running. Overnight, over a weekend, or however long the budget allows.
- Review at the end. The agent returns a ranked list of candidate improvements with evidence.
The discipline is staying out. Every time the human nudges the agent toward a “promising” direction, the experimentation loop slows and inherits the human’s biases.
The Karpathy GPT-2 example
Karpathy applied this to his own GPT-2 training repo, code he had hand-tuned over decades. The agent found:
- Weight decay on value embeddings was misconfigured (the agent discovered this by ablating)
- Adam beta parameters were poorly tuned for his particular setup
- Joint interactions between hyperparameters that he had missed because he’d only tuned one at a time
The humbling result: code refined over years of careful manual work still had low-hanging fruit an agent found in one overnight run. Most hand-tuned research code probably harbors more of this than researchers want to admit.
His framing: “I shouldn’t be a bottleneck.”
What auto research needs to work
- Clean metrics. If the objective resists reduction to a single number (or a small vector), the agent has no way to rank experiments. Research domains with noisy, subjective, or multi-dimensional outcomes are harder to automate.
- Fast iteration cycle. If each experiment takes a week, auto research becomes scheduled-batch research. The payoff scales with how many cycles you can run per budget.
- Reproducibility. The agent needs to trust that a given config produces the same result twice, or it can’t reason about which change caused which improvement.
- Bounded search space. Open-ended “go improve this repo” tasks fail because the agent wanders. Constrained “tune these 12 hyperparameters” tasks succeed because the search is small enough to sweep.
- Safety/cost guardrails. The harness must enforce compute budgets; asking the agent nicely fails.
What changes when this works
- Human hours per experiment approach zero. The researcher’s job shifts from running experiments to specifying them and reviewing results.
- Research throughput becomes compute-bound. The question becomes “how much compute can we throw at the loop,” because researcher iteration speed no longer limits it.
- Hand-tuned artifacts become suspicious. If an agent improves an expert’s careful work in one night, any artifact that has skipped auto-research probably harbors the same hidden low-hanging fruit.
- Researchers move up the stack. Instead of tuning models, researchers define objectives, curate evaluation sets, and design the search spaces. The mechanical work drops below them.
The limits to watch
- Metric hacking. Agents optimize what you measure. If the metric has a loophole, the agent finds it. Requires careful metric design and adversarial review.
- Compute asymmetry. Auto research favors whoever has more compute. This may accelerate the gap between well-resourced labs and everyone else — though distributed approaches like Bittensor-style training may counter this.
- Non-verifiable domains. Tasks without clean metrics (writing quality, research judgment, taste) stay human-led even as mechanical experimentation goes automated.
Related Notes
- Research Org as Tunable program dot md
- Token Throughput as the New Coding Bottleneck
- Distributed Open Source AI Training as Orthogonal Threat
- Karpathy - No Priors Code Agents Autoresearch (source)
- Claws - Persistent Looping Agents as App Replacement — both are autonomous agent loop patterns; auto-research applies the loop to experimentation