Auto Research — Agents as Overnight Experimentation Engines

The setup

The pattern is deceptively simple:

Objective: Define a measurable goal. Minimize validation loss on this benchmark. Maximize code pass rate on this test suite. Shrink the model without losing accuracy below threshold X.
Boundaries: Define what the agent can and cannot change. You may modify hyperparameters, loss weights, and optimizer config. You may not change the architecture or training data.
Budget: Define compute, time, and cost limits.
Leave it running. Overnight, over a weekend, or however long the budget allows.
Review at the end. The agent returns a ranked list of candidate improvements with evidence.

The discipline is staying out. Every time the human nudges the agent toward a “promising” direction, the experimentation loop slows and inherits the human’s biases.

The Karpathy GPT-2 example

Karpathy applied this to his own GPT-2 training repo, code he had hand-tuned over decades. The agent found:

Weight decay on value embeddings was misconfigured (the agent discovered this by ablating)
Adam beta parameters were poorly tuned for his particular setup
Joint interactions between hyperparameters that he had missed because he’d only tuned one at a time

The humbling result: code refined over years of careful manual work still had low-hanging fruit an agent found in one overnight run. Most hand-tuned research code probably harbors more of this than researchers want to admit.

His framing: “I shouldn’t be a bottleneck.”

What auto research needs to work

Clean metrics. If the objective resists reduction to a single number (or a small vector), the agent has no way to rank experiments. Research domains with noisy, subjective, or multi-dimensional outcomes are harder to automate.
Fast iteration cycle. If each experiment takes a week, auto research becomes scheduled-batch research. The payoff scales with how many cycles you can run per budget.
Reproducibility. The agent needs to trust that a given config produces the same result twice, or it can’t reason about which change caused which improvement.
Bounded search space. Open-ended “go improve this repo” tasks fail because the agent wanders. Constrained “tune these 12 hyperparameters” tasks succeed because the search is small enough to sweep.
Safety/cost guardrails. The harness must enforce compute budgets; asking the agent nicely fails.

What changes when this works

Human hours per experiment approach zero. The researcher’s job shifts from running experiments to specifying them and reviewing results.
Research throughput becomes compute-bound. The question becomes “how much compute can we throw at the loop,” because researcher iteration speed no longer limits it.
Hand-tuned artifacts become suspicious. If an agent improves an expert’s careful work in one night, any artifact that has skipped auto-research probably harbors the same hidden low-hanging fruit.
Researchers move up the stack. Instead of tuning models, researchers define objectives, curate evaluation sets, and design the search spaces. The mechanical work drops below them.

The limits to watch

Metric hacking. Agents optimize what you measure. If the metric has a loophole, the agent finds it. Requires careful metric design and adversarial review.
Compute asymmetry. Auto research favors whoever has more compute. This may accelerate the gap between well-resourced labs and everyone else — though distributed approaches like Bittensor-style training may counter this.
Non-verifiable domains. Tasks without clean metrics (writing quality, research judgment, taste) stay human-led even as mechanical experimentation goes automated.

Research Org as Tunable program dot md
Token Throughput as the New Coding Bottleneck
Distributed Open Source AI Training as Orthogonal Threat
Karpathy - No Priors Code Agents Autoresearch (source)
Claws - Persistent Looping Agents as App Replacement — both are autonomous agent loop patterns; auto-research applies the loop to experimentation