🌰 seedling
Auto Research - Agents as Overnight Experimentation Engines

Auto Research — Agents as Overnight Experimentation Engines

The setup

The pattern is deceptively simple:

  1. Objective: Define a measurable goal. Minimize validation loss on this benchmark. Maximize code pass rate on this test suite. Shrink the model without losing accuracy below threshold X.
  2. Boundaries: Define what the agent can and cannot change. You may modify hyperparameters, loss weights, and optimizer config. You may not change the architecture or training data.
  3. Budget: Define compute, time, and cost limits.
  4. Leave it running. Overnight, over a weekend, or however long the budget allows.
  5. Review at the end. The agent returns a ranked list of candidate improvements with evidence.

The discipline is staying out. Every time the human nudges the agent toward a “promising” direction, the experimentation loop slows and inherits the human’s biases.

The Karpathy GPT-2 example

Karpathy applied this to his own GPT-2 training repo, code he had hand-tuned over decades. The agent found:

  • Weight decay on value embeddings was misconfigured (the agent discovered this by ablating)
  • Adam beta parameters were poorly tuned for his particular setup
  • Joint interactions between hyperparameters that he had missed because he’d only tuned one at a time

The humbling result: code refined over years of careful manual work still had low-hanging fruit an agent found in one overnight run. Most hand-tuned research code probably harbors more of this than researchers want to admit.

His framing: “I shouldn’t be a bottleneck.”

What auto research needs to work

  • Clean metrics. If the objective resists reduction to a single number (or a small vector), the agent has no way to rank experiments. Research domains with noisy, subjective, or multi-dimensional outcomes are harder to automate.
  • Fast iteration cycle. If each experiment takes a week, auto research becomes scheduled-batch research. The payoff scales with how many cycles you can run per budget.
  • Reproducibility. The agent needs to trust that a given config produces the same result twice, or it can’t reason about which change caused which improvement.
  • Bounded search space. Open-ended “go improve this repo” tasks fail because the agent wanders. Constrained “tune these 12 hyperparameters” tasks succeed because the search is small enough to sweep.
  • Safety/cost guardrails. The harness must enforce compute budgets; asking the agent nicely fails.

What changes when this works

  • Human hours per experiment approach zero. The researcher’s job shifts from running experiments to specifying them and reviewing results.
  • Research throughput becomes compute-bound. The question becomes “how much compute can we throw at the loop,” because researcher iteration speed no longer limits it.
  • Hand-tuned artifacts become suspicious. If an agent improves an expert’s careful work in one night, any artifact that has skipped auto-research probably harbors the same hidden low-hanging fruit.
  • Researchers move up the stack. Instead of tuning models, researchers define objectives, curate evaluation sets, and design the search spaces. The mechanical work drops below them.

The limits to watch

  • Metric hacking. Agents optimize what you measure. If the metric has a loophole, the agent finds it. Requires careful metric design and adversarial review.
  • Compute asymmetry. Auto research favors whoever has more compute. This may accelerate the gap between well-resourced labs and everyone else — though distributed approaches like Bittensor-style training may counter this.
  • Non-verifiable domains. Tasks without clean metrics (writing quality, research judgment, taste) stay human-led even as mechanical experimentation goes automated.
Connected Notes