🌰 seedling
GAN-Inspired Agent Architecture - Generator Evaluator Loops

GAN-Inspired Agent Architecture β€” Generator-Evaluator Loops


The self-evaluation problem

When agents evaluate their own work, they reliably skew positive. Even when output quality is obviously mediocre to a human observer, the agent will confidently praise its results. This is especially pronounced for subjective tasks like visual design where there is no binary pass/fail equivalent to a software test.

Even on tasks with verifiable outcomes, agents exhibit poor judgment about their own work. The underlying dynamic: an LLM generating output and then assessing that same output within the same context is structurally biased toward approval.

The architectural fix

Separate the roles into distinct agents:

AgentRoleKey behavior
GeneratorProduces output (code, design, content)Iterates based on evaluator feedback
EvaluatorGrades output against criteria, writes critiquesTuned for skepticism and thoroughness

The separation does not eliminate leniency on its own β€” the evaluator is still an LLM inclined to be generous. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work. Once external feedback exists, the generator has something concrete to iterate against.

Source: Harness design for long-running application development (Prithvi Rajasekaran, Anthropic, March 2026)

Iteration dynamics

Across runs, evaluator assessments improve over iterations before plateauing, with headroom remaining. The pattern is not always linear:

  • Later implementations tend to be better as a whole
  • Middle iterations sometimes beat the final one on specific dimensions
  • Implementation complexity increases across rounds as the generator reaches for more ambitious solutions
  • Even the first iteration outperforms a no-prompting baseline, suggesting the criteria themselves steer the model before any feedback loop begins

In one example, a Dutch art museum website was refined through nine iterations of a polished dark-themed landing page. On the tenth cycle, the generator scrapped the approach entirely and reimagined the site as a 3D spatial experience with CSS perspective rendering and doorway-based navigation β€” a creative leap not seen from single-pass generation.

Scaling to three agents

For full-stack development, the pattern extends to a planner-generator-evaluator triad:

  1. Planner β€” expands a short prompt into a full product spec (ambitious scope, high-level technical design, avoids specifying granular implementation details that could cascade errors)
  2. Generator β€” implements features sprint by sprint, self-evaluates before QA handoff
  3. Evaluator β€” uses browser automation to interact with the running application, grades against sprint contracts, files specific bugs

The three-agent version produced applications that were dramatically more functional than single-agent baselines β€” the solo agent’s core feature was broken, while the harness version’s core features worked.

Cost and tradeoffs

The harness is expensive. A retro game maker comparison:

ApproachDurationCost
Solo agent20 min$9
Full harness6 hr$200

Over 20x more expensive, but the quality gap justified it β€” the solo run produced a broken core feature while the harness run produced a working, polished application with AI integration.


Connected Notes