Agent Harnesses

What Is a Harness?

The runtime orchestration layer that wraps a reasoning loop and coordinates:

Tool execution — file editing, bash, web search, MCP servers
Context management — what the model sees, when, and how much
Safety enforcement — permissions, dangerous command detection, approval flows
Session persistence — memory, state, conversation history
Channel integration — how humans interact (CLI, Telegram, Discord, web)

Pasted image 20260413185447

From @vtrivedy10

Per the arXiv formalization (2603.05344):

Scaffolding = constructing the agent BEFORE the first prompt (CLAUDE.md, skills, rules, context files)
Harness = everything that happens AFTER (tool execution, context management, safety, persistence)

The moat is the harness, not the model.

“Fat Skills, Thin Harness, Deterministic App”

Pasted image 20260413192412

From @GarryTan

Why the Harness Matters More Than the Model

Empirical evidence

LangChain’s coding agent: 52.8% → 66.5% on Terminal Bench 2.0 by changing only the harness. Same model, same prompts. The 14-point jump came entirely from orchestration changes.

Market confirmation

OpenAI open-sourced codex-plugin-cc (March 30, 2026) to invoke Codex from inside Claude Code. If models were the moat, this wouldn’t exist. Models are swappable plugins. The harness is the platform.

The Claude Code leak

When Anthropic accidentally published Claude Code’s full source (512K lines TypeScript), the community ignored the model integration and reverse-engineered the 12-layer harness system, 40+ tools, and orchestration patterns. Claw Code replicated the architecture in Python overnight. 100K GitHub stars in 24 hours.

The Two Concepts That Matter Most

1. Context Hygiene (from Dex Horthy - Context Engineering for AI Coding Workflows)

The “dumb zone”: past ~40% context window fill, output quality drops. Everything after that gets worse.

Worst context (ranked):

Incorrect information (worst)
Missing information
Noise

Trajectory matters. If an agent keeps getting corrected, the conversation history trains the next token toward more mistakes. Errors compound.

The fix: compact early and often. When context fills or goes sideways, compress state to a markdown summary and start fresh. Use sub-agents for context isolation (dispatch to read a large area, return one sentence). Don’t use them as role analogues.

2. Research → Plan → Implement

Context hygiene produces a natural workflow:

Research — read the actual codebase, compress into truthful snapshot
Plan — exact file names, line snippets, test steps. “The dumbest model won’t screw this up.”
Implement — follow the plan with deliberately low context

This maps directly to Spec-Driven Development and AI-Native SDLC - 2026 Analysis — specs are the plan artifact, and the harness executes against them.

Architectural Philosophies

Gateway (OpenClaw, 345K stars)

A persistent process managing routing, permissions, channels, skills, and connections. The model is one of many tools it orchestrates. 15+ platform integrations. Skills marketplace.

Memory-First (Hermes, 21.6K stars)

An agent that knows your preferences, writes its own skills, and retains context across sessions beats one that’s well-routed but forgetful. Honcho user modeling, self-improving skill library.

Minimal (Pi)

A tiny extensible harness you adapt to your workflows. No opinions, no batteries — TypeScript extensions, skills, templates. Created by Armin Ronacher. For people who want to understand and control every layer.

Orchestration (Paperclip, 43.9K stars)

The harness IS the company — org charts, budgets, governance, multiple heterogeneous agents as “employees.” Not a replacement for coding harnesses; a layer above them.

Clean-Room (Claw Code, 114K stars)

Replicate Claude Code’s harness in open source. Python/Rust rewrite, model-agnostic, independently audited.

Proxy (Claude Code Router, 31.1K stars)

Intercept the harness instead of replacing it. Route different task types to different models (cheap for simple, expensive for complex). Zero migration, full compatibility.

The Current Landscape (April 2026)

Harness	Stars	Lang	Model Lock-in	Memory	Channels	Skills
Claude Code	Closed	TS	Anthropic	File-based	Via plugins	12 (ours)
OpenClaw	345K	TS	Agnostic	Persistent	15+ native	Marketplace
Claw Code	114K	Rust/Py	Agnostic	Replicated	TBD	Compatible
Hermes	21.6K	Py	Agnostic	Self-improving	15+ native	Self-evolving
Codex CLI	1M devs	Rust	OpenAI	Session	CLI	Via plugins
Gemini CLI	Google	-	Google	Session	CLI	Native
Pi	-	TS	Agnostic	Extensions	CLI	Templates
IronClaw	-	Rust	Agnostic	Local encrypted	Fork of OpenClaw	-
Paperclip	43.9K	Node	Agent-agnostic	DB-backed	Via agents	Injected
CCR	31.1K	TS	Routes any	Unchanged	Unchanged	Unchanged

Recent Events (March-April 2026)

March 31: Claude Code source leak → Claw Code born, 100K stars in 24 hours
April: Anthropic ships Claude Code Channels (native Telegram/Discord)
April: Anthropic cuts off OpenClaw from Claude subscriptions
March 30: OpenAI open-sources codex-plugin-cc (invoke Codex from Claude Code)
March 11: Gemini CLI ships Plan Mode

Skills as the Knowledge Layer

From The Complete Guide to Building Skills for Claude - Anthropic:

Skills sit on top of the harness as the knowledge layer. The harness provides tools (what Claude can do). Skills provide workflows (how Claude should do it).

Three-level progressive disclosure:

YAML frontmatter — always loaded, tells Claude when to use the skill
SKILL.md body — loaded when relevant, full instructions
Linked files — discovered on demand

From Boris Cherny - Hidden Claude Code Features Thread (Claude Code lead):

Skills + loops = powerful automation (/loop 5m /babysit)
Hooks for deterministic lifecycle logic (SessionStart, PreToolUse, PermissionRequest, Stop)
Worktrees for parallel execution (claude -w)
/batch for massive parallelized changesets

What to actually do

Spend time on the harness, not the model. Model improvements arrive via API updates. Harness improvements require your work.
Watch your context window. Past 40% fill, output quality drops. Compact early, use sub-agents for isolation.
Write skills and context files. They encode what you’ve learned about your domain. This is what compounds.
Treat the knowledge graph as the durable asset. Harnesses will be replaced. Your accumulated context won’t.
Don’t lock into one model. Claude Code Router exists because models are swappable. The harness persists.
Research → Plan → Implement — the workflow that emerges from taking context seriously

2026-04 Update — Harness Governance as SpecOps

From Spec-Driven Development – Adoption at Enterprise Scale (Hari Krishnan, InfoQ, Feb 2026), cross-linked into Spec-Driven Development and AI-Native SDLC - 2026 Analysis:

Spec authoring is a first-class engineering surface. Quality practices, version control, review processes, and continuous improvement should apply to specs the same way they apply to production code. Leigh and Ray called this “SpecOps” in their related InfoQ piece.
Bugs are spec defects, not code defects. Two gap types: spec-to-implementation (implementation diverged from a clear spec → strengthen validation agents in the harness) and intent-to-spec (use case was missed during deliberation → improve spec elicitation). Either way, the harness itself is the thing being improved, not the individual bug.
Quality engineering moves upstream. The job shifts from validating implementations to validating the context harnesses that guide agent execution.
Directing agent swarms is an organizational capability, not an individual one. Adrian Cockcroft framing at QCon SF: “directing swarms of agents” is a different skill than managing individual contributors, and enterprises have to install it as a capability at team level, not at developer level.

2026-04 Update — Karpathy’s Frame on Agents, Throughput, and program.md

From Karpathy - No Priors Code Agents Autoresearch:

Token Throughput as the New Coding Bottleneck — Dec 2024 inflection, “Peter Steinberg mode” (10 agents in parallel), orchestration as the new developer skill
Claws - Persistent Looping Agents as App Replacement — Dobby the elf experiment, apps become APIs, agent is the customer
Auto Research - Agents as Overnight Experimentation Engines — give an objective, a budget, and walk away; agents found hyperparameter bugs in Karpathy’s hand-tuned GPT-2 repo
Research Org as Tunable program dot md — the research org itself becomes a markdown document you can version, A/B test, and meta-optimize
Three Waves of AI Opportunity - Unhobbling, Physical Interface, Robotics — 1M-order-of-magnitude gap between bits and atoms; almost all 2026 value is in Wave 1

2026-04 Update — The Folder Is the Agent (44-Agent Production Setup)

From The Folder Is the Agent - Kieran Klaassen (Every, April 2026):

The Folder Is the Agent - Context Accumulation as Specialization — an agent is not a framework or protocol. It’s a folder with CLAUDE.md + accumulated docs + skills. Change the folder, not the model, and you get a different specialist. Klaassen runs 44 agents across a handful of folders — same model (Opus 4.6), different folder = different specialist (Rails engineer vs ops engineer). Dispatch layer is a Ruby daemon with file-based messaging — no custom networking, no agent-to-agent protocol. Key principle: “Build it, use it, trust it, then orchestrate it” — don’t hand a folder to automation until you’ve used it yourself and trust its flows.

2026-04 Update — AXI: Agent-Computer Interface Design, Benchmarked

From AXI Agent eXperience Interface (Kun Chen, April 2026 — 915 benchmark runs):

AXI Principles - Agent-Ergonomic Interface Design — 10 principles for agent-ergonomic CLI (Efficiency / Robustness / Discoverability / Fallback). Benchmarked across 915 runs: same model (Claude Sonnet 4.6), only variable is interface design, 1.6x variance in cost and duration. AXI hits 100% success at lowest cost/turns/duration on both browser and GitHub benchmarks. Mechanisms: TOON format (40% token savings over JSON), combined operations (navigate + snapshot in one call), next-step hints after output.

2026-04 Update — Agent-Native Architecture Principles

From Agent-native Architectures (Dan Shipper + Claude, Every):

Agent-Native Architecture - Five Principles for Building After Code Ends — five principles for building apps where agents are first-class: Parity (agent matches UI), Granularity (atomic tools, outcome-based features), Composability (new features = new prompts), Emergent Capability (handles the unanticipated), Improvement Over Time (context compounds). Files as the universal agent interface. The “ultimate test”: can the agent accomplish an outcome you didn’t build a feature for?

Harness as a Service - Infrastructure for the Agent Era — HaaS: the commercialization of harnesses as a rentable infrastructure category (Cursor SDK, Managed Agents, Foundry)
Most PMs Cant Touch Production - Gus Trigos via Aman Khan — Runtime (YC): sandboxed agent sessions so PMs ship PRs directly, engineering becomes quality gate
Agent Harness Landscape - April 2026 Update — post-leak landscape (Claw Code, Channels, OpenClaw cutoff)
Agent Harness Comparison - Hermes vs Paperclip vs Claude Code Router — detailed feature comparison
Dex Horthy - Context Engineering for AI Coding Workflows — context hygiene, dumb zone, Research → Plan → Implement
Carl Vellotti - Claude Code Mastery Context Engineering — sub-agents for context isolation, builder-validator pattern
Boris Cherny - Hidden Claude Code Features Thread — 15 features from Claude Code’s lead dev
5 Claude Code Patterns - Sequential to Headless — 5 escalating usage patterns
Anthropic - Claude Managed Agents Launch — cloud-hosted meta-harness (brain/hands architecture)
The Complete Guide to Building Skills for Claude - Anthropic — official skill architecture
Spec-Driven Development and AI-Native SDLC - 2026 Analysis — specs as the planning layer
Karpathy - LLM Wiki Pattern for Personal Knowledge Bases — the knowledge layer above the harness
Karpathy - No Priors Code Agents Autoresearch — auto-research, claws, program.md pattern
GAN-Inspired Agent Architecture - Generator Evaluator Loops — multi-agent generator/evaluator pattern inspired by GANs, from Anthropic’s long-running harness work
Criteria-Based Grading for Subjective Agent Output — making subjective quality (design, UX) gradable through decomposed criteria with explicit weightings
Sprint Contracts - Negotiated Agent Agreements — generator and evaluator negotiate “done” definition before each sprint
Harness Simplification as Models Improve — the discipline of stripping scaffolding as model capabilities increase
The 90 to 100 Percent Automation Gap — 95% automation requires 100% monitoring; only 100% lets the human walk away
Cat Wu - Head of Product Claude Code Cowork at Anthropic — insider context on Claude Code leak (human error), OpenClaw cutoff (capacity), anti-pattern of over-customizing MCPs
LLM Document Degradation — agentic tool use does not mitigate document corruption in delegated workflows (DELEGATE-52)
Delegation Readiness and the Jagged Frontier — domain-specific trust calibration for AI delegation
Agent Runtime - The Execution Layer Nobody Is Building — the runtime hosts the harness; compute substrate, sandbox, state model, lifecycle controller
Runtime Shift - A New Form of Distributional Shift for Agents — agents learn the runtime during training; moving runtimes causes silent quality regressions
Sandbox Isolation Primitives for Agent Systems — five isolation primitives (containers, Firecracker, gVisor, Kata, V8) with trade-offs for agent workloads