Why AI Agent Reliability Depends More on the Harness Than the Model

Written by epappas | Published 2026/02/25
Tech Story Tags: artificial-intelligence | agentic-systems | mlops | llms | harness-engineering | ai-agent-harness | context-management | state-persistence

TLDRThe APEX-Agents benchmark tested frontier models on real professional tasks (banking, consulting, law) Models score above 90% on coding puzzles and multiple-choice tests, then fail at the kind of work an analyst does on a Tuesday morning.via the TL;DR App

I keep hearing the same question at every engineering offsite, Slack thread, and investor pitch: “What’s the best model right now — GPT, Claude, or Gemini?” I spent the last several months building and debugging agent-based systems, and I think this is the wrong question entirely. The evidence is now overwhelming: what determines whether an AI agent succeeds in production is not the model underneath it, but the infrastructure wrapped around it.

I am going to lay out my hypothesis, test it against three independent case studies with published data, and show you exactly where the industry is converging. Every claim in this article is backed by a published source — engineering blogs, peer-reviewed papers, or reporting from outlets with direct access.

My hypothesis: Agent harness engineering — the design of context management, tool selection, error recovery, and state persistence — is the primary determinant of agent reliability, not model capability. Past a capability threshold, improving the harness yields better returns than swapping the model.


tl;dr;

  • The APEX-Agents benchmark tested frontier models on real professional tasks (banking, consulting, law). Best pass@1: 24.0%. Pass@8: ~40%. Failures are primarily orchestration problems, not knowledge gaps [1].
  • Vercel removed 80% of their agent’s tools (15 down to 2). On a 5-query benchmark, accuracy jumped from 80% to 100%, tokens dropped 37%, speed improved 3.5x. Small sample, but the direction is striking [2].
  • Manus rebuilt their agent framework four times, and the biggest gains came from removing user-facing complexity while adding targeted infrastructure (context compaction, logit masking). They average ~50 tool calls per task and use the filesystem as external memory [3].
  • OpenAI, Anthropic, and Manus (acquired by Meta in late 2025 [9][19]) all independently converged on the same insight: simpler harnesses plus better models beat complex orchestration [4][5][6].
  • Verdict: The hypothesis holds with one important qualification — it applies above a model capability floor. Below that floor, no harness compensates for insufficient reasoning. Above it, harness engineering dominates outcome

1. Defining the Harness

Before going further, let me define what I mean by harness. OpenAI recently published a blog post explicitly titled “Harness Engineering” [4], and Martin Fowler published an analysis of the concept [7]. The term is gaining traction, but here is a precise technical definition:

An agent harness is the infrastructure layer that wraps a foundation model and controls five things:

  1. Context management — what enters the model’s context window, in what order, and what gets evicted
  2. Tool selection — which capabilities the model can invoke, and how those interfaces are designed
  3. Error recovery — how the system handles failed tool calls, reasoning dead-ends, and retry logic
  4. State management — how the agent persists progress across turns, sessions, and context window boundaries
  5. External memory — how information is stored and retrieved beyond the context window

Think of the model as the engine and the harness as the car. The industry has spent years arguing about who has the best engine. Almost nobody has been building a car that can stay on the road.

2. The Benchmark That Broke the Illusion

The disconnect between benchmark scores and real-world performance has been a running joke in the industry. Models score above 90% on coding puzzles and multiple-choice tests, then fail at the kind of work an analyst does on a Tuesday morning.

In January 2026, Mercor published APEX-Agents [1], a benchmark that does something different: it tests agents on real professional work. Not coding puzzles. Not trivia. The actual tasks that investment banking analysts, management consultants, and corporate lawyers perform — the kind of work that takes a human 1–2 hours and involves navigating documents, spreadsheets, PDFs, email, and calendars across multi-day engagements.

The benchmark consists of 480 tasks across 33 distinct “worlds” — 10 banking, 11 consulting, 12 legal — each simulating a 5–10 day client engagement with an average of 166 files per world.

The Results

With eight attempts (pass@8), the best model climbed to only ~40%. Depending on the agent configuration, zero-score rates — where the agent failed every rubric criterion — ranged from 40% to 62% across tested configurations. Timeout rates (exceeding 250 steps) reached up to 30% for some models.

These numbers come from the APEX-Agents evaluation framework (“Archipelago”), which runs each agent in a sandboxed environment with standardized tool access, a 250-step limit, and rubric-based scoring by domain experts. Pass@1 reflects a single attempt; pass@8 takes the best of eight independent runs. The scores above represent best-case results across tested configurations — individual harness setups produced significant variance.

The critical finding: these failures were predominantly not knowledge failures. The models had the information and could reason through the problems in isolation. The failures were execution and orchestration problems — agents getting lost after too many steps, looping back to failed approaches, and losing track of their objectives mid-task.

This is exactly the failure pattern that harness engineering addresses: context management (losing track), error recovery (looping on failures), and state management (forgetting objectives).

3. Vercel’s Counterintuitive Discovery: Fewer Tools, Better Results

This case study is the one that challenged my own intuitions most directly.

Vercel had a text-to-SQL agent called d0. The architecture was standard and, honestly, was what I would have built: specialized tools for every stage of the pipeline [2].

The Old Architecture: 15 Specialized Tools

GetEntityJoins    LoadCatalog      RecallContext
LoadEntityDetails SearchCatalog    ClarifyIntent
SearchSchema      GenerateAnalysisPlan
FinalizeQueryPlan FinalizeNoData   JoinPathFinder
SyntaxValidator   FinalizeBuild    ExecuteSQL
FormatResults

Each tool had structured inputs, validation, error handling, and prompt engineering around it. This is how most teams build agents — the instinct is to constrain the model, mediate its interactions, and provide specialized interfaces for every operation.

It worked 80% of the time (4/5 on their benchmark).

The New Architecture: 2 Tools

Then they did something radical: they deleted most of it. The new agent has exactly two tools:

  1. ExecuteCommand — bash access in a Vercel Sandbox
  2. ExecuteSQL — direct query execution

The agent now uses grepcatfind, and ls to explore YAML, Markdown, and JSON files representing the Cube semantic layer. Standard Unix utilities that every developer already knows.

The Numbers

The worst case under the old system: 724 seconds145,463 tokens100 steps, and it still failed. The filesystem agent completed the same query in 141 seconds using 67,483 tokens across 19 steps — successfully.

The model they used: Claude Opus 4.5, running inside a Vercel Sandbox with access to the Vercel AI Gateway.

Vercel’s team published an open-source tool (bash-tool) and a companion post on building agents with filesystems and bash [8]. Their conclusion: “The best agents might be the ones with the fewest tools.”

Why This Works

The insight is not that tools are bad. It is that specialized tools become bottlenecks when the model is already capable enough to use general-purpose interfaces. Each specialized tool is a constraint point — the model must learn its schema, handle its errors, and decide when to use it versus alternatives. With 15 tools, the model spends more tokens choosing than doing.

General-purpose tools (bash, file access) map directly to how models are trained. Most frontier models have seen enormous amounts of shell interaction in their training data. They know how to grep. They do not know how to call GetEntityJoins with the right parameters.

4. Manus: Four Rebuilds and a $2B Lesson

Manus went viral in early 2025 as a general-purpose AI agent. Then they did something most companies avoid: they published their mistakes. In their blog post “Context Engineering for AI Agents” [3], Yichao “Peak” Ji detailed how they rebuilt their framework four times, each time discovering a better approach to context management.

In December 2025, Meta acquired Manus for a reported ~$2 billion according to CNBC and TechCrunch [9][19] — validation that the harness architecture they built had significant production value beyond the underlying model.

What They Removed

Each rebuild followed a pattern: removing user-facing complexity that seemed necessary but was degrading performance, while investing in targeted internal infrastructure (compaction, caching, logit masking) that improved the model’s operating environment.

  • A complex document retrieval system — replaced by direct file access
  • Fancy routing logic between specialized sub-agents — replaced by structured handoffs
  • Specialized tools for each operation — replaced by general-purpose shell execution

What They Kept and Refined

Filesystem-as-memory: Instead of stuffing everything into the context window, the agent writes key information to files and reads it when needed. As they describe it, files are “unlimited in size, persistent by nature, and directly operable by the agent” [3].

Todo-list mechanism: The agent maintains a persistent progress file, reciting its objectives at the end of the context to combat the “lost-in-the-middle” attention degradation [10].

Context compaction: With an input-to-output ratio of approximately 100:1, they implemented a compaction hierarchy:

  1. Raw context (preferred) — full tool output
  2. Compaction — swap full results for compressed versions while preserving restoration paths (URLs, file paths)
  3. Summarization (last resort) — only when compaction no longer yields sufficient space

KV-cache optimization: By maintaining stable prompt prefixes, append-only contexts, and deterministic serialization, they achieved 10x cost savings on cached tokens ($0.30/MTok vs $3/MTok uncached with Claude Sonnet) [3].

Tool management via logits masking: Rather than dynamically adding and removing tools from the prompt, they use a context-aware state machine that constrains tool selection through logit-level masking. Three modes: Auto (model chooses), Required (unconstrained), Specified (subset selection via prefilling).

The Production Scale

Their agents average approximately 50 tool calls per task. Even with large context windows (200k+ tokens), performance degraded past a threshold — not because the model “forgot” earlier content, but because the signal-to-noise ratio in the context window collapsed. Important instructions at the beginning get buried under hundreds of intermediate tool results.

This aligns with the “Lost in the Middle” research by Liu et al. [10], which demonstrated that LLMs exhibit a U-shaped attention pattern — they attend strongly to the beginning and end of context but poorly to the middle. Greg Kamradt’s “Needle in a Haystack” tests [11] confirmed this empirically across multiple frontier models.

5. Three Architectures, One Convergence

The three most production-tested agent harnesses right now are OpenAI Codex, Claude Code, and Manus. They were built independently by different teams with different philosophies. They converged on the same core insight.

OpenAI Codex: Harness Engineering as a Discipline

OpenAI published “Harness Engineering” [4] and “Unlocking the Codex Harness” [12] — describing how a small team built and shipped a million-line production system in five months using Codex agents. Per their blog, the engineers wrote no source code directly; they shifted from writing code to designing harness environments, specifying intent, and reviewing agent-generated pull requests.

Their architecture enforces a strict layered dependency model:

Code can only depend forward through these layers. Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. This is classical layered architecture applied to agent-generated code — the harness enforces constraints that keep the agent productive.

Claude Code: Minimal Tools, Maximum Model Intelligence

Anthropic’s approach with Claude Code is deliberately minimal. The core tool set centers on:

  • Read a file
  • Write/Edit a file
  • Run bash commands
  • Search (grep/glob)

Most of the intelligence lives in the model. Extensibility comes through MCP (Model Context Protocol) [13] — an open protocol for connecting Claude to external tool servers — and project-level instructions via CLAUDE.md files.

Anthropic published a companion guide on “Effective Harnesses for Long-Running Agents” [5], recommending a two-agent pattern:

  1. Initializer Agent — sets up the environment on first run (init.sh, progress file, feature tracking)
  2. Coding Agent — handles incremental work, reading progress files at session start

Their key state management artifacts: an init.sh script for reproducible environments, a claude-progress.txt file for work logging, and git for version control and rollback. The constraint: one feature per session, incremental progress, leave code in a mergeable state.

Manus: Reduce, Offload, Isolate

Manus’s approach can be summarized in three words:

  1. Reduce — aggressively shrink context through compaction and eviction
  2. Offload — use the filesystem for persistent memory beyond the context window
  3. Isolate — delegate heavy sub-tasks to sub-agents and pull back summaries

Manus architecture

The Convergence

Three independent architectures. Same direction:

6. The Bitter Lesson, Applied

Richard Sutton’s “The Bitter Lesson” [14], published in March 2019, is one of the most cited essays in modern AI. The core argument: “The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.”

Sutton was writing about search and learning methods. But the pattern maps directly to agent harnesses:

Every Vercel tool that was removed, every Manus retrieval system that was deleted, every routing layer that was replaced with a simple handoff — these are instances of the Bitter Lesson playing out in real time.

I want to be honest about a tension here. A strict reading of Sutton’s argument would predict that harness engineering itself will eventually be obsoleted by sufficiently capable models — that we should just scale models until they handle long-horizon tasks end-to-end without orchestration scaffolding. That counterargument is real and I take it seriously. Manus had to rebuild their harness four times as models evolved, which is itself evidence that model improvements erode harness value.

My position is that multi-step execution tasks have irreducible coordination requirements — context management, state persistence, error recovery — that are not reasoning problems for the model to solve but infrastructure problems for the system to handle. A model does not need to be “smarter” to save its progress to disk; it needs a harness that persists state. The harness is itself a general method: it manages context and recovers from errors in ways that scale with model capability. The key distinction is that the harness should get simpler as models improve, not more complex.

The practical implication: if every model upgrade makes you add more hand-coded logic, routing, or pipeline steps, you are swimming against the current. Build for deletion. Every piece of harness logic should be something you can remove when the model no longer needs it. If your infrastructure keeps getting more complicated as models improve, you are over-engineering.

This is not a theoretical argument. Anthropic’s “Building Effective Agents” guide [6] explicitly recommends starting with simple patterns (augmented LLM, prompt chaining) before reaching for complex agent frameworks. LangChain’s evolution from heavily-abstracted chains (v0.1–0.2) to the simpler graph-based composition of LangGraph [15] is another instance of this pattern. The industry is learning the Bitter Lesson in real time.

The Smartphone Analogy

In early smartphones, the processor was the story — faster chips meant better phones. Eventually processors crossed a sufficiency threshold and the difference stopped mattering to users. Differentiation moved to the operating system (iOS vs. Android), the camera software (computational photography, not the sensor), and the developer ecosystem.

Raw compute power became a commodity. Value moved to the infrastructure layer.

The same pattern played out in cloud computing: server hardware commoditized, and value moved to AWS’s infrastructure abstractions. In databases: raw storage commoditized, and value moved to query optimization and transaction management. In GPUs: raw FLOPS commoditized across NVIDIA SKUs, and value moved to the CUDA/cuDNN/PyTorch software stack.

In the agent era, the harness is the operating system. The teams and companies that build great harnesses will maintain their advantage as the underlying models keep changing. This is why OpenAI built Codex as a harness product, not just a model. This is why Meta reportedly paid ~$2 billion for Manus’s harness [9][19], not a foundation model.

7. Production Engineering Realities

The case studies above are compelling, but they focus on capability. Production systems must also address reliability, observability, cost, and security. These are harness concerns that the published literature often underemphasizes.

This is not a new insight in ML systems. Sculley et al.’s “Hidden Technical Debt in Machine Learning Systems” [16] demonstrated in 2015 that ML model code is a small fraction of a production ML system — the surrounding infrastructure dominates. Agent harnesses are the latest manifestation of the same pattern.

Context Window Economics

Context is not free. As of mid-2025, Claude Sonnet uncached input tokens cost $3/MTok. Manus’s approximately 100:1 input-to-output ratio means context management directly determines cost. Their KV-cache optimization (stable prefixes, append-only context, deterministic serialization) cuts this to $0.30/MTok for cached tokens [3]. That is a 10x cost reduction from a pure harness optimization, with zero model changes. (Pricing is time-sensitive — verify current rates before applying these figures to your own cost models.)

For a system averaging 50 tool calls per task, naive context management can easily push a single task to 200k+ tokens. At $3/MTok uncached, that is $0.60 per task. At $0.30/MTok cached, it is $0.06. Across millions of tasks, this is the difference between a viable product and an unsustainable cost structure.

Failure Mode Taxonomy

From the APEX-Agents analysis, the Manus blog post, and Anthropic’s harness guide, a consistent taxonomy of agent failure modes emerges — and every failure mode is a harness problem:

Observability

One production tradeoff of the “fewer tools” approach: specialized tools produce structured telemetry (tool=search_code, query=X, results=N, latency=Yms). Bash commands produce unstructured output that requires parsing to extract equivalent signals.

Production harnesses need a structured logging layer regardless of tool design:

  • Per-tool-call telemetry: tool name, input hash, output size, latency, success/failure
  • Context utilization tracking: tokens used vs budget, cache hit rate, compaction events
  • Task-level metrics: total steps, total tokens, wall-clock time, outcome
  • Distributed tracing: OpenTelemetry spans across multi-turn agent workflows

Security Considerations

The “give it bash” approach has an obvious security surface. Vercel addresses this with sandboxed execution (Vercel Sandbox). Manus uses full VM isolation. Claude Code runs locally with user-controlled permissions.

For production deployments:

  • Sandbox everything: Shell access without isolation is a vulnerability, not a feature
  • Principle of least privilege: The agent should have access to exactly what it needs for the current task
  • Audit logging: Every tool invocation should be logged for compliance and forensics
  • Input/output filtering: Sensitive data in context windows requires handling at the harness level
  • Egress controls: A manipulated agent could use legitimate tool calls to exfiltrate data — for example, encoding sensitive context into web search query parameters. Egress monitoring and content inspection on tool inputs are necessary
  • Secret management: API keys and credentials required by tools must be injected at the harness level, never exposed in the context window where they could leak through model outputs
  • Data governance: When using filesystem-as-memory patterns, apply retention policies and data classification. Agent-written files may contain PII, proprietary data, or intermediate reasoning that requires the same governance as any other data store

8. Where My Assumptions Broke

Assumption 1: “More tools means more capability”

What I found: The Vercel case study directly contradicts this. 15 specialized tools produced 80% accuracy. 2 general-purpose tools produced 100%. The model is not constrained by tool availability — it is constrained by tool complexity. Each additional tool increases the decision space and the probability of misrouting.

Assumption 2: “Context windows are big enough now”

What I found: Even 200k+ token windows degrade under production workloads. Manus’s 50-tool-call sessions generate enough intermediate content to drown the signal. The “Lost in the Middle” research [10] and Needle-in-a-Haystack evaluations [11] confirm this is not just anecdotal. Context window size is necessary but not sufficient — what matters is context quality, which is a harness responsibility.

Assumption 3: “The Bitter Lesson means you should not build infrastructure”

What I found: This is a misreading of Sutton’s argument. The Bitter Lesson says general methods that scale with compute win. It does not say do nothing and wait for better models. A good harness is itself a general method — it manages context, recovers from errors, and persists state in ways that scale with model capability. The key is that the harness should get simpler as models improve, not more complex. Build infrastructure that can be progressively deleted.

Assumption 4: “Benchmark scores predict production performance”

What I found: APEX-Agents exposed this comprehensively. Models scoring 90%+ on traditional benchmarks achieved 24% on professional tasks. The gap is not intelligence — it is execution infrastructure. Benchmarks that test isolated reasoning tell you about the engine. Production tells you about the car.

9. Was My Hypothesis Correct?

Verdict: Correct, with one important qualification.

Where It Holds

For any production agent system where the underlying model meets a capability floor — roughly, a model that can reliably follow multi-step instructions, use tools via structured function calling, and recover from single-step errors (GPT-4-class and above; operationally, you can test this by running your agent on 10 representative tasks and checking whether failures are reasoning errors or orchestration errors):

  • Harness engineering yields higher marginal returns than model selection
  • Simplifying the harness improves outcomes more often than adding complexity
  • Context management, error recovery, and state persistence are the primary failure points, not model reasoning
  • The Vercel (80% to 100%), Manus (iterative simplification), and APEX-Agents (~24% despite high benchmark scores) data all support this

Where It Breaks

Below a model capability threshold, no harness compensates for insufficient reasoning. You cannot harness-engineer GPT-3.5 into solving APEX-Agents consulting tasks. The harness amplifies model capability — it does not replace it.

Also, for tasks that are purely reasoning-bound (mathematical proofs, novel algorithm design), model capability dominates. The harness thesis applies most strongly to long-horizon, tool-using, multi-step execution tasks — which is exactly the category where agents are being deployed in production.

What I Recommend

  1. Run the Vercel experiment on your own system. Strip your agent to bash + file access. Run your eval suite. If performance improves, your specialized tools were net-negative. If it drops, your task genuinely requires structured interfaces.
  2. Add a progress file. Have your agent maintain a persistent todo list that it reads at the start of each action and writes to at the end. This is the simplest possible state management, and both Manus and Claude Code use variants of it.
  3. Measure your context budget. Instrument your agent to track tokens consumed per task. Set a budget. When you hit it, you have a harness problem, not a model problem.
  4. Build for deletion. Every piece of harness logic should have an expiration date. If the next model can handle something without your scaffolding, delete the scaffolding.
  5. Adopt MCP for tool interfaces. Anthropic’s Model Context Protocol [13] is becoming a de facto standard for connecting agents to external tools. Clean tool interfaces are cheaper to maintain than custom integrations.

Final Thoughts

2025 was the year of agents. 2026 is the year of harnesses.

If you think Opus is the best coding model right now, notice that it behaves differently in Claude Code versus Cursor versus the API with a custom harness. The model is the same. The harness changes everything.

The biggest AI companies are all telling you this. OpenAI published “Harness Engineering.” Anthropic published guides on effective harnesses. Manus published their context engineering lessons (and Meta reportedly paid ~$2 billion for the result [9][19]). The evidence is not subtle.

Choose your harness carefully — whether you are using an agent or building one. The model will change every few months. The harness is what makes it work.

References

[1] Mercor. “APEX-Agents.” arXiv:2601.14242, January 2026. https://arxiv.org/abs/2601.14242. Benchmark: https://www.mercor.com/apex/

[2] Vercel. “We removed 80% of our agent’s tools.” Vercel Blog, 2025. https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools

[3] Ji, Yichao “Peak”. “Context Engineering for AI Agents: Lessons from Building Manus.” Manus Blog, 2025. https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus

[4] OpenAI. “Harness Engineering: Leveraging Codex in an Agent-First World.” OpenAI Blog, 2025. https://openai.com/index/harness-engineering/

[5] Anthropic. “Effective Harnesses for Long-Running Agents.” Anthropic Engineering, 2025. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

[6] Anthropic. “Building Effective Agents.” Anthropic Research, December 2024. https://www.anthropic.com/research/building-effective-agents

[7] Fowler, Martin. “Harness Engineering.” MartinFowler.com, 2025. https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html

[8] Vercel. “How to Build Agents with Filesystems and Bash.” Vercel Blog, 2025. https://vercel.com/blog/how-to-build-agents-with-filesystems-and-bash

[9] CNBC. “Meta acquires intelligent agent firm Manus, capping year of aggressive AI moves.” December 30, 2025. https://www.cnbc.com/2025/12/30/meta-acquires-singapore-ai-agent-firm-manus-china-butterfly-effect-monicai.html

[10] Liu, Nelson F. et al. “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172, 2023. https://arxiv.org/abs/2307.03172

[11] Kamradt, Greg. “Needle in a Haystack — Pressure Testing LLMs.” GitHub, 2023. https://github.com/gkamradt/LLMTest_NeedleInAHaystack

[12] OpenAI. “Unlocking the Codex Harness: How We Built the App Server.” OpenAI Blog, 2025. https://openai.com/index/unlocking-the-codex-harness/

[13] Anthropic. “Model Context Protocol.” 2024–2025. https://modelcontextprotocol.io/

[14] Sutton, Richard S. “The Bitter Lesson.” March 13, 2019. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[15] LangChain. “LangGraph Documentation.” https://langchain-ai.github.io/langgraph/

[16] Sculley, D. et al. “Hidden Technical Debt in Machine Learning Systems.” NeurIPS 2015. https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html

[17] OpenAI. “OpenAI Agents SDK.” GitHub, 2025. https://github.com/openai/openai-agents-python

[18] OpenAI. “A Practical Guide to Building Agents.” January 2025. https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf

[19] TechCrunch. “Meta just bought Manus, an AI startup everyone has been talking about.” December 29, 2025. https://techcrunch.com/2025/12/29/meta-just-bought-manus-an-ai-startup-everyone-has-been-talking-about/


Written by epappas | Software Architect & Engineer - Multi-exit startup founder too.
Published by HackerNoon on 2026/02/25