Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Written by aimodels44 | Published 2026/02/24
Tech Story Tags: ai | agents.md | ai-context-files | repository-level-context | ai-code-assistants | instruction-following | agentbench | tool-overuse

TLDRA new study suggests AGENTS.md-style repo context files can reduce coding-agent success while raising inference cost. Here’s why—and what to do instead.via the TL;DR App

The assumption nobody questions

Every major AI coding framework tells developers the same thing: create an AGENTS.md file for your repository. Document the structure. Explain the conventions. Tell the agent how to be successful in your codebase. It sounds obvious. When you're giving instructions to someone—or something—you should be detailed and thorough, right? That's just good communication.

This practice has become universal because the logic seems airtight. Coding agents operate with limited context windows. Detailed guidance fills in gaps. Clearer instructions should produce better outcomes. The entire ecosystem of Claude, Codex, and other AI coding assistants actively recommends this approach. Popular frameworks bake it directly into their workflows. When consensus converges this completely, you stop questioning whether the thing works. You just assume it does.

But what if the assumption is backwards?

A new paper from arXiv presents something counterintuitive: repository context files like AGENTS.md actually reduce task success rates while increasing inference costs by over 20%. The carefully crafted instructions that developers believe guide their AI helpers toward success often guide them away from it. The agents aren't malfunctioning. They're following instructions perfectly. Which is precisely the problem.

Setting up the experiment

To answer whether context files help, you need to measure something real. It's not enough to ask "do developers think this helps?" or "does it seem reasonable?" You need actual tasks, consistent measurements, and a way to isolate the effect of context files specifically.

The researchers built this isolation using two complementary datasets. The first uses established SWE-bench tasks from popular repositories with LLM-generated context files created according to agent-developer recommendations. The second is newer: AGENTbench, a collection of over 200 real issues from repositories where developers had already committed their own context files. This second dataset is crucial because it represents real-world belief in action. When actual developers write context files and commit them, they're betting that these files help. That bet becomes testable data.

The experimental design is simple but rigorous: take the same task, run the same agent, and vary only whether a context file is present. Three conditions for each task: no repository context (just the task description), context generated by an LLM following standard prompts, or context written by actual developers.

Overview of the evaluation pipeline showing task selection from repositories, context generation strategies, and agent evaluation across three settings

Overview of the evaluation pipeline. The researchers begin with real-world repositories and tasks derived from past pull requests. For each repository, they generate three settings: one without context files, one with LLM-generated context files, and one with developer-provided context files.

The AGENTbench dataset spans 12 open-source GitHub repositories, each containing context files that developers had already written.

Distribution showing AGENTbench instances across 12 GitHub repositories with context files

Distribution of AGENTbench instances across 12 open-source GitHub repositories, each containing context files.

The main finding

The results appear across multiple models and both datasets. Agents with no repository context outperform agents given context files. The effect is consistent and substantial. On SWE-bench Lite, GPT-4o achieves 33.5% resolution without context files, 32% with LLM-generated context, and 29.6% with developer-written context. The pattern repeats across other models. On AGENTbench, where real-world issues and developer-written context files live, the gap widens. GPT-4o reaches 32.5% resolution without context files but only 24.2% with developer-provided context.

This isn't a small effect. Context files consistently harm performance, and the harm is larger when the context comes from real developers rather than LLMs.

Resolution rates for four different models without context files, with LLM-generated context files, and with developer-written context files

Resolution rate for four different models, without context files, with LLM-generated context files, and with developer-written context files, on SWE-bench Lite (left) and AGENTbench (right).

The cost penalty accompanies this performance loss. Agents working with context files spend significantly more reasoning tokens—the computational budget used for thinking rather than acting.

Reasoning tokens spent by GPT-4o and GPT-4o mini without and with context files

Number of reasoning tokens spent on average by GPT-4o and GPT-4o mini, without context files, with LLM-generated context files, and with developer-written context files, on SWE-bench Lite (left) and AGENTbench (right).

So context files make tasks harder to solve and more expensive to run. Understanding why requires looking at what agents actually do when they receive these files.

Why this happens

The behavioral data reveals the mechanism. When given context files, agents explore more. They look at more files. They run more tests. They call more tools. They ask more questions. By almost every metric of "thoroughness," agents become more thorough.

The problem is that this thoroughness doesn't translate into problem-solving. Instead, it delays it. Agents take more steps before they interact with the files that actually need fixing.

Number of steps before first interaction with files included in the PR patch

Number of steps before the first interaction between the agent and a file included in the PR patch (lower is better) is generally lower without context files than with LLM-generated context files or with developer-written context files.

Context files act like a checklist that agents take seriously. When a file says "make sure to test thoroughly," agents do more thorough testing. When it says "this repo has these key directories," agents spend more time exploring them. The agents aren't misbehaving. They're following instructions precisely. The problem is that following instructions about how to be thorough makes them worse at the actual task.

This shows up clearly in tool usage patterns. Agents call functions more frequently when context files are present. They search for more information, run more commands, and invoke more tools.

Increase in average tool use when including LLM-generated or developer-provided context files

Increase in average tool use when including LLM-generated (bright green) or developer-provided (dark green) context files, compared to the average tool use without context files.

But the increase in activity isn't random. When agents receive context files that mention specific tools or frameworks, they use those tools more frequently. There's a direct correlation between being told about a tool and using that tool, regardless of whether using it actually helps solve the problem.

Average number of tool calls depending on whether the tool name is mentioned in the context files

Average number of tool calls depending on whether the tool name is mentioned in the context files. Tools mentioned in context files get used more frequently.

This systematic pattern emerges from a simple principle: agents follow what they're told to do. When context files include suggestions about exploration, testing, or tool usage, agents implement those suggestions faithfully. They're optimizing for instruction-following rather than problem-solving. In this case, instruction-following and problem-solving point in opposite directions.

The details that lock it in

This mechanism holds across different models and different prompts for generating context files. When researchers generated context files using different LLMs or different prompting strategies, the harmful effects persisted. The specific model used to create context files mattered less than the fact that context files existed at all.

Interestingly, larger models sometimes generate better context files than smaller ones, yet even these larger-model-generated files still reduce task success rates. The generation quality doesn't solve the fundamental problem.

Performance improvements or degradation depending on which model generates context files

On SWE-bench Lite, performance is improved with context files generated by GPT-4o compared to using the model underlying the agent, while on AGENTbench performance is degraded.

The behavioral patterns also vary by repository, but the overall trend remains consistent. Some repositories show larger performance drops than others, but the direction doesn't change.

Resolution rates grouped by repository for different models and context file conditions

Resolution rate grouped by repository for four different models: without context files, with LLM-generated context files, and with developer-written context files.

When context files might actually help

There's one scenario where the paper hints at context files becoming useful. When researchers removed documentation files from the repository before running agents, LLM-generated context files performed better than developer-written ones. This suggests that context files help most when they're replacing information that's hard to find in the codebase itself.

Resolution rates when all documentation-related files are removed from the codebase

When removing all documentation-related files from the codebase, LLM-generated context files tend to outperform developer-provided ones on AGENTbench.

The implication is significant. Redundant information becomes noise. If a repository already has a thorough README, architecture documentation, and code comments that explain conventions, adding that same information to a context file doesn't help. It just gives the agent more to process before focusing on the actual problem. But if the repository has gaps in its documentation, if critical information is scattered or missing, then a context file that provides that information could genuinely fill a void.

This transforms the lesson from "don't use context files" to something more nuanced: use them minimally and strategically, as supplements to incomplete documentation rather than summaries of complete documentation.

The real lesson

The paper's conclusion is direct: "unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements."

This reframes the entire practice. The problem isn't context files themselves. The problem is everything developers put into them. Long, detailed context files that list all the directories in a project, explain the testing framework, describe the team's coding philosophy, and encourage thorough exploration make agents less likely to succeed. Minimal context files that point to the actual requirements might make agents more likely to succeed.

Consider what a minimal context file might contain: "Fix the bug in line 42 of module.py. The issue occurs when processing null values. Ensure the fix handles edge cases." Compare that to what developers often write: "This repository uses Jest for testing with 95% coverage requirements. We have three main directories. Follow this architectural pattern. Review the testing infrastructure before making changes. Ensure all tests pass."

One points directly toward the problem. The other creates a checklist of things to do before solving the problem. Agents work through the checklist first, and by the time they reach the actual bug, they've already spent their reasoning budget on exploration that didn't move toward a solution.

This finding matters beyond the specific practice of writing AGENTS.md files. It reveals something fundamental about how instruction-following systems behave. When we assume that more guidance is always better, we're applying human intuition to a different kind of agent. Humans asked to "explore thoroughly" develop metacognitive awareness about when to stop exploring. They develop a sense of purpose that keeps them directed even when instructions suggest broader exploration.

AI agents don't have that metacognitive layer. They optimize for what they're literally asked to do. When you ask them to be thorough, they become thorough. When you ask them to follow guidelines, they follow them. They're not lazy or stubborn. They're extremely literal.

The broader pattern here extends beyond coding agents. As more AI systems become part of development workflows, the principle holds: constraint can be as valuable as guidance. The temptation is always to add one more helpful instruction, one more piece of context, one more guideline. This paper suggests that impulse should be resisted. The goal isn't to create perfect instructions. The goal is to create minimal instructions that point directly at what matters.

For developers using coding agents today, the research offers practical guidance. If you're creating a context file, make it small. If you're maintaining one, remove the parts that document information already visible in the codebase. If you're considering adding guidelines about process or exploration, reconsider whether that guidance actually helps agents solve problems or just makes them busier before they start solving them.

This is a Plain English Papers summary of a research paper called Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.



Written by aimodels44 | Among other things, launching AIModels.fyi ... Find the right AI model for your project - https://aimodels.fyi
Published by HackerNoon on 2026/02/24