The Delta Brief: Why the Smartest AI-Assisted Teams Send Less, Not More
How sending only what changed, instead of entire files or codebases, cuts token costs by up to 90% without sacrificing output quality.
Introduction: The Case for the Delta Brief
In the first generation of AI-assisted coding, the dominant instinct was to throw everything at the model. Developers copied entire files into ChatGPT, pasted hundreds of lines of error logs, and asked Claude to "look at my whole codebase" by concatenating directories into a single prompt. The reasoning was intuitive: more context means better answers. That reasoning was wrong, and expensive.[1]
The Delta Brief workflow inverts that instinct. Instead of sending the full state of a codebase with every interaction, teams practising the approach send only what changed: the specific function being modified, the diff of the last edit, the precise error output from a failing test, or a structured specification of the delta between current and desired behaviour. The workflow emerged organically across the AI-assisted development community between 2023 and 2025, driven by tooling like Aider, Claude Code, and Cursor, each of which implements some form of minimal-context editing. It is not a single protocol but a family of practices united by one principle: the AI model needs less than you think.[2][3]
“Adding a bunch of files that are mostly irrelevant to the task at hand will often distract or confuse the LLM. The LLM will give worse coding results, and sometimes even fail to correctly edit files.”
— Paul Gauthier, creator of Aider (2024) [2]
How the Delta Brief Works in Practice
The workflow’s efficiency is easiest to see in the edit-format benchmarks published by Paul Gauthier’s Aider project. When an AI coding tool uses a “whole file” format&, returning an entirely new copy of each modified file, every edit costs the full token price of the file, regardless of how small the change is. Aider’s “diff” format, by contrast, asks the model to return only the lines that changed, wrapped in a search-and-replace block. Gauthier found that “GPT-4 gets comparable results with the whole and diff edit formats, but using whole significantly increases costs and latency compared to diff.” On a file of even a few hundred lines, the token difference compounds rapidly across a multi-turn coding session.[2]
The real-world impact is clearest in production data. Dod Fraser, CEO of Mainstay, a company running AI-assisted workflows across approximately 30,000 property tax and HOA portals, reported that with optimised model routing and minimal-context prompting, “GPT-5.4 achieved a 95% success rate on the first attempt and 100% within three attempts, compared to ~73–79% with prior CUA models. It also completed sessions ~3x faster while using ~70% fewer tokens, materially improving reliability and cost efficiency at scale.”[4] The 70% token reduction was not a trade-off against quality. It was a reflection of a system that reached correct answers with less back-and-forth.
The academic foundation for this approach is even stronger. In 2023, Stanford researchers Lingjiao Chen, Matei Zaharia, and James Zou published FrugalGPT, a cascade routing framework demonstrating that querying cheaper models first and escalating only on low-confidence responses could match GPT-4’s performance with up to 98% cost reduction. In 2024, UC Berkeley and Anyscale published RouteLLM, an open-source routing framework trained on 80,000 Chatbot Arena conversations, which reduced costs by over 85% on MT Bench while maintaining 95% of frontier-model performance quality.[5][6] Both findings reinforce the same architectural principle: most real-world AI interactions do not require maximal context, and systems that treat every prompt identically pay an unnecessary and compounding premium.
Where the Delta Brief Breaks Down
Minimal-context prompting is not universally better. It is conditionally better, and the conditions are stricter than many teams initially assume. The most common failure mode is under-contextualisation: the AI receives a diff or a narrow slice of code, but the change it needs to make depends on conventions, abstractions, or type definitions that live in files the model has never seen. The result is syntactically correct code that violates the architecture of the surrounding system.[7]
Anthropic’s engineering team addressed this tension directly in their 2024 guide to building effective agents. In a section on tool format design, Erik S. and Barry Zhang noted: “Writing a diff requires knowing how many lines are changing in the chunk header before the new code is written. Writing code inside JSON (compared to markdown) requires extra escaping of newlines and quotes.” Their advice was to “give the model enough tokens to think before it writes itself into a corner” and to “keep the format close to what the model has seen naturally occurring in text on the internet.”[7] The implication is that a Delta Brief that is too brief, one that economises tokens past the point where the model can reason, produces worse output, not cheaper output.
“Writing a diff requires knowing how many lines are changing in the chunk header before the new code is written. Some formats are much more difficult for an LLM to write than others. Give the model enough tokens to think before it writes itself into a corner.”
— Erik S. & Barry Zhang, Anthropic Engineering, “Building Effective Agents” (2024) [7]
The second failure mode is more subtle: models that are perfectly capable of producing quality code in a whole-file format sometimes struggle with the structural precision that diff formats demand. Gauthier observed this in his own benchmarks, noting that some models “use the diff format in a pathological manner”, placing the entire original source file in the “ORIGINAL” block and the entire updated file in the “UPDATED” block, effectively negating any token savings while adding format overhead.[2] The tool format that saves tokens on paper can cost more tokens in practice if the model is not well-suited to it.
The Optimised Version: The Structured Delta Brief
The teams that get the most from the Delta Brief workflow do not simply send less. They send less, more deliberately. They treat prompt construction as an engineering discipline rather than an improvisation, applying three structural practices that compound on each other.[3][7]
- Context Triage: Before every AI interaction, the developer explicitly separates what the model must see from what it does not need to see. The relevant function or component, the test that defines success, and any type signatures or interfaces the change depends on. Everything else is excluded. This is the same instinct that Gauthier describes in Aider’s usage guidance: “Just add the files that need to be changed to the chat. Too much irrelevant code will distract and confuse the LLM.”[2]
- Format-Aware Prompting: The edit format, diff, whole-file, or structured JSON, is chosen based on the model’s known strengths, not convenience. Anthropic’s guidance is to “put yourself in the model’s shoes. Is it obvious how to use this tool, based on the description and parameters, or would you need to think carefully about it?”[7] Stronger reasoning models (Claude Sonnet, GPT-5.4 and above) handle diff and structured edit formats reliably. Weaker or older models often perform better with whole-file formats despite the token overhead, because the cognitive load of formatting the diff degrades their output quality.
- Iterative Refinement Over One-Shot Requests: The Delta Brief is not about getting everything right in a single prompt. It is about short, focused iterations where each turn sends only the delta from the previous turn. Simon Willison described this pattern in 2024: “I’ll often iterate on these a lot. I’ll say, ‘I don’t like the variable names you used there. Change those.’ Or ‘Refactor that to remove the duplication.’ I call it my weird intern, because it really does feel like you’ve got this intern who is screamingly fast…and they make mistakes and they don’t realise them. But crucially, they never get tired.”[3] Each iteration is cheap in tokens because each carries only the specific change request, not the full state.
When these three practices are combined, the cumulative token savings are structural, not incidental. Anthropic’s prompt caching feature adds a fourth compounding lever: for any repeated prompt structure, such as the same system prompt and coding conventions applied across dozens of sessions, cached input tokens cost 90% less than uncached ones. Simon Last, co-founder of Notion, described the practical impact: “We’re excited to use prompt caching to make Notion AI faster and cheaper, all while maintaining state-of-the-art quality.”[8] The combination of context triage, format-aware prompting, iterative refinement, and prompt caching represents a compounding set of efficiency levers. Teams that address only one leave most of the available savings on the table.
Key Lessons for Your Business
Whether your team is building software with AI assistance or procuring it from vendors who do, the Delta Brief workflow offers three transferable principles for any organisation that pays for inference tokens.
More Context Is Not Always Better Context
The instinct to send entire files and full conversation histories is a carryover from human communication norms, not an evidence-based prompting strategy. Research from FrugalGPT and RouteLLM demonstrates that most AI interactions need far less context than what teams typically provide, and that excess context actively degrades both cost efficiency and, in many cases, output accuracy.[5][6]
Edit Format Is a Strategic Decision, Not an Implementation Detail
The choice between whole-file, diff, and structured edit formats has measurable effects on token consumption, model accuracy, and iteration speed. Gauthier’s benchmarks show that the wrong format for a given model can produce “pathological” behaviour that negates all intended savings. Anthropic’s guidance reinforces this: tool design deserves as much engineering attention as prompt design.[2][7]
Token Efficiency Compounds. Token Waste Compounds Faster.
A single whole-file edit on a 500-line file might cost 700–1,000 tokens. Across a multi-turn session with 20 iterations, that becomes 14,000–20,000 tokens. Across a team of five developers working daily with AI assistance, the gap between a disciplined Delta Brief workflow and an unstructured “send everything” approach can exceed millions of tokens per month. Mainstay’s 70% token reduction was achieved not through a single architectural change but through deliberate attention to what the model receives at every interaction.[4]
Conclusion: Structured Prompts Over Dumped Context
The evidence from academia, tooling, and production deployments converges on a single finding: in AI-assisted software development, the teams that send the most context are not the ones producing the best output. They are the ones paying the highest inference bills. The FrugalGPT and RouteLLM research established that intelligent routing and minimal-context strategies could match frontier-model quality at a fraction of the cost. Aider’s edit-format benchmarks proved that the same principle applies to the structure of individual code edits. Mainstay’s production data confirmed that it works at scale.[2][4][5]
What separates the teams that succeed with the Delta Brief from those that struggle is not technical sophistication. It is discipline. The workflow requires that someone, a developer, a tech lead, or an AI workflow architect, makes a deliberate decision about what the model sees before every interaction. That decision takes thirty seconds and saves thousands of tokens. Repeated across a team and a quarter, it is the difference between an AI bill that scales linearly with usage and one that scales geometrically with carelessness.
“The boring yet crucial secret behind good system prompts is test-driven development. You don’t write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.”
— Amanda Askell, Anthropic (2024) [9]
For SMBs building software with AI assistance, the first step is simple and costs nothing: audit one day of your team’s AI-assisted coding sessions. Count how many tokens were spent resending context the model already had, repeating full files for single-line changes, or including irrelevant code that the model never needed. The number will almost certainly be larger than you expect. From there, introduce a single rule: before each prompt, ask what the model actually needs to see. That rule alone, applied consistently, captures the majority of the available savings before any tooling change is required.
Sources & References (10 cited)
- Simon Willison. “Things we learned about LLMs in 2024.” simonwillison.net, December 31, 2024.
- Paul Gauthier. “GPT Code Editing Benchmarks.” Aider Documentation, 2023–2025. Multiple edit format benchmarks comparing whole-file, diff, and function-call approaches across GPT-3.5 and GPT-4 model families.
- Simon Willison. “Notes on Using LLMs for Code.” simonwillison.net, September 20, 2024. Transcript highlights from TWIML podcast appearance on AI-assisted development workflows.
- OpenAI. “Introducing GPT-5.4.” March 2026. Includes production benchmarks and customer testimony from Dod Fraser, CEO of Mainstay, on token efficiency gains.
- Lingjiao Chen, Matei Zaharia, and James Zou. “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” Stanford University, 2023. Demonstrated that cascade routing to cheaper models could match GPT-4 quality with up to 98% cost reduction.
- Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, Ion Stoica. “RouteLLM: Learning to Route LLMs with Preference Data.” UC Berkeley & Anyscale, July 2024. Open-source routing framework achieving over 85% cost reduction on MT Bench while maintaining 95% of GPT-4’s performance. arXiv:2406.18665.
- Erik S. and Barry Zhang. “Building Effective Agents.” Anthropic Engineering Blog, December 19, 2024. Guidance on tool format design, agent architectures, and prompt engineering best practices from production deployments.
- Anthropic. “Prompt Caching.” Claude Blog, August 14, 2025. Announcement of prompt caching on the Anthropic API, with up to 90% cost reduction and 85% latency reduction for cached prompts. Includes testimonial from Simon Last, Co-founder at Notion.
- Amanda Askell, quoted in Simon Willison. “Things we learned about LLMs in 2024.” simonwillison.net, December 31, 2024. On test-driven development for system prompts.
- Paul Gauthier. “Aider FAQ: How can I add ALL the files to the chat?” Aider Documentation, 2024–2025. Guidance on minimal file selection and the risks of overloading LLM context windows.
The Delta Brief Isn’t Just for Code.
It’s a Token Efficiency Discipline.
We help SMBs audit their AI-assisted development workflows, implement structured prompt engineering disciplines, and build inference cost structures that scale predictably, whether your team writes code or manages the people who do. The difference between sending everything and sending what matters compounds every quarter.
Schedule Your Free AI Readiness Assessment →