TL;DR: This post introduces a novel and efficient code editing mechanism that builds on top of existing Hash Anchored edits, utilizes a stateful backend, single-token hash anchors, and the Myers Diff algorithm to re-anchor selectively. It is currently used in Dirac and found to be well suited for AI Agent tooling.
Premises
Some basic premises to put things into context.
- Reading and editing code are the two most frequent operations in agentic coding, with reads outnumbering edits by a significant margin.
- Output tokens cost 5-6x more than input tokens, and 50-60x more than cached read tokens (source: Anthropic, Gemini, OpenAI current costs).
- Tool call failure penalty grows with context size.
How current tooling handles code edits (Claude Code, GeminiCLI, Codex, etc.)
Pretty much every popular coding agent/harness/scaffolding uses some variant of search and replace for code edits. Claude and Gemini use plain search and replace while OpenAI uses the `apply_patch` tool, which utilizes the V4A diff format.
Consider a file like
# type: ignore
...previous_100_lines
def complex_payment_processor(transaction_data):
logger.info("Starting processing")
# ...[46 lines of validations, API calls, and db queries] ...
logger.info("Payment successful")
return {"status": "success"}
Suppose the task requires modifying most of the lines of this function. Here's roughly how each of OpenAI, Gemini, and Claude will emit the edit-file tool call (approximated based on specs, for illustration only).
OpenAI
{
"tool_calls":[
{
"type": "function",
"function": {
"name": "apply_patch",
"arguments": "{\n \"operations\":[\n {\n \"type\": \"update_file\",\n \"path\": \"payments.py\",\n \"diff\": \"*** Update File: payments.py\\n-def complex_payment_processor(transaction_data):\\n- # ... [48 lines of ORIGINAL code generated here prefixed with -] ...\\n- return {\\\"status\\\": \\\"success\\\"}\\n+def complex_payment_processor(transaction_data, strict_mode=True):\\n+ # ... [48 lines of UPDATED code generated here prefixed with +] ...\\n+ return {\\\"status\\\": \\\"success\\\", \\\"strict\\\": True}\\n\"\n }\n ]\n}"
}
}
],
"usage": {
"approx_output_tokens": 1150
}
}
Claude (Gemini uses similar format)
{
"type": "tool_use",
"name": "replace_in_file",
"input": {
"file_path": "payments.py",
"old_text": "def complex_payment_processor(transaction_data):\n logger.info(\"Starting processing\")\n # ...[46 lines of ORIGINAL code generated here] ...\n logger.info(\"Payment successful\")\n return {\"status\": \"success\"}",
"new_text": "def complex_payment_processor(transaction_data, strict_mode=True):\n logger.info(\"Starting processing in strict mode\")\n # ... [46 lines of UPDATED code generated here] ...\n logger.info(\"Strict payment successful\")\n return {\"status\": \"success\", \"strict\": True}"
},
"usage": {
"approx_output_tokens": 1080
}
}
The rule of thumb: each of those will generate the number of output tokens that are proportional to existing code to search (S) + new code to replace with (R). In other words, asymptotic complexity O(S+R). Also, S has a uniqueness constraint, meaning the model has to make sure that the search block it emits matches exactly one place in the target file (or the tool call will fail), which requires the model to include a few lines of context before and after S.
Two obvious shortcomings of this format:
- The model has to repeat the entire old code plus before/after lines token-for-token without a mistake even if it just wants to say replace lines X to Y with this.
- The larger the 'search' block grows, the higher the probability of the model making mistakes, and any single mistake in the search block causes the entire tool call to fail and the model has to generate the whole thing again.
Enter Hash Anchored Edits
This amazing post, "The Harness Problem" by Can Bölük, explains the mechanism really well. In fact, that was the post where I first learned about this idea.
The key insight: what if we enable the model to pinpoint a specific line or range of lines instead of asking it to write the entire search block?
The obvious thing you'd need for such a mechanism is to add some kind of line identification mechanism as control plane info that your agent adds to the file content each time the LLM asks to read a file. So instead of the example file we showed above, the model will read something like:
# type: ignore
# ...previous_100_lines
101:x9|def complex_payment_processor(transaction_data):
102:b4| logger.info("Starting processing")
...[46 lines of validations, API calls, and db queries with their 5-token prefixes] ...
149:m2| logger.info("Payment successful")
150:k3| return {"status": "success"}
The format the linked article proposed was {line_number}:{2_character_hash}|{code}. Let's see how this will handle the edit above.
{
"tool_calls":[
{
"type": "function",
"function": {
"name": "edit_file",
"arguments": "{\n \"file_path\": \"payments.py\",\n \"start_anchor\": \"101:x9|\",\n \"end_anchor\": \"150:k3|\",\n \"replacement\": \"def complex_payment_processor(transaction_data, strict_mode=True):\\n logger.info(\"Starting processing in strict mode\")\\n # ...[46 lines of UPDATED raw code generated here] ...\\n logger.info(\"Strict payment successful\")\\n return {\\\"status\\\": \\\"success\\\"}\"\n}"
}
}
],
"usage": {
"approx_output_tokens": 540
}
}
The key differentiator here is that the output token size reduces to new code to replace with (R), asymptotic complexity O(R). And since the output costs are a lot higher than input, this reduces the overall cost of edits. The difference is particularly large in cases when the model intends to simply delete hundreds of lines of code.
Issues with this approach
The protocol works like this: for an edit to be validated successfully, the code at the given line number has to match the hash on that line. If it doesn't, that's an error response back to the LLM saying the code has changed since your last read.
There are two issues in particular here:
- For every line read, regardless of it ever being edited, this approach has a 4-token overhead (5 tokens when the line number goes beyond 999) - one or two for the line number, one for :, one for 2-character hash, and one for |. This tradeoff is only worth it if the edits are proportionately large.
- More importantly, because the protocol is tied to line numbers, a single edit at the top of the file would invalidate the hash of all the subsequent lines. So, if the LLM made a 5 line change at the top of a 2000 line file, the whole file would need to be re-read to make further edits possible.
A new Hash Anchor-based solution
Let's rethink our requirements from the ground up and attempt to improve the Hash Anchoring. What do we need for a functional anchor-based editing tool?
- Something that pinpoints any line in any file (Anchor)
- Something that separates the Anchor from the corresponding code (Delimiter)
- Something that validates the LLM-proposed edits (Validator)
- Something that keeps track of which line of which file is associated with which anchor (State Manager)
- Something that can reconcile a file after edits are made, allocating new anchors to only changed lines (Reconciler)
Anchor
An anchor is simply a label, and there is nothing in our requirements that forces the line numbers to be part of the anchor. There is nothing stopping us from using plain English words as anchors if we are tracking the state separately. 'Cherry' can serve just as well as '101:ab'.
So, instead of generating hashes or words, I simply asked tiktoken to do it for me tiktoken.get_encoding("o200k_base"); then, through a process of iterative refinements, I ended up with about 1,700+ suitable single-word, single-token anchors. These anchors get shipped as an asset with the agent code.
Delimiter
Any simple symbol that's preferably not used in coding would suffice; I chose § for Dirac.
For the example above, Dirac reads something like
# type: ignore
...previous_100_lines
Moderator§def complex_payment_processor(transaction_data):
Qualifier§ logger.info("Starting processing")
...[46 lines of validations, API calls, and db queries with their 2-token prefixes] ...
Ripple§ logger.info("Payment successful")
Corona§ return {"status": "success"}
Validator
In lieu of the hashes, the tool-call in Dirac asks the model to use the full line as start/end anchors, the backend can simply do a string match to validate.
State Manager
Note that we moved away from the previous architecture in one significant way: the proposed architecture is no longer stateless. For example, 101:x9|def complex_payment_processor(transaction_data): can be validated purely from its content—any file anywhere with the line 101 containing that exact code will always yield the exact same hash.
In the proposed architecture, we need a State Manager to keep track of which line of which file is currently assigned which anchor. This was a deliberate decision because statelessness is not a prized attribute, particularly in AI agents that are already tracking a huge number of state variables for many things.
The State Manager essentially works like a bucket manager. It keeps task-scoped maps of every file that is read (integrated into all read-related tools). It starts with the aforementioned ~1,700 anchor list and starts assigning the anchors to lines. It keeps another list of 'used' anchors to prevent the same anchors from ever being reused for the same files. When it runs out, it starts using 2-token anchors. Everything goes through the Reconciler to keep a 'current' view of every file.
Reconciler
What happens when an edit is successful? Keeping that anchor state coherent in all cases is the Reconciler's job. There is already a well-established algorithm for this called the Myers Diff algorithm. GitHub uses it, so you've likely seen it in action.
The Reconciler applies the Myers Diff algorithm to any changed files to understand which lines now need new anchors and assigns them accordingly. If the user was to manually update the file in between, the Reconciler would fire from a file update hook.
After edits are validated and applied successfully by the tool, we send back the updated anchors in the response to the LLM.
Does it work?
This is already live in Dirac's VS Code extension and CLI (npm install -g dirac-cli) versions and has been working flawlessly.
See it in action with Deepseek-v4-flash here.
Results
In the test tasks, most of which require editing many files, this approach makes a significant difference, as proxied by the cost numbers below.
| Task (Repo) | Files* | Cline | Kilo | Ohmypi | Opencode | Pimono | Roo | Dirac |
|---|---|---|---|---|---|---|---|---|
| DynamicCache (transformers) | 8 | 🟢 (diff) [$0.37] | 🔴 (diff) [N/A] | 🟡 (diff) [$0.24] | 🟢 (diff) [$0.20] | 🟢 (diff) [$0.34] | 🟢 (diff) [$0.49] | **🟢 (diff) [$0.13]** |
| IOverlayWidget (vscode) | 21 | 🟢 (diff) [$0.67] | 🟡 (diff) [$0.78] | 🟢 (diff) [$0.63] | 🟢 (diff) [$0.40] | 🟢 (diff) [$0.48] | 🟡 (diff) [$0.58] | **🟢 (diff) [$0.23]** |
| addLogging (vscode) | 12 | 🟡 (diff) [$0.42] | 🟢 (diff) [$0.70] | 🟢 (diff) [$0.64] | 🟢 (diff) [$0.32] | 🟢 (diff) [$0.25] | 🟡 (diff) [$0.45] | **🟢 (diff) [$0.16]** |
| datadict (django) | 14 | 🟢 (diff) [$0.36] | 🟢 (diff) [$0.42] | 🟡 (diff) [$0.32] | 🟢 (diff) [$0.24] | 🟡 (diff) [$0.24] | 🟢 (diff) [$0.17] | **🟢 (diff) [$0.08]** |
| extensionswb_service (vscode) | 3 | 🔴 (diff) [N/A] | 🟢 (diff) [$0.71] | 🟢 (diff) [$0.43] | 🟢 (diff) [$0.53] | 🟢 (diff) [$0.50] | 🟢 (diff) [$0.36] | **🟢 (diff) [$0.17]** |
| latency (transformers) | 25 | 🟢 (diff) [$0.87] | 🟡 (diff) [$1.51] | 🟢 (diff) [$0.94] | 🟢 (diff) [$0.90] | 🟢 (diff) [$0.52] | 🟢 (diff) [$1.44] | **🟢 (diff) [$0.34]** |
| sendRequest (vscode) | 13 | 🟡 (diff) [$0.51] | 🟢 (diff) [$0.77] | 🟢 (diff) [$0.74] | 🟢 (diff) [$0.67] | 🟡 (diff) [$0.45] | 🟢 (diff) [$1.05] | **🟢 (diff) [$0.25]** |
| stoppingcriteria (transformers) | 3 | 🟢 (diff) [$0.25] | 🟢 (diff) [$0.19] | 🟢 (diff) [$0.17] | 🟢 (diff) [$0.26] | 🟢 (diff) [$0.23] | 🟢 (diff) [$0.29] | **🟢 (diff) [$0.12]** |
| Total Correct | 5/8 | 5/8 | 6/8 | 8/8 | 6/8 | 6/8 | **8/8** | |
| Avg Cost | - | $0.49 | $0.73 | $0.51 | $0.44 | $0.38 | $0.60 | $0.18 |
Full results, including all diffs, are available on the GitHub repository.
Also, Dirac scored 65.2% on the Terminal Bench 2.0 leaderboard using gemini-flash-3-preview (although a lot of that is attributable to other tools as well).
Word on Token Efficiency
This and all the other hard work that went into building Dirac are not about saving a few pennies here and there. I believe improving the token efficiency of AI agents is a rare outcome that's good for everyone involved. Better-curated context means better model capability (as LLM reasoning ability degrades with context length), lower API costs (a win for the user), better capacity utilization (a win for capacity-constrained labs), and the environmental win of lower electricity consumption overall. The individual task savings per session are pennies. Compounded across the industry, it's a meaningful number worth optimizing for.