Dirge: the coding agent that fits in your pocket and punches above its weight
Dirge is an agentic harness that I've been developing for my own use, and it's getting to the point where it's becoming generally useful. In this post, I'll discuss some of the rationale behind it and the interesting features it provides which differentiate it from other agentic harnesses.
The first thing to note is its performance. Most existing coding tools like OpenCode are rather memory-intensive, often using around 300 MB of RAM even when sitting there doing nothing. Dirge is written in Rust, which compiles to a tiny binary file weighing in at about 30 MB. When Dirge starts up, it needs only around 8 MB of RAM while idle, while working on tasks pushes that up to roughly 15 MB. So you could run twenty copies of Dirge at the same time for the cost of a single instance of OpenCode.
However, lean size alone is not the main point, and there are other Rust-based harnesses to choose from. What makes Dirge actually interesting is how it supports less capable models to get the most out of them. The conventional wisdom is that intelligence resides in the model, and the harness is treated as a matter of minimal plumbing. Its job is to give the model a tool loop along with a system prompt, and then to stay out of the way. In this view, the only way of getting a better agent is to get a bigger, and typically more expensive, model.
Having spent some time working with models such as DeepSeek and Qwen running inside a harness has changed my mind on the subject. It turns out that much of what makes an agent effective in practice lies in how well the harness meets the expectations of the model. The model usually knows what it wants, and can figure out how to get there and what actions it needs to take. What makes one setup feel cutting-edge and another feel frustrating is everything that surrounds the model. A harness needs to guide it before it acts, to correct mechanical errors, and to tell the model exactly what went wrong. It should also remember what has been learned from each attempt and manage the context intelligently. Frontier labs build much of this into post-training and tune their own harnesses to fit the strengths and weaknesses of their model. While Dirge cannot change how a model was trained, it can close the performance gap by meeting the model where it is. Once you invest a bit of work in the harness capabilities, a cheaper open model starts to behave like one that costs much more.
The gap appears at three different time scales, and Dirge invests in all three cases. Each time the model makes a tool call, it can either succeed or fail. Maybe the call is malformed. Maybe it edits a file and introduces a syntax error. Or maybe it gets stuck retrying the same failing command over and over. In each case, a failed step consumes time and tokens without advancing the task, and these failures quickly accumulate to fill up the context window with noise, leading the model to lose the thread of what it's doing. The longer a session runs, the worse the model gets at following instructions because as the window nears its limit, earlier instructions and corrections get truncated or forgotten. So the model continues to repeat mistakes or ignore earlier context. Things get even worse across sessions, since each new agent starts with amnesia and has to orient itself to how your project works. The model doesn’t remember any past decisions, file structures, or problems you’ve already solved. Every session has to rebuild its understanding of the codebase from scratch.
Let's take a look at what Dirge does with each separate piece of the puzzle. The attacks are in a certain order, and each is connected with its neighbor, making it a part of the whole process. As often tends to be the case, the aggregate is more than the sum of its parts.
How Dirge works
Dirge is essentially a state machine wrapped around the model. It lays down a series of steps, each of which consists of running the model, classifying the reply, then verifying and executing any tool calls, and finally verifying that the job has really been done before allowing the model to stop. This loop in itself is the core plumbing. What makes it a real power multiplier are three layers of apparatus wrapped around it, each of which corresponds to one of the time scales we just discussed.
A steering-and-repair layer ensures that each turn lands. A long-horizon layer ensures continuity within a session despite the limitation of the context window’s size. A learning layer transfers hard-won knowledge between sessions. That knowledge is stored in one SQLite database associated with the project. On top of that sits a plugin system which lets you reach into any part of the agentic loop you want. Let's take a look at these features in order.
Making each turn land
You might have heard that open models aren’t good at tool calling and that you have to pay for a top-tier model trained on API contracts to get reliable results. All tool calling means is that the model has to output structured data, like JSON, in a specific format. Frontier models, like GPT-4 or Claude, are directly trained on thousands of API contracts to get them to produce outputs that match function signatures and parameter rules. Open models are typically trained for general text generation rather than structured output tasks, and aren't capable of producing such precise outputs. But that's precisely an area where the harness can close the gap in how the model’s output is parsed, formatted, and verified.
The first thing that can be adjusted is steering itself. Dirge includes a set of instructions that are known to work well based on mature harnesses like OpenCode. These are baked into the system prompt and loop, causing the model to complete what it starts by checking itself against an explicit definition of done, creating self-discipline. The system also prompts the model to plan before working, and for longer tasks, it maintains short progress notes, helping the model keep track. At the start of each task, Dirge pulls up a few relevant tool-call examples (found by word matching) and prepends them to the prompt. In-context examples are a big deal for weaker models, since they give the model concrete patterns to emulate. On top of that, model-aware steering adjusts the guidance according to which model is running. The harness has special fragments for DeepSeek in particular, taking care of its oddities like over-exposition or special formatting requirements.
Fine steering reduces the percentage of failures, but does not prevent them altogether, and a small number of bad tool calls account for most of the “this model can’t do tools” complaints. Say a null appears where a field should be missing, or a JSON array comes in as a string. Dirge first tries the input just as it is, then attempts to correct the parts that the schema rejected so that valid inputs never get rewritten. The design avoids the silent-corruption trap that besets naive preprocess-then-validate designs, where rewriting a correct input corrupts it. Dirge also automatically flattens nested tool schemas that models handle poorly. It also scavenges tool calls the model mentioned in its reasoning text but failed to put into the structured field, extracting those calls out of the surrounding prose.
For editing, Dirge uses tree-sitter to parse and check grammar before a write, edit, or apply_patch touches the disk. The tool parses code into an abstract syntax tree, which is then checked against the language’s formal grammar. Syntactically broken code is rejected with errors that point to the exact line and column. A missing token is named directly from the grammar, e.g., “missing }”. For brace and Lisp languages, a balance summary that understands comments, strings, and char literals points to the exact unclosed bracket. Immediately telling the model exactly what’s wrong works much better than letting it save broken code and find the break three steps later.
Some problems cannot be solved by a single device of this sort because they appear in scopes larger than a single query. For example, weak models often tend to repeat the same attempt over and over again. A circuit breaker stops this process and forces a reflect-then-pivot, which forces the model to reconsider what it has been trying by naming the false belief at the basis of its failures and suggesting trying some other device. Another metacognitive monitor watches for a string of different failures; after a certain number, it states the common element in the recent failures and causes the model to name the single false belief that has been causing them. An intra-session memory retains abandoned devices and prevents the model from quietly repeatedly walking into a blind alley. When repair fails or a pre-writing syntax check breaks down, the flow may be bumped up to a stronger model, which then hands control back to the cheaper one.
Finally, the loop doesn’t just take the model’s word that it has finished. A pre-finalization gate checks whether the code was changed and whether a build or test ran and passed. If not, it injects a soft nudge to get the model to actually finish specified tasks. When you set up a critic_provider, substantive runs escalate to a bounded second opinion. This judge asks: is this really complete and correct? If not, the run goes back into the loop for more work. For runs that run headless, you can give the agent a clear --goal like “all tests pass and changes committed.” An independent judge keeps the run open until this goal is achieved or turns are exhausted. This composes beautifully with a two-model setup where a capable model does the work, and a cheap, fast model judges the state of progress.
While none of these features might seem remarkable by themselves, together, they address much of the mechanical noise, such as misdirected calls, interrupted edits, and retry loops. These common problems would otherwise consume context and drag a model down into a negative feedback loop, degrading its performance. Each feature takes care of one particular kind of failure, such as malformed calls introducing garbage data, interrupted edits corrupting state, and retry loops multiplying errors. Collectively, they prevent these failures from compounding and getting out of control. Without these rails, the model loses clear context and begins to emit increasingly bad outputs, which cause more failures, and so on in a downward spiral.
Holding the thread over a long task
Steering and correcting each turn helps keep individual tasks alive. But actual work is done across many turns, which means that eventually the context window fills up. The obvious way of dealing with this, summarizing the conversation when it gets too long, doesn't actually work all that well in practice. A monolithic summary near the limit of capacity often loses important information. And what's worse is that the model makes that summary just when its powers of compression are at their most degraded.
Dirge’s long-term architecture is inspired by MiMo-Code, folding old history into a structured summary when the conversation becomes too long. That summary is then treated as an item in a permanent session checkpoint. Each conversation is assigned a stable ID, which remains constant even across internal rotations during a fold. The checkpoint pairs the latest summary with the original request, creating a write-once anchor that prevents goal drift when the model repeatedly re-summarizes the body.
Two things keep this system running smoothly during long sessions. First, the checkpoint updates itself bit by bit in the background at regular intervals within the context window. This process ensures the fold is done at a point when the model is still able to summarize effectively, avoiding emergency compaction. Second, resume actually resumes exactly where the user left off because Dirge maps any ID to the live end of its chain.
The end result is that a long-running task stays on the rails. The model is always working against an up-to-date, intent-anchored picture of where things are, rather than a degraded transcript.
An agent that learns your project
The third time dimension is the most annoying thing about modern agents. Since they have no memory, each session begins de novo. A fresh agent will not remember that your project uses eslint-config-custom or that the integration-test mock server needs to be started with --feature=test-utils. It will also forget that you spent 45 minutes last week fixing a race condition in the auth middleware. This amnesiac behavior means that context and instructions have to be rebuilt every time a session begins.
Dirge addresses the problem by building an active learning system for each project, borrowing the memory architecture from Hermes Agent and adapting it to coding. A memory store collects facts about the project; these include build commands, coding conventions, library oddities, etc. The store also keeps a log of mistakes that come up during coding sessions, tracking things that have been tried and found not to work. It's a skills system which acquires know-how for a specific type of task within this codebase. When a session ends, a background process forks the agent, which retains only its memory and skill tools. The process then queries the agent to see what it learned. These findings are saved to storage without perturbing your session. A curator agent is used to keep the library healthy by merging overlapping entries and archiving stale entries. Additionally, there’s also a global layer that spans projects. Say you always want TDD or for commit messages to stay terse; these directives end up in this layer to separate them from project-specific facts.
Most agents rely heavily on markdown files such as MEMORY.md that the agent reads into context and which get edited during the run of the session. It's a fragile process because these files quickly expand beyond control; every byte rides along in the prompt whether it's relevant or not, and the agent can't actually search it effectively. On top of that, it's easy for these files to get out of sync with the actual state of the project, ending up having misleading information that can easily steer the agent down a wrong path. Dirge takes the approach of using a per-project SQLite database to store memories, skills metadata, the full session history, and the long-horizon checkpoints.
Memory formation is handled by a two-tier injection system. First, hot entries that are highly relevant get placed directly into the system prompt. Once that inline space runs out, the remaining entries drop down to a one-line breadcrumb index that the agent can look up when needed. This approach keeps the prompt small and stable in cache, no matter how much the project grows. And since there's a database backing the data, the SQLite FTS5 engine that supports full-text search can be used to replace wrangling markdown files. The agent can now ask about its own history (e.g., “how did we fix this last time?”) instead of reading through a wall of markdown text. Each entry also has a salience score where use of an entry boosts its salience, while disuse causes it to drop. When space gets low, the least useful entries end up getting pruned from the database. Removal of an entry creates a tombstone, so any archived material can be resurrected if needed. And so, you no longer have to maintain text files and constantly remind the agent to keep them up to date. The process runs consistently and transparently in the background of the session, and each new session starts with an up-to-date picture of how the project works.
Programmable: the harness as a platform
The real power of Dirge, though, comes from the ability to reach inside any part of its agentic loop. Most agents are barely customizable, giving you a config file with a few flags, or an MCP model where every extension runs as its own process that the agent communicates with via JSON-RPC. And the model is responsible for having to find and invoke each one of these tools. While MCP use is fully supported, and can be great for hooking up external tools, it’s a roundabout way to change the behavior of the agent itself.
Pi got the right basic idea here. It exposes the agent lifecycle as hooks, so users can write code against them using a full-blown language. Dirge takes a similar approach with Janet, which is a Lisp in a ~1MB runtime you can embed. Users can drop plugins into ~/.config/dirge/plugins/ or .dirge/plugins/ within a specific project folder. This way you can create plugins that target functionality for individual projects in addition to providing global tooling. Those plugins can be either individual .janet files or folders. Each one runs on its own worker thread, so if a plugin misbehaves, it can’t starve the whole session.
A minimal plugin looks as follows:
(defn on-prompt [ctx]
(when (string/find "security" (ctx :prompt))
(harness/notify "running with security mindset" :info)))
Plugins get the full harness API. They can intercept any tool call, block it, alter it, or replace it entirely. They can even add new slash commands and tools or pop up dialogs in the TUI. Anything from augmenting the system prompt to modifying the message context before each call is directly available to plugins. All this happens via lifecycle hooks like on-init, on-prompt, on-tool-start, on-tool-end, and prepare-next-run. Each example plugin in the project repository does something useful in under 100 lines of code. The one I use the most is an nREPL plugin that connects to a running Clojure REPL and gives the agent an nrepl_eval tool that lets it check its own changes live as it works on the program. The plugin replaces an entire MCP server with just a few lines of code. For power users, the agent becomes a platform which they can extend and modify to fit their needs.
The rest of the package
Dirge ships with everything you'd expect from a modern coding agent and a few things you might not. You can use OpenRouter, OpenAI, Anthropic, Gemini, DeepSeek, GLM, and Ollama out of the box, or configure any endpoint that works with OpenAI. Role-based routing allows different parts of the system to use different models. The main loop, summarizer, critic, and subagents can each point to their own model. So one model handles the conversation while another looks at code.
Dirge also connects to language servers, and has analyzers for Rust, TypeScript, Python, Go, Clojure, Java, C/C++, and Ruby. Each one gives real-time code diagnostics and suggestions. There is tree-sitter support for 11 languages, giving the list_symbols command to show all symbols in a file, get_symbol_body to extract functions or class bodies, and find_callers/find_callees to trace function calls.
A unified permission engine controls system access with four modes. Per-tool glob patterns determine which files each tool can read or write. Session allowlists grant temporary permissions.
Dirge also provides support for git worktrees with the /worktree command to create a detached branch workspace, /wt-merge to merge changes back in, and /wt-exit to clean up. Using worktrees allows each task to stay in its own isolated branch.
Dirge is free and open source under the GPL-3.0 license. You can install it via cargo install dirge-agent or brew install dirge-code/dirge/dirge. Once installed, set your API key as an environment variable and run dirge to start the agent. This tool addresses common issues with other AI agents, such as excessive memory usage and frequent failures in tool calls. Dirge learns from your usage patterns over time, getting better the more you use it. If you found this post interesting, I hope you'll give Dirge a shot to see if it improves your own workflow as it has mine.