Strategic Forgetting: A Cognitive Architecture for Long Horizon Autonomy
I have been building software for close to two decades. In that time, the single most consistent pattern I have observed in my own work is this: I do not solve hard problems by remembering everything. I solve them by maintaining a clear mental model of where I am, what I have tried, what I have learned, and what I plan to do next. The raw details, the exact error messages, the specific terminal output from three hours ago, those I let go. When I need them again, I look them up. I re-read the file. I re-run the test. I check the logs.
This is not a limitation of human memory. It is, I have come to believe, a prerequisite for sustained coherent thought.
When I started building an autonomous coding agent in mid 2024, the prevailing assumption in the industry was that the path forward was more memory. Bigger context windows. 32K, then 128K, then 200K tokens. The implicit promise was simple and appealing: if the model can see everything, it will understand everything. Just keep accumulating the conversation history, let the context window grow, and the model’s attention mechanism will sort out what matters.
This was before context window caching was widely available as a product feature. Before the industry settled on the current approaches to conversation compaction. The question I was asking was not “how do we compete with what exists?” It was a more basic question: if I were designing a cognitive architecture for an agent that needed to work autonomously for hours on complex problems, what would that architecture look like?
The answer I kept coming back to was not a bigger memory. It was a curated one.
What Strategic Forgetting Actually Is
Let me be precise about this, because the name invites misunderstanding. Strategic Forgetting is not about throwing information away. It is about transforming information from one form into another.
When an agent works on a problem, it generates a stream of raw experience. Commands executed. Files read. Error messages encountered. Test results observed. Hypotheses formed and tested. This stream is rich in detail and utterly hostile to coherent reasoning once it grows beyond a certain density. It is a tape recording of everything that happened, in order, with no editorial judgment applied to any of it.
Strategic Forgetting is a structured process in which the agent periodically pauses, reflects on its recent experience, and extracts the essential knowledge from that experience into a dense, organized format. Then it releases the raw history and continues working from the distilled understanding.
This is not a single summarization call. It is a multi-pass reflection process that extracts different categories of knowledge separately:
Actions and outcomes. What was done. What the result was. What worked and what did not. This is the factual record, stripped of noise.
Current state. The present condition of the project, the code, the system under work. Not what it looked like four iterations ago. What it looks like now.
Reasoning and decisions. Why specific choices were made. What alternatives were considered and rejected, and the reasons for rejection. This is perhaps the most critical category, because without it the agent is liable to reconsider paths it has already ruled out.
Open questions. What remains uncertain. What assumptions have not yet been verified. What needs to be checked before proceeding.
The output of this process reads nothing like a conversation summary. It reads like the notes a senior engineer writes to themselves before stepping away from a problem for the evening. Dense, structured, opinionated, and focused entirely on enabling the next phase of work.
And here is the principle that makes the entire architecture viable: the agent does not need to remember what is in a file. It has cat and grep and every other tool at its disposal. It does not need to remember which tests pass. It can run them. It does not need to carry forward the raw output of every command it has ever executed. It needs to carry forward what it learned from those commands. The ground truth is always one tool call away. The understanding of what that ground truth means is what Strategic Forgetting preserves.
There is a phrase we use internally that captures what the reflection process really is, and I think it communicates the idea more directly than any technical description: the agent is curating its own brain.
Every reflection cycle is the agent making editorial decisions about its own cognition. What is worth keeping in active memory. What can be let go. What needs to be reframed because the original understanding was incomplete or wrong. It is not an external process being done to the agent. It is the agent doing something to itself, reshaping its own working context so that what remains is maximally useful for what comes next.
This is what distinguishes Strategic Forgetting from any approach where compression happens as a mechanical response to running out of space. Compacting because you have hit a token limit is plumbing. Continuously curating your own cognitive state because that is how sustained intelligent work actually functions is architecture. The difference is not in the mechanism. Both involve summarization. The difference is in the intent, the frequency, the rigor, and the understanding of why it matters.
An agent that curates its own brain does not wait until it is overwhelmed. It maintains clarity as an ongoing discipline. The result is an agent whose working context at turn fifty is as focused and coherent as it was at turn five. Not because it has forgotten fifty turns of work, but because it has distilled fifty turns of work into precisely the understanding needed to execute turn fifty-one.
The Cognitive Architecture Was the Starting Point
I want to emphasize something that I think distinguishes this work from approaches that have emerged since: the cognitive science parallel was not something we noticed after the fact. It was the design framework from day one.
Human cognition operates on multiple memory systems, and the interactions between those systems are well studied. Working memory is small and actively maintained. It holds what you are thinking about right now. Episodic memory is the raw experiential record, rich in sensory detail and temporal context. Semantic memory is distilled knowledge, decontextualized, stripped of the specific episode in which it was acquired. And external memory is everything you have written down, filed away, or can look up on demand.
The critical insight, and this comes from decades of cognitive science research, is that humans do not solve complex multi-day problems through perfect episodic recall. The person who remembers every detail of every debugging session they have ever conducted is not a better engineer. The person who has distilled those sessions into deep understanding of how systems fail is.
The transition from episodic to semantic memory is not information loss. It is information refinement. It is the process by which raw experience becomes usable knowledge.
Strategic Forgetting implements this transition explicitly. The agent’s raw experience (terminal output, file contents, error messages) is episodic memory. The reflection process converts it to semantic memory (structured understanding of state, progress, and reasoning). The file system and toolchain serve as external memory, always available for re-consultation when specific details are needed.
We did not set out to make our agent more human. We set out to make it more effective at sustaining coherent behavior over long task horizons. The cognitive architecture gave us a principled framework for thinking about what that requires. Almost, two years later, it remains the best framework we have found.
What One Year of NonBioS being live Has Taught Us
We have run Strategic Forgetting across thousands of sessions. Complex debugging tasks. Multi-file refactors. Feature implementations that touch dozens of modules. Infrastructure work. Migration projects. The kind of work that takes hours and demands sustained understanding of an evolving codebase.
Here is what we have learned.
What Works
The core mechanism is remarkably robust. Over the full history of NonBioS operating under Strategic Forgetting, there has been no major bug that the agent has been unable to solve. I want to be careful about how I frame this. We have not run controlled benchmarks against alternative approaches. We do not have a peer-reviewed paper with ablation studies. What we have is the accumulated evidence of daily, real-world use across a wide variety of codebases and problem types, and that evidence is consistent: the agent maintains coherence over task durations that, in our experience, cause agents relying on raw accumulated context to degrade.
It does not go in circles. It does not re-attempt approaches that have already failed. It does not lose track of its own reasoning. When it encounters a dead end, the reflection process captures that dead end explicitly, and the agent does not revisit it.
What Surprised Us
The quality threshold for the reflection process is far higher than we initially expected. This was our most expensive lesson. Early versions used lightweight summarization. A single pass, a simple prompt asking the model to condense the recent history. The results were worse than no summarization at all.
The failure mode was insidious. A reflection pass that drops a single critical detail (the test failed because of a timezone mismatch, not because of a missing import) produces a confident, well-structured summary that points the agent in exactly the wrong direction. And because the raw history has been released, there is no safety net. The agent follows the flawed summary until it re-discovers the correct information through its own investigation, which can take many turns.
Getting the reflection pipeline right, making it consistently extract the right information at the right level of detail, took far more iteration than we anticipated. It is, in many ways, the hardest engineering problem in the entire system.
The second surprise was about metacognition. After a reflection cycle, the agent is working from a summary, not from raw observation. This means it must recognize when the summary tells it something was done but does not include the specific details of how. It needs to know, at the right moment, to re-read a file and verify that a previous edit actually landed correctly, rather than assuming it did because the summary says so.
This capacity, the ability to reason about the boundaries of your own current knowledge, is something that only works with genuinely capable models. We expected the forgetting part to be the hard engineering challenge. It turned out that the knowing what you have forgotten part was harder and more important.
What Is Still Hard
I believe in being honest about limitations, particularly when the underlying idea is one I am asking people to take seriously.
The current frontier challenge for Strategic Forgetting is extended multi-feature sessions. When the agent works through a series of feature requests on the same codebase, operating over many hours, it can lose track of which features have been completed and which remain. The individual feature work is solid. Each feature, taken on its own, is handled with the coherence and quality we expect. But the meta-level tracking of “I have finished features one through four and I am now working on feature five” can degrade over very long horizons.
We understand the mechanism behind this. We have a targeted fix in development that we expect to ship within the next month. But I mention it here because it illustrates something important about the nature of this work: Strategic Forgetting is not a technique you implement once and declare finished. It is a living architecture that reveals new edge cases and new requirements as you push it into harder territory.
Our most recent releases, nonbios-1.135 and 1.141, shipped last week with significant upgrades to the cognitive architecture. Internal testing shows a substantial step up in capability. But we are still learning. Roughly every other month, we discover something new about how reflection should work, what information needs to be preserved differently, how the structure of the summary should adapt to different types of tasks. Each discovery gets modeled into the architecture. The pace of improvement has not slowed down.
Why This Requires Frontier-Class Models
There is a finding from our work that I think has implications beyond NonBioS, and it is worth stating directly: Strategic Forgetting creates a sharp capability threshold in the underlying model.
The reflection process asks the model to read a dense, interleaved stream of terminal output, code snippets, error messages, and tool results, and to extract precisely the information that matters while discarding everything that does not. This is not a routine language task. It requires genuine comprehension of code, of error semantics, of the logical relationships between actions and outcomes. A model that cannot deeply understand the work cannot accurately reflect on it.
Similarly, operating effectively from summary context after reflection requires the model to identify its own knowledge gaps and proactively fill them through tool use. This is metacognition: reasoning about what you know and what you do not know, and acting accordingly.
What we have observed is that models above a certain capability threshold produce high-fidelity reflections and operate naturally from summarized context. Models below that threshold produce lossy summaries and then act confidently on incomplete information. The degradation is not gradual. It is a cliff. There is very little middle ground.
I find this encouraging rather than limiting. As frontier models improve, the quality of Strategic Forgetting improves automatically. Better reasoning produces better reflection, which produces better memory curation, which produces better task outcomes. The architecture is designed to leverage exactly the capabilities that the field is improving fastest. What works well today with the best available models will work even better tomorrow.
This is also why the engineering behind Strategic Forgetting is more constrained than most people would expect. It is not enough to have a good summarization prompt. The entire agent architecture has to be built around the principle that every token in the system prompt, every instruction, every piece of scaffolding exists under extreme budget pressure.
The NonBioS agent uses zero native tool calls.
None. Every interaction with the environment happens through raw shell commands composed by the model and executed directly. This was not a shortcut or a simplification. It was a deliberate architectural choice made to keep the system prompt as small and tightly controlled as possible.
Native tool call schemas are expensive. Each tool definition, with its parameter descriptions, type annotations, and usage instructions, consumes tokens in the system prompt. An agent with fifteen registered tools might spend thousands of tokens just describing its own capabilities before it does any actual work. In a conventional architecture with a 200K token window, this cost is negligible. In an architecture built around Strategic Forgetting, where the working context is deliberately kept lean and every token must justify its presence, it is a cost we are unwilling to pay.
So the agent operates on raw commands. It writes bash. It reads files with cat and grep. It modifies code with sed and direct file writes. It runs tests by invoking the test runner. It navigates the project with find and ls.
This includes file editing, which deserves specific mention because anyone who has worked on AI coding agents knows it is one of the hardest problems in the space. Nearly every agent framework has landed on some form of custom tool call for file edits, typically involving a specialized diff format that the model generates and the framework applies. These formats exist because getting a model to reliably edit files through raw commands is genuinely difficult. The escaping, the edge cases, the multi-line modifications, the risk of corrupting a file with a malformed sed command. It is a real engineering challenge and most teams have understandably decided to abstract it away.
NonBioS does not use a tool call even for file edits.
It does not rely on a diff format. The agent edits files through the same bash commands it uses for everything else. Every edit is a real command that you can read, understand, and verify. There is no translation layer, no custom syntax, no proprietary format sitting between the model’s intent and the change on disk. If you are using NonBioS, you can see exactly what the agent is doing at every step, because every step is a transparent operation in a language you already know.
Getting this to work reliably was one of the hardest engineering challenges in the entire system. It took extensive iteration and it demanded a level of model capability that simply was not available until recently. But today it works correctly very close to 100% of the time. That was a milestone we did not take for granted when we reached it.
This places a higher burden on the model. It must know how to compose correct commands without the guardrails of a structured tool interface. But it also means the system prompt stays small, the agent’s instructions stay focused, and the entire token budget is available for what actually matters: the curated memory and the current task. And it means the human watching the agent work never has to wonder what just happened. Every action is legible. Every edit is a command you could run yourself.
This is the kind of tradeoff that only makes sense when you take the cognitive architecture seriously as a design constraint rather than treating it as an optimization to apply after the fact. Every part of the NonBioS agent, from the system prompt to the tool interface to the reflection pipeline, was designed together as a single coherent system in service of one goal: maximizing the density and quality of what occupies the agent’s working context at any given moment.
Where This Is Going
Context windows will continue to grow. Caching and pricing optimizations will make raw token storage increasingly cheap. Many teams are working on agent memory from different angles, and some of that work is very good. I do not claim Strategic Forgetting is the only viable approach to the problem of long-horizon agent coherence.
What I do claim is that it is a fundamentally sound one.
There is another reasonable and increasingly popular approach to this problem that many agent frameworks have adopted. Let the context window fill up naturally, sometimes to 150K or 200K tokens. Take advantage of prompt caching, which makes carrying forward that massive history remarkably cheap per turn. When the window finally approaches capacity, run a compaction pass to summarize and compress, then continue. The logic is straightforward: give the model access to as much raw history as possible for as long as possible, and only compress when you have no other choice.
This is a viable architecture. I want to be clear about that. For many tasks, particularly shorter ones that resolve in a dozen or so turns, it works well. The model has full access to every observation it has made and every action it has taken. Nothing has been lost or transformed. The full fidelity of the raw experience is preserved.
But it is a fundamentally different bet than the one we made. That approach bets that raw fidelity and volume are what matter most, that a model attending to 150K tokens of unfiltered history will be able to extract what it needs through the sheer power of its attention mechanism. Strategic Forgetting bets that curation and density matter more, that a smaller body of deliberately refined knowledge will outperform a larger body of raw, unstructured experience, particularly as tasks stretch longer and the ratio of noise to signal in that raw history grows.
Both are coherent positions. We chose ours for reasons rooted in how we believe sustained cognition actually works, and two years of daily use have not given us cause to reconsider.
The core hypothesis, that active memory curation produces better sustained reasoning than passive memory accumulation, has held up through every revision of the architecture, every edge case we have encountered, and every new class of problem we have thrown at it. The implementation details have changed dozens of times. Some early design choices look naive in retrospect. But the foundational principle, that an agent should transform its experience into understanding rather than simply carrying it forward as raw data, has never once been undermined by what we have observed in practice.
The work ahead is substantial. The multi-feature tracking problem is the immediate priority. Beyond that, there are open questions about hierarchical goal management over very long sessions, about how the reflection process should adapt its behavior to fundamentally different types of tasks, about how to make the architecture more self-correcting when a reflection pass does miss something important.
These are hard problems. We are making measurable progress on them every month. The last year has taught us that the path forward is not to abandon the approach when it encounters a new limitation, but to study the limitation closely and model the solution into the architecture. Every time we have done this, the result has been a meaningful capability improvement. I see no reason to expect that pattern to stop.
Two years ago, we asked a simple question: what if the key to an agent that never loses the plot is one that deliberately lets go of everything except the plot? We are still refining the answer. But the question, I am more convinced than ever, was the right one to ask.

