Large language models have finite context windows. When a conversation or agent session exceeds the window, something must be discarded. Most frameworks handle this through context compaction (also called compression or summarisation): a model summarises the conversation history to make room for new content.
This works reasonably well for conversation content. It works badly for standing instructions — the system prompt, role definitions, behavioural policies, and operating procedures that define how an agent should behave.
The failure mode is subtle. After compaction, the agent continues to function. It answers questions, uses tools, produces outputs. But it drifts. Policies it was told to follow are no longer in context. Safety constraints that were stated explicitly are now at best implied by a summary. The agent doesn't know it has forgotten anything, because the compacted summary reads as a coherent (if abbreviated) conversation.
This is not a theoretical concern. It is observed across every major agent framework. We have seen it in our own production systems, and the broader community documents the same patterns across providers:
/responses/compact endpoint that returns an encrypted compaction item carrying forward prior state. This is architecturally more sophisticated than client-side summarisation. But users still report that compaction obscures session context and silently drops prior work — confirmations like "Bug X is fixed" vanish, and the agent re-runs expensive steps it has already completed because it no longer remembers doing them.The obvious solutions all have the same structural flaw: they treat the system prompt as conversation content.
System message role. All major LLM APIs provide a mechanism for system-level instructions separate from conversation messages — Anthropic's Messages API has a top-level system parameter, OpenAI's Responses API has instructions, and so on. But in most agent frameworks, this is assembled once at the start of the session and then becomes part of the token sequence that gets managed (and compacted) as a unit. The API distinguishes the system prompt; the framework's conversation management typically does not.
Instructions files (CLAUDE.md, AGENTS.md, rules files). Both Claude Code and OpenAI's Codex use instruction files — CLAUDE.md and AGENTS.md respectively — for project-specific instructions. These are injected near the top of the conversation history before the user prompt. Anthropic's documentation is candid about the limitation: instructions from early in the conversation can get lost after compaction. The recommended mitigation is to put persistent rules in the instructions file — but these files are read at session start and become tokens in the conversation. They help new sessions; they do not survive compaction within a long session.
The community experience bears this out. A GitHub issue cataloguing compaction-related context loss in Claude Code lists at least eight open reports describing different symptoms of the same root cause: memory loss after auto-compact, repository paths forgotten, skills context lost, agent teams vanishing mid-session. OpenAI's Codex community reports the same pattern: compaction replaces the full history with a single bridge containing only user prompts plus a summary, so assistant confirmations and prior work vanish.
Common workarounds include: manually quitting and restarting sessions at natural breakpoints; backing up session files before risky operations so they can be restored if compaction destroys context; using customisable compaction prompts to steer what gets preserved in the summary; and writing important state to external files (scratchpads, TODO lists) rather than relying on conversation memory.
These are all coping strategies. None of them address the structural problem.
Framework-level summarisation (LangGraph, ADK). Agent orchestration frameworks like LangChain's LangGraph and Google's ADK offer built-in context management through summarisation, sliding windows, and memory stores. These are more principled than ad-hoc compaction, but they operate on the same fundamental model: the conversation (including system instructions) is a mutable token sequence that gets compressed as it grows. The LangGraph approach groups strategies into "write, select, compress, and isolate" — all operating on conversation content. None of them distinguish structurally between standing instructions and conversation history.
RAG retrieval. Retrieval-augmented generation can fetch relevant knowledge per query. But standing instructions are not "relevant to a query" — they are always relevant. RAG is the wrong mechanism for content that should be permanently present.
Checkpointing and session restart. Save the conversation state to a file, start a new session, re-read instructions from disk. This works but is manual, loses conversational continuity, and doesn't help within a long session.
The solution is to treat the system prompt as infrastructure rather than conversation content. The prompt is not in the conversation. It is prepended to the conversation by the backend on every API call.
The implementation has three components:
The system prompt is assembled from stable, versioned sources:
┌───────────────────────────┐
│ Base system prompt │ ← static, identical across all calls
├───────────────────────────┤
│ Role-specific instructions│ ← fixed for the agent's role
├───────────────────────────┤
│ Context documents │ ← reference material the agent will need
├───────────────────────────┤
│ Task-specific instructions│ ← varies per task, but fixed within a task
└───────────────────────────┘
The "context documents" layer deserves a brief explanation, because it is the least obvious. When you know in advance that an agent will need certain reference material — a style guide, a requirements document, a dataset schema, a client brief — you can pre-fetch that material and include it in the system prompt rather than having the agent retrieve it via tool calls during the conversation. The content is identical on every API call within the task, so it caches efficiently. And because it is part of the system prompt rather than the conversation, it is never compacted. An agent that needs to refer to a 50-page specification on its fifteenth tool call sees the same specification it saw on its first.
The alternative — letting the agent fetch documents as needed — places the content in the conversation history as tool results. Those tool results are subject to compaction. On a long task, the agent may lose access to the very material it was asked to work with. Worse, because the agent decides when and in what order to fetch documents, the content appears at unpredictable positions in the conversation. This defeats prefix caching entirely: if the same document appears at position 3,000 in one call and position 8,000 in the next, there is no shared prefix to cache.
This stack is stored and managed by the backend infrastructure, not by the conversation. It is not a message in the conversation history.
When the conversation history needs to be compacted, the compressor receives only the user/assistant messages. System prompt components (SystemPromptPart, cache markers, injected documents) are stripped before the compressor sees them.
The compressor summarises the conversation — what was discussed, what was decided, what tools were called, what results came back. It does not summarise the instructions, because the instructions are not in its input.
After compaction produces a summarised conversation, the backend prepends the full prompt stack to the summarised history before the next API call. The agent receives:
[Full system prompt — fresh, complete, unmodified]
[Compacted conversation summary]
[Recent messages preserved verbatim]
The system prompt is structurally identical to what the agent received on its first call. No degradation. No drift. No summarisation artefacts.
This pattern was not originally designed to prevent drift. It was designed to optimise prompt caching.
LLM providers cache input token sequences based on prefix matching. Anthropic's prompt caching, for example, processes request components in a fixed order — tools, then system message, then message history — and caches based on the longest matching prefix. Changes to components higher in the hierarchy invalidate all downstream caches. If two consecutive API calls share the same prefix, the cached portion is served at a fraction of the cost — typically 90% cheaper, with cache entries refreshing on each hit rather than expiring at a fixed time.
An agent that makes 15 tool calls in a step sends the system prompt 15 times. Without caching, you pay full price 15 times. With prefix caching, you pay full price once and 10% for the remaining 14 calls.
But caching only works if the prefix is stable. If the system prompt changes between calls — or if compaction rearranges the early tokens — the cache is invalidated and you pay full price again.
The structural separation guarantees prefix stability:
The cost impact is significant. In production, we observe 83–95% cache hit rates on system prompt tokens. For a step with 15 tool calls and a 1.6 million token system prompt, the difference between 0% and 95% cache hits is roughly 10× in cost.
The drift prevention is a structural side effect of getting the caching right.
The pattern works because it inverts the usual assumption. Most frameworks treat the system prompt as "the first thing the model sees" — early tokens in a long sequence. This pattern treats it as "something the infrastructure provides" — managed outside the conversation, injected fresh every time.
Of the major frameworks, Google's ADK comes closest to this thinking. Their context engineering architecture explicitly separates "Session" (the full, structured state) from "Working Context" (the compiled view sent to the model), with a pipeline of processors that transform one into the other. OpenAI's Codex takes a different approach: their compaction endpoint returns an encrypted, opaque item that preserves the model's latent understanding — a server-side solution that avoids the summarisation fidelity problem but is provider-specific and not inspectable. Block's Goose has a notably sophisticated context management system — customisable compaction prompts, background tool-call summarisation, audience filtering — and assembles system prompts from model-specific templates via a prompt manager that is separate from conversation messages. Of the open-source agents, Goose comes closest to having the right separation in place, though we have not been able to confirm from the available source and documentation whether it explicitly strips system-level content before compaction and re-injects it after.
The practical requirements are:
A client that manages the system prompt separately from the conversation. This does not require a backend server, an orchestration framework, or any sophisticated infrastructure. Any client that calls an LLM API — a desktop app, a CLI tool, a simple script — can do it. The system prompt is held aside. When it is time to compact, the client strips it from the conversation, compacts the conversation, and prepends the system prompt again. That is the entire pattern. It is not clear why this is not standard practice.
A compressor that respects the boundary. The compaction mechanism must strip infrastructure content (system prompts, cache markers, injected documents) before summarising, and re-inject after. This is a simple filter, but it must be explicit.
Stable, ordered prompt assembly. The prompt stack should be constructed from versioned sources in a deterministic order. If the order changes between calls, caching breaks. If the content changes, both caching and behavioural consistency break.
Context documents as part of the prompt, not the conversation. If an agent needs reference material, pre-inject it into the system prompt rather than having the agent fetch it via tool calls. Tool call results go into the conversation history and are subject to compaction. System prompt content is preserved indefinitely.
The prediction that prompt-stack preservation prevents specification drift is empirically testable. The experiment is straightforward: same task, same model, same agent — one condition with the task specification in the conversation history (subject to compaction), the other with the specification pinned in the prompt stack (immune to compaction). Run both through enough turns to trigger multiple compaction cycles and measure specification compliance over time. We intend to run this experiment and publish the results as part of a broader research programme on structural trust properties in multi-agent systems.
This pattern preserves the system prompt perfectly. It does not preserve conversation content — the compactor still summarises the conversation, and information is still lost. Important decisions, findings, and intermediate results should be persisted to external storage (a document store, database, or file system) rather than relying on conversation memory.
The pattern also requires the system prompt to fit within the context window alongside the compacted conversation and recent messages. If the system prompt itself is very large (millions of tokens of context documents), there may be little room left for conversation. This is a context window budgeting problem, not a flaw in the pattern.
Context compaction degrades system prompts because it treats them as conversation content. The fix is architectural: separate the prompt from the conversation, strip it before compaction, re-inject it after. This preserves behavioural consistency across arbitrarily long sessions and, as a side effect, enables efficient prompt caching.
The pattern is simple. The implementation is straightforward. The surprising thing is that it is not standard practice.
This design note describes patterns implemented in the Perseverance Composition Engine (PCE), a multi-agent document composition framework developed by the Leith Document Company.