§ release note

Context compaction for long-running agents. Manual, auto, and the agent asking for it itself.

2026-06-12 · by Dennis Gubsky · ~8 min read

Yesterday's v0.26 → v0.29 interactive terminal made it possible to drive a loomcycle agent for hours from the Web UI. Close the tab, come back, the run is still alive. That changes how you use the runtime. It also surfaces a new problem: a multi-hour conversation eventually crowds the model's context window. By the time the gauge reads ctx 178k / 200k (89%), every additional turn pays the prompt-caching cost of re-shipping a near-full history. Once you actually hit the wall, the next turn just fails.

v0.32.0 (today) ships the answer: a context-compaction subsystem with three coordinated triggers around one shared summarizer. The operator can compact manually via a one-click button. The runtime can compact automatically when the prompt footprint crosses a configurable threshold. And the agent can compact itself, having first looked at its own context usage via an augmented introspection tool. Per-agent settings, spawn-tree inheritance, durable + replayable, one summarizer shared across all three paths.

The three triggers

All three call the same internal function (loop.Summarize: one-shot, no-tools, target-percentage-parameterized). All three replace the middle of the conversation with a summary, keeping a pinned head and a recent tail. All three respect the same per-agent settings. They differ only in who decides and when the compaction happens.

Manual: the Compact button (#460)

A neutral "Compact" pill sits in the run terminal header, next to the (now redesigned, dark-red) "✕ Stop" button. One click on a long interactive session, the runtime summarizes everything that's happened so far, and the session continues from the summary. The operator sees a ↯ context compacted (N→M) marker in the transcript where the cut happened. The token budget drops by 70-90% in a single step.

The endpoint is POST /v1/runs/{run_id}/compact (scope runs:create). It's gated to a safe boundary: a live interactive run must be parked at awaiting_input; calling mid-turn returns 409. (More on why in the "engineering moments" section below.)

Auto - at a configurable threshold

Set autocompact_at_pct: 80 on an agent and the runtime watches the per-iteration context footprint. At the top of each iteration, when the previous turn's prompt size crosses 80% of the model's window, the loop summarizes inline and replaces the conversation before the next provider call. Self-debounces: the +1-iteration guard prevents an immediately-still-large request from triggering again. Skips when the window is unknown (Ollama doesn't report a window, so the gauge can't compute a percentage - better to skip than to fire on a fabricated number).

Off by default. The operator opts in per agent, with a value chosen for the workload - a long-running autonomous agent might use 70; a chatty interactive session might use 85.

Self - `Context op=compact`

The agent itself can ask. A new operation on the existing Context tool sets the loop's compact-request flag; at the next safe boundary the loop summarizes and replaces. This is the trigger for autonomous long runs - where there's no operator watching the gauge and an agent can reach the end of a long task with its context still mostly empty, or it can plow through and fill the window. A well-prompted agent knows when to ask.

The mechanism that makes this conscious rather than guessed is the augmented introspection tool.

The agent knows how full it is. Context op=self reports usage.

Before today's release, an agent had no signal about how much of its context window was used. The Web UI showed it (the context-size gauge added yesterday in v0.29.0). The model itself had no idea - it just kept seeing more turns until something broke.

Context op=self now returns a context object alongside the existing identity bundle and the resolved compaction settings:

{
  "agent": "long-run-analyst",
  "run_id": "r_…",
  "provider": "anthropic-oauth-dev",
  "model": "claude-sonnet-4-6",
  "sampling": { "temperature": 0.4 },
  "compaction": {
    "enabled": true,
    "target_percentage": 20,
    "keep_last_n": 8,
    "keep_first": true,
    "autocompact_at_pct": 80,
    "model": "claude-haiku-4-5"
  },
  "context": {
    "used_tokens": 162453,
    "max_tokens":  200000,
    "used_pct":    81.2
  }
}

So an agent's system prompt can include something like:

"Periodically call Context op=self. If context.used_pct ≥ compaction.autocompact_at_pct, immediately call Context op=compact before continuing the current task. The summarized history will keep your original instructions and your last N turns; you can continue from there."

No new external observability is needed. The agent reads its own state through the same introspection seam it uses to read its name and provider. Compaction becomes a decision the agent participates in, not just a thing that happens to it.

The footprint is honest: used_tokens is computed as input + cache_read + cache_creation tokens (the same true-prompt-footprint formula the v0.29.0 gauge uses), not just input_tokens (which undercounts under prompt caching). max_tokens comes from the provider's Capabilities().MaxContextTokens. Both omitted before the first turn completes, and when the window is unknown.

Per-agent settings

A new compaction block on each AgentDef, with all-pointer fields so absence means "unset, inherit" (mirroring the v0.28.0 sampling block):

Field	Range / type	What it does
`enabled`	bool	Master switch. Off by default.
`target_percentage`	10..50	Roughly what fraction of context the summary should consume - the summarizer is asked to produce output sized for the target.
`keep_last_n`	int	Preserve the last N messages verbatim. The "recent tail" that the agent needs intact to continue.
`keep_first`	bool	Pin the original task message verbatim. Almost always yes - the task is the thread.
`autocompact_at_pct`	50..95	Auto-compaction trigger threshold. Above this %, the loop summarizes at the next safe boundary.
`model`	string	Override model for the summary call. A cheaper same-provider model (e.g. claude-haiku-4-5) gets the summary done at a fraction of the cost of the conversation's working model.

Per-agent settings round-trip through every AgentDef mirror - mergedDef + applyOverlay + lookup.SubstrateAgentDef + ToConfigDef - and through the content hash (a fork that changes a compaction field mints a new content_sha256; the omitempty + normalize-collapse ensures a no-compaction agent hashes byte-identical to pre-feature rows). Plus a per-run override on POST /v1/runs with the same field shape. MergeCompaction is per-field - set what you want to change, leave the rest unset.

Configurable via the Web UI today through the advanced JSON/YAML overlay (the v0.29.0 form addition); dedicated form controls are a follow-up.

Spawn inheritance

Compaction is the first agent-setting that flows DOWN the spawn tree. Memory scopes, sampling, evaluation scopes, channels - all of those are each agent's own. A parent agent's sampling temperature doesn't reach the child agent it spawns. But a parent that knows it needs aggressive compaction almost certainly wants its children compacted too - otherwise a fan-out generates N sub-conversations each blowing their own context windows, and the parent's careful budget is meaningless.

The precedence chain on a sub-agent's effective policy:

Per-spawn override on the Agent.spawn or Agent.parallel_spawn call (a new compaction object in the input). Highest priority - the parent can target one child specifically.
Parent's effective policy - what flowed down from the parent's own resolved settings. Middle priority.
Child def's own settings - fills any gaps the parent left unset. Lowest priority.

The blend is per-field, so a parent that sets only autocompact_at_pct leaves the child's own keep_last_n and model intact. tools.WithCompactionPolicy stamps the resolved policy onto ctx at every top-level dispatch (RunOnce, handleRuns, handleMessages, the resume path); runSubAgent re-stamps on the child's subCtx, so grandchildren inherit recursively. The chain works for arbitrary depth.

That's the right shape because compaction is a resource-management decision, not an identity decision. Identity flows per-agent (memory scopes, sampling); resource management flows with the call stack.

Three engineering moments

Clean-boundary discipline (again)

Compaction is the third subsystem in two months to insist on snapping its action to a clean iteration boundary, never mid-tool-cycle:

Cooperative pause (v0.28.0 F41/RFC X) - parks in-flight runs at iteration boundaries.
Mid-run steering (v0.26.0) - drains queued operator input at the top of each iteration, never between a tool_use assistant turn and its tool_results user turn (orphaning the tool_use 400s the provider).
Compaction (today) - replaces the conversation only at safe boundaries. The manual endpoint requires awaiting_input; auto-compaction triggers at top-of-iteration; self-compact sets a flag that fires at the next boundary. A new helper, loop.CompactionSplit, snaps the cut point to a clean user-turn boundary so a tool_use/tool_results pair is never split across the summarized middle.

Same lesson, third time: the LLM message graph has structural constraints that the runtime has to respect. The provider's protocol is part of the runtime's contract, not just the model's preference. When you write a feature that mutates the conversation, you write it as "wait for the boundary, then act." Three subsystems built on that discipline already; future ones will keep using it.

One summarizer, three triggers

An earlier draft of the feature (commit history shows #460 shipping manual compaction first, then #461 the auto + self surface) had a separate server.summarizeConversation for the manual path and a different loop path for auto. The #461 PR consolidates to a single loop.Summarize - one-shot, no-tools, target-percentage-parameterized - called by all three triggers. Identical behavior across paths, half the test surface, and any future improvement (better summary prompt, different model selection logic, a smarter keep-last-N policy) lands once and reaches everywhere.

Worth noticing: the manual handler used to do its own one-shot provider call. Now it composes the shared summarizer with the same resolveCompactionSettings chain the loop uses. Two paths converged on one implementation; that's the right direction.

Keep-last-N + keep-first, not drop-everything

The naive shape would be: replace the whole conversation with a summary, lose the last few turns of detail. That ruins recency: the agent forgets what it just said. A better shape is what the runtime now ships:

compacted = [
  pinned_task,        // keep_first: the original instruction, verbatim
  summary,            // the new turn pair: a user message stating "what was summarized" + the assistant's compact summary
  ...last_N_messages  // keep_last_n: the recent tail, verbatim
]

The pinned task survives so the agent never forgets what it's doing. The recent tail survives so the agent never loses what it just said and did. The middle - the part most likely to be redundant after a long conversation - gets compressed. The summary itself is a user+assistant pair (starts with user, ends with assistant) so the next operator turn alternates cleanly with the existing convention.

And the durability: a persisted EventContextCompaction marker (with trigger, keep_n, keep_first, before, after) means replayTranscript rebuilds the same compacted form on crash-recovery, resume, or continuation. The full transcript is retained (non-destructive audit), but every reconstruction sees the compacted shape. OTEL gets a context.compaction span event for the same trigger/before/after triple.

What this unlocks

The substrate now manages context-window pressure as a first-class concern, with the agent able to participate in the decision rather than just bumping into the wall.

Long interactive sessions stop dying at the 200k-token wall. Autonomous long-running agents stop hitting "context length exceeded" mid-task. Fan-out orchestrators (exp5 agent ensembles) can let their sub-agents inherit the parent's compaction discipline. Self-evolving meta-agents (exp6.5) can survive across generations without each generation's run filling its window.

And - the bit that took the most thinking to get right - the agent participates. Context op=self shows the gauge; the prompt instructs the agent to check it; Context op=compact lets the agent act. Compaction stops being something the runtime does to the agent and becomes something the agent does with the runtime.

Latest release is v0.32.0 (today). The compaction block goes in the agent's yaml or AgentDef overlay; the Compact button is in the run terminal header next to "✕ Stop"; the help topic is at Context op=help topic=compaction.

Companion reading from the operator-via-MCP series: the interactive terminal (the feature this naturally pairs with - the multi-hour conversations compaction was built to support), self-evolving agents + snapshot/resume (the long-autonomous-run shape that benefits most from self-compact and spawn inheritance), agent ensembles (where spawn inheritance becomes operationally important - a parent that needs compaction wants its fan-out children compacted too).

Context compaction for long-running agents. Manual, auto, and the agent asking for it itself.

The three triggers

Manual: the Compact button (#460)

Auto - at a configurable threshold

Self - Context op=compact

The agent knows how full it is. Context op=self reports usage.

Per-agent settings

Spawn inheritance

Three engineering moments

Clean-boundary discipline (again)

One summarizer, three triggers

Keep-last-N + keep-first, not drop-everything

What this unlocks

Self - `Context op=compact`