§ field report

133 minutes on a local Qwen, after four fixes.

2026-06-16 · by Dennis Gubsky · ~9 min read

Cloud LLMs are wonderful when you have a credit card and a clean API. They're a different proposition when the workload is "let an agent run unattended for two hours on hardware I already own." That's the local-model use case. It's also where I'd done the least real stress testing on loomcycle.

So two days ago I sat down to fix that. The plan was simple. Take a code-reviewer agent, point it at Qwen3.6:27b via ollama-local, give it a real task, and watch it run long enough to trigger multiple context auto-compactions. If the whole pipeline works, the runtime is honest about local inference. If it doesn't, I find out where the cloud-shaped defaults break.

It didn't work. I found four real bugs in a row, plus a fifth that surfaced after I'd fixed the first four. Six minor releases shipped between then and now: v0.34.3, v0.34.4, v0.34.5, v0.35.0, v0.36.0, v0.37.0. The final run is the 133 minutes in the title.

Here's the saga.

Bug 1: the compaction gauge lied for one turn (v0.34.3)

The first thing you watch on a long run is the context-size gauge in the Web UI terminal. It reads used / max (pct), so on a slow local model where every iteration costs minutes you watch it climb and decide when to compact.

Compaction worked. The gauge kept showing the pre-compaction size for one full turn after a successful compaction. So you'd see 164k / 82%, click Compact, the actual outbound request would shrink to ~1500 tokens on the wire, but the gauge would still say 164k / 82% until the next turn finished and updated the cached footprint.

This wasn't a correctness bug. The compaction did its job. But on a slow model you'd hit Compact twice in a row because the gauge didn't acknowledge the first one, and the second compaction would be a no-op you couldn't tell apart from a stuck UI.

Root cause: lastCtxTokens (the value Context op=self reports as used_tokens, and what the auto-compact threshold reads) was only ever refreshed from a completed provider turn's usage. So the loop knew about its own compaction but kept reporting the stale pre-compaction number until the next turn finished.

Fix (PR #486, v0.34.3): refresh lastCtxTokens at every compaction site (the parked-run path, the running-run path, and the inline auto/self path), and stamp the footprint below the compaction block so a same-turn op=self is correct too. One fail-before regression test (TestRun_Interactive_ContextUsageRefreshedAfterCompaction).

Bug 2: the Ollama context window was a lie in both directions (v0.34.4 → v0.34.5)

Onto the gauge for ollama-local. It showed the absolute used size and nothing else. No percentage. No max. Just a token count climbing.

Two reasons. First, Capabilities().MaxContextTokens was hard-coded as 0 for the Ollama driver. The driver already tracked WithNumCtx (the operator-pinned LOOMCYCLE_OLLAMA_LOCAL_NUM_CTX), so I just had to report it. Done.

Except LOOMCYCLE_OLLAMA_LOCAL_NUM_CTX ALSO went out on every request as options.num_ctx, which caps the actual window. So the environment variable was doing two things at once: capping what Ollama could load AND reporting what we said it was.

Then the real edge. qwen3.6:27b is trained for 256K and Ollama loads it at 128K via OLLAMA_CONTEXT_LENGTH. If you set LOOMCYCLE_OLLAMA_LOCAL_NUM_CTX=32768 you force/report 32K, wasting the model's actual capacity. If you unset it, Ollama loads at the real 128K, but loomcycle reports "unknown" because the driver has no way to ask.

Fix (PR #495, v0.34.5): the Ollama driver now reads the model's actual loaded context from GET /api/ps once the model is in VRAM (Ollama publishes this only after load, as context_length). Precedence: an explicit operator-pinned num_ctx still wins (exact, no probe); otherwise the live /api/ps value; otherwise "unknown." Cached per-model with a 5-minute TTL, 2s probe timeout, best-effort. Feeds the gauge only, never correctness.

So now you can set OLLAMA_CONTEXT_LENGTH=131072 once on the Ollama side, unset the loomcycle override, and the gauge reads the real loaded window without any cross-system coupling.

Bug 3: cold local models blew past cloud-shaped timeouts (v0.34.4)

Default time-to-first-byte: 60s. Default idle: 90s. Fine for an Anthropic call. Death for a cold local model. Ollama disk-loads weights into VRAM (anywhere from 5s for a small model to ~60s for a 27B at q4), then evaluates the prompt on consumer hardware, then starts streaming.

If you don't get the first byte in 60s, the run dies before producing anything.

Fix (PR #488, v0.34.4): the ollama-local registration gets its own timeout pair, default 300s / 300s, configurable via LOOMCYCLE_OLLAMA_LOCAL_HEADER_TIMEOUT_MS and LOOMCYCLE_OLLAMA_LOCAL_IDLE_TIMEOUT_MS. Cloud Ollama keeps the cloud defaults. The difference is the deployment shape: ollama-local is the one that runs on your hardware; everything else is a cloud endpoint and should behave like one.

Bug 4: compaction "succeeded" but didn't fit (v0.37.0 / PR #503)

This was the deep one.

A code-reviewer run filled its context to ~84%, the auto-compact threshold tripped, the loop summarized the older turns, the marker went into the transcript. Everything looked clean. Next request went out: request body too large for context window. Run died.

Investigation: the compacted form is [pinned task + summary, ack] ++ last-N kept verbatim. The last-N is a clean user-turn boundary, picked so a tool_use / tool_result pair is never split. Sensible default.

But a code-reviewer reads files. Every tool result is a Read block returning 5-50KB of source. The last-N boundary captured 20 of those, and the kept-verbatim tail alone was 153.8k tokens. Still over the 131k window. Compaction had folded the older history into a 20.4k summary (good), but the tail was bigger than what fit.

So compaction "succeeded" while leaving the agent in a worse state than no-compaction. The next prefill blew the window. The run died.

Fix (PR #503): when the provider reports a window, advance the cut forward. Fold the OLDEST kept-verbatim turns into the summarized span (one at a time, snapping to fresh-user-turn boundaries) until the kept tail fits ~half the window (compactionKeptTailBudgetPct = 50). The summary captures the folded turns. No content is lost; it just moves from the verbatim tail into the summarized middle.

Three edges worth naming. A single irreducible over-budget turn (one giant tool_result) is KEPT, not dropped to empty. The agent keeps its most recent context. Budget is estimate-based (chars/4, which overcounts dense JSON and code) so it errs toward keeping LESS, the safe direction for a slow local model's prefill cost. And window<=0 (a provider that reports no window) means no cap, behavior unchanged.

Bug 5: the heartbeat died on slow turns (v0.37.0 / PR #502)

Companion to #503. Even with the tail-cap fix in place, a single iteration on a slow local model could block for ~10 minutes on the actual model call. The prefill itself is what's slow.

The stale-run sweeper noticed no heartbeat for more than 10 minutes and reaped the live run as failed / heartbeat_timeout. Twice in a row, while the run was visibly still working in the terminal.

Root cause: OnHeartbeat fired only at the START of each iteration. Any iteration that blocks longer than the sweeper's threshold (a long prefill, a long tool, retry backoff after a transient error) exceeds 10 minutes with no pulse. The sweeper's job is detecting a crashed process. A slow-but-alive call is not crashed.

Fix (PR #502): a run-lifetime heartbeat ticker. A goroutine pulses OnHeartbeat every 30 seconds for as long as the run goroutine is alive, IN ADDITION to the per-iteration pulse. The HTTP header/idle timeouts remain the authority on a genuinely stalled provider; the heartbeat just reflects "this run's goroutine is alive."

Cleanly stops on Run return (defer close) or ctx cancel. No goroutine leak. Safe to call concurrently with the per-iteration pulse: the callback is UpdateHeartbeat, a fire-and-forget DB write. One fail-before regression test (TestRun_HeartbeatPulsedDuringSlowCall) covers a one-iteration run whose Call blocks 120ms at a 15ms cadence: without the ticker, the heartbeat fires exactly once; with the ticker, several times.

The 133-minute run

After all four fixes landed (#486, #488, #495, #502, #503), I ran the code-reviewer on Qwen3.6:27b through ollama-local against a real codebase for 133 minutes straight.

Multiple auto-compactions fired correctly at the configured threshold.

The tail-cap kept the post-compaction request under the 131k window every time.

The heartbeat ticker kept the run alive through every long prefill.

The gauge reported honest used_pct after each compaction without a stale-by-one-turn delay.

The Ollama driver reported the real 131k loaded window via /api/ps.

No reaper. No failed prefill. No stale gauge.

The agent finished its task.

Why this matters

This wasn't a clean planned feature line. It was four real bugs that surfaced from running on hardware that's been mainstream for 18 months and that loomcycle had never properly stress-tested. The cloud-shaped defaults made the runtime fragile to the realities of local inference: assumed-fast model calls, assumed-known context windows, assumed-completed turns within minutes.

Each fix is small. A goroutine. An extra check. A 2s probe. A re-stamp. None of them touch the wire shape. None of them introduce a new primitive. But each was load-bearing for "run a 27B model on commodity hardware for two hours and don't die."

The other thing this exercise gave me is a real local-model config recipe. docs/CONFIGURATION.md §6b now documents the slow-local-model lessons: the two Ollama providers, the num_ctx knob, the header / idle timeouts, the heartbeat-during-call behaviour, compaction tuned for prefill cost (compact early, small keep_last_n), and the interactive knobs. There's also a new loomcycle.local-interactive.example.yaml in the repo: a focused, ready-to-run config for steering interactive agents on local models. Aliases-first (v0.35.0 made bare-string aliases work in tier candidates), local-first tiers with cloud fallback, and a coder agent wiring unbounded_iterations, early/small-tail compaction, and interruption.

If you're running loomcycle on Ollama and hit something else, file an issue. The in-product introspection (Context op=self reporting used_pct, the gauge in the terminal, the sandbox introspection added in v0.36.0 reporting jailed_dir and host_allowlist to the agent itself) is meant to make these rough patches visible from inside the runtime, not just in the operator's logs.

Apache-2.0. brew install denn-gubsky/loomcycle/loomcycle or pull the v0.37.0 Docker image. Release notes for all six versions in REVISIONS.md.

Companion reading: the context-compaction subsystem itself (the feature these bugs surfaced in), the interactive terminal (where the gauge lives), the 8-hour stability soak (the cloud-side equivalent of this kind of long-running stress test, on Anthropic).