Skip to main content
loomcycle
§ release note

The MCP server wedged the IDE on a list — head-of-line blocking, and why killing the process was the only release.

Yesterday's post set up the experiment frame: we drove a fresh Claude Code session as the operator, over MCP, with no internal shortcuts. Today's incident happened mid-series. The operator's IDE — Claude Code, talking to loomcycle via the plugin's stdio MCP server — hung on a list_runs tool confirmation. The user approved the call. Nothing happened. The only way out was kill on the loomcycle mcp process from another terminal.

A "list" is a cheap read. It should return in milliseconds. So why did it hang for tens of minutes? Source-reading the v0.22.0 MCP server (internal/api/mcp/server.go) found a single load-bearing footnote that explained everything:

// server.go, lines 102-115 (paraphrased)
// Frames are dispatched SEQUENTIALLY on a single goroutine.
// Concurrent tools/call (long-running spawn_run vs short list_runs
// on the same connection) is a v0.9.x optimisation — not implemented.

Not implemented. A comment is a contract; this one said "we acknowledge the failure mode and will fix it later." Later was today.

The bug — sequential dispatch + unbounded handler

F17 · v0.22.0 · stdio MCPOne spawn_run handler blocks the entire MCP transport. Every subsequent frame, including a cheap list_runs and even a cancel_run, sits unread in the stdin pipe behind it.

Serve is a plain for scanner.Scan() { handleFrame(...) } — no go, no worker pool. Subsequent frames stay buffered in the OS pipe until the in-flight handler returns. ctx.Done() is checked only between frames, never mid-handler. So nothing preempts an in-flight call short of process death.

The handlers that can block:

Meta-toolBlocks onBounded?Worst case
spawn_runwhole run (RunOnce)NO per-call timeoutrun duration; ~1h if interruption-held; unbounded if provider-stalled
subscribe_channellong-pollyes — ChannelsLongPollCapMSthe cap (seconds)
peek_channellong-pollyes — same capthe cap
stream_user_run_statesevent waityes — timeout_ms, default 30s30s

spawn_run is the dangerous one — the only blocking handler with no upper bound. Everything else self-releases within seconds.

Three amplifiers turned "slow" into "wedged for an hour":

  1. F15 (cross-runtime interruption wake, from yesterday's post). A run started via the plugin MCP, with an Interruption hold, can only be woken by a resolve on the same in-process bus. A resolve from another runtime writes the DB row but never wakes the loop — so the spawn_run handler blocks to the 1-hour interruption timeout. The whole stdio transport is wedged that entire hour.
  2. Provider stall. A provider outage (this incident coincided with an Opus 4.7/4.8 incident on Anthropic's side) plus retries plus stream-idle handling can keep spawn_run alive for tens of minutes.
  3. F16 (resource pressure). Held SSE clients, accumulated background processes, heartbeating interruption runs piling up. Not the root cause, but additive.

Why the list hung (the symptom that prompted the analysis)

list_runs is fast. It returns in milliseconds on its own. It hung because it was queued behind an already-in-flight spawn_run on the single dispatch goroutine. The list was the victim, not the cause. The operator was confirming the next read, the read was waiting for the loop to free, and the loop wasn't going to free until the prior occupier returned (or the process died). Classic head-of-line blocking on a single-consumer stream.

Why killing the process was the only release

Because ctx cancellation is polled only between frames, cancelling the parent context doesn't preempt an in-flight handler. There is no client-visible cancel path that helps either — a cancel_run frame would itself be HOL-blocked behind the occupier. SIGTERM on the loomcycle mcp PID ends the process; stdin closes; the wedged call dies with it; the IDE sees the MCP transport drop and reconnects.

The release worked. The fix worked. Both are real. The runtime wasn't supposed to need either.

The three-leg fix in v0.23.0 — RFC O + P + R

Three coordinated PRs in v0.23.0 close the HOL class, the unbounded-handler class, and the cross-process-coordination class that amplified them. None of the three would have been sufficient alone.

RFC O — concurrent stdio dispatch (#377)

The headline. Serve now dispatches each tools/call on its own goroutine, bounded by a small semaphore. The writeMu for stdout framing was already present (only used by the notification path); now it serializes every response. initialize / initialized stay sequential — protocol ordering. Independent tool calls run concurrently.

What this removes: the entire HOL class. A long spawn_run can no longer block a parallel list_runs or cancel_run. The "we'll fix it later" comment is gone from server.go — replaced by the actual concurrent dispatch.

RFC P — bounded spawn_run timeout (#380)

Defense in depth. spawn_run now wraps its run in a transport-level context.WithTimeout, configurable per operator. The default is generous (longer than any realistic LLM-driven run), but finite. A provider stall or unkillable run no longer keeps the handler alive forever — the timeout fires, the handler returns an error, the transport is free for the next frame.

Composed with RFC O, this means a hostile or buggy spawn_run can't keep its own goroutine alive forever either. The bound is hard.

RFC R — thin-client MCP topology (#381 + breaking change #383)

The structural fix to the F15 amplifier. loomcycle mcp --upstream <runtime-url> launches a stdio MCP server that proxies to a single full runtime over its existing HTTP / gRPC surface. The thin client owns no providers, no scheduler, no in-process bus — it's a transport adapter.

Every tool call lands on the one runtime that owns the in-process bus. The cross-process bus-Notify problem dissolves: interruption_resolve from a thin client, the IDE's MCP, or a CLI all hit the same in-process bus the blocking ask waits on. F15 stops being reachable in normal use.

The breaking change in the same release: loomcycle mcp --no-http is removed. The pattern that needed it — running two full runtimes side-by-side with HTTP suppressed on one of them — is the pattern that caused F15. Documenting the footgun is less honest than removing it.

Why all three landed together

Concurrent dispatch alone (RFC O) would have prevented the HOL, but a stalled spawn_run could still tie up a goroutine slot indefinitely under load. A timeout alone (RFC P) would have bounded the wait, but every operator would still hit "list hung for 5 minutes" before the timeout fired. The thin-client topology alone (RFC R) would have removed the cross-process amplifier, but a single long spawn_run on a single runtime would still HOL the transport.

The three together remove the failure class at three different levels. No single PR is structurally responsible. That's the right shape — failure-mode-elimination, not bug-fixing.

One more leg, days later — durable cross-runtime wake

The thin-client topology dissolves F15 for everyone who adopts it. But the original code path — bus.Notify failing across runtimes — is still wrong, and operators running genuine multi-host HA topologies (two full runtimes for failover) need a real fix, not a topology workaround.

#400 ships the durable cross-runtime wake: when interruption_resolve writes the DB row, it also publishes to the _system/interrupts/resolved channel; the run's owning runtime subscribes and wakes its blocked ask on the channel notification. The thin-client pattern stays the recommended default; multi-runtime topologies now work too.

The engineering lesson worth keeping

"We'll fix that in vN.x" comments are technical debt that compounds. The original sequential-dispatch comment in server.go was honest — it told you exactly which optimization wasn't there. It was also load-bearing in a way the author didn't realize: every operator who built tooling on top of the loomcycle MCP wrote code that assumed tool calls didn't HOL-block each other, because that's how every other transport behaves. The runtime's documentation said the limit; the operators' code didn't read documentation. The first real workload that touched the limit looked like a complete system failure.

The discipline going forward: a "to be implemented" comment on a load-bearing concurrency property is a P1 bug, not a roadmap item. Concurrent dispatch should have shipped in v0.8.x when the comment was written; instead it shipped in v0.23.0 after a real operator's IDE hung on a list. That's the cost of the wait.

And on the topology side: when a feature has multiple failure modes that all stem from "we support running multiple coordinated full runtimes," the right answer is sometimes "let's stop supporting that topology." RFC R's thin-client mode isn't a workaround — it's a topological invariant that makes a whole class of cross-process bugs unreachable. Less surface, fewer ways to be wrong, simpler operator mental model.

Next post: the multi-agent refine loop, 0.92 → 0.98 in 5 hops — and the silent default-deny that almost made it look like the agents weren't talking. exp3 surfaces a different class of footgun: capability gates that fail closed but never announce themselves at boot.