§ release note

We drove a fresh Claude Code session as the operator. The first experiment found a DeepSeek bug we'd shipped without noticing.

2026-06-05 · by Dennis Gubsky · ~8 min read

To pressure-test loomcycle from the outside, we set up an isolated sandbox, installed the brew binary, and drove it through a fresh Claude Code session as the operator - talking to loomcycle over MCP, with no internal shortcuts and no API access we wouldn't expose to a real third party. Every experiment was designed in advance by us; every step was executed by Claude through the same tool surface (`spawn_run`, `list_agents`, `agentdef`, `interruption_resolve`, …) a community operator would see.

The point of running the experiments through Claude rather than a hand-written script is that the agent surfaces gaps a script wouldn't. A script does what the author wrote. An agent does what the system tells it is possible - and an honest agent surfaces the moments where the system's docs, error messages, and capability gates don't line up. We get an evaluation of the operator-facing wire surface at the same time we get the experiment's result.

The series is four experiments, each escalating one notch in surface area:

exp1 - tool access. Can a coding agent actually use the built-in tools (Read / Write / Bash / WebSearch) the operator enabled? (This post.)
exp2 - interruption. Can an agent ask a Yes/No question, hold its run while the human decides, and branch on the answer? (This post.)
exp3 - multi-agent loop. Three agents iterating a shared artifact over Channels + Memory + Evaluation, with a third aggregating. (Tomorrow's post.)
exp4 - real dev workflow. Real Gitea webhooks, third-party MCP, Telegram - closed-loop coder → PR → reviewer → merge. (Day after.)

Exp1 + exp2 ran on v0.22.0. Fixes shipped in v0.23.0. Headline finding: F10, a DeepSeek adapter bug. Would have stayed invisible without a real agent exercising the loop on a non-Anthropic provider.

Experiment 1 - can the agent actually use the built-in tools?

The simplest possible test: a generic code-guru agent ( allowed tools [Read, Write, Edit, Bash, Grep, Glob, WebSearch, WebFetch]; model deepseek-v4-pro) is asked to write a Python program generating the first 100 prime numbers, save it as exp1/generate_primes.py, run it, verify the output, and report PASS/FAIL.

We picked DeepSeek deliberately - Anthropic-OAuth is the "easy" path we already exercise during internal development, and provider-adapter bugs hide there. Real operators run mixed fleets.

Run 1 - failed at `mkdir`

F10 · v0.22.0 · openai-compat adapterThe first Bash call (mkdir -p exp1) returned empty stdout. The DeepSeek API returned 400: messages[3] missing field content. The loop died.

Tracing it: the agent's first tool_use was a Bash{mkdir -p exp1}. mkdir -p on success prints nothing - empty stdout. Loomcycle's openai-compat path (used for DeepSeek's API) was serializing the tool-result message with no content field when the content was empty, on the assumption that the OpenAI message schema treats content as optional. DeepSeek's implementation does not - it requires content on every message, even if it's the empty string. So the loop's second LLM call returned 400 mid-conversation, and the agent never got to write the file.

The agent had working tool access - the first Bash call dispatched, executed, returned. The provider adapter aborted the next round-trip. From the operator's perspective this looks like "the tools don't work"; from the runtime's perspective the tools worked perfectly.

The fix (#379, RFC Q in v0.23.0): the openai-compat serializer now always emits content on tool messages, defaulting to the empty string when the actual content is empty. The regression test makes the empty-content path explicit: a Bash that produces no stdout no longer trips DeepSeek. This unblocks every tool-using agent on DeepSeek - every other provider tolerated the missing field, so the bug had been invisible against Anthropic / OpenAI / Gemini.

Run 2 - PASS, with independent verification

Workaround on v0.22.0 was simple: make every shell command print something. The agent then completed the loop cleanly - wrote generate_primes.py (trial division to isqrt(n), even-skip), ran it, read the CSV back, reported PASS. Independent verification by recomputing primes in a separate process: header index,prime, exactly 100 data rows, indices 1..100, values match the first 100 primes (0 mismatches), first/last = 2 / 541. Total tokens: 38,317 in / 2,884 out.

Built-in tool access is real. Two real bugs surfaced along the way:

F10 (above): DeepSeek empty-content - fixed v0.23.0.
F13: the Write tool resolved relative paths against process CWD, not LOOMCYCLE_WRITE_ROOT. With the server launched from the project root, Write exp1/… resolved to ./exp1/ at project root (parent missing), while Bash's mkdir used BASH_CWD=./work - the paths split. Operationally fixed by launching loomcycle from the jail directory (cd ./work && loomcycle …) so all the tool roots align.

Plus one already-known gap that bit:

F11: agentdef create with privileged tools refused with "caller's effective allowed_tools not on ctx", even with LOOMCYCLE_MCP_ALLOW_PRIVILEGED_TOOLS=1 set. The flag did not populate the operator ceiling the create checks against. We fell back to static-yaml authoring for exp1 - and filed a fix that landed later (#393 in v0.23.2). More on this in the post about exp3.

Experiment 2 - interruption (human-in-the-loop) gating

The second experiment: a yesno-agent asks one Yes/No question ("Approve the deployment?" - options Yes/No), blocks until answered, and outputs RESULT: PROCEED on Yes or RESULT: ABORT on No. Three checks:

Pass 1 - Claude (the operator) self-answers the question. The agent should wake and branch.
Pass 2 - Claude marshals the question to a human via the IDE UI; the human answers; the agent should wake on the human's answer.
Timeout hold - confirm the agent is genuinely suspended (not crashed) while waiting.

Both passes worked end-to-end. The agent held cleanly, the answer arrived, the branch was correct. The interesting moment was which resolve endpoint we used.

F15 - cross-runtime wake fails silently

F15 · v0.22.0 · in-process busPass 1, first attempt: we resolved the interrupt via the Claude Code plugin's MCP server. The DB row flipped to resolved. The run never woke. After 85 seconds the run was still running - the only events on its SSE stream were : keepalive.

Root cause: Interruption.ask blocks on channels.Bus.Wait("intr:<id>") - an in-process bus. The resolve handler calls bus.Notify on whichever process received the resolve. The plugin MCP server and the HTTP server (:8788) on v0.22.0 were two separate processes sharing the SQLite store but each with its own in-process bus. The plugin's bus.Notify woke no one; the HTTP server's blocking ask never received the signal.

Resolving via the HTTP server's own MCP endpoint (POST /v1/_mcp → interruption_resolve) woke the agent immediately and the branch was correct. So the feature works - the bug was that which process you resolved from mattered, and the only error indication was the run silently never waking until the 1-hour timeout.

What shipped in v0.23.0: instead of fixing the cross-process notify directly, the runtime got a different topology - the thin-client MCP mode (#381, RFC R). loomcycle mcp --upstream <runtime-url> launches a stdio MCP server that proxies to a single full runtime, instead of starting its own. Every tool call lands on the one process that owns the in-process bus. The cross-runtime topology that triggered F15 stops arising in normal use; the operator-facing pattern is "one full runtime, N thin-client MCP entry points."

The breaking change in the same release (#383): loomcycle mcp --no-http is removed. The pattern that needed it - running two full runtimes side-by-side, one with HTTP suppressed - is the pattern that caused F15. Removing the option is more honest than documenting the footgun.

A real fix for the original code path - bus.Notify via durable cross-runtime wake - shipped two days later (#400) for the rare topology that genuinely needs multiple full runtimes (HA across separate hosts). The thin-client topology is the supported answer for everything else.

The engineering lesson

Provider-adapter correctness is load-bearing in a way unit tests miss. The F10 bug had been live in internal/providers/openai/serialize.go for months. Unit tests against the OpenAI API (which tolerates the missing field) passed; integration against DeepSeek (which doesn't) had never been exercised on a Bash that returns empty stdout. The shape of the bug - "every other provider tolerated it" - is the hardest class to catch without a real agent driving the loop. Hand-written integration tests would have had to know to send an empty content deliberately; an agent doing mkdir -p sends it accidentally on the first call.

The discipline going forward: every provider in provider_priority must pass an end-to-end smoke test where the agent's first tool call returns empty stdout. The smoke test is now in the v0.23.x CI matrix. We don't ship a new openai-compat adapter without it.

And on F15 / F11 - the right answer to "this topology has subtle failure modes" is sometimes "stop supporting that topology." The thin-client pattern (RFC R) collapses what used to be a multi-process coordination problem into a single-process invariant. Less code, fewer ways to be wrong, and the failure mode that bit us in exp2 dissolves rather than being patched.

Next post: the MCP server wedged the IDE on a list - head-of-line blocking, and why killing the process was the only release. The same operator-driven framing, now finding a wedge bug that turned a long spawn_run into "your whole MCP transport is frozen for an hour."