§ hardening notes

3000 agents + 2000 memories + 2000 channels in one stress test.

2026-05-26 · by Dennis Gubsky · ~5 min read

Three thousand agent runs spawned. Two thousand memory entries written. Two thousand named channels subscribed. Seven thousand entities total, all tracked by loomcycle's v0.12.x substrate without a structural failure. The agents themselves were a different story - they spent most of the run starving for provider capacity, because both Anthropic and Ollama cap parallel calls at roughly the number we needed at peak. The video below is exactly that: the substrate holding everything correctly while the agents wait in line for an LLM slot to free up.

Recording of the x1000 run. Watch the activity tab: substrate-side bookkeeping stays steady; agent-side throughput is paced by how fast provider slots come back.

This was the first of the load-testing exercises promised in yesterday's multi-replica HA post - the work between *"the architectural pieces are in"* and *"we'll cut a v1.0 tag."* This run was single-replica; cluster-mode load testing is a separate exercise queued up next. The point of today was to drive the substrate hard enough to discover the failure modes that only emerge under N-way contention, and fix them before they show up at a customer site. Five real substrate bugs surfaced. All shipped before the test session ended.

The circuit

Each "circuit" is a three-agent pipeline that touches every cross-cutting substrate primitive in sequence:

Researcher answers a question, writes the answer to Memory.set scope=user, publishes research-done/c{N}.
Editor blocks on Channel.subscribe research-done/c{N}, wakes when the researcher commits, reads the memory entry back, tightens the prose, writes it back, publishes editing-done/c{N}.
Evaluator blocks on Channel.subscribe editing-done/c{N}, wakes, reads both memory versions, scores 0.0-1.0 with a one-sentence rationale, writes the scored version, and calls Evaluation.submit against the editor's run_id.

All three agents are POSTed at T+0. The editor and evaluator block on the channel; the channel signal is the only coordination mechanism. This is deliberate - the test is for the Channel tool, not for orchestration logic. Multiple circuits per user_id exercise the per-user fairness path; the per-circuit key namespacing keeps memory isolated even when ten circuits share a user.

What held - substrate-side

Through x300 circuits, every pipeline completed cleanly - 300 of 300 successful, 100% pipeline completion, mean score 0.96 across all evaluations. 900 agent invocations had been processed without a single structural failure at the natural provider-imposed concurrency ceiling (~100 in-flight runs).

Scale	Pipeline result	p50	p95
x10	10/10	22 s	29 s
x50	50/50	22 s	29 s
x100	100/100	23 s	34 s
x300	300/300	20 s	58 s

The interesting line on that table is the flatness of p50 across two orders of magnitude. Median latency barely budges from x10 → x300; what grows is p95, and the growth is mostly queue wait, not execution. The substrate doesn't degrade with contention - it queues, fairly, and the queue drains at the rate the upstream provider allows.

At x1000 (100 users × 10 circuits per user), the substrate spawned and tracked every one of the 7000 entities cleanly - 3000 agent records, 2000 memory entries with their version chains, 2000 named channels with their subscribers and message buffers. No dropped writes, no orphaned subscribers, no lost evaluations. The bookkeeping held. What changed at x1000 was the agent-side throughput, not the substrate-side correctness.

What didn't hold - agent-side

With 1000 circuits running and three agents per circuit, peak demand reached around 1000 simultaneous LLM calls in flight at the model layer. Combined upstream capacity was nowhere near that:

Anthropic (OAuth-dev MAX subscription): ~120 parallel calls before the per-account rate limit kicks in
Ollama (cloud + local): ~120 parallel calls before the same backend starts 429-ing
Combined via fallback resolver: roughly 240 parallel - still ¼ of demand

So at x1000 the agents starved. Not for substrate resources - the run-admit semaphore worked correctly, the per-user fairness counter distributed evenly, the channel bus signalled cleanly. The agents starved for provider slots, and that's exactly what the video captures: dozens of agents at any given moment sitting in RUNNING state inside loomcycle, waiting for the upstream HTTP call to come back from a provider that's holding them in its own queue.

This is the right kind of starvation to see. It means the substrate is no longer the bottleneck for any realistic deployment - provider capacity is. For multi-tenant production deployments the fix is on the operator side (paid-API-key tier, or aggregator routing across multiple providers / regions), not the substrate side. The thing v1.0 is meant to claim - *"loomcycle can run as many agents as your provider quota allows"* - is what x1000 demonstrated.

The five substrate bugs the contention exposed

Getting to "the substrate held the 7000 entities cleanly" wasn't free - it required catching and fixing five real bugs that only surface when contention crosses a threshold. All shipped before the test session ended. The two most interesting:

The Channel.subscribe long-poll race (PR #232). At x100, around half of all subscribers were doing extra work - a publish that committed between a subscriber's initial read and its waker registration was being lost. Notify fired against an empty waiter slice; the subscriber blocked until the long-poll timeout expired, retried, and found the message on the second attempt. Functionally correct, but the latency was wrong by a long-poll interval. The fix is a new Bus.Register / Bus.Unregister API that registers the waker before the initial read - structurally closes the check-then-wait race rather than patching around it. This is the cleanest kind of concurrency bug: invisible at low scale, dominant at high scale, fixable structurally rather than via retry.

Treating 429 as 5xx (PR #235). The deepest bug of the day. The first time x500/x1000 was attempted, eighty circuits succeeded cleanly, the next forty hit Anthropic's 429 ("rate limited, try again"), and then the resolver matrix flagged the model as stalled for the next 15-minute probe interval. Every subsequent admit returned 503 in under a millisecond. A single rate-limit storm took down the entire pipeline for the rest of the run. Root cause: MarkStalled didn't distinguish between *"upstream is down"* (5xx, should stall) and *"upstream is asking us to slow down"* (429, should retry with backoff). New providers.IsRateLimit(err) typed predicate. The loop now skips MarkStalled at both 429-classification call sites; 5xx still stalls because that's still the right call.

The other three were smaller: a residual subscribe-re-read race covered with bounded retry (#234), a prompt-phrasing bug where strict-serialization prompts caused agents to end-turn prematurely (#233), and an operator-workflow trap where the test runner cached a stale binary (#239) - costing roughly ten million tokens of false-positive debugging before someone noticed.

The pattern across all five bugs: each one was structurally invisible until the contention crossed a threshold. Channel.subscribe was correct at x10. MarkStalled on 429 was correct when nothing ever 429'd. Prompt phrasing was correct when there was nothing to wait for. The classic distributed-systems shape - every bug a race that doesn't fire until the load is high enough that the window matters.

Why two providers in fallback didn't fix the starvation

The naïve assumption going into the test was that adding Ollama as a fallback provider would absorb the overflow when Anthropic 429'd. That's not what happened. Ollama's own concurrency ceiling turned out to sit at roughly the same ~120 parallel calls - the cloud backend started rate-limiting at almost the identical threshold. So the resolver matrix would fall through from Anthropic to Ollama and find another full queue, not a relief valve.

The structural lesson: multi-provider fallback only helps when the providers have uncorrelated capacity ceilings. Two providers with similar parallel-call limits on a subscription tier give you parallel queues at the same depth, not double the depth. To cleanly prove the substrate at full x1000 demand, the next session needs a paid-API-key tier on at least one provider - paid Anthropic, DeepSeek, or one of the aggregators - where the parallel-call cap scales with what you pay rather than what the subscription tier allows.

What's still ahead before v1.0

Four open hardening items, ranked by what would teach us most:

x1000 cleanly, with multi-provider fallback so the substrate hits its own ceiling rather than the provider's.
Cluster-mode load test: the same circuit harness, but against a two-replica deployment from docker-compose.cluster.yaml. Today's exercise was single-replica throughout; the multi-replica primitives (Postgres LISTEN/NOTIFY backplane, advisory-locked singletons, cross-replica cancel) have unit-test coverage but no contention exposure yet.
Soak test: x100 sustained for four hours. RSS, goroutine count, DB connection-pool stats. Looking for slow drift - the failure modes that only emerge over hours, not minutes.
Failure injection: kill the upstream mid-call, restart loomcycle mid-pipeline, bounce Postgres. Each tests a specific recovery path that today's run didn't exercise.

None of these are blockers in the sense that the substrate has a known structural problem. They're blockers in the sense that not having run them means the v1.0 tag would be carrying claims it hasn't earned. The fun part of distributed systems is that the bugs only show up when the boring tests do. Today proved that for the Channel race and the MarkStalled classification; the rest of the list is the same shape of work, against the surfaces that didn't get exercised today.

Seven thousand entities tracked cleanly. Five real substrate bugs found and shipped. The bottleneck moved from "loomcycle internals" to "upstream provider concurrency limits" - which is the result an agentic runtime is supposed to produce. A clear list of what to attack next. Reasonable shape for one Tuesday.

Companion writeups: Multi-replica HA - the seven phases that get loomcycle close to v1.0 (yesterday - the architectural work that this load test exercises), and When the agent is in one container and its definition is in another (the substrate primitives the circuits compose against).