Reliable under stress, sustainable for hours: seven load experiments in two days.
Yesterday's writeup ended on a promissory note. The synthetic mock provider had lifted the cost ceiling. The launch-storm and cooldown-cascade bugs were fixed. Yesterday's substrate had absorbed 15,000 agent runs in one shot — and named the next-deepest bottleneck (pgxpool contention) without solving it. The closing line was the queue of follow-ups: cluster, Linux, sustained, find the real saturation knee.
Two days later — yesterday and today, on a 32-thread Linux box — we ran the seven independent experiments that turn that promise into measured data. Single-binary baseline. Cluster burst across r ∈ {1, 2, 3, 4}. Cancel + crash injectors. 15-minute sustained. 30-minute sustained with a live OpenTelemetry stack alongside. A capacity ramp to x5000. A saturation ramp to x10000. Roughly 280,000 circuits across the campaign; over 200,000 of them completed cleanly. This post is the synthesis: what loomcycle does under stress, what it does over hours, where it actually breaks, and what shape the break has.
The short answer is the title. The long answer is the table.
The seven experiments at a glance
| # | Experiment | Replicas | Scale / Duration | Headline result |
|---|---|---|---|---|
| 1 | Single-binary Linux baseline | 1 | x10,000 burst | 100 % / 10,000 · p99 = 4.0 s · peak active ~1,500 |
| 2 | Cluster burst sweep | {1, 2, 3, 4} | x5,000 per run | Sharp r=2 → r=3 phase transition · p99 collapses 15× (54 s → 3.6 s) |
| 3 | Cancel + crash injection | 3 | x5,000 + 127 cancels / 1 docker-kill | Cancel ack p99 = 130 ms · crash reaper at T+20 s · zero zombies, zero quota leak |
| 4 | 15-min sustained | 4 | x1000 → x2000 → x1000 · 46 waves | 56,000 / 168,000 runs · 100 % · 0.7 % load spread · no drift |
| 5 | 30-min sustained + live OTEL | 4 | same shape, 90 waves | 109,000 / 327,000 runs · 100 % · 0.5 % spread · ~180 runs/sec · OTEL adds ~8 % p50 |
| 6 | Capacity probe | 4 | x3000 → x4000 → x5000 ramp | Wave wall perfectly linear · slope = 16 s per +1000 circuits · p99 flat 3.6–4.2 s |
| 7 | Saturation probe | 4 | x6000 → x8000 → x10000 ramp | Knee found x6000 → x8000 · soft ceiling ~6,650 circuits/wave · no cliff |
Hardware was the same across all of them: a single 32-thread Intel Xeon E5-2697A v4, 62 GB RAM, Linux 6.17, Docker 29.5, Postgres 16.14. One box. Where the prior session used macOS (where /proc-based RSS and CPU sampling returns zero — the platform-blind gap from yesterday's post), this entire campaign ran on Linux with the substrate's process-resource sampler fully populated. Per-replica pool sized so the cluster total stayed under Postgres' max_connections with headroom for psql + sweeper + the LISTEN/NOTIFY backplane.
Experiment 1 — single-binary on Linux: the reference number
Same workload that ended yesterday's writeup with a 36-second p50 on macOS at launchSem=500. Re-running it on Linux with real /proc sampling: x10,000 circuits → 100 % completion · p99 = 4.0 s · peak active runs ≈ 1,500. The launchSem-500 number from yesterday (62 s p99 on a CPU-starved macOS host) shrinks to 4.0 s once the runtime gets the cores it expects. That's the reference number every cluster experiment after this is measured against.
The shape worth noting: peak active runs sat at ~1,500 — that's where the harness's lifted launch gate finally pushed in-flight admissions, and where the substrate's own admission ceiling (max_concurrent_runs: 2000) still never engaged. The drain rate kept up. The pgxpool stayed tight. No silent regressions. No leak signature.
Experiment 2 — the cluster's sharp phase transition
Same workload, this time pushed through N replicas behind nginx, sharing one Postgres — the full v0.12.x cluster substrate: LISTEN/NOTIFY backplane, replicas table, advisory-locked singletons, DB-backed quotas and hooks. Single hardware, different replica counts:
| Replicas | Scale | Completion | p50 | p99 | Cross-replica | Silent regr. |
|---|---|---|---|---|---|---|
| 1 | 5,000 | 40 % | 2.5 s | 56.7 s | n/a | 39 |
| 2 | 5,000 | 75 % | 2.6 s | 53.8 s | 75 % | 135 |
| 3 | 5,000 | 100 % | 2.5 s | 3.6 s | 89 % | 0 |
| 4 | 5,000 | 100 % | 2.5 s | 3.5 s | 94 % | 0 |
Three observations the table only half-tells. First, r=1 cluster mode is worse than the single-binary host — 40 % completion versus the baseline's 100 %, on the same hardware. The cluster substrate adds real per-replica overhead (LISTEN sessions, DB-backed quotas, hook bookkeeping) that consumes connection-pool budget the non-cluster binary keeps for application work. Cluster mode is not free; the choice to enable it is a deployment decision, not a free upgrade.
Second, the transition between r=2 and r=3 isn't smooth. It's a step function around the per-replica saturation knee — at r=2 each replica is at its own knee and the queue tail explodes; at r=3 the load splits enough that no replica is at its knee, and p99 drops back to a neighborhood of the single-host reference. This is the practical lesson for operators: "add a replica" doesn't help linearly; it helps in cliffs. Three replicas was the smallest number that crossed the knee for this workload on this hardware.
Third, cross-replica fan-out works at scale. At r ≥ 3, 89–94 % of circuits had agents that landed on different replicas — meaning the editor and evaluator subscribed to channels via Postgres LISTEN/NOTIFY from a different process than the publisher, and every one of those traversals succeeded. The cluster isn't just admitting requests across replicas; the substrate primitives compose correctly across the wall.
Experiment 3 — cancel and crash, named in the data
Two injectors on top of the r=3 cluster, each exercising a specific cluster-mode primitive:
The cancel coordinator. 127 of the 5,000 in-flight runs were cancelled from a different replica than the one running them. The cancel flow goes: HTTP DELETE /v1/runs/{id} at one replica → NOTIFY loomcycle.cancel on Postgres → owning replica receives the LISTEN, terminates the loop, fires NOTIFY loomcycle.cancel.ack back. Result: 127 / 127 cancels acked. p50 ack = 21 ms, p99 = 130 ms, max = 176 ms — roughly 38× headroom under the default 5-second LOOMCYCLE_CANCEL_ACK_TIMEOUT_MS. The Postgres backplane is fast enough that "wait for an ack from another replica" reads as a local operation.
The replica-died reaper. One replica was docker killed mid-run. The cluster left zero zombies, zero quota leak, and fired the dead-replica reaper as a single batched UPDATE around T+20 s. The literal in the data is stop_reason='replica_died' (from coord/replicas_sweeper.go) — distinct from 'owner_replica_dead', which is a different code path (coord/cancel_coordinator.go) that fires only when a cancel POST races a replica death. Two reaper paths, both observable, both named. The first report draft's "open question" about which path fires when is now resolved by the test data itself.
One operational gotcha worth surfacing: pre-fix runs lost 3–50 % of circuits to nginx 500s caused by the LB container's default nofile=1024. Fixed by adding explicit ulimits.nofile=65536 + nginx worker_rlimit_nofile=65536 + worker_connections=16384. The lesson generalises beyond loomcycle: under any high-concurrency workload, the LB's file-descriptor ceiling is operator-supplied, and a forgotten ulimit bottlenecks the rest of the system invisibly.
Experiment 4 — 15 minutes of nothing happening
r=4 cluster, three back-to-back phases over fifteen minutes: x1000 sustained for 5 min → x2000 sustained for 5 min → x1000 sustained for 5 min, with the orchestrator firing the circuit-stress driver in back-to-back waves until each phase elapsed. The point: does the cluster drift under continuous pressure?
56,000 circuits. 168,000 agent runs. 46 waves. 100 % completion. Zero silent regressions. Zero stuck rows. Zero quota leak. Load split across the four replicas: 42,112 / 41,954 / 42,117 / 41,817 — a 0.7 % spread.
The single cleanest "cluster keeping up" indicator is in that chart. If the cluster were saturating under the x2000 spike, phase 2 walls would diverge upward. They don't — they sit flat at twice phase 1, which is exactly what linear scaling looks like. If the cluster carried residual state from phase 2 into phase 3, phase-3 walls would be elevated. They aren't — they're statistically indistinguishable from phase 1. RSS oscillates with the GC sawtooth but the rolling-mean envelope is flat across the entire 15 minutes. Goroutines stay bounded. Throughput holds at ~178 agent runs per second sustained.
The chart looks dull. That's the point. "Nothing interesting happens" is, for a sustained-load curve, the ideal shape — and a non-trivial property to demonstrate, because every leak and every queue-collapse bug shows up as drift over time.
Experiment 5 — 30 minutes, with a live observability stack alongside
Same shape, double the duration, and this time the full Profile A OTEL stack (Grafana + Tempo + Prometheus + Loki + OTEL Collector) was brought up in the same compose, each replica exporting OTLP spans to a shared collector at sampler ratio 1.0 (every span exported). Two questions this answered that the prior run couldn't: does the cluster stay healthy at 2× duration, and does the OTEL pipeline itself hold up under the same load it's observing?
109,000 circuits. 327,000 agent runs. 90 waves. 100 % completion across all of them. Zero silent regressions. Throughput identical to the 15-minute run — ~180 runs/sec sustained for the full thirty minutes. Load split across r=1/2/3/4 = 81,818 / 81,990 / 81,650 / 81,542 — a 0.5 % spread, tighter than 15-min (more waves = more averaging).
agent_name shows the step-up from phase 1 → phase 2; goroutines + RSS + concurrency all elevate proportionally; tool call rate doubles. Captured live while the run was in flight; no panel turned red.The OTEL stack cost about +8 % p50 compared to the 15-min run without it — most plausibly the cost of serialising every span and shipping it over OTLP/HTTP at sampler ratio 1.0. The cluster absorbed it cleanly inside its capacity envelope: no degradation in completion, throughput, or load distribution. A production deployment would drop the sampler ratio to 0.1 or lower, eliminating that overhead while still capturing trace samples for slow-path investigation. The point of running at 1.0 here was to stress the export path, not to ship the production config.
Experiment 6 — perfectly linear up to x5000
Same r=4 + OTEL setup, but instead of holding scale flat, a five-minute weighted ramp: x3000 for 1 min → x4000 for 2 min → x5000 for 2 min. The question wasn't "does it complete" — sustained had already shown that. It was "how does wall time grow as per-wave scale grows."
6 / 6 waves at 100 % completion. 24,000 circuits / 72,000 agent runs in 388 s. The wall-time slope across x1000 → x5000 came out to 16 seconds per +1000 circuits, a perfectly straight line. p99 stayed within 3.6–4.2 s across the entire range — meaning the cluster absorbed 5× the per-wave load by spending more wall on each wave, not by slowing individual circuits.
That's the strongest "cluster keeping up" signature possible. If the cluster were close to its keep-up boundary, p99 would diverge upward at the largest scale. It doesn't. The substrate had substantial headroom we hadn't yet characterised — the 15-min and 30-min runs at x1000/x2000 were running at ~30 % of capacity, not 70–80 %.
Experiment 7 — finding the wall, finally
The natural next ramp: x6000 for 3 min → x8000 for 4 min → x10000 for 5 min. Twelve minutes, thirteen waves, and the keep-up boundary finally announced itself.
The picture is clear once you read it. Phase 1 (x6000): 100 % completion, p99 ~4.5 s, walls at 98–111 s — the linear regime still holds, just barely. Phase 2 (x8000): 83–84 % completion on the real waves (~6,650 circuits successfully drained per wave), p99 13 s, and a bimodal pattern emerges — every "real" wave at ~115 s is followed by a "collapse" wave at ~1–5 s that fires while the cluster is still draining the previous admission. Phase 3 (x10000): 66–67 % completion on real waves, same bimodal shape, and the real-wave wall does not scale further. It stays ~115 s regardless of whether the requested scale is x8000 or x10000.
The cluster has hit its drain-rate ceiling at ~115 s per wave. Above that, additional circuits per wave don't drain faster; they spill into the launch-collapse waves or fail as silent regressions (the same shape PR #246 fixed at the launch-storm tier, resurfacing here at higher scale). At x10000, 87 silent regressions in the worst wave — out of 10,000 attempted, ~6,650 still completed cleanly.
The shape of the ceiling matters as much as the ceiling itself. Saturation here is soft, not catastrophic — the cluster keeps absorbing a remarkably consistent ~6,650 circuits per wave above the knee, and the overflow degrades gracefully (drop a circuit, or admit fewer next wave) rather than failing the whole run. This is the difference between "the cluster has a capacity" and "the cluster falls over past a capacity." Operators can size load up to the knee with high confidence; past it, the cluster declines visibly and the data tells you you've crossed the line.
What the campaign establishes
Two claims, each backed by specific data points:
Reliable under stress. The single-binary baseline holds 10,000 circuits at p99=4.0 s on this hardware. The cluster's r=2 → r=3 phase transition pulls p99 from 54 s back down to 3.6 s — and from r=3 upward, every burst experiment was 100 % completion with zero silent regressions through x5000. Cross-replica cancel acks in 130 ms p99 (38× under default timeout). Crash recovery completes in ~20 s with the reaper path named in the run data itself. The cluster doesn't merely scale; its failure modes are tractable, observable, and bounded.
Sustainable for hours. 15 minutes of continuous load: 56,000 circuits, zero drift, phase 3 indistinguishable from phase 1. 30 minutes of the same shape with a live OTEL stack underneath: 109,000 circuits, ~180 runs/sec, load distributed across the four replicas within 0.5 %, the OTEL pipeline itself holding up at sampler ratio 1.0. No leak signature in RSS, no growth in goroutine count, no slow drift in p99. The capacity probe extended the same envelope to x5000 per wave at perfectly linear wall, and the saturation probe finally named the soft ceiling between x6000 and x8000. Half an hour of continuous load is a non-trivial property to demonstrate; the runtime did it without changing shape.
The numbers are tied to this hardware, this Postgres, this pool sizing, this provider profile (mock at 15 % injected 429). Scaling them to your deployment requires re-running on your hardware — that's what the harness is for, and what the OTEL stack lets you observe at production wire shape. What the campaign establishes is the method of measurement and the shapes to expect: linear up to a knee; soft saturation past it; flat over time within capacity; cluster mode worth its overhead only above r=2.
This is what "closer to v1.0" looks like in evidence rather than vibes. A week ago the reliability claim for the v1.0 launch was qualitative — Apache-2.0, multi-replica HA shipped, OTEL native, the architectural pieces in place. After this campaign it's quantitative: linear capacity up to a named knee, soft saturation past it, half an hour of zero drift, cross-replica primitives working at scale, and failure modes that are tractable rather than catastrophic. The remaining work below — multi-hour runs, real-provider RTT, the pgxpool fix paths, heterogeneous-replica fairness — is sizing, not load-bearing unknowns. We feel meaningfully more confident in loomcycle's reliability and sustainability than at the start of the week, and the v1.0 milestone correspondingly closer.
What's left to verify
- Longer than 30 minutes. The 30-min run captured no drift; multi-hour runs are the next item — the question is whether anything we missed in 30 minutes accumulates over six.
- Real provider RTT in the loop. Mock zero-latency calls maximise the load on the substrate but compress the time spent in the agentic loop. A real-provider re-run (~1–3 s per call) stretches per-circuit time ~25×, which pushes the same pool ceiling out to "much higher concurrency at much lower QPS" terrain — closer to typical production shape.
- Heterogeneous replica sizes. All four replicas here were identical. Production clusters often run mixed-size pods; the load-split fairness story needs verifying under unequal capacity.
- The pgxpool fix paths from yesterday. The substrate's pool contention is still the next-deepest ceiling; the three options from yesterday's writeup (bump
pg_max_open_conns, shard agent → connection, per-replica hot-path cache) remain the immediate PR work.
None of these are blockers in the sense of "the runtime has a known structural problem." They're blockers in the sense that the v1.0 tag would otherwise carry claims it hasn't tested. The campaign so far is the foundation; the remaining items are what makes the claim production-grade rather than benchmark-grade.
Seven experiments, two days, one Xeon. The cluster scales across replicas — in steps, not curves. It stays flat under sustained load. The OTEL stack you'd run in production held up under the load it was observing. The saturation knee is named, the failure mode past it is gentle, and the next ceiling to attack is already on the workbench. That's what "reliable under stress and sustainable for hours" means in measured numbers — and it's the foundation the v1.0 launch claim sits on.
Companion writeups: 15,000 agents on a synthetic provider (yesterday — the launchSem + mock-provider work this campaign built on), Multi-replica HA — the seven phases that get loomcycle close to v1.0 (the cluster primitives the burst sweep exercises), and Three MCP tokens in one run (the per-run-credentials work that ships alongside this one).