§ substrate note

Scheduled runs at 30,000 fires. The double-fire we caught at the ceiling.

2026-05-28 · by Dennis Gubsky · ~7 min read

Earlier today's post about the named credentials map closed on a promise: "the next post covers the same map on the scheduled side." This is that post. v0.12.7 ships the second half of the pair - RFC E, ScheduleDef as a substrate primitive - and pushes it through a stress test designed to find the seams.

The shape is the obvious one. Operators declare a cron in yaml; the sweeper fires a real RunInput on each tick; on_complete hooks deliver the result via a channel, a memory write, or an outbound MCP call; and for templates that fan across users, each fork carries its own user_credentials map - the wire shape from RFC F, persisted as substrate state rather than supplied on the request. That's the small story. The interesting story is what happened when we asked the new compound test to handle 30,000 of these in a single test run.

The shape - ScheduleDef as the fourth primitive

Loomcycle's substrate already had three append-only versioned primitives - AgentDef (v0.8.5), SkillDef (v0.8.22), and MCPServerDef (v0.9.2). ScheduleDef is the fourth, structurally identical: every mutation creates a new version with full lineage, fork-from-parent for per-user customisation, a 5-op CRUD tool (create, fork, get, list, retire), 4-transport admin (HTTP, gRPC, MCP, TS adapter), and a drift test pinning the schema parity between the three sides of each round-trip.

Two yaml entry styles share the same struct. A template entry has no user_id and offers per-tier cron defaults - orchestrators fork it per user at signup time. A standalone entry has an explicit user_id and a single schedule: - the operator owns the cron directly.

scheduled_runs:
  # Template - orchestrator forks per user
  job-search-template:
    agent: job-search-batch
    user_tier_schedules:
      low:    "0 6 1,11,21 * *"       # 3×/month
      middle: "0 6 1,8,15,22 * *"     # 4×/month
      high:   "0 6 * * *"             # daily
    required_credentials: [jobs, slack, telegram]
    timezone: "Europe/Berlin"
    enabled: true
    on_complete:
      - kind: mcp.call
        server: telegram
        tool:   send_message
        args:
          chat_id: "{{user.telegram_chat_id}}"
          text:    "{{run.final_text}}"

  # Standalone - operator-owned cron
  alarm-summary-weekly:
    agent: alarm-summarizer
    user_id: [email protected]
    user_credentials_from_env:
      slack: LOOMCYCLE_OPERATOR_SLACK_BEARER
    schedule: "0 9 * * 1"             # Monday 09:00
    timezone: "UTC"
    enabled: true
    on_complete:
      - kind: channel.publish
        channel: _system/operator-digest
        payload: { text: "{{run.final_text}}" }

on_complete is a closed set: three hook kinds, no others. channel.publish writes into a loomcycle channel; memory.set persists into the run's memory layer; mcp.call fires an outbound MCP tool. Closed-set means the validator refuses unknown kinds at boot, and the substrate-side add_hook / remove_hook ops (PR #270) refuse them too. No catch-all, no plugin surface, no place for a hook author to slip an arbitrary HTTP call past the trust boundary.

Configuring the scheduler

The sweeper is off by default. Operators with no scheduled_runs: entries see no scheduler activity. Operators who do have entries opt in explicitly:

# .env / process environment

# Master switch. Default: false.
LOOMCYCLE_SCHEDULER_ENABLED=true

# How often the sweeper polls schedule_run_state for due rows.
# Default: 30s. The compound test below uses 100ms to make
# burst-fire scenarios reproducible; production deployments stay
# at the default, where the over-fire shape doesn't manifest.
LOOMCYCLE_SCHEDULER_TICK_SECONDS=30

# Per-fire cap on the agent run. Default: 600s (10 min). Reaching
# the cap cancels via ctx and records last_status=failed.
LOOMCYCLE_SCHEDULER_FIRE_TIMEOUT_SECONDS=600

# Comma-separated allowlist for user_credentials_from_env keys
# that schedules may read. Empty allowlist (default) disables
# env-credential resolution entirely. Safe-by-default posture.
LOOMCYCLE_SCHEDULER_ENV_ALLOWLIST=LOOMCYCLE_OPERATOR_SLACK_BEARER,LOOMCYCLE_OPERATOR_JOBS_BEARER

One in-config knob is worth knowing about: MaxConcurrentFires. It bounds the goroutine pool the sweeper spawns inside a single tick when a cron crossing makes hundreds of forks due in the same second. Default: runtime.NumCPU() * 4. The tick still waits for the whole batch to drain - the "one tick at a time" invariant is preserved - but the fires inside a tick run in parallel up to the cap. Larger values trade memory and store pressure for tighter cascading at burst moments; the default is sized for production-shape hardware and should not need tuning unless your cron pattern produces single-second avalanches.

The yaml shape and the env knobs together mean a working scheduled-run deployment has three things in place: a yaml block, the env switch flipped to true, and a tier picker (for template forks) or a static cron (for standalone). The cluster scheduler (per-def advisory locks across N replicas) is on the v0.12+ roadmap; the single-replica sweeper in v0.12.7 is what shipped today.

The compound test - proving the three substrates compose

v0.12.7 binds three v1.x substrates together for the first time: RFC E (this one), RFC F (per-run credentials), and MCP per-server bearer substitution (which both rely on). Each of the three had its own isolation tests. None had been exercised together.

PR #271 ships the compound test that gates the release. It seeds 310 schedules across three phases (10 at T+0, 100 at T+1s, 200 at T+2s), watches the cascade, and asserts:

All 310 runs complete with status=completed.
Each of two MCP servers received exactly 310 calls - not 309, not 311.
Zero bearer mismatches across both servers - 620 substitution checks, 0 cross-fork credential races.
Per-user isolation - each user_id appears exactly once on each server, no parallel-fire cross-contamination.

Default scale is 310. The -scale=N flag preserves the phase ratios while pushing the absolute numbers up. We ran the test at scale ∈ {100, 1000, 3000, 5000, 10000, 20000, 30000, 50000, 100000} to characterise where the wall-time bends, where the substrate's actual ceiling lives, and whether bearer-substitution correctness holds the whole way up. The answers turned out to be "linear through 50,000," "a real double-fire race at 30,000," and "yes, every single one of 200,000+ MCP calls got the right bearer."

The bug at x30,000 - every schedule fired twice

At scale=30000 the test failed loudly: 60,000 MCP calls instead of 30,000. Every schedule fired exactly twice. Zero credential mismatches even at double-fire - the bearer plumbing held perfectly - but the wall ballooned to 163 seconds because the runtime was doing 2× the work.

Root cause: when a fire's RecordResult write took longer than the tick interval, the same row still appeared as "due" on the next tick. The sweeper's only guard was ctx.Done() - no suppression of "this row is already firing." Under heavy concurrent load RecordResult writes were slower than 100 ms, so each row fired on every tick during its in-flight window. At x30,000 that worked out to almost exactly 2× call counts.

The fix (PR #272): an in-process sync.Map tracker on the Scheduler struct. Before slot-acquire, the tick atomically LoadOrStores the def_id; loaded keys mean "previous fire is still running," and the tick skips them. The fire goroutine's deferred cleanup deletes the entry (running after recover(), so panicking fires still clear their slot), and the ctx.Done() path explicitly deletes the reserved entry so a cancelled tick doesn't strand a def "stuck" in-flight forever.

// internal/scheduler/scheduler.go
type Scheduler struct {
    // ...
    inFlight sync.Map // key: def_id, value: time fired
}

// tick():
if _, alreadyFiring := s.inFlight.LoadOrStore(row.DefID, time.Now()); alreadyFiring {
    continue
}

A regression test pins it: 1 schedule, fake runner sleeps 300 ms per fire, 5 back-to-back ticks at 50 ms intervals (4 fall during the in-flight window). Asserts calls == 1. Without the fix the test reports calls = 2, the exact pattern the compound test caught at scale.

What the scale curve actually looks like

Post-fix, the scheduler holds a linear curve from x100 to x50,000 on the M1 laptop the test ran on. Past x10,000 the wall is dominated by sustained mock-provider latency × calls / cores; throughput plateaus at roughly 1,000 MCP calls per second. The x100,000 run hit the test's 5-minute hard deadline at ~75 % completion (151,459 of 200,000 expected calls), consistent with SQLite :memory: single-writer contention on the per-fire RecordResult updates.

scale	pre-fix wall	calls (per srv)	post-fix wall	calls (per srv)
100	2.74 s	100 ✓	2.74 s	100 ✓
1,000	3.78 s	1,000 ✓	3.78 s	1,000 ✓
5,000	10.05 s	5,000 ✓	10.05 s	5,000 ✓
10,000	18.47 s	10,000 ✓	18.47 s	10,000 ✓
20,000	58.02 s	20,000 ✓	58.02 s	20,000 ✓
30,000	163.36 s	60,000 ✗	57.89 s	30,000 ✓
50,000	-	-	115.00 s	50,000 ✓
100,000	-	-	300 s (deadline)	~75 % complete

Numbers are an Apple M1, 8 cores, 16 GB RAM, SQLite :memory:, mock LLM provider with 20 ms latency + 30 ms jitter, two in-process httptest MCP servers, tick interval 100 ms (an aggressive test-only knob - production default is 30 s, which sidesteps the over-fire shape entirely). They're a laptop floor, not the production ceiling. The Xeon-class hardware the multi-replica sustained-load research ran on (32 threads, 62 GB RAM, Postgres backend) will push every number in this table substantially higher.

What we did not find

The headline guarantee of the compound test is the credential one. Across the full sweep - more than 200,000 MCP calls in the post-fix scale-out, including the double-fired 60,000 at x30,000 pre-fix - every single outbound MCP request carried the substituted Authorization header matching its fork's user_id. Zero bearer mismatches. Zero cross-fork credential leaks. The substitution path is thread-safe under cascading load through at least x50,000 and likely well beyond.

No silent regressions, either - every fire's last_status=completed check passed at every scale, and the scheduler's Stop() cleanly drains the in-flight set on teardown (no goroutine leaks). The bug we found was a volume bug - extra correct work - not a correctness bug.

The pattern matters more than the number. Stress tests that produce only round numbers - "x scheduled fires per minute" - are less informative than the ones that surface a shape. The double-fire ceiling at 30,000 wasn't on anyone's list; it became visible only because the compound test was structured to count outgoing MCP calls per server, not just check that the runs completed. The release gate caught the bug; the in-flight tracker closed it; the next ceiling sits beyond x50,000 on commodity hardware and is some kind of writer-contention shape rather than a logic flaw. That's a release we can ship.

What you can do with it today

If you're on v0.12.7: scheduled_runs: in your yaml + LOOMCYCLE_SCHEDULER_ENABLED=true in your env is the entire opt-in. A static schedule: string for operator-owned crons; a user_tier_schedules: block for templates that fan across users, with required_credentials declaring the keys forks must populate. The /ui/schedules admin tab (PR #269) gives you the list / fork / pause / resume / run-now / retire surface live; the same operations are available on the HTTP, gRPC, MCP, and TS-adapter sides. Hooks can be edited per-fork via add_hook / remove_hook ops without rewriting the parent template.

Companion reading: Three MCP tokens in one run (the credential-map shape this scheduler consumes), Reliable under stress, sustainable for hours (the seven-experiment campaign the scheduler stress test joins), and Multi-replica HA - the seven phases (the cluster substrate the per-def advisory-locked scheduler will land on next).