Skip to main content
loomcycle
§ release note

Multi-tenant authorization shipped — and the four bugs adversarial QA caught before v0.17.0.

v0.17.0 tagged today. The headline is RFC L — OSS multi-tenant authorization and isolation: the seventh substrate primitive (OperatorTokenDef), an authoritative principal resolved from the bearer instead of from the wire, per-route and per-RPC scope enforcement, a tenant-scoped read boundary across both the HTTP API and the Web UI, and a role-aware workspace where super-admins see all tenants and a tenant principal sees only its own. The single LOOMCYCLE_AUTH_TOKEN shared secret is now one option among many; production deployments mint per-principal lct_… bearers with explicit scopes, and the substrate stops trusting tenant_id on the wire.

That's the feature. Three PRs (#323 substrate, #324 identity threading, #325 cache + invalidation + docs). They shipped clean — tests green, contract drift checks green, manual end-to-end verification clean. The interesting story is what happened next: an adversarial-QA pass found four authorization gaps the feature PRs missed. One was CRITICAL (a gRPC authorization bypass that let any narrow token mint substrate:admin tokens). Three were HIGH (cross-principal session continuation, last-admin-retire-into-open-mode, per-route scope-map typos). All four closed before tag with regression-grade tests. None of them affected authentication; all four affected authorization.

This post is the honest engineering account: the feature, the four bugs, the discipline that caught them, and the strategic shift the release locks in. Authentication is the easy half. Authorization is what needs the second pass.

What RFC L actually ships

Pre-v0.17, loomcycle authenticated every request against one shared secret. LOOMCYCLE_AUTH_TOKEN in env, auth.CompareBearer in middleware, constant-time compare — either you had the secret or you didn't. The substrate then trusted the caller to populate tenant_id and user_id on the wire. v0.10.1's per-tenant fairness primitive shipped, but its key was caller-asserted; a single token with the right wire payload could impersonate any tenant at no cost.

v0.17 makes tenant identity authority-derived. The auth middleware resolves the inbound bearer to an auth.Principal{TenantID, Subject, Scopes, TokenDefID, TokenSuffix} from the OperatorTokenDef substrate, stamps it into ctx via auth.WithPrincipal, and the principal's TenantID + Subject override anything the wire claims. The v0.10.1 fairness primitive that has been waiting for trustworthy tenant identity is now load-bearing.

Five things ship together:

  1. OperatorTokenDef substrate (PR #323) — the seventh content-addressed Def, alongside AgentDef / SkillDef / MCPServerDef / ScheduleDef / WebhookDef / MemoryBackendDef. Same 5-op CRUD shape (create / rotate / retire / get / list). Per-name advisory-locked versioning. Token plaintext (lct_-prefixed, 256-bit CSPRNG-minted, peppered-SHA-256 stored — not argon2id, indexable, matching the existing auth.CompareBearer primitive) shown to the operator exactly once at creation. loomcycle operator-token CLI verb with the five subcommands. File-based audit log appending {ts, actor_token_suffix, action, target_def_id, target_tenant, scopes_before, scopes_after} on every mutation.
  2. Authoritative principal + identity threading (PR #324) — refactored internal/auth/middleware.go resolves the bearer to a typed Principal via the new operator_token_def_active lookup. The principal threads through ctx on HTTP, gRPC, MCP, and the TS adapter (one ctx-stamping seam, four transports). The runs table gets a denormalized tenant_id column (migration 0036) so tenant-scoped list reads don't need a JOIN; store.RunIdentity gains TenantID; the four run-creation sites (RunOnce / handleRuns / handleMessages / runSubAgent) thread the principal's tenant into the identity. The wire-side tenant_id is now ignored when it disagrees with the token's tenant — logged as kind=tenant_id_overridden for operator triage, never silently honored.
  3. Token cache + invalidation + --copy-from-env + docs (PR #325) — the auth hot path can't roundtrip to Postgres on every request. A per-replica tokenCache stores {hash → Principal} with a 30-second TTL, plus cluster-wide invalidation via Postgres LISTEN/NOTIFY on loomcycle.operator_token_changed (same backplane the rest of the cluster substrate uses). The loomcycle operator-token create --copy-from-env migration command is the zero-disruption path for v0.x operators: it creates the first OperatorTokenDef from the existing LOOMCYCLE_AUTH_TOKEN value, marks it as the admin token, and the legacy env-var path stays valid for any operator who didn't migrate.
  4. Tenant-scoped read boundary (PR #334) — `principalTenantScope` (for list reads) and `tenantVisible` (for single-row reads) wired across the read endpoints. A tenant principal sees only its own tenant's runs and agents; a super-admin sees all, or focuses one via ?tenant=. GET /v1/_me exposes the authoritative principal to the UI: {tenant_id, subject, scopes, is_admin, legacy, open_mode?}. Within-tenant model is whole-tenant (a tenant login sees its tenant's whole workspace, subjects collaborate); cross-tenant is opaque-404 (no enumeration oracle).
  5. Login page + role-aware Web UI + tenant-focus switcher (PRs #335 + #336) — the SPA gates the shell behind GET /v1/_me, redirects to /ui/login on 401, gates operator-global nav (library / channels / schedules / audit / etc.) behind is_admin. A super-admin-only tenant-focus switcher in the topbar threads ?tenant= into the user picker and run lists. Tenants don't see the switcher and can't widen through the wire (the backend forces their tenant regardless).

Migration is one CLI command:

# On any v0.17.0 instance with LOOMCYCLE_AUTH_TOKEN still set:
$ loomcycle operator-token create \
    --copy-from-env \
    --tenant=acme --subject=admin --scopes=substrate:admin

Token: lct_8mEd7n…  (shown ONCE; copy now or rotate later)
Wrote: operator_token_defs row #1 (tenant=acme subject=admin)
Legacy LOOMCYCLE_AUTH_TOKEN: still valid as a fallback admin token
Audit: /var/loomcycle/audit.log appended { action: create, … }

The legacy LOOMCYCLE_AUTH_TOKEN path remains valid for single-operator deployments that don't need multi-tenancy — RFC L is additive, not a breaking change. But operators provisioning per-principal tokens stop trusting tenant_id on the wire from the first create onward.

The four bugs adversarial QA caught

After the three feature PRs landed and tests were green, the adversarial-QA pass treated the merged code as the target. The premise of adversarial QA isn't "find bugs the unit tests missed"; it's "assume the principal is hostile and try to do something the design says is impossible." Mint a narrow token. Try to act outside its scope. Try to read another tenant's data. Try to escalate to admin. Try to lock the operator out. The four bugs below all fell out of that exercise; all four were closed before v0.17.0 tagged.

Worth noting up front: none of the four were authentication bugs. The middleware authenticated correctly in every case — the right principal was always stamped into ctx after PR #324. All four were authorization bugs: "the wrong principal could do X", not "authentication was bypassable."

Bug #1 — gRPC interceptor authenticated but didn't scope-check

CRITICAL · PR #327Any valid token could mint substrate:admin tokens over gRPC.

PR #324 stamped the resolved principal into ctx over both HTTP and gRPC. On HTTP, the middleware then ran requiredScopeFor(method, path) against the principal's scopes — a narrow runs:read token couldn't reach POST /v1/_operatortokendef because the scope check rejected it. On gRPC, the interceptor only authenticated. It stamped the principal into ctx, then handed the call to the substrate dispatcher (substrateGRPCCtx), which stamped OperatorTokenDefPolicy{Admin:true} + wildcard channel/def policies regardless of the caller's actual token scopes.

Net effect: a token scoped to runs:read could call OperatorTokenDef.create over gRPC and mint itself a substrate:admin token. The HTTP and gRPC transports gated the same operation differently, and the looser transport silently won.

The fix added requiredScopeForRPC(fullMethod) with a deny-by-default-to-admin posture: unmapped RPCs require substrate:admin, so any newly-added admin RPC is protected even if the maintainer forgets to map it. Only the explicit consumer RPCs (Run / Continue / CancelAgentruns:create, the reads → runs:read, the channel RPCs → channel:*) get a lesser scope. enforceScope() runs in both the unary and stream interceptors right after authenticate. The regression test (TestGrpcEnforceScope_AdminRPCDeniedForNarrowToken) verifies a runs:read token is denied OperatorTokenDef + PauseRuntime over gRPC — and that neutralizing enforceScope reproduces the exact pre-fix bypass.

Why this was missed in the feature PRs. The HTTP middleware was the design center; gRPC was a "we also intercept over gRPC" footnote. The principal-stamping seam was correctly identical across both transports — but the scope-checking seam was HTTP-only. Two transports, two enforcement code paths, the looser one shipped.

Bug #2 — cross-principal session continuation

HIGH · PR #328Session continuation trusted session-id-as-secret.

A session continuation (POST /v1/sessions/{id}/messages, POST /v1/runs with a session id, the gRPC Continue path, GET /v1/sessions/{id}/transcript) runs under the session's stored (tenant, subject). That's deliberate — fairness keys, the run row, the threaded RunIdentity, and the memory-tenancy TenantID all read from sess.UserID / sess.TenantID so a continuation lands on the original principal's resources.

Nothing checked that the calling principal owned the session. Session ids aren't secrets — they're returned to the caller, logged, shown in the UI, embedded in events and transcripts. So a token from principal-A holding runs:create could POST to a session id belonging to principal-B and have the run execute under B's identity: cross-tenant memory read/write, B's full prior transcript replayed back to A over SSE, B's fairness cap spent (evasion + DoS amplifier), and the run attributed to B in the audit trail.

The fix is sessionOwnershipOK(ctx, sess) in auth_principal.go — a non-legacy principal may act on a session only when sess.TenantID == p.TenantID AND sess.UserID == p.Subject, with the same-tenant-different-subject case allowed (whole-tenant collaboration is the v0.17 model). Wired at all four session-identity sites. Cross-tenant mismatch returns opaque 404 / ErrSessionNotFound — no oracle for which session ids exist, matching the RFC H trust-boundary discipline.

Why this was missed. The session-as-resource model was older than RFC L; sessions had been "the caller knows the id, the caller continues it" for the whole v0.x line. RFC L introduced authoritative principal identity, but the session continuation code still read from sess.* as if the caller and the session owner were the same entity. They aren't anymore. The fix is one helper called at four sites.

Bug #3 — retiring the last admin token silently dropped into open mode

HIGH · PR #329Lockout-prevention had a fail-OPEN twin nobody noticed.

Loomcycle's authConfigured() returns false when LOOMCYCLE_AUTH_TOKEN is unset AND the active-admin-token count is zero — and the auth middleware treats !authConfigured as open mode (pass-through, no principal stamped, no scope check). That was the dev-only path: a fresh checkout with no env var lets the local dev UI work without making the operator mint a token first.

RFC L's Decision 12 guarded the bootstrap-in direction: the substrate's first create call disables the legacy fallback gracefully. It didn't guard the retire-out direction. An operator who migrated off the legacy env var (LOOMCYCLE_AUTH_TOKEN unset, all admin auth via OperatorTokenDef) and retired their last admin-scoped token silently dropped the entire server into open mode. The runtime didn't even surface a warning — it just stopped checking auth.

Worse than the lockout this guard was supposed to prevent.

The fix is a refusal at execRetire: an admin-scoped token can be retired only when (a) a legacy fallback is set (LegacyTokenSet=true, wired from cfg.Env.AuthToken), OR (b) it's not the last active admin (OperatorTokenDefCountActiveAdmin > 1). Rotate is exempt by construction — it mints a replacement admin before retiring the prior, so the count never hits zero.

Why this was missed. The "open mode" predicate was a single function shared between bootstrap and steady state. It correctly answered "is auth configured right now?" — yes when a legacy env var or any admin token exists. The mistake was treating that predicate as a binary capability gate rather than a state-transition guard. Bootstrap-in and retire-out are different transitions with different safety properties; one predicate can't speak for both.

Bug #4 — per-route scope map had typos that left mutating routes ungated

HIGH · PR #330Multiple sensitive routes fell through to the any-authenticated default.

requiredScopeFor(method, path) mapped each route to the scope a caller's principal needed. Unmapped routes returned "" — the any-authenticated default. The map had four real gaps:

Each typo was small. Together they meant a runs:read-only token — the kind operators would mint for app-side read-only access — could cancel any run, resolve any interrupt, publish to any channel, and scrape /metrics.

The fix added the missing cases — POST cancel → runs:create; interrupt resolve → runs:create, list → runs:read; the per-user channel surface → channel:publish (writes) / channel:read (peek), placed before the generic /v1/users/ read case so peek resolves correctly; /metricssubstrate:admin. The dead DELETE /v1/agents/ case was removed. The regression test (TestRequiredScopeFor) was extended with the cancel / interrupt / channel / metrics cases and fails-before — the pre-fix scope map returns "" or runs:read for them.

Why this was missed. The scope-map function is a long switch statement keyed on the route. Reviewers reading it scan for the case they're checking; they don't enumerate the cases that aren't there. The fix wasn't a design change; it was the discipline of grepping the route table for every route and ensuring each one had a mapping.

One smaller fix worth naming: the dead scope vocabulary

PR #333 removed memory:read and memory:write from the scope catalog. They had been grantable from the catalog but enforced by no route. The HTTP memory surface (/v1/_memory/*) is operator-admin (substrate:admin); per-tenant memory read/write is the agent-facing Memory tool, gated by the run's memory policy — not an HTTP scope.

A scope the runtime never checks is a false boundary. An operator creating a token with memory:read believed they were narrowing access; they weren't. Removing the scopes is more honest than the alternative — graduating them onto an admin route would have weakened the route's actual gate.

The pattern of all five fixes is the same: "never silently allow what the operator believed was restricted." The gRPC bypass let a narrow token do admin work it thought it couldn't. The session-ownership gap let a token act on a resource it thought it didn't own. The last-admin-retire let a deliberate operator action have an outcome it didn't intend. The scope-map typos let routes the operator believed were protected fall through unchecked. The dead scopes let operators believe they were narrowing access they weren't actually narrowing. Every fix made the runtime more honest about what authorization actually means at each seam.

One more fix worth naming: cache outage-negative + size bound

PR #331 closed two issues in the per-replica token cache that the QA pass surfaced. They're not authorization bugs in the same sense as the four above, but they're worth naming because they shape availability under stress.

Cache amplified DB outages into sticky lockouts. The cache lookup miss handler (resolvePrincipalUncached) mapped any non-ErrNotFound store error to (_, false) — including a transient DB outage. So a valid token that tried to resolve during a Postgres blip got a negative cache entry for the full 30-second TTL: locked out for up to 30 seconds after the DB recovered. A blip amplified into a sticky lockout.

The fix: resolvePrincipalUncached now returns a third cacheable flag — false only on a transient store outage (fail closed for this request, re-probe next time), true for definitive outcomes (hit / legacy fallback / genuine not-found-and-expired). resolvePrincipal caches only when cacheable is true.

Cache had no size bound — negative-spray amplifier. Negative resolution results (unknown bearer → not-found) were cached for the TTL, with no max size and no proactive eviction. Spraying distinct random bearers at any authenticated route grew the cache map without bound for 30 seconds. Memory-amplification vector; each 256-bit bearer was a fresh map key.

The fix: tokenCache gains a maxSize bound (16384 entries). put() at the cap sweeps expired entries first, then skips caching if still full. Correctness is preserved — the next request does the direct lookup — and the negative-spray memory cost is capped.

The strategic shift: what v1.0 means now

For most of the last two months, RFC L was framed as the v1.0 capstone: the headline feature that unlocked team-and-small-VPS adoption before v1.0 launched. The framing changed with v0.17.0.

RFC L is shipped, tested, hardened. The four authorization gaps are closed. The Web UI multi-tenant boundary is real (not just operator-config hiding). The Mem9 / pluggable-backend per-tenant isolation re-points onto the authoritative TenantID instead of caller-asserted user_id. The strategic centerpiece of v1.1 has, instead, landed in v0.17.

So v1.0 reframes as a pure hardening + distribution milestone:

No new substrate primitives in v1.0. No new wire-shape changes. The seven Defs we have are the seven Defs v1.0 ships with. The job between v0.17.0 and v1.0 is to make what we have land cleanly in operator hands — install paths, distribution polish, and one more round of adversarial QA against the rest of the surface.

What you can do with it today

On a v0.17.0 binary with no migration:

$ export LOOMCYCLE_AUTH_TOKEN="your-existing-secret"
$ loomcycle  # works exactly like v0.16
              # single-operator deployments need no migration

To migrate to multi-principal auth:

$ loomcycle operator-token create --copy-from-env \
    --tenant=acme --subject=admin --scopes=substrate:admin
$ loomcycle operator-token create \
    --tenant=acme --subject=alice --scopes=runs:create,runs:read,channel:read
$ loomcycle operator-token create \
    --tenant=acme --subject=app-prod --scopes=runs:create
# Each create prints lct_… ONCE; copy now or rotate later.
# Distribute via your team's secret manager.

The operator-token verb is available over four transports: HTTP (POST /v1/_operatortokendef), gRPC, MCP meta-tool, TypeScript adapter (@loomcycle/client 0.17.0). The audit log appends a structured JSONL line on every mutation; rotate via the file-rotation tool of your choice.

The role-aware Web UI is live at /ui. GET /v1/_me returns the authoritative principal — paste a token, see what it actually authorizes. Super-admins get the tenant-focus switcher in the topbar; tenants see only their workspace.

Companion reading: Two memory interfaces — flat KV and the layered paradigm honest about its shape (RFC I + K, the per-tenant memory boundary RFC L makes load-bearing) and Input webhooks — the signed-by-default front door for external events (RFC H, the trust-boundary discipline this feature inherits — verify-before-parse, never silently degrade, opaque 404 on cross-tenant probes).