Self-evolving agents — a genetic algorithm over forkable AgentDefs.
The sixth experiment in the operator-via-MCP series turned the substrate inward. The previous five had agents talking to tools, channels, memory, schedulers, and each other. Experiment 6 had agents talking to themselves across generations — specifically, rewriting their own system prompts based on an advisor's evaluation, with the rewrites inherited by their offspring via AgentDef.fork.
A meta-agent (the breeder) ran a small genetic algorithm over a population of solver agents. Each solver carried a three-gene persona baked as literal text into its system_prompt. An advisor agent judged each solver's output with the Evaluation tool. The breeder forked survivors, asked them to propose their own child genes, applied the proposed mutations as overlays on the forked def, and promoted the winner when one crossed a fitness threshold.
Static variant: 4 generations, mean score 0.537 → 0.828, winner found at gen 3. Fully-dynamic variant (every role agent runtime-authored): first surfaced one real runtime gap — F40, the AgentDef overlay was silently dropping the *_def_scopes capability family — then re-validated after the fix: 5 generations, mean 0.495 → 0.86, a 16-version solver lineage forked entirely by a runtime-authored breeder.
Worth saying up front: this is the first experiment in the series that needed no new substrate primitives. The full self-evolution apparatus was already on the wire — AgentDef.fork, AgentDef.promote, parent_def_id lineage, Agent.parallel_spawn, Evaluation.submit, Memory, Context.lineage. The experiment's job was to prove the substrate composes into self-evolution end-to-end. It did. The one gap it found was an overlay-round-trip closure that the F14 fix had missed.
The design — persona as genotype
A "self-evolving agent" here is one that rewrites its own system prompt to score better on a task, and whose improved prompt is inherited by its offspring. The unit of selection is the AgentDef version; the unit of inheritance is the parent_def_id lineage chain; the unit of variation is whatever changes between a parent's system_prompt and its forked child's.
To make the variation observable and the lineage interpretable, we picked three named integer genes (0–10), baked as literal text into each solver's prompt. The values are the heritable material:
| Gene | Low (0–3) | High (7–10) |
|---|---|---|
| creativity | literal, conventional | bold, lateral, vivid leaps |
| courage | hedge, qualify, avoid committing | commit decisively to one answer, no hedging |
| caution (the playful "self-doubt") | confident, minimal caveats | self-critical, piles on caveats |
One of the three drives a real runtime knob — effort = creativity≥7 ? high : ≥4 ? medium : low, plumbed into the provider's reasoning budget. The other two are expressed purely through the prompt; loomcycle exposes no per-agent temperature / top_p (those are tier-level), so "creativity-as-sampling" wasn't available. The experiment's honest rough edge.
The advisor's rubric was engineered to reward novelty + decisiveness + correctness — an implicit optimum at creativity↑, courage↑, caution↓. The genes are interpretable so you can read the gene drift back from the store and check whether the population actually moved toward the optimum (it did).
Topology — spawn-based, no channels, no scheduler
Three agents, one driver per generation, lineage carried by the substrate:
POST /v1/runs (driver: one breeder run per generation g)
│
exp6-breeder (depth 0 — GA controller)
gen 0 ─ author task (advisor) → Memory task:spec
seed 4 diverse variants via AgentDef.fork
┌──────────────────────────────────────────────────────────┐
each generation g:
1. SOLVE Agent.parallel_spawn → 4 × exp6-solver (depth 1),
pinned by def_id, prompt = task + g/i.
each solver: Context.self → run_id
Memory.set gen:g:var:i:result
2. SCORE spawn exp6-advisor (depth 1): reads each result,
Evaluation.submit run_id score∈[0,1]
dimensions={novelty, decisiveness, correctness}
writes gen:g:var:i:eval
3. SELECT breeder reads the evals →
gen:g:summary {scores, mean, best_var, best_def_id}
4. STOP? best ≥ 0.85 OR g == MAX_GEN-1 →
AgentDef.promote(best) + result:summary → STOP
5. MUTATE elitism keeps the best; each other child = survivor
spawned in REFLECT mode (proposes its own child
genes, Δ ≤ ±3) → AgentDef.fork(
parent_def_id=survivor,
overlay={system_prompt:<rebuilt>, effort:…})
→ gen:g+1:var:j
└──────────────────────────────────────────────────────────┘
Constants: POP=4, MAX_GEN=5, THRESHOLD=0.85, user_id=exp6. The external driver (work/exp6_run.sh) is a thin generation-stepper — it loops g, blocks on each breeder run over SSE, and reads the memory ledger to detect the stop signal. All mutation, scoring, and forking are done by agents. The "self-evolving" content (each survivor proposing its own gene tweak) is agent-authored; the breeder only applies it as a fork, so the AgentDef lineage chain is a real artifact of the run, not a side-channel.
The three roles
Each role agent is a single AgentDef. The static variant declares them in yaml; the fully-dynamic variant authors all three at runtime via POST /v1/_agentdef.
exp6-solver — the evolving lineage
- Allowed tools:
[Context, Memory]— the permanent ceiling. Fork can only narrow. - Memory scopes:
[user]. - Base genes: 5 / 5 / 5 (the unmutated baseline; gen-0 seeds vary around it).
- Two prompt-driven modes:
- SOLVE — answer the task in-character (per the persona baked into the prompt), self-report run_id + answer to
Memory.set gen:<g>:var:<i>:result. - REFLECT — given its score + advisor feedback, emit a JSON proposal for its child's genes (Δ ≤ ±3 from its own).
- SOLVE — answer the task in-character (per the persona baked into the prompt), self-report run_id + answer to
exp6-advisor — task-giver + fitness judge
- Allowed tools:
[Evaluation, Memory, Context]. - Evaluation scopes:
[submit_any, read_any]— scores sibling solver runs (submit_anybecausesubmit_siblingsis inert today; rough edge documented). - Authors one creative task per experiment, whose rubric implicitly rewards creativity↑ courage↑ caution↓. Judges the output, never the genes — the advisor doesn't know what genes a solver carries, only what it produced.
exp6-breeder — GA controller, the meta-agent
- Allowed tools:
[Agent, AgentDef, Evaluation, Memory, Context]. agent_def_scopes: [named:exp6-solver]— the capability gate that lets it fork the solver lineage and only the solver lineage. Default-deny without it.- Evaluation scopes:
[read_any](reads scores, doesn't submit them). max_concurrent_children: 6— bounds the per-generation fan-out under load.
The memory ledger
All shared state lives in scope: user Memory under user_id=exp6. Every role agent reads and writes the same prefix; the external verifier reads the same ledger to independently re-derive the result. The keys form a small generation-tree:
task:spec = {task, rubric} (gen 0, advisor)
gen:<g>:var:<i> = {def_id, genes, parent, gen, var} (genotype record)
gen:<g>:var:<i>:result = {run_id, genes, answer} (solver self-report)
gen:<g>:var:<i>:eval = {score, dimensions, rationale} (advisor verdict)
gen:<g>:summary = {scores, mean, best_var, best_def_id}
result:summary = {generations, best_score, winner_def_id, stopped}
The independent verifier (work/exp6_run.sh verify) walks this ledger over REST, recomputes per-generation mean + max from the per-variant evals, checks that every gen>0 variant's parent resolves to a real prior-gen def_id, and asserts the promotion landed on the winning lineage. The breeder's self-report is checked against the substrate state, not trusted.
Static variant — 4 generations to a winner
The static variant (loomcycle.exp6.macos.yaml on v0.25.2) declares the three role agents in yaml; the population variants are still runtime-forked, because evolution is intrinsically dynamic. The breeder's agent_def_scopes is in yaml, so the capability gate is in place from boot.
Seeded with a deliberately sub-optimal gen 0 (low creativity, high caution — the rubric's opposite), the population climbed steadily as the self-reflective mutations dragged the mean gene vector toward the optimum:
| gen | n | mean score | max score | mean genes {creativity, courage, caution} |
|---|---|---|---|---|
| 0 | 4 | 0.537 | 0.65 | {3.0, 3.5, 7.0} ← sub-optimal seeds |
| 1 | 4 | 0.758 | 0.87 | {4.2, 5.8, 5.5} |
| 2 | 4 | 0.790 | 0.88 | {6.2, 6.2, 4.2} |
| 3 | 4 | 0.828 | 0.90 | {7.8, 6.5, 3.8} ← crossed 0.90 → STOP |
Winner: exp6-solver v12 (def_6041f0019c32e432), with parent_def_id = def_c6fd8c558ea89288 (a genuinely-evolved descendant, not a gen-0 seed). Its baked-in genes were creativity=7, courage=6, caution=3 — the self-evolution had discovered the rubric's optimum persona and written it into its own system prompt. Promotion confirmed: the active exp6-solver def is the winner.
Independent re-derivation from the store passed cleanly: improvement (mean(gen3)=0.828 ≥ mean(gen0)=0.537), lineage integrity (every gen>0 variant's parent_def_id resolves to a known prior def), and promotion.
The fully-dynamic variant — the gap and the fix
The fully-dynamic variant (loomcycle.exp6-dynamic.macos.yaml, routing-only) authors all three role agents at runtime via POST /v1/_agentdef. The point: prove the substrate can express self-evolution from a fully runtime-authored meta-agent — no static yaml scaffolding.
On v0.25.2, it didn't. Here's what happened.
F40 · v0.25.2 · AgentDef overlay drops *_def_scopesThe dynamic exp6-advisor worked. Its evaluation_scopes round-tripped (F14 holds for tool/eval agents — it authored the task + rubric cleanly). The dynamic exp6-breeder was created fine — Context op=permissions showed allowed_tools: [..., AgentDef, ...] and evaluation_scopes: [read_any] intact. But agent_def_scopes: null. Every AgentDef op returned is_error: "agent has no agent_def_scopes (default-deny); add agent_def_scopes: [...] to the agent yaml". The breeder seeded zero variants and self-reported {"status":"blocked","reason":"agent_def_scopes=null prevents AgentDef fork"}.
Root cause: the AgentDef create/fork overlay's mergedDef struct (internal/tools/builtin/agentdef.go) plus its applyOverlay round-tripped channels, evaluation_scopes, interruption (the F14 closure for "a complete dynamic agent"). It had no field for agent_def_scopes — nor for schedule_def_scopes / skill_def_scopes / a2a_*_def_scopes. The read side (lookup.SubstrateAgentDef) mirrored the same omission, consistently. So an agent created via POST /v1/_agentdef always had empty *_def_scopes; the AgentDef tool's checkScopeForName default-denied; the substrate refused.
A runtime-authored meta-agent — any agent that authors, forks, schedules, or otherwise mutates the substrate's def plane — was therefore impossible. Only statically-declared agents could carry the capability. That's exactly the kind of static-vs-dynamic asymmetry the v0.20.0 → v0.23.5 work was supposed to have closed; F40 was the last seam from the F14 family.
The fix (#436, v0.26.2, RFC W): round-trip the five *_def_scopes capability gates through mergedDef + applyOverlay + staticToMergedDef on the write side and lookup.SubstrateAgentDef / ToConfigDef on the read side, pinned by the lookup drift test (TestAgent_DriftDetection) plus a new TestAgentDefTool_CreateRoundTripsDefScopes that fails-before on a missing field. The *_def_scopes values are deliberately not part of content_sha256 — ACLs are authority, not content. An existing agent row stays byte-stable; a pure fork-scope change doesn't mint a new version (just like retry_attempts didn't, back at RFC L).
agent_def_scopes and skill_def_scopes are consumed in the loop (server.go substratePoliciesForAgent / skillDefPolicyForAgent), so a runtime-authored breeder is now a full substrate participant. schedule_* and a2a_* def-scopes round-trip for parity, though their in-loop policy wiring is a separate, pre-existing gap to close later.
Re-validation on v0.26.2
The same dynamic variant, on the post-fix build, ran end-to-end. Different seed, same convergence shape, slightly different cadence (the dynamic breeder needed one extra generation):
| gen | n | mean score | max score | mean genes {creativity, courage, caution} |
|---|---|---|---|---|
| 0 | 4 | 0.495 | 0.72 | {3.0, 3.5, 7.0} ← seeds |
| 1 | 4 | 0.672 | 0.82 | {3.0, 5.2, 5.2} |
| 2 | 4 | 0.755 | 0.87 | {3.5, 6.5, 3.8} |
| 3 | 4 | 0.780 | 0.82 | {4.5, 7.0, 3.0} |
| 4 | 4 | 0.860 | 0.90 | {5.8, 8.0, 2.0} ← crossed 0.90 → STOP |
Winner: exp6-solver v16 (def_8268f616a9c91351), parent_def_id = def_6e9a23fb3148334a. The runtime-authored breeder forked its entire 16-version solver lineage itself. No static scaffolding. The breeder, the solver lineage, the advisor — every entity exists only as substrate defs. Mean score 0.495 → 0.86; gene drift toward the optimum confirmed; lineage integrity confirmed; promotion confirmed.
The static-vs-dynamic asymmetry for meta-agents is now closed. The substrate can be authored, evolved, and re-shaped entirely at runtime.
Issues and rough edges (the honest residual scope)
Beyond F40, the experiment surfaced four documented rough edges. None are blockers; all are worth knowing about if you build something like this.
- No per-agent
temperature/top_p. Sampling controls are tier-level only, so "creativity-as-sampling" isn't available. The experiment rodeeffort+ the prompt-baked gene; the gene's influence on output was almost certainly weaker than a temperature dial would have been. A real evolution study would want both. parallel_spawndoesn't return child run_ids directly. Worked around with each solver self-reporting its ownContext.self-derivedrun_idinto Memory. The breeder reads it back from the ledger. Not load-bearing, just slightly awkward.agent_def_scopeshas nodescendantsmode yet. The choice today is[named:<name>](this specific lineage) or effectivelyanyvia wildcard. A "descendants ofdef_id" mode would let a breeder evolve a lineage without naming it explicitly. Noted as a future scope.submit_siblingsevaluation scope is inert today. The advisor usessubmit_anybecause it's scoring sibling solver runs andsubmit_siblings's gate doesn't yet recognize the parallel-spawn relationship.submit_anyis broader than ideal; a real production rubric would want the narrower gate.
Plus the F-note documented in the experiment writeup: loomcycle has no prompt templating, so the gene values are baked as literal text into the forked def's system_prompt. That's actually a feature for an evolution experiment — the genotype is observable by reading the def back — but a more conventional template substitution ({{ genes.creativity }} resolved at fork time) would be cleaner for production prompt-engineering workflows. A future RFC.
The engineering lesson worth keeping
The substrate was already enough. Forkable AgentDefs, parent-pointer lineage, parallel fan-out, an evaluation primitive that can score someone else's run, shared user-scope memory — all of these had shipped over the past several months for other reasons (RFC self-evolution, the multi-agent loop, the JS code-agent's stateless replay). exp6 didn't need a single new wire-shape primitive. The genetic algorithm composes from primitives that already existed and were already tested against other use cases.
What it did need was a closure that prior work missed. The F14 closure ("an agent created over MCP can carry channels, evaluation_scopes, interruption") was the right shape — but it stopped at the capability-gates that govern who can author other substrate entities. *_def_scopes was the family of fields F14 left out. The fix is small (a struct-field round-trip on five fields plus a drift test) but structural: a runtime-authored meta-agent is now a full substrate participant.
The discipline this surfaces: when you add a capability-gate family to a substrate primitive, the static-vs-dynamic seam is one round-trip per field, not one closure across the family. F14 closed channels/eval/interruption; F40 closed the def-scopes family. The next time we add a capability gate, the test matrix has to ride the new field through create + fork + read + drift-detection before the feature is "complete."
What this experiment unlocks: a meta-agent that authors, evolves, promotes, and retires the substrate's def plane — entirely from runtime, with no static yaml scaffolding. That's the building block for prompt evolution, auto-tuning, A/B routing of competing agent versions, and self-improving pipelines in general. The substrate is ensemble-shaped (per RFC S, v0.25) and now meta-agent-capable (RFC W, v0.26.2). Together those are the shape a production agentic system can grow within.
Companion reading from the operator-via-MCP series: exp1+2 — tool access and interruption · exp3 side analysis — the MCP wedge · exp3 main — the multi-agent refine loop (where F14 closed channels/eval/interruption) · exp4 — Gitea + Telegram + secret redaction · exp5 — agent ensembles + RFC S. And the upstream design lock: doc-internal/rfcs/meta-agent-def-scopes.md (RFC W).