SWE-Bench-CL Architecture

Legend: control / data flow memory read/write (plugin-* only) per-task result → CL metrics

Keith Mazanec's pipeline · PR #178

Stage matrix & memory-source selection

Each pipeline run drives one stage over one sequence against the persistent per-version substrate. Stages compose across separate invocations through that store, rather than in one mega-run. plugin-accum and plugin-dreamed copy-seed their fresh namespace from a prior run's _memory/ via --source-memory (auto-discovering the newest matching run when omitted); the others start blank or carry nothing.

Stage	memory_mode	Starts from	What it measures
`base`	`off`	no memory	the memoryless baseline (control)
`builtin`	`builtin`	Claude Code's native file memory	native session memory over prior runs
`plugin-blank`	`plugin-real`	blank shared substrate	memory built up within this sequence only
`plugin-accum` seeded	`plugin-real`	copy-seeded from a prior run (`--source-memory`)	accumulation — does carried memory help?
`plugin-dreamed` seeded	`plugin-real`	copy-seeded, then one dream pass before eval	does consolidating the carried memory help?
`plugin-primed`	`plugin-real`	shared substrate (natural recall)	natural-recall prompt + primed stream-json invocation

🌱 Copy-seeding via `--source-memory`

Only plugin-accum and plugin-dreamed are in _SOURCE_MEMORY_STAGES. _seed_source_memory resolves the source (a path to _memory / .cookbook-memory, or a results version slug), checks it has durable memories, then shutil.copytrees it into this run's own versioned namespace before evaluation. When --source-memory is omitted, _memory_source_candidates auto-discovers the newest prior run for the same benchmark+sequence that has durable items. The run must write to a different version than its source.

♻️ In-place vs seeded

plugin-blank starts from a blank shared substrate and only builds memory within the sequence it runs. The seeded stages instead inherit a prior run's experience, so they isolate the value of carrying (accum) and consolidating (dreamed) memory across runs. The memoryless base and native-builtin builtin stages provide the controls the lift is measured against.

Why continual learning

Memory accumulates over the sequence

A sequence is a strictly-ordered chain of real coding tasks from one repo. As the agent solves each task, the daydream pass writes what it learned to the shared substrate; the next task's recall reads it back. The benchmark scores whether later tasks benefit (plasticity / FWT) without regressing the earlier ones (stability / forgetting / BWT) — the over-time learning signal that a single independent-task benchmark cannot capture.

SWE-Bench-CL on the benchmarks page → General architecture →

Stage matrix & memory-source selection

🌱 Copy-seeding via --source-memory

♻️ In-place vs seeded

Memory accumulates over the sequence

🌱 Copy-seeding via `--source-memory`