SWE-Bench-CL is the project's primary CODE benchmark (alongside VISTA). It reorganizes
SWE-bench Verified into 8 per-repo sequences (273 tasks) and scores how a coding agent
learns, retains, and transfers across the strictly-ordered tasks of each sequence. The harness drives a
real ClaudeCodeAgent task-by-task, grades each patch by applying it and running the gold
tests, and — across a sequence — lets memory accumulate so later tasks benefit (plasticity) without
regressing earlier ones (stability / forgetting). Keith Mazanec's single-stage pipeline (PR #178) selects which
memory substrate each run starts from, so the stages compose across separate invocations through a persistent
store rather than within one run.
Each pipeline run drives one stage over one sequence against the
persistent per-version substrate. Stages compose across separate invocations through that store, rather than in
one mega-run. plugin-accum and plugin-dreamed copy-seed their fresh namespace from a
prior run's _memory/ via --source-memory (auto-discovering the newest matching run
when omitted); the others start blank or carry nothing.
| Stage | memory_mode | Starts from | What it measures |
|---|---|---|---|
base |
off |
no memory | the memoryless baseline (control) |
builtin |
builtin |
Claude Code's native file memory | native session memory over prior runs |
plugin-blank |
plugin-real |
blank shared substrate | memory built up within this sequence only |
plugin-accum seeded |
plugin-real |
copy-seeded from a prior run (--source-memory) |
accumulation — does carried memory help? |
plugin-dreamed seeded |
plugin-real |
copy-seeded, then one dream pass before eval | does consolidating the carried memory help? |
plugin-primed |
plugin-real |
shared substrate (natural recall) | natural-recall prompt + primed stream-json invocation |
--source-memoryOnly plugin-accum and plugin-dreamed are in _SOURCE_MEMORY_STAGES.
_seed_source_memory resolves the source (a path to _memory / .cookbook-memory,
or a results version slug), checks it has durable memories, then shutil.copytrees it into this
run's own versioned namespace before evaluation. When --source-memory is omitted,
_memory_source_candidates auto-discovers the newest prior run for the same benchmark+sequence
that has durable items. The run must write to a different version than its source.
plugin-blank starts from a blank shared substrate and only builds memory within the
sequence it runs. The seeded stages instead inherit a prior run's experience, so they isolate the value of
carrying (accum) and consolidating (dreamed) memory across runs. The
memoryless base and native-builtin builtin stages provide the controls the lift is
measured against.
A sequence is a strictly-ordered chain of real coding tasks from one repo. As the agent solves each task, the daydream pass writes what it learned to the shared substrate; the next task's recall reads it back. The benchmark scores whether later tasks benefit (plasticity / FWT) without regressing the earlier ones (stability / forgetting / BWT) — the over-time learning signal that a single independent-task benchmark cannot capture.
SWE-Bench-CL on the benchmarks page → General architecture →