Overview / Benchmarks / SWE-Bench-CL Architecture

SWE-Bench-CL Architecture

SWE-Bench-CL is the project's primary CODE benchmark (alongside VISTA). It reorganizes SWE-bench Verified into 8 per-repo sequences (273 tasks) and scores how a coding agent learns, retains, and transfers across the strictly-ordered tasks of each sequence. The harness drives a real ClaudeCodeAgent task-by-task, grades each patch by applying it and running the gold tests, and — across a sequence — lets memory accumulate so later tasks benefit (plasticity) without regressing earlier ones (stability / forgetting). Keith Mazanec's single-stage pipeline (PR #178) selects which memory substrate each run starts from, so the stages compose across separate invocations through a persistent store rather than within one run.

① SWE-Bench-CL sequence — ordered per-repo task chain (group_id, strict Task.order) 8 sequences · 273 tasks · e.g. django · sympy · sphinx · matplotlib · scikit-learn · astropy · xarray · pytest task t₁order 1 task t₂order 2 task t₃order 3 · · · task t_Norder N each task = one SWE-bench Verified instance (repo · base commit · issue · gold patch · gold test_patch) for each task, in order → ② Per-task execution loop ClaudeCodeAgent runs the `claude` CLI in a sandboxed config dir model: haiku-4-5 / sonnet-4-6 / opus-4-8 reads the repo, edits files → produces a patch code_mode: agentic (default) · blind cookbook-memory recall plugin-* stages only (mode = plugin-real) BEFORE editing, agent calls the MCP `recall` tool with the issue text → retrieves prior fixes for this repo base stage has no memory (mode = off) patch model_patch (unified diff) ③ Grader — apply + test (real resolve rate) SwebenchHostGrader (default `swebench`, Docker-free) · LocalExecGrader (`local`/`auto`, host venv) · overlap · none 1 · revert any agent edits to test files 2 · apply model_patch 3 · apply gold test_patch 4 · run tests shared venv prewarmed once per sequence (one repo@version) resolved? pass / fail → next task in sequence advance Task.order write / carry → memory plugin Stop hook fires the Daydream pass between tasks; extracts what to remember ④ Persistent memory substrate results/v{version}/_memory/.cookbook-memory · Markdown store · SQLite-vector (memory.db · items) · Graph (graph.db · nodes) Carries across the whole sequence: recall reads it before each task; daydream writes to it after each task. Dream (night) pass: TTL prune, dedup, contradiction + governance consolidation. recall ← daydream writes → ⑤ Native continual-learning metrics — SWEBenchCLNativeEvaluator (arXiv:2507.00014) Three result vectors per sequence: initial_pass a(i,i) [mem-ON, first solve] · final_state a(N,j) [mem-ON end-of-sequence re-test] · mem-OFF zero-shot baseline Forgetting (F)Σ initial−final / (N−1)lower is better BWTbackward transfer= −F FWTforward transfermem-ON − mem-OFF AULCarea underlearning curve CL-Plasticitymean diagonal a(i,i)(learns new tasks) CL-Stability= 1 − F(retains old tasks) ACC = (1/N) Σ final_state · CL-F1 = harmonic mean(plasticity, stability) · CL-Score = ACC − F + FWT + BWT + AULC (all λ = 1.0) Also: per-difficulty strata (1–4) · memory-ON vs memory-OFF accuracy contrast · tool-use efficiency · snapshot-phase (initial vs final). Headline numbers macro-average each metric across the 8 sequences. The native evaluator runs its own mem-on / re-test / mem-off A/B and assumes a per-sequence reset; the pipeline's shared accumulating substrate deliberately breaks that, so its native CL report is captured as comparative, not paper-faithful (resolve-rate accuracy is primary). per-task resolved? → CL vectors
Legend: control / data flow memory read/write (plugin-* only) per-task result → CL metrics
Keith Mazanec's pipeline · PR #178

Stage matrix & memory-source selection

Each pipeline run drives one stage over one sequence against the persistent per-version substrate. Stages compose across separate invocations through that store, rather than in one mega-run. plugin-accum and plugin-dreamed copy-seed their fresh namespace from a prior run's _memory/ via --source-memory (auto-discovering the newest matching run when omitted); the others start blank or carry nothing.

Stage memory_mode Starts from What it measures
base off no memory the memoryless baseline (control)
builtin builtin Claude Code's native file memory native session memory over prior runs
plugin-blank plugin-real blank shared substrate memory built up within this sequence only
plugin-accum seeded plugin-real copy-seeded from a prior run (--source-memory) accumulation — does carried memory help?
plugin-dreamed seeded plugin-real copy-seeded, then one dream pass before eval does consolidating the carried memory help?
plugin-primed plugin-real shared substrate (natural recall) natural-recall prompt + primed stream-json invocation

🌱 Copy-seeding via --source-memory

Only plugin-accum and plugin-dreamed are in _SOURCE_MEMORY_STAGES. _seed_source_memory resolves the source (a path to _memory / .cookbook-memory, or a results version slug), checks it has durable memories, then shutil.copytrees it into this run's own versioned namespace before evaluation. When --source-memory is omitted, _memory_source_candidates auto-discovers the newest prior run for the same benchmark+sequence that has durable items. The run must write to a different version than its source.

♻️ In-place vs seeded

plugin-blank starts from a blank shared substrate and only builds memory within the sequence it runs. The seeded stages instead inherit a prior run's experience, so they isolate the value of carrying (accum) and consolidating (dreamed) memory across runs. The memoryless base and native-builtin builtin stages provide the controls the lift is measured against.

Why continual learning

Memory accumulates over the sequence

A sequence is a strictly-ordered chain of real coding tasks from one repo. As the agent solves each task, the daydream pass writes what it learned to the shared substrate; the next task's recall reads it back. The benchmark scores whether later tasks benefit (plasticity / FWT) without regressing the earlier ones (stability / forgetting / BWT) — the over-time learning signal that a single independent-task benchmark cannot capture.

SWE-Bench-CL on the benchmarks page →   General architecture →