Overview / Plan

Two-week plan, four people

Codename: Cookbook Memory. The problem, the technical approach, the scope, who owns what, and a ten-day timeline — four parallel workstreams locked behind a shared interface, converging on a single end-to-end run.

Why this exists

The problem

There is no standard, model-agnostic layer that decides what to remember, where to store it, how to retrieve it fast, and how to keep it consistent over time for long-running agents.

💾

What

signal vs. noise

🗂️

Where

the right store

Fast

no context flood

♻️

Consistent

dedup, no conflicts

How we'll build it

Technical approach

A pluggable four-module memory harness over three indexed storage backends, wired into OpenCode and measured on existing public benchmarks — no home-grown evals.

💾

Persistence

Decides what/when/where to save; tags every item.

🧭

Router

Routes each query to the single best backend.

🔎

Orchestrator

Ranks by recency × relevancy, dedups, returns.

🌙

Dreaming

While agents sleep: dedup, conflict resolution, governance.

Three storage backends, each indexed

  • Markdown + YAML — literal recall · inverted keyword index.
  • SQLite + vectors — semantic search · HNSW / FAISS ANN.
  • Graph store — relationships & conflicts · typed traversal.

Validated, not invented

Integrated into OpenCode. Per benchmark, a controlled grid runs Haiku + harness vs. Haiku, Opus 4.8 and Sonnet — prompts and task order fixed, every trajectory logged.

Full architecture & data flow →

Boundaries

Scope

What the sprint commits to — and what it doesn't.

In scope

  • The memory harness: all four modules, end to end.
  • Three storage adapters behind one interface, each indexed.
  • The intelligent router (rule-based; learned-classifier hook).
  • The async dreaming component: dedup, conflict resolution, governance, retention.
  • OpenCode integration for write/read on each agent step.
  • Eval harness on five benchmarks; baselines for Haiku, Opus 4.8, Sonnet.
  • Metric definitions, a reproducible protocol, and a results dashboard.

Out of scope (non-goals)

  • Building our own benchmark or dataset — we use public ones.
  • Training or fine-tuning any model.
  • Production hardening: multi-tenant infra, auth, SLAs, polished UI.
  • Distributed / scaled deployment — single-node is enough for the sprint.
  • Validating non-coding agents — design aims to generalize, not tested now.
Timeline at a glance

Sprint map

Week 1 = days 1–5 (build against frozen contracts; cheap no-memory baselines start ~D4). Week 2 = days 6–10 (wire together, run the Haiku+harness treatment on sharded keys, ship).

D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
Keith · Harness
Abstraction + persistence
Orchestrator · OpenCode e2e
Ken · Eval infra
Datasets · run harness · metrics
LongMemEval shard + dashboard
Brent · Store + Retrieve
3 backends + indexes
Adapters + router
Scott B. · Dreaming
Dedup + conflict rules
Async worker + governance
All · Eval (own keys)
No-mem baselines
Haiku+mem · E2E
Keith · P1 Ken · P2 Brent · P3 Scott B. · P4 Eval: baselines wk1 · treatment wk2 (sharded keys)
Who owns what

Roles & deliverables

Keith · Person 1

🏗️ Harness Architecture & OpenCode Integration

Critical path — the contracts everyone builds against.

Week 1
  • Design the three-module abstraction & storage interface (swappable backends).
  • Build the persistence layer: write logic, versioning, metadata tagging.
  • Instrument OpenCode to write memory on each agent step.
Week 2
  • Build the retrieval orchestrator (rank → dedup → return).
  • Wire in Brent's router; run an end-to-end task.
  • Captain the SWE-Bench-CL runs on his own API key.
Ken · Person 2

📊 Evaluation Infrastructure & Coordination

Shared runner, metrics, dashboard — and captains one benchmark.

Week 1
  • Download & parse all four datasets; loaders + trajectory logging (with Scott B.).
  • Define the four metrics; build the shared run harness: run(benchmark, model, memory) → metrics.
  • Stand up the cost/budget tracker and per-key config.
Week 2
  • Captain the LongMemEval runs on his own API key.
  • Aggregate every captain's results; stats + results & cost dashboard; document the protocol.
Brent · Person 3

🗄️ Storage Implementation & Router

Three fast backends + the dispatch layer.

Week 1
  • SQLite + vector pipeline (embedding model, HNSW/FAISS index).
  • Markdown store with YAML frontmatter + inverted index.
  • Graph store schema (Neo4j) with typed traversal index.
Week 2
  • Adapters for all three backends behind Keith's interface.
  • Implement the router; performance-test each backend.
  • Captain the SWE-ContextBench & ContextBench runs on his own API key.
Scott B. · Person 4

🌙 Dreaming Component & Memory Governance

The offline engine that keeps memory clean.

Week 1
  • Deduplication logic (exact / semantic / near-duplicate).
  • Conflict detection + reconciliation rules (recency, confidence, source).
  • Session-filter & blacklist semantics.
  • Co-build trajectory logging with Ken — the dreaming worker consumes it.
Week 2
  • Async offline scheduler; must-know / must-do extraction.
  • Retention & pruning; observability logs; integration with Keith & Brent.
  • Captain the MemoryAgentBench runs on his own API key.
RACI, lightweight

Who owns what

Every deliverable has one accountable owner. The interface is frozen on Day 3 so the four streams never block on each other.

P1 Keith  ·  P2 Ken  ·  P3 Brent  ·  P4 Scott B.

Area / deliverableOwnerSupporting
Storage interface & memory-item schemaP1P3
Persistence layer (write path, tagging, versioning)P1
Retrieval orchestrator (rank · dedup · return)P1P3
OpenCode integration (write/read each step)P1
Three storage backends + per-backend indexesP3P1
Intelligent router (rules → learned)P3P1
Backend performance testingP3P2
Dreaming worker (dedup · conflict · governance · retention)P4P1, P3
Memory semantics ("what is good memory")P4P2
Datasets, loaders & trajectory loggingP2P4
Metric defs, shared run harness & cost gatesP2P4
Results + cost dashboard, stats, aggregationP2
Run SWE-Bench-CL eval (own key)P1
Run LongMemEval eval (own key)P2
Run SWE-ContextBench eval (own key)P3
Run MemoryAgentBench eval (own key)P4
Run ContextBench eval (own key)P3
Final end-to-end integrationAll
Running the evals

Divide & conquer the benchmark runs

~20 runs (5 benchmarks × 4 configs) is too much cost and time for one person on one key. Each teammate captains the benchmark(s) that stress their own component, on their own API budget — runs go wide, not deep.

BenchmarkCaptainWhy themConfigs
SWE-Bench-CLKeithDrives the coding agent end-to-end — his OpenCode wheelhouse.4
LongMemEvalKenEval lead; recency / temporal reasoning.4
SWE-ContextBenchBrentContext reuse exercises his retrieval + router.4
MemoryAgentBenchScott B.Tests conflict resolution — his dreaming component.4
ContextBenchBrentRetrieval-quality (gold contexts) — same retrieval/router he owns.4

Whoever can best debug a bad number is the one watching it. Ken owns the shared runner and aggregates results.

Cost & throughput controls

🔑 Sharded keys

Each captain runs on a separate API key/account — roughly 4× the aggregate rate limit, isolated per-benchmark budgets, no single-key throttle.

💸 Cheapest-first + early-exit

Order configs Haiku+mem → Haiku → Sonnet → Opus. Opus (priciest) runs last, only to confirm a signal — skip it if the cheaper tier already settles the question.

🎯 Dev slice → full

Iterate on a fixed ~10–15% stratified subset, then one full run per config. Use LongMemEval_S (~115k) to iterate; reserve _M (~1.5M) for a single final confirmation.

♻️ Cache + resume

Content-hash every (task, model, config) call and checkpoint per task, so a crash or re-run never re-pays — critical on long SWE-bench trajectories.

🚧 Hard budget gates

Per-benchmark $ and token ceiling in the runner; abort and log partial results on overrun — no silent cost blow-ups.

🕒 Baselines in week 1

No-memory baselines need only the model + dataset, not the harness — start them ~D4 to flatten the week-2 spike and surface API/dataset issues early.

⚖️

Must-run cells: Haiku + harness (treatment) and Opus 4.8 (target). Haiku no-memory is cheap and stays; Sonnet is optional if budget is tight.

Critical path

Dependency map

What blocks what across the two weeks. The Day-3 interface freeze unblocks every stream; the bold chain is the critical path to ship.

Dependency map: the Day-3 interface freeze unblocks the datasets/harness, storage backends, and persistence+orchestrator work; backends → router → orchestrator, backends → dreaming, all converging on Integration (D8) then Ship (D10). Critical path: freeze → backends → router → orchestrator → integration → ship.
Keith Ken Brent Scott B. Gate / all Critical path

Dependencies, in words

  • Keith + Brent (D1–D3): co-author and freeze the storage interface + memory-item schema, so persistence and the adapters build against one contract.
  • Brent → Keith: the router lands before the orchestrator can route in week 2.
  • Ken → all (by D5): datasets, trajectory logging and the shared run harness ready, so no-memory baselines can start ~D4–D6.
  • Keith + Brent → Scott B.: the dreaming worker integrates once real storage exists (D8+).
  • Sharded keys: each captain runs on a separate API budget — baselines week 1, the Haiku+harness treatment week 2.
⚠️

The hardest risk is semantic, not structural: what counts as "good memory." Ken and Scott B. must align on memory semantics in week 1, or the harness will store everything and retrieve nothing useful.

Milestones

D3contract freeze
  • Storage interface + memory-item schema locked.
  • Metric definitions agreed.
D5end of week 1
  • Three backends read/write; run harness + loaders ready.
  • No-memory baselines started on sharded keys.
D8integration start
  • Router + orchestrator connected.
  • Dreaming worker running against real stores.
D10ship
  • All four benchmark shards complete (Haiku + harness).
  • Four metrics aggregated vs. baselines.

Next: the architecture

See how the modules connect, how the router decides, and how indexing keeps retrieval fast.

View architecture →