2-Week Plan · AI Agent Memory Harness

Why this exists

The problem

There is no standard, model-agnostic layer that decides what to remember, where to store it, how to retrieve it fast, and how to keep it consistent over time for long-running agents.

💾

What

signal vs. noise

🗂️

Where

the right store

⚡

Fast

no context flood

♻️

Consistent

dedup, no conflicts

How we'll build it

Technical approach

A pluggable four-module memory harness over three indexed storage backends, wired into OpenCode and measured on existing public benchmarks — no home-grown evals.

💾

Persistence

Decides what/when/where to save; tags every item.

🧭

Router

Routes each query to the single best backend.

🔎

Orchestrator

Ranks by recency × relevancy, dedups, returns.

🌙

Dreaming

While agents sleep: dedup, conflict resolution, governance.

Three storage backends, each indexed

Markdown + YAML — literal recall · inverted keyword index.
SQLite + vectors — semantic search · HNSW / FAISS ANN.
Graph store — relationships & conflicts · typed traversal.

Validated, not invented

Integrated into OpenCode. Per benchmark, a controlled grid runs Haiku + harness vs. Haiku, Opus 4.8 and Sonnet — prompts and task order fixed, every trajectory logged.

Full architecture & data flow →

Boundaries

Scope

What the sprint commits to — and what it doesn't.

In scope

The memory harness: all four modules, end to end.
Three storage adapters behind one interface, each indexed.
The intelligent router (rule-based; learned-classifier hook).
The async dreaming component: dedup, conflict resolution, governance, retention.
OpenCode integration for write/read on each agent step.
Eval harness on five benchmarks; baselines for Haiku, Opus 4.8, Sonnet.
Metric definitions, a reproducible protocol, and a results dashboard.

Out of scope (non-goals)

Building our own benchmark or dataset — we use public ones.
Training or fine-tuning any model.
Production hardening: multi-tenant infra, auth, SLAs, polished UI.
Distributed / scaled deployment — single-node is enough for the sprint.
Validating non-coding agents — design aims to generalize, not tested now.

Timeline at a glance

Sprint map

Week 1 = days 1–5 (build against frozen contracts; cheap no-memory baselines start ~D4). Week 2 = days 6–10 (wire together, run the Haiku+harness treatment on sharded keys, ship).

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

Keith · Harness

Abstraction + persistence

Orchestrator · OpenCode e2e

Ken · Eval infra

Datasets · run harness · metrics

LongMemEval shard + dashboard

Brent · Store + Retrieve

3 backends + indexes

Adapters + router

Scott B. · Dreaming

Dedup + conflict rules

Async worker + governance

All · Eval (own keys)

No-mem baselines

Haiku+mem · E2E

Keith · P1 Ken · P2 Brent · P3 Scott B. · P4 Eval: baselines wk1 · treatment wk2 (sharded keys)

Who owns what

Roles & deliverables

Keith · Person 1

🏗️ Harness Architecture & OpenCode Integration

Critical path — the contracts everyone builds against.

Week 1

Design the three-module abstraction & storage interface (swappable backends).
Build the persistence layer: write logic, versioning, metadata tagging.
Instrument OpenCode to write memory on each agent step.

Week 2

Build the retrieval orchestrator (rank → dedup → return).
Wire in Brent's router; run an end-to-end task.
Captain the SWE-Bench-CL runs on his own API key.

Ken · Person 2

📊 Evaluation Infrastructure & Coordination

Shared runner, metrics, dashboard — and captains one benchmark.

Week 1

Download & parse all four datasets; loaders + trajectory logging (with Scott B.).
Define the four metrics; build the shared run harness: run(benchmark, model, memory) → metrics.
Stand up the cost/budget tracker and per-key config.

Week 2

Captain the LongMemEval runs on his own API key.
Aggregate every captain's results; stats + results & cost dashboard; document the protocol.

Brent · Person 3

🗄️ Storage Implementation & Router

Three fast backends + the dispatch layer.

Week 1

SQLite + vector pipeline (embedding model, HNSW/FAISS index).
Markdown store with YAML frontmatter + inverted index.
Graph store schema (Neo4j) with typed traversal index.

Week 2

Adapters for all three backends behind Keith's interface.
Implement the router; performance-test each backend.
Captain the SWE-ContextBench & ContextBench runs on his own API key.

Scott B. · Person 4

🌙 Dreaming Component & Memory Governance

The offline engine that keeps memory clean.

Week 1

Deduplication logic (exact / semantic / near-duplicate).
Conflict detection + reconciliation rules (recency, confidence, source).
Session-filter & blacklist semantics.
Co-build trajectory logging with Ken — the dreaming worker consumes it.

Week 2

Async offline scheduler; must-know / must-do extraction.
Retention & pruning; observability logs; integration with Keith & Brent.
Captain the MemoryAgentBench runs on his own API key.

RACI, lightweight

Who owns what

Every deliverable has one accountable owner. The interface is frozen on Day 3 so the four streams never block on each other.

P1 Keith · P2 Ken · P3 Brent · P4 Scott B.

Area / deliverable	Owner	Supporting
Storage interface & memory-item schema	P1	P3
Persistence layer (write path, tagging, versioning)	P1	—
Retrieval orchestrator (rank · dedup · return)	P1	P3
OpenCode integration (write/read each step)	P1	—
Three storage backends + per-backend indexes	P3	P1
Intelligent router (rules → learned)	P3	P1
Backend performance testing	P3	P2
Dreaming worker (dedup · conflict · governance · retention)	P4	P1, P3
Memory semantics ("what is good memory")	P4	P2
Datasets, loaders & trajectory logging	P2	P4
Metric defs, shared run harness & cost gates	P2	P4
Results + cost dashboard, stats, aggregation	P2	—
Run SWE-Bench-CL eval (own key)	P1	—
Run LongMemEval eval (own key)	P2	—
Run SWE-ContextBench eval (own key)	P3	—
Run MemoryAgentBench eval (own key)	P4	—
Run ContextBench eval (own key)	P3	—
Final end-to-end integration	All	—

Running the evals

Divide & conquer the benchmark runs

~20 runs (5 benchmarks × 4 configs) is too much cost and time for one person on one key. Each teammate captains the benchmark(s) that stress their own component, on their own API budget — runs go wide, not deep.

Benchmark	Captain	Why them	Configs
SWE-Bench-CL	Keith	Drives the coding agent end-to-end — his OpenCode wheelhouse.	4
LongMemEval	Ken	Eval lead; recency / temporal reasoning.	4
SWE-ContextBench	Brent	Context reuse exercises his retrieval + router.	4
MemoryAgentBench	Scott B.	Tests conflict resolution — his dreaming component.	4
ContextBench	Brent	Retrieval-quality (gold contexts) — same retrieval/router he owns.	4

Whoever can best debug a bad number is the one watching it. Ken owns the shared runner and aggregates results.

Cost & throughput controls

🔑 Sharded keys

Each captain runs on a separate API key/account — roughly 4× the aggregate rate limit, isolated per-benchmark budgets, no single-key throttle.

💸 Cheapest-first + early-exit

Order configs Haiku+mem → Haiku → Sonnet → Opus. Opus (priciest) runs last, only to confirm a signal — skip it if the cheaper tier already settles the question.

🎯 Dev slice → full

Iterate on a fixed ~10–15% stratified subset, then one full run per config. Use LongMemEval_S (~115k) to iterate; reserve _M (~1.5M) for a single final confirmation.

♻️ Cache + resume

Content-hash every (task, model, config) call and checkpoint per task, so a crash or re-run never re-pays — critical on long SWE-bench trajectories.

🚧 Hard budget gates

Per-benchmark $ and token ceiling in the runner; abort and log partial results on overrun — no silent cost blow-ups.

🕒 Baselines in week 1

No-memory baselines need only the model + dataset, not the harness — start them ~D4 to flatten the week-2 spike and surface API/dataset issues early.

⚖️

Must-run cells: Haiku + harness (treatment) and Opus 4.8 (target). Haiku no-memory is cheap and stays; Sonnet is optional if budget is tight.

Critical path

Dependency map

What blocks what across the two weeks. The Day-3 interface freeze unblocks every stream; the bold chain is the critical path to ship.

Keith Ken Brent Scott B. Gate / all Critical path

Dependencies, in words

Keith + Brent (D1–D3): co-author and freeze the storage interface + memory-item schema, so persistence and the adapters build against one contract.
Brent → Keith: the router lands before the orchestrator can route in week 2.
Ken → all (by D5): datasets, trajectory logging and the shared run harness ready, so no-memory baselines can start ~D4–D6.
Keith + Brent → Scott B.: the dreaming worker integrates once real storage exists (D8+).
Sharded keys: each captain runs on a separate API budget — baselines week 1, the Haiku+harness treatment week 2.

⚠️

The hardest risk is semantic, not structural: what counts as "good memory." Ken and Scott B. must align on memory semantics in week 1, or the harness will store everything and retrieve nothing useful.

Milestones

D3contract freeze

Storage interface + memory-item schema locked.
Metric definitions agreed.

D5end of week 1

Three backends read/write; run harness + loaders ready.
No-memory baselines started on sharded keys.

D8integration start

Router + orchestrator connected.
Dreaming worker running against real stores.

D10ship

All four benchmark shards complete (Haiku + harness).
Four metrics aggregated vs. baselines.

Next: the architecture

See how the modules connect, how the router decides, and how indexing keeps retrieval fast.

View architecture →

Two-week plan, four people

The problem

What

Where

Fast

Consistent

Technical approach

Persistence

Router

Orchestrator

Dreaming

Three storage backends, each indexed

Validated, not invented

Scope

In scope

Out of scope (non-goals)

Sprint map

Roles & deliverables

🏗️ Harness Architecture & OpenCode Integration

📊 Evaluation Infrastructure & Coordination

🗄️ Storage Implementation & Router

🌙 Dreaming Component & Memory Governance

Who owns what

Divide & conquer the benchmark runs

Cost & throughput controls

🔑 Sharded keys

💸 Cheapest-first + early-exit

🎯 Dev slice → full

♻️ Cache + resume

🚧 Hard budget gates

🕒 Baselines in week 1

Dependency map

Dependencies, in words

Milestones

Next: the architecture