The contracts the four workstreams build against — memory-item schema, storage interface, per-backend schemas, router and dreaming contracts, eval protocol. Pseudocode is illustrative.
Every memory — wherever it is stored — shares one shape. This is what the persistence layer writes and the dreaming component reconciles.
# A memory item, shown as Markdown YAML frontmatter id: mem_8f2a1c content: "Forward Windows env vars into WSL login shells via WSLENV." created_at: 2026-06-16T22:14:05Z session: sess_2026-06-16_a relevancy: 0.82 # 0..1, decays over time source: agent_step # agent_step | user | tool | reconciled tags: [wsl, env, secrets] backends: [markdown, sqlite] status: active # active | superseded | blacklisted supersedes: null # id of the item this replaces, if any
The single abstraction every backend implements. Freezing this on day 3 is what lets the four streams build in parallel.
class MemoryStore(Protocol): def write(self, item: MemoryItem) -> str: ... # returns id def get(self, id: str) -> MemoryItem | None: ... def search(self, query: Query, k: int = 8) -> list[Hit]: ... def delete(self, id: str) -> None: ... def all(self) -> Iterable[MemoryItem]: ... # for the offline dreaming sweep
Backends are swappable: each store satisfies this interface, so the orchestrator and dreaming worker never depend on a concrete store.
CREATE TABLE memory ( id TEXT PRIMARY KEY, content TEXT NOT NULL, embedding BLOB, -- float32[d] → HNSW/FAISS created_at TEXT, relevancy REAL, session TEXT, status TEXT DEFAULT 'active' ); CREATE INDEX idx_session ON memory(session);
Embed on write, mirror the vector into an HNSW/FAISS ANN index, hydrate rows by id on hit.
(:Memory {id, content, created_at})
(:Memory)-[:RELATES_TO]->(:Memory)
(:Memory)-[:CONTRADICTS]->(:Memory)
(:Memory)-[:SUPERSEDES]->(:Memory)
Typed traversal index over node/edge types; bound traversals by depth or relevancy.
memories/<session>/mem_8f2a1c.md
inverted_index.json # keyword → [file paths]
def route(query: Query) -> Backend: if query.kind == "relationship": return GRAPH if query.needs_semantic: return VECTOR return MARKDOWN # literal recall
Start rule-based; later swap the body for a learned classifier over the query embedding + metadata. Same signature, no caller changes.
def dream(store: MemoryStore) -> Governance: items = list(store.all()) merge_duplicates(items) # exact·semantic·near resolve_conflicts(items) # recency·conf·source gov = session_rules(items) # know·do·blacklist prune(items) # decay + caps return gov
Runs asynchronously when the agent is idle; emits a conflict-free state + the per-session governance object the next run loads.
Day Dream (in-session) is fired automatically by the plugin's
Stop hook at session end: it reads the session .jsonl logs, tracks
processed-vs-delta state (sending only unprocessed entries), calls an
OpenRouter model to decide what to remember, then calls the Orchestrator to save it.
Dream (night) above is the deep cross-session sweep over the whole store.
score(item, q) = w_r · recency(item)
+ w_s · relevancy(item, q)
recency(item) = exp(-Δt / τ) # freshness decay
relevancy(i,q) = cos(emb_i, emb_q)
< 10% overhead.> 0.7.One controlled grid per benchmark, run cheapest-first. The cell that matters: does Haiku + harness clear the Opus 4.8 baseline?
| Run | Model | Memory | Role | Order / cost |
|---|---|---|---|---|
| Treatment | Haiku | on | The hypothesis under test | 1st · must-run |
| Lower bound | Haiku | off | How far memory lifts the small model | 2nd · cheap |
| Reference | Sonnet | off | Middle-tier sanity check | 3rd · optional |
| Target | Opus 4.8 | off | The bar to beat | 4th · confirm only |
Hold prompts, tools and task order fixed across runs; log every trajectory (written / retrieved / used) for reproducibility, and report deltas with a significance test.
Execution — local Claude Code CLI. Benchmarks run through the
Claude Code CLI (subscription auth — API keys stripped, no API billing), comparing Claude Code's
built-in memory vs our plugin memory:
python -m memeval.claudecode.run_bench --benchmark all --mode all --model claude-haiku-4-5.
Each run saves a per-benchmark, versioned file
results/{vX.Y}/{bench-name}-{timestamp}.json (vX.Y = the memory-system
version, starts v0.1) and appends to the aggregate ledger the
Results page reads.
Runs are sharded by benchmark captain, each on a separate API key — Keith (SWE-Bench-CL), Ken (LongMemEval), Brent (SWE-ContextBench + ContextBench), Scott B. (MemoryAgentBench). Iterate on a ~10–15% stratified dev slice, cache + checkpoint every call, and enforce a per-benchmark budget gate. Full cost strategy on the plan.
agent-memory-harness/
├── *.html assets/{css,js,img}/ # this site (GitHub Pages)
├── architecture.md · plan.md · prd.md # docs / contract
├── eval/ # the eval harness (Python pkg `memeval`)
│ ├── memeval/
│ │ ├── schema.py · protocols.py # frozen contract
│ │ ├── harness.py · agent.py · cli.py · results.py
│ │ ├── loaders/ · metrics.py · cost.py · trajectory.py · models.py # Ken
│ │ ├── stores/ · router.py # backends + dispatch (Brent)
│ │ ├── dreaming/ # async curation (Scott B.)
│ │ ├── opencode/ # OpenCode memory framework (Keith)
│ │ └── claudecode/ # run benchmarks via Claude Code CLI + plugin
│ ├── tests/ # offline smoke tests (stdlib-only)
│ └── pyproject.toml
├── results/{vX.Y}/ # per-benchmark versioned result files
└── README.md