Implementation · AI Agent Memory Harness

The unit of memory

Memory-item schema

Every memory — wherever it is stored — shares one shape. This is what the persistence layer writes and the dreaming component reconciles.

# A memory item, shown as Markdown YAML frontmatter
id: mem_8f2a1c
content: "Forward Windows env vars into WSL login shells via WSLENV."
created_at: 2026-06-16T22:14:05Z
session: sess_2026-06-16_a
relevancy: 0.82            # 0..1, decays over time
source: agent_step        # agent_step | user | tool | reconciled
tags: [wsl, env, secrets]
backends: [markdown, sqlite]
status: active            # active | superseded | blacklisted
supersedes: null          # id of the item this replaces, if any

Keith · Person 1 owns this

Storage interface

The single abstraction every backend implements. Freezing this on day 3 is what lets the four streams build in parallel.

class MemoryStore(Protocol):
    def write(self, item: MemoryItem) -> str: ...        # returns id
    def get(self, id: str) -> MemoryItem | None: ...
    def search(self, query: Query, k: int = 8) -> list[Hit]: ...
    def delete(self, id: str) -> None: ...
    def all(self) -> Iterable[MemoryItem]: ...           # for the offline dreaming sweep

🧩

Backends are swappable: each store satisfies this interface, so the orchestrator and dreaming worker never depend on a concrete store.

Brent · Person 3 owns this

Per-backend schema & index

SQLite + vectors

CREATE TABLE memory (
  id         TEXT PRIMARY KEY,
  content    TEXT NOT NULL,
  embedding  BLOB,        -- float32[d] → HNSW/FAISS
  created_at TEXT,
  relevancy  REAL,
  session    TEXT,
  status     TEXT DEFAULT 'active'
);
CREATE INDEX idx_session ON memory(session);

Embed on write, mirror the vector into an HNSW/FAISS ANN index, hydrate rows by id on hit.

Graph store (Cypher-style)

(:Memory {id, content, created_at})

(:Memory)-[:RELATES_TO]->(:Memory)
(:Memory)-[:CONTRADICTS]->(:Memory)
(:Memory)-[:SUPERSEDES]->(:Memory)

Typed traversal index over node/edge types; bound traversals by depth or relevancy.

Markdown

memories/<session>/mem_8f2a1c.md
inverted_index.json   # keyword → [file paths]

Brent · Person 3

Router contract

def route(query: Query) -> Backend:
    if query.kind == "relationship":
        return GRAPH
    if query.needs_semantic:
        return VECTOR
    return MARKDOWN   # literal recall

Start rule-based; later swap the body for a learned classifier over the query embedding + metadata. Same signature, no caller changes.

Scott B. · Person 4

Dreaming worker

def dream(store: MemoryStore) -> Governance:
    items = list(store.all())
    merge_duplicates(items)      # exact·semantic·near
    resolve_conflicts(items)     # recency·conf·source
    gov = session_rules(items)   # know·do·blacklist
    prune(items)                 # decay + caps
    return gov

Runs asynchronously when the agent is idle; emits a conflict-free state + the per-session governance object the next run loads.

Day Dream (in-session) is fired automatically by the plugin's Stop hook at session end: it reads the session .jsonl logs, tracks processed-vs-delta state (sending only unprocessed entries), calls an OpenRouter model to decide what to remember, then calls the Orchestrator to save it. Dream (night) above is the deep cross-session sweep over the whole store.

Keith + Ken · Person 1 + Person 2

Ranking & metric definitions

Retrieval ranking

score(item, q) = w_r · recency(item)
              + w_s · relevancy(item, q)

recency(item)  = exp(-Δt / τ)     # freshness decay
relevancy(i,q) = cos(emb_i, emb_q)

The four metrics

Recency — % of queries where the freshest relevant item is ranked #1.
Efficiency — memory tokens ÷ total tokens; target < 10% overhead.
Relevancy — mean cosine of retrieved items to query; target > 0.7.
Accuracy — task success rate, memory-on vs memory-off.

Ken builds the harness · captains run the shards

Evaluation protocol

One controlled grid per benchmark, run cheapest-first. The cell that matters: does Haiku + harness clear the Opus 4.8 baseline?

Run	Model	Memory	Role	Order / cost
Treatment	Haiku	on	The hypothesis under test	1st · must-run
Lower bound	Haiku	off	How far memory lifts the small model	2nd · cheap
Reference	Sonnet	off	Middle-tier sanity check	3rd · optional
Target	Opus 4.8	off	The bar to beat	4th · confirm only

Hold prompts, tools and task order fixed across runs; log every trajectory (written / retrieved / used) for reproducibility, and report deltas with a significance test.

▶️

Execution — local Claude Code CLI. Benchmarks run through the Claude Code CLI (subscription auth — API keys stripped, no API billing), comparing Claude Code's built-in memory vs our plugin memory: python -m memeval.claudecode.run_bench --benchmark all --mode all --model claude-haiku-4-5. Each run saves a per-benchmark, versioned file results/{vX.Y}/{bench-name}-{timestamp}.json (vX.Y = the memory-system version, starts v0.1) and appends to the aggregate ledger the Results page reads.

🔀

Runs are sharded by benchmark captain, each on a separate API key — Keith (SWE-Bench-CL), Ken (LongMemEval), Brent (SWE-ContextBench + ContextBench), Scott B. (MemoryAgentBench). Iterate on a ~10–15% stratified dev slice, cache + checkpoint every call, and enforce a per-benchmark budget gate. Full cost strategy on the plan.

Where things live

Suggested repo layout

agent-memory-harness/
├── *.html  assets/{css,js,img}/  # this site (GitHub Pages)
├── architecture.md · plan.md · prd.md  # docs / contract
├── eval/                          # the eval harness (Python pkg `memeval`)
│   ├── memeval/
│   │   ├── schema.py · protocols.py     # frozen contract
│   │   ├── harness.py · agent.py · cli.py · results.py
│   │   ├── loaders/ · metrics.py · cost.py · trajectory.py · models.py  # Ken
│   │   ├── stores/ · router.py          # backends + dispatch (Brent)
│   │   ├── dreaming/                    # async curation (Scott B.)
│   │   ├── opencode/                    # OpenCode memory framework (Keith)
│   │   └── claudecode/                  # run benchmarks via Claude Code CLI + plugin
│   ├── tests/                          # offline smoke tests (stdlib-only)
│   └── pyproject.toml
├── results/{vX.Y}/                # per-benchmark versioned result files
└── README.md

See the results template →

Implementation reference

Memory-item schema

Storage interface

Per-backend schema & index

SQLite + vectors

Graph store (Cypher-style)

Markdown

Router contract

Dreaming worker

Ranking & metric definitions

Retrieval ranking

The four metrics

Evaluation protocol

Suggested repo layout